#ceph IRC Log


IRC Log for 2013-02-19

Timestamps are in GMT/BST.

[0:00] <ShaunR> i'd actually be nice to see a list of connections too, is that possible?
[0:01] <joshd1> you can see in-flight requests with the admin socket 'objecter dump_requests' command
[0:02] <joshd1> io can be tracked at the hypervisor level too
[0:02] <darkfader> i could push the monitoring software i worked on for the last 2 years but i know our iops calculation from /proc/diskstats was buggy :)
[0:03] <darkfader> joshd1: how do you track from the hypervisors? i know in xen you have libxenstat - how about others?
[0:03] <iggy> qemu has it built in... info blockstats
[0:03] <iggy> or some such
[0:04] <darkfader> cool
[0:04] <darkfader> thanks :)
[0:18] * dosaboy (~gizmo@host86-164-229-186.range86-164.btcentralplus.com) has joined #ceph
[0:19] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[0:23] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[0:23] <infernix> why does it take about a minute to delete an rbd device?
[0:24] <infernix> *8GB rbd device?
[0:28] * dosaboy (~gizmo@host86-164-229-186.range86-164.btcentralplus.com) Quit (Quit: Leaving.)
[0:28] <infernix> and is there any way to make that go faster, maybe outside python?
[0:29] <infernix> i'm going to create and teardown TB sized rbd disks
[0:30] <Robe> infernix: I guess because you need to delete lots and lots of rados keys
[0:30] <Robe> dunno how parallelized the task is
[0:34] <joshd1> not at all parallelized. if someone would like to do so, that'd be great (http://tracker.ceph.com/issues/2256)
[0:37] <Robe> lol.
[0:38] <Robe> well, there's your answer.
[0:40] * darkfader (~floh@ Quit (Ping timeout: 480 seconds)
[0:47] * darkfader (~floh@ has joined #ceph
[0:53] <phantomcircuit> infernix, does it really matter?
[0:54] <junglebells> Oh boy :) I think my distributed dd just caused my cluster to crash.
[0:56] <junglebells> 2/3 nodes aren't even pinging at this point...
[1:00] * bylzz (~bylzz@hostname.se) has joined #ceph
[1:02] * bylzz (~bylzz@hostname.se) has left #ceph
[1:02] <infernix> phantomcircuit: very much. i will be deleting and creating up to 20-30TB of rbd devices daily
[1:03] * ScOut3R (~scout3r@5400CAE0.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[1:03] <infernix> so with the time it takes to delete 8gb
[1:03] <infernix> it will take a few days to delete 30TB worth of rbd devices i think
[1:03] <infernix> effectively killing my whole plan
[1:04] <junglebells> Hey all: Thanks for all the assistance today! I'll be online a bit later
[1:04] * junglebells (~bloat@0001b1b9.user.oftc.net) Quit (Quit: HAZAAH)
[1:05] <ShaunR> there is no way to control/limit IOPS or IO bandwidth per client i'm assuming? If you were going to offer customers access to your cluster, how would you go about keeping them from abusing resources?
[1:22] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:24] <iggy> ShaunR: if you are talking VMs, qemu has some knobs to control that (but it's not perfect)
[1:27] <phantomcircuit> iggy, it actually is pretty perfect
[1:28] <phantomcircuit> infernix, why are you planning on creating/deleting that much data daily
[1:28] <phantomcircuit> on rbd's
[1:28] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has left #ceph
[1:28] <iggy> somehow I doubt that... the words qemu and perfect don't often go together
[1:28] <phantomcircuit> iggy, it works very very well
[1:29] <phantomcircuit> the problem is a lot of stuff assumes that issuing read iops sequentially is free
[1:29] <phantomcircuit> grub issues thousands of individual reads when it loads the kernel
[1:30] <phantomcircuit> so yeah the qemu iops limits work
[1:30] <phantomcircuit> but it tends to break poorly written things
[1:32] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:36] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[1:39] <infernix> phantomcircuit: backups
[1:40] <infernix> i could overwrite and/or resize the rbd backup disk if it already exists, but it's less ideal
[1:43] <darkfader> phantomcircuit: wow, thats interesting info
[1:50] <phantomcircuit> infernix, backups of what?
[1:50] <phantomcircuit> cause that seems like a very very inefficient way to do that
[1:52] <infernix> phantomcircuit: basically i will be dd-ing many TBs of data from SSDs to Ceph on a daily basis
[1:55] <ShaunR> iggy: i plan on testing that actually since cgroups i doubt will work with rbd... but i was mainly talking about giving customers access to their own storage space where you dont have control over their server (fuse/rbd mount)
[1:57] <ShaunR> phantomcircuit: i'm glad to hear sombody is using it, i talked with the guy who was developing it about a month ago and it sounded like the project kinda fell off the importance table. His last words about it were that the algorithm needed some optimizing but nobody took on that project.
[1:59] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[2:04] <phantomcircuit> ShaunR, what it needs is to detect sequential access and merge then
[2:04] <phantomcircuit> normally the os does a good job of that
[2:04] <phantomcircuit> but the specific case of grub is a real nuisance
[2:05] <phantomcircuit> it means i have all the vms setup with several orders of magnitude more read iops than i would otherwise
[2:09] <ShaunR> Whats the issue with grub?
[2:10] <phantomcircuit> it issues 1 read io for every 4KB of kernel image
[2:11] <phantomcircuit> so
[2:11] <phantomcircuit> 20 MB initrd
[2:11] <phantomcircuit> 4 MB kernel
[2:12] <phantomcircuit> so about 6k read ios to start a ubuntu system
[2:12] <phantomcircuit> with a conventional disk that doesn't matter since it's all sequential
[2:12] <phantomcircuit> the disk merges them
[2:13] <lurbs> Does the page cache of the host OS catch it?
[2:13] <phantomcircuit> lurbs, sure but that doesn't help at all with the rate limiting :)
[2:17] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[2:18] <lurbs> Yeah, of course, because they're getting limited after that.
[2:21] <lurbs> You could try the weighted limits via blkiotune instead of the absolute ones in blkdeviotune, I guess.
[2:22] <lurbs> That's still per VM host, though, not on the Ceph side.
[2:23] * LeaChim (~LeaChim@b0fac1c4.bb.sky.com) Quit (Read error: Connection reset by peer)
[2:41] <infernix> Writes with 16 processes took 18.458577s at 887.609050253 MB/s
[2:41] <infernix> yay
[2:41] <infernix> Reads with 16 processes took 12.278051s at 1334.41374368 MB/s
[2:41] <infernix> even more yay
[2:42] <infernix> let's try this on a faster box
[2:43] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[2:52] * MK_FG (~MK_FG@00018720.user.oftc.net) Quit (Ping timeout: 480 seconds)
[3:02] <iggy> phantomcircuit: what mgmt tool are you using? seems like something that could be handled at that layer... i.e. don't impose i/o limits until the guest has been "started" for 180s or something
[3:04] * MK_FG (~MK_FG@00018720.user.oftc.net) has joined #ceph
[3:04] <iggy> although, I'm honestly not sure if that's something that can be changed after the guest is started
[3:08] <iggy> yeah, there it is... block_set_io_throttle
[3:08] <ShaunR> phantomcircuit: has this been brought up to the dev guys, specially Ryan Harper?
[3:13] <phantomcircuit> iggy, the qemu block io throttle limits can only be set once at start on the command line
[3:13] <phantomcircuit> maybe there's a way to change them through the command interface
[3:13] <iggy> phantomcircuit: I posted
[3:16] <phantomcircuit> hmm?
[3:16] <iggy> there's a monitor command
[3:17] <phantomcircuit> oh there is one?
[3:17] <phantomcircuit> yeah i dont think that's the right one
[3:17] <phantomcircuit> let me double check though
[3:17] <iggy> block_set_io_throttle device bps bps_rd bps_wr iops iops_rd iops_wr "Change I/O throttle limits for a block drive to bps bps_rd bps_wr iops iops_rd iops_wr "
[3:17] <iggy> seems pretty spot on to me
[3:17] <phantomcircuit> ooh that's newish
[3:18] <phantomcircuit> or maybe it's not
[3:18] <phantomcircuit> i must have missed it :(
[3:19] <iggy> I'm looking at the generated docs on weilnetz.de... I assume they follow master
[3:19] <iggy> so I dunno when it originated
[3:27] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[3:36] <phantomcircuit> {"id":"libvirt-26160","error":{"class":"DeviceNotFound","desc":"Device 'sdb' not found"}}
[3:36] <phantomcircuit> huh
[3:39] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[3:42] <phantomcircuit> nvm
[3:47] <iggy> I take it you figured out it's the qemu internal name (i.e. gotten from "info block")
[3:47] <phantomcircuit> yeah
[4:04] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 18.0.2/20130201065344])
[4:29] * The_Bishop (~bishop@e177089147.adsl.alicedsl.de) has joined #ceph
[4:41] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[4:56] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[5:06] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[5:12] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[5:13] * al (d@niel.cx) Quit (Ping timeout: 480 seconds)
[5:17] * al (d@niel.cx) has joined #ceph
[5:18] * ananthan_RnD (~ananthan@ has joined #ceph
[5:19] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[5:24] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:29] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[6:07] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:10] * Svedrin (svedrin@glint.funzt-halt.net) Quit (Ping timeout: 480 seconds)
[6:12] * chftosf (uid7988@hillingdon.irccloud.com) Quit (Ping timeout: 480 seconds)
[6:12] * jochen (~jochen@laevar.de) Quit (Remote host closed the connection)
[6:12] * stefunel- (~stefunel@static. Quit (Ping timeout: 480 seconds)
[6:12] * Tribaal (uid3081@tooting.irccloud.com) Quit (Ping timeout: 480 seconds)
[6:13] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[6:14] * scalability-junk (uid6422@id-6422.richmond.irccloud.com) Quit (Ping timeout: 480 seconds)
[6:14] * jefferai (~quassel@quassel.jefferai.org) Quit (Ping timeout: 480 seconds)
[6:18] * ivoks (~ivoks@jupiter.init.hr) Quit (Ping timeout: 480 seconds)
[6:19] * jefferai (~quassel@quassel.jefferai.org) has joined #ceph
[6:19] * Svedrin (svedrin@glint.funzt-halt.net) has joined #ceph
[6:19] <- *Svedrin* "I am currently away. Please leave a message, I will see it as soon as I get back."
[6:20] * stefunel (~stefunel@static. has joined #ceph
[6:23] * ivoks (~ivoks@jupiter.init.hr) has joined #ceph
[6:23] * jochen (~jochen@laevar.de) has joined #ceph
[8:02] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) has joined #ceph
[8:10] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:17] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:42] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[8:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[8:50] * gerard_dethier (~Thunderbi@ has joined #ceph
[9:02] * ScOut3R (~scout3r@1F2EAE7E.dsl.pool.telekom.hu) has joined #ceph
[9:04] * ScOut3R (~scout3r@1F2EAE7E.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[9:06] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[9:08] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[9:11] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:19] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[9:27] * Morg (b2f95a11@ircip2.mibbit.com) has joined #ceph
[9:31] * leseb (~leseb@2001:980:759b:1:5823:7c25:8ad6:af7a) has joined #ceph
[9:31] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:37] * l0nk (~alex@ has joined #ceph
[9:46] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[9:56] * eschnou (~eschnou@ has joined #ceph
[10:00] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) has joined #ceph
[10:01] * low (~low@ has joined #ceph
[10:01] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:03] * Tribaal (uid3081@id-3081.hillingdon.irccloud.com) has joined #ceph
[10:04] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[10:09] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[10:13] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[10:20] * LeaChim (~LeaChim@b0fac1c4.bb.sky.com) has joined #ceph
[10:25] * hvn (~hvn@ has joined #ceph
[10:43] * hvn (~hvn@ Quit (Quit: leaving)
[10:48] * andret (~andre@2a02:2528:ff65:0:129a:ddff:feae:7fe5) has joined #ceph
[10:50] * leseb (~leseb@2001:980:759b:1:5823:7c25:8ad6:af7a) Quit (Remote host closed the connection)
[10:57] * andret (~andre@2a02:2528:ff65:0:129a:ddff:feae:7fe5) Quit (Remote host closed the connection)
[11:11] <lxo> is it normal for a CDir to contain entries with snapid (unsigned long)-2 ?
[11:13] <lxo> I've been looking into why the mds crashes when I move files or dirs out of some dirs that may have been corrupted by earlier versions of ceph. CDir::add_remote_dentry fails the initial assertion because lookup finds an entry for the entry that's being moved out
[11:14] <lxo> oddly enough, restarting the mds enables the move to complete successfully, but then it crashes again when moving the next subdir
[11:15] <lxo> the directory in question used to have snapshots, with similar names even, but it hasn't had any snapshots left for a long time
[11:16] <absynth> or "snapshits", as someone stated here earlier
[11:16] <lxo> heh. that sounds right, unfortunately ;-)
[11:17] <absynth> it was a typo, but a very freudian one
[11:17] <lxo> indeed
[11:19] <absynth> what you are doing sounds a bit like a remediation that sage had for our cluster - but there, the OSDs started crashing when they touched files that weren't supposed to be there anymore
[11:19] <absynth> what version is that, current?
[11:19] <lxo> my ceph filesystem seems to have piled up a large number of issues because of snap-o-shits; I've recently found out I overdid the PG counts for my home cluster by about an order of magnitude, so I'll probably rebuild the cluster eventually
[11:20] <lxo> I'm running 0.56.3 now
[11:20] <lxo> but this cluster has run very many earlier releases too
[11:21] <lxo> it's my longest-living ceph filesystem ever. I've managed to put nearly *all* of my backups in it already
[11:21] * BManojlovic (~steki@ has joined #ceph
[11:21] <absynth> i`d say "file a ticket", but you probably don't have an inktank support contract for the cluster with the crashing MDSes, do you?
[11:22] <lxo> nah, I've been doing self support out of interest in ceph. had a few patches going in, and various other bugs I reported got fixed over the past couple of years :-)
[11:23] <lxo> I'll try to remove the dirs and re-create them once I'm done moving data out of them
[11:24] <lxo> hopefully the removal will succeed. though I've had corruption before (long ago) in which dirs seemed to be empty but couldn't be removed ;-)
[11:31] <absynth> maybe if you have some free time, you can fix the massive memleak in OSDs during deepscrub?
[11:31] <absynth> pretty please? :)
[11:43] <lxo> you got any details?
[11:45] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[11:50] * chegk (~devnull@ Quit (Quit: Leaving)
[11:51] <absynth> uh, it's on the mailing list
[11:51] <absynth> we don`t have a reliable trigger yet, but there are numerous people who are seeing scrub leaks
[11:51] <absynth> for us, it was more like an iceberg-sized hole, not a small leak
[11:52] <lxo> aah, the (unsigned)-2 is the actual mainline entry. the oddity is the presence of another entry, probably created last time I moved stuff out of the dir by cow, and that cow is now trying to create again. and since I didn't create any snapshots since then, the snapid for the cow is the same, and so the assertion fails
[12:01] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:05] * scalability-junk (uid6422@id-6422.tooting.irccloud.com) has joined #ceph
[12:09] <lxo> filed bug 4188
[12:22] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[12:32] * leseb (~leseb@mx00.stone-it.com) has joined #ceph
[12:44] * nz_monkey (~nz_monkey@ Quit (Remote host closed the connection)
[12:44] * nz_monkey (~nz_monkey@ has joined #ceph
[12:44] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:54] * diegows (~diegows@ has joined #ceph
[12:57] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:58] * andreask (~andreas@93-189-29-152.rev.ipax.at) has joined #ceph
[13:01] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[13:15] * andreask (~andreas@93-189-29-152.rev.ipax.at) Quit (Ping timeout: 480 seconds)
[13:16] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:20] * itamar (~itamar@ has joined #ceph
[13:20] * itamar (~itamar@ Quit ()
[13:20] * itamar (~itamar@ has joined #ceph
[13:22] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[13:27] * Ul (~Thunderbi@135.219-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[13:32] * itamar (~itamar@ Quit (Remote host closed the connection)
[13:33] * itamar (~itamar@ has joined #ceph
[13:33] * itamar (~itamar@ Quit (Remote host closed the connection)
[13:33] * itamar (~itamar@ has joined #ceph
[13:34] <itamar> hi
[13:34] <itamar> I have an argonaut setup using cephfs that suffered from a power failure.
[13:35] <Ul> hello everybody. I can't get qemu-kvm to use the rbd image as disk. I've configured the kvm xml file to use the monitor, I've created a virsh secret and I've added the <auth> tag to define the authentication. when I want to do a virsh create of the xml file, it says "error connecting" to the monitors. I following the steps shown here http://wiki.skytech.dk/index.php/Ceph_-_howto,_rbd,_lvm,_cluster#KVM_-_add_secret.2Fauth_for_use_with_ce
[13:35] * itamar (~itamar@ Quit ()
[13:35] * itamar (~itamar@ has joined #ceph
[13:36] <absynth> itamar: and?
[13:36] <absynth> (we have, too)
[13:36] <itamar> thanks
[13:36] <itamar> now, the mds showd as up:replay
[13:36] <itamar> and doesnt server metadata
[13:37] <itamar> actually I had two..
[13:37] <itamar> upgraded one of them to 0.56.3 to see if it helps
[13:37] <itamar> it doesn't :(
[13:37] <Ul> i'm running ubuntu 12.04 LTS. libvirt-bin is 0.9.8-2ubuntu17 which contains a bugfix for that I think. Any ideas what i'm doing wrong?
[13:37] <absynth> you had 2 MDSes?
[13:38] <itamar> actuall I have an MDS on every server in the cluster, but only one active
[13:39] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:39] <absynth> are your osds etc. up?
[13:39] <itamar> all is up, health OK
[13:39] <absynth> the replay can take a while, i personally don't use cephfs though
[13:39] <itamar> are two days enough? :(
[13:39] <absynth> oh.
[13:39] <absynth> that is, errm... weird
[13:39] <absynth> can the MDSes see each other?
[13:40] <itamar> yup, all on one happy 10Gig network
[13:40] <absynth> hm... probably you should stay in here for a couple more hours
[13:40] <absynth> as in, 5 or 6 more hours, until the US guys are awake
[13:40] <itamar> :) already opened a ticket to Inktank, thought I will try to beat them to it..
[13:41] <absynth> oh, you have a support contract?
[13:41] <itamar> I'm on GMT+2 I'll be asleep when they will start warming up..
[13:41] * fghaas (~florian@ has joined #ceph
[13:41] <itamar> I do..
[13:41] <absynth> yesterday was a holiday in the US
[13:42] <itamar> yes, so I was told..
[13:42] <absynth> so they will probably start working on your ticket today, i guess
[13:42] <absynth> IMHO, it`s probably a good idea to leave the system as it is if you don't have a business critical application on it
[14:01] * verwilst (~verwilst@d5152D6B9.static.telenet.be) has joined #ceph
[14:05] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Remote host closed the connection)
[14:06] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[14:08] * fghaas (~florian@ Quit (Ping timeout: 480 seconds)
[14:16] * mgalkiewicz (~mgalkiewi@staticline-31-182-128-35.toya.net.pl) has joined #ceph
[14:17] * l0nk (~alex@ Quit (Remote host closed the connection)
[14:18] * l0nk (~alex@ has joined #ceph
[14:39] * gaveen (~gaveen@ has joined #ceph
[14:40] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:46] <infernix> ugh, is there a more recent ceph kernel available for precise?
[14:46] <infernix> i'm getting OOPSes for bonding in 3.5.0
[14:48] <absynth> you'll have to build one yourself, i guess
[14:48] <nhm> infernix: I think we have 3.6.3 builds
[14:49] <infernix> nhm: o hey
[14:49] <infernix> i have a multithreaded benchmark tool for you
[14:49] <nhm> infernix: what does it benchmark?
[14:50] <infernix> writes random generated data to an rbd device, then rereads it
[14:50] <nhm> huh, I guess we don't have the 3.6.3 debs up on gitbuilder anymore. We've got 3.8
[14:50] <infernix> can specify number of threads, rbd size
[14:50] <infernix> all in userspace, e.g. python
[14:50] <infernix> 3.8 is fine
[14:50] <nhm> infernix: http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/v3.8/
[14:51] <infernix> nhm: i have just one thing to add to it; right now it'll allocate the total disk size in ram, i need to make that into a specifyable buffer
[14:51] <infernix> but it works nicely. been pushing 1.2GB rbd writes
[14:51] <nhm> infernix: cool!
[14:51] <infernix> which is about the limit with 59 OSDs
[14:52] <nhm> infernix: right now I'm just using fio on rbd devices.
[14:52] <infernix> yeah but that goes into kernel
[14:52] <infernix> as i have to work on centos 5 boxes, i have to stay in userspace
[14:52] <infernix> so it's all librados
[14:52] <nhm> infernix: I'm also topping out at about 1.2-1.4GB/s with a single rbd device.
[14:52] <nhm> infernix: ah, interesting!
[14:53] <infernix> but currently messing with bonding and infiniband
[14:54] <infernix> looks like i have a way to utilize both ipoib links for transmit
[14:54] <wer> what pg_num/pgs_num did you guys use for your pools?
[14:54] <infernix> so that's 40gbit of ip
[14:54] <nhm> infernix: btw, I'm doing some testing on centos5 boxes that have new kernels but otherwise are ancient. Ended up building GCC, probably about 20 different dependencies, finally got Ceph to compile from github.
[14:54] <Teduardo> does rbd connect to multiple hosts?
[14:54] <infernix> nhm: i have packages
[14:54] <nhm> infernix: Don't have root
[14:54] <infernix> didn't need to build gcc
[14:55] <infernix> aha
[14:55] <nhm> infernix: well, that's not enitrely true
[14:55] <nhm> infernix: I have root, but am not supposed to make permenant changes.
[14:55] <infernix> nhm: well i am testing against centos 5.9
[14:55] <nhm> infernix: the old version of GCC on the boxes was causing very strange build problems.
[14:55] <infernix> so all default, but with python2.6 from EPEL
[14:55] <nhm> infernix: this is likely 5.5 or 5.0 or something.
[14:55] <infernix> hm. maybe i did pull in a newer gcc
[14:55] * infernix checks
[14:56] <nhm> infernix: also using ipoib on those boxes. I can do about 2.0GB/s per client max with rados bench.
[14:57] <infernix> nhm: gcc 4.4.6
[14:57] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[14:57] <infernix> nhm: eh. 2GB/s?
[14:58] <infernix> how many OSDs
[14:58] <nhm> infernix: that test was with 24 OSDs
[14:59] <infernix> o_O
[14:59] <infernix> i have 59 and i have to push to get to 1.2GB writes
[14:59] <nhm> On 4 nodes.
[14:59] <infernix> are you on FDR?
[14:59] <infernix> 5 nodes here
[14:59] <nhm> infernix: QDR
[15:00] <nhm> infernix: the SC847a chassis I have does better. I was able to do 2GB/s over bonded 10GbE from another node to that 1 chassis.
[15:00] <infernix> hrm
[15:00] <nhm> infernix: internally I've hit up to 2.7GB/s to localhost when using ext4.
[15:00] <infernix> maybe i need to go and disable power management on these boxes
[15:00] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[15:01] <wer> nhm: how many pg's did you define for your pool at 59 osd's?
[15:01] <nhm> infernix: one thing that you might want to do is check the admin socket for each osd periodically as you are running tests and see if ops are backing up on any specific OSDs or nodes.
[15:01] <wer> err infernix I meant...
[15:01] <nhm> wer: I didn't do 59 osds, but in my case with 24 OSDs I think I was using 4096 PGs.
[15:02] <wer> nhm: and you are getting 2.7GB/s writes?
[15:03] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[15:04] <nhm> wer: on that cluster there are 4 nodes and the max I got is 3.3GB/s so far with RADOS bench. I'm trying to hit about 4-5GB/s.
[15:04] <infernix> ok why does ceph not start at bootup when i'm on that 3.8.0 kernel
[15:04] <jmlowe> I'm having trouble with backfill
[15:04] <wer> yeah me too, and I am no where close. nhm what's your replication factor?
[15:05] <jmlowe> specifically it's glacial
[15:05] <nhm> wer: 1
[15:05] <nhm> wer: well, no replication
[15:05] <infernix> nhm: aha :D
[15:05] <wer> ok. That makes more sense.
[15:05] <infernix> i'm on 2x
[15:05] <nhm> ah, ok
[15:05] <wer> yeah, I need 3x and can sustain about 1GB ish...
[15:06] <wer> I think I overshot the number of PG's and do better with less....
[15:06] <wer> But I was sizing for the production number of osd's..... make things kind of a pain.
[15:07] <nhm> wer: I've been doing some work trying to pin down how performance and data distribution changes with the number of PGs, but the results so far seem to oscilate. It's strange.
[15:07] <absynth> oh, btw
[15:07] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:07] <absynth> does anyone have scripts for OSD memory consumption monitoring in munin or similar?
[15:07] <absynth> we really need to keep close tabs on that memory
[15:08] <wer> nhm: yeah. I have been questioning that as well. Or the number of osd's.... since you have to predetermine all that stuff. I guess I can always give radosgw multiple pools.. but I am not really clear on how that works either.
[15:08] <nhm> absynth: I seem to remember both some munin and nagios work being done. Not sure if it includes memory monitoring. I think DH uses collectd and there's some ceph collectd stuff. I juse use collectl (it can do per process monitoring)
[15:09] <infernix> urgh, how do I debug cephs init script not doing anything at all?
[15:09] <infernix> like, not even -v outputs anything
[15:10] <nhm> infernix: hrm, check the paths?
[15:10] <absynth> nagios plugins exist, they basically look at ceph -s
[15:13] <infernix> get_name_list is empty
[15:17] <infernix> ceph-conf -l mon returns zip, same for osd
[15:17] <infernix> does ceph-deploy create a valid config file?
[15:19] <infernix> it seems to me like it doesn't since there's no [osd.x] or [mon] entries in ceph.conf
[15:19] * junglebells (~junglebel@0001b1b9.user.oftc.net) has joined #ceph
[15:19] <infernix> or maybe i broke it
[15:23] * mib_r0srk9 (522f4fbe@ircip2.mibbit.com) has joined #ceph
[15:23] <mib_r0srk9> http://www.fanteamz.com/bloodstock/cc/6/97
[15:30] * joao sets mode +b *!*522f4fbe@*.mibbit.com
[15:30] * mib_r0srk9 was kicked from #ceph by joao
[15:31] <infernix> so i probably overwrote my ceph.conf files with ceph-deploy admin
[15:31] * joao sets mode -b *!*522f4fbe@*.mibbit.com
[15:31] * joao sets mode +b mib_r0srk9!*@*
[15:32] * eschnou (~eschnou@ has joined #ceph
[15:32] <infernix> that, or ceph-deploy isn't compatible with the initscript
[15:33] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) has joined #ceph
[15:33] <infernix> because there's no osd or mon entries in ceph.conf. what do?
[15:33] <dilemma> I have an interesting problem with a single incomplete PG
[15:33] <dilemma> hopefully someone can point me in the right direction
[15:34] * itamar (~itamar@ Quit (Quit: Leaving)
[15:34] <dilemma> failures occurred to all OSDs handling this PG in such a way that the only newest copy of one object in this PG ended up on an OSD that had a failed disk
[15:35] <dilemma> all other objects in this PG exist on another OSD, but that one object is irrecoverable
[15:35] <dilemma> however, I can't convince the cluster to use the PG as it exists on other OSDs
[15:35] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[15:35] <dilemma> I've marked the OSD with the bad disk as lost
[15:36] <dilemma> but I still have "1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean"
[15:36] <dilemma> hmm... possibly too early for west coast guys
[15:38] <absynth> yup
[15:38] <absynth> wait another 2 hours
[15:38] <joao> sjust would certainly be the right guy to talk to
[15:39] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:40] <jmlowe> anybody know anything about stuck recovery?
[15:41] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[15:42] <mikedawson> jmlowe: what are you seeing?
[15:43] <jmlowe> failed drive yesterday, removed the osd from the cluster, still backfilling
[15:43] <mikedawson> jmlowe: "glacial"? ha
[15:43] <jmlowe> will go 10 minutes between recovery ops
[15:44] <absynth> how many recovery threads?
[15:44] <absynth> do you see slow requests?
[15:44] <jmlowe> I see slow requests
[15:44] <jmlowe> where would I find the recovery thread count?
[15:45] <infernix> ok so is it me or is ceph-deploy wholly inadequate for managing a production cluster? i see zero code generating the [osd.X] and [mon] ceph.conf entries?
[15:45] <absynth> ceph --admin-daemon $i config show | grep osd_recovery_max_active
[15:45] <absynth> with $i = /var/run/ceph/ceph-osdblah.asok
[15:45] <absynth> can you identify which OSD is causing them (sometimes it says "waiting for subops on X, with x being an osd number")?
[15:46] <absynth> or does the slow request even say "osd.x slow request"?
[15:46] <mikedawson> jmlowe: I have seen very slow backfill with a low # of health OSDs, but as OSDs increase, I've seen pretty decent performance. My experience has been the number of OSDs matters more than the amount of data
[15:46] <jmlowe> currently waiting for subops from [0,8]
[15:46] * diegows (~diegows@ has joined #ceph
[15:46] <jmlowe> so is that 0 or 8?
[15:46] <absynth> both, i think
[15:46] <absynth> do you have replica count = 3?
[15:46] <jmlowe> should be 2
[15:47] <absynth> maybe you should check both, to see if they are i/o saturated
[15:47] <absynth> slow requests piling up are usually not really a very good sign
[15:50] <jmlowe> osd.0 does appear to be choking on io
[15:54] <mikedawson> jmlowe: what do you have for osd_max_backfills and osd_backfill_scan_max?
[15:55] <mikedawson> jmlowe: and osd_disk_threads?
[15:56] <mikedawson> jmlowe: see conversation between me and Sam starting at 22:00 http://irclogs.ceph.widodh.nl/index.php?date=2013-01-09
[15:58] <jmlowe> Well, of course, I look away for a few minutes to do something else and I clear 2k objects
[15:58] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[15:59] <absynth> mikedawson: 100 recovery threads per osd? you are not colocating VMs on the nodes, are you?
[16:01] <mikedawson> absynth: that was just a test cluster with nothing else going on
[16:01] <absynth> ah, okay
[16:01] <absynth> because we start seeing slow requests when we put it above 2 ;)
[16:02] * PerlStalker (~PerlStalk@ has joined #ceph
[16:02] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:09] <jmlowe> absynth: ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok isn't the correct syntax
[16:10] <ron-slc> absynth: I've seen this, when size is 3 or above as well. I traced this down to overwhelming the 1GB ethernet port. Multiple OSD hosts were bursting traffic to another OSD-replica, and the switch must OutputDrop anything faster than 1GB
[16:10] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[16:10] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[16:11] <ron-slc> I would thing the solution to this is a higher host count, and possible ethernet bonding.
[16:12] <jmlowe> absynth: nm, got it
[16:12] <absynth> jmlowe: you are missing the "config show"
[16:12] <absynth> k
[16:12] <jmlowe> I was
[16:13] <absynth> ron-slc: no, we currently do not have a network or disk i/o throttle, we think it might be the load put on the OSD machines by the colocated VMs
[16:13] <absynth> we have 10gbe in the backend
[16:13] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[16:13] * ircolle1 (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[16:13] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[16:13] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Read error: Connection reset by peer)
[16:13] <ron-slc> absynth: good, probably not that then, I would glance at your switch's port error counters just for fun.
[16:14] * ircolle1 (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit ()
[16:14] <jmlowe> "osd_max_backfills": "10", "osd_backfill_scan_max": "512", "osd_disk_threads": "1", "osd_recovery_max_active": "5",
[16:16] <absynth> that looks like the default values
[16:16] <absynth> but if you are seeing slow requests already, you probably shouldnt increase them
[16:16] <absynth> maybe you want to lower max_active to 1, just for shits&giggles
[16:16] <jmlowe> I would hope they would be the defaults, I haven't changed them
[16:17] <absynth> ceph osd tell \* injectargs '--osd-recovery-max-active 0'
[16:17] <absynth> this would effectively stop recovery altogether
[16:17] <absynth> and might help to get rid of the slow requests
[16:17] <absynth> how many do you see?
[16:17] <absynth> and how old is the oldest?
[16:17] <jmlowe> 333.390766 secs
[16:17] <absynth> could be worse
[16:18] <absynth> means that there _is_ i/o ongoing
[16:18] <absynth> my approach would be: stop recovery by injecting said parameters to all OSDs
[16:18] <absynth> wait if slow requests disappear
[16:18] <absynth> then ramp up the threads, first to 1, then maybe to 2 or 3
[16:18] <absynth> until you start seeing slow requests regularly, then go down one until you have a stable state
[16:19] <absynth> (NB: i am not a inktank employee or something)
[16:21] <jmlowe> that sounds sensible, back off the recovery and start it up again in less of a free for all situation
[16:22] <absynth> maybe it helps
[16:24] * eschnou (~eschnou@ Quit (Ping timeout: 480 seconds)
[16:35] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[16:38] <jmlowe> hmm, 1 seems to be too many
[16:42] * fghaas (~florian@zux188-117.adsl.green.ch) has joined #ceph
[16:46] * aliguori (~anthony@ has joined #ceph
[16:47] * Ul (~Thunderbi@135.219-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[16:47] <absynth> then you have an I/O throttle somewhere
[16:47] <absynth> slow node (probably osd 0 or 8?)
[16:48] <absynth> check with nmon or iotop where the i/o goes
[16:48] <jmlowe> mikedawson: do you think that disk_threads bug sjust mentioned is in the latest point release?
[16:49] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has left #ceph
[16:50] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[16:52] * Morg (b2f95a11@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[16:57] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Quit: Leaving)
[16:58] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[17:01] * gerard_dethier (~Thunderbi@ Quit (Quit: gerard_dethier)
[17:01] * fghaas (~florian@zux188-117.adsl.green.ch) has left #ceph
[17:03] * low (~low@ Quit (Quit: Leaving)
[17:04] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[17:12] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:13] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[17:17] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[17:17] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[17:18] * gaveen (~gaveen@ has joined #ceph
[17:18] * ananthan_RnD (~ananthan@ Quit (Remote host closed the connection)
[17:19] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[17:24] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[17:31] <mattch> Ceph networking: What advantages are there to splitting it using the cluster and public network options? Isolation of clients from replication, allows you to have different network speeds/setups for the different tasks - anything else?
[17:32] <scuttlemonkey> jmlowe: you talking about the disk_threads setting hanging recovery?
[17:33] <jmlowe> yeah
[17:33] <scuttlemonkey> looks like it's set to "Can't Repoduce"
[17:33] <scuttlemonkey> http://tracker.ceph.com/issues/3772
[17:34] <absynth> mattch: ceph is very sensitive to latency and packet loss
[17:34] <absynth> so you want it in an isolated layer2 cloud that cannot be influenced by anything that happens in the rest of your network
[17:35] * mgalkiewicz (~mgalkiewi@staticline-31-182-128-35.toya.net.pl) Quit (Ping timeout: 480 seconds)
[17:36] <mattch> absynth: I was wondering about that - my decision is 1 4xbonded or 2 2xbonded interfaces. I guess the advantage of the former is that you never end up with 'unused capacity' but the latter means that replication gets its 'fair share' even if clients are saturating the public network to a particular node?
[17:38] <infernix> can you make the same osd reachable on multiple IPs in the same cluster?
[17:38] <infernix> and does it offer any benefit?
[17:46] * verwilst (~verwilst@d5152D6B9.static.telenet.be) Quit (Quit: Ex-Chat)
[17:48] * Cube (~Cube@ has joined #ceph
[17:49] <mattch> infernix: Not multiple ips I don't think - but you can bond interfaces together for more bandwidth/redundancy if that helps
[17:49] * diegows (~diegows@ has joined #ceph
[17:52] <infernix> yeah, already set up with balance-xor layer3+4
[17:52] <infernix> older kernels don't like it with infiniband though, they panic
[17:53] <infernix> -13/43630 degraded (-0.030%) - how can i have a negative degraded percentage?
[17:56] * vata (~vata@2607:fad8:4:6:11f4:cedb:50b7:4c14) has joined #ceph
[17:58] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[17:58] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[18:01] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[18:02] * leseb (~leseb@mx00.stone-it.com) Quit (Remote host closed the connection)
[18:04] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:04] <mikedawson> infernix: http://tracker.ceph.com/issues/3720 Sage says this is a bug, but they haven't fixed it. I have seen it a few times when changing replication levels or deleting pools
[18:05] <absynth> infernix: argonaut?
[18:07] <mikedawson> jmlowe: I think #3772 was for my report, but Sam can't reproduce it, so there is no fix
[18:17] <jmlowe> If I wait long enough I think it will eventually recover
[18:21] <jtangwk> just read - http://ceph.com/community/deploying-ceph-with-comodit/#more-2880
[18:21] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:22] <jtangwk> wouldnt it be nice to have a 'ceph config' which spits out the whole config for the running cluster
[18:22] * jtangwk mumbles
[18:23] <Cube> ceph --admin-daemon /path/to/admin/socket config show
[18:23] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[18:23] <jtangwk> was that added recently?
[18:24] <Cube> Been around pre argonaut I believe.
[18:25] <darkfader> that's very similar to ceph config
[18:25] <jtangwk> i need to fire up a vm to test it
[18:26] * The_Bishop_ (~bishop@e179004124.adsl.alicedsl.de) has joined #ceph
[18:26] <jtangwk> while growing and shrinking a ceph cluster dynamically is nice, i'd like to be able to dump the config to disk for recovery purposes
[18:27] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:27] * leseb_ (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:32] <jtangwk> in fact im going to bring up a vm to test it now
[18:33] * loicd (~loic@128-79-136-150.hfc.dyn.abo.bbox.fr) has joined #ceph
[18:33] * The_Bishop (~bishop@e177089147.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[18:35] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[18:35] * ScOut3R (~scout3r@1F2EAE7E.dsl.pool.telekom.hu) has joined #ceph
[18:35] * l0nk (~alex@ Quit (Quit: Leaving.)
[18:40] * loicd (~loic@128-79-136-150.hfc.dyn.abo.bbox.fr) Quit (Quit: Leaving.)
[18:42] <jtangwk> hmmm the —admin-socket show config isn't quite what i was hoping for
[18:43] <jtangwk> but its close
[18:46] * loicd (~loic@128-79-136-150.hfc.dyn.abo.bbox.fr) has joined #ceph
[18:48] * The_Bishop_ (~bishop@e179004124.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[18:56] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[18:58] * jlogan1 (~Thunderbi@2600:c00:3010:1:39ff:a0a8:8669:4dd6) has joined #ceph
[19:00] * loicd (~loic@128-79-136-150.hfc.dyn.abo.bbox.fr) Quit (Quit: Leaving.)
[19:08] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:08] * Ryan_Lane (~Adium@ has joined #ceph
[19:12] * rturk-away is now known as rturk
[19:13] * rturk is now known as rturk-away
[19:13] * rturk-away is now known as rturk
[19:13] <infernix> absynth: bobtail
[19:14] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:22] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Ping timeout: 480 seconds)
[19:32] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[19:32] * chutzpah (~chutz@ has joined #ceph
[19:33] <wido> todin: Indeed, fixing it. rsync script stalled (again).....
[19:35] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[19:36] * Cube (~Cube@ has joined #ceph
[19:41] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: It's a dud! It's a dud! It's a du...)
[19:42] * The_Bishop (~bishop@2001:470:50b6:0:8deb:bafd:156e:3462) has joined #ceph
[19:44] <todin> wido: okay
[19:44] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has joined #ceph
[19:50] * noahmehl (~noahmehl@cpe-75-186-45-161.cinci.res.rr.com) Quit (Quit: noahmehl)
[19:54] * stxShadow1 (~Jens@jump.filoo.de) has joined #ceph
[19:57] * stxShadow1 (~Jens@jump.filoo.de) has left #ceph
[19:58] * junglebells (~junglebel@0001b1b9.user.oftc.net) Quit (Quit: leaving)
[19:58] * junglebells (~junglebel@0001b1b9.user.oftc.net) has joined #ceph
[19:59] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) Quit (Ping timeout: 480 seconds)
[20:01] * dmick (~dmick@2607:f298:a:607:d116:49ec:8642:4c24) has joined #ceph
[20:01] <dmick> identify nickpass
[20:04] <dmick> ah well, there's another one shot by my window manager... seems to be the week of recycling passwords
[20:05] <janos> lol
[20:13] <yehudasa> gregaf1: can you peak at wip-4177 (pretty trivial)?
[20:14] <yehudasa> peek
[20:14] <yehudasa> I mean
[20:14] <gregaf1> in the middle of Sam's watch-notify stuff (sorry it's delayed, Sam!), but put it on the queue
[20:17] * ScOut3R (~scout3r@1F2EAE7E.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[20:25] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[20:28] * eschnou (~eschnou@146.108-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:38] * Ul (~Thunderbi@ip-83-101-40-159.customer.schedom-europe.net) has joined #ceph
[20:40] * wschulze1 (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[20:43] <sstan> today I learned that hosts' clocks must be in sync
[20:44] <sstan> not sure if the documentation insists enough on that
[20:46] <junglebells> sstan: I thought it did :)
[20:47] <sstan> maybe I missed it : )
[20:47] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[20:47] <junglebells> sstan: Careful with updating time if you have a large offset too. I was considering setting up sntp with frequent updates to ensure sync
[20:47] <sstan> I have a problem ... ceph osd tree shows some osd is UP. However , that osd is actually down, does anyone know why ?
[20:48] <junglebells> Have you looked at that particular OSD's log? Or do you have any output from ceph -w perhaps?
[20:49] <sstan> I'll look into that, thanks. But it should be simple ... monitors check if some osd is running or not
[20:50] <dmick> there is a certain amount of time before the monitors believe/accept the harsh reality that an OSD isn't responding
[20:50] <dmick> IIRC it works out to about 30s by default, or did (some of those times have changed recently IIRC)
[20:50] * Ul (~Thunderbi@ip-83-101-40-159.customer.schedom-europe.net) Quit (Ping timeout: 480 seconds)
[20:51] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[20:51] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:51] * terje (~joey@184-96-132-37.hlrn.qwest.net) has joined #ceph
[20:53] <sstan> why is it talking about "mon.1" and"mon.c""2013-02-19 14:35:33.160660 mon.1 [INF] mon.c calling new monitor election
[20:53] * terje_ (~joey@97-118-121-147.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[20:54] <dmick> I'm...guessing because you named your monitors 1 and c?
[20:54] <absynth> sjust: around?
[20:54] <sjust> yeah
[20:54] <absynth> sjust: we are not sure if it was RSS or VIRT
[20:54] <absynth> we were too panicky
[20:54] <sjust> hmm
[20:54] <sjust> ok
[20:54] <absynth> it showed up as "used" in Cacti, so i think it was probably RSS
[20:54] <absynth> i'd say "let's reproduce", but i'm a bit reluctant
[20:55] <sjust> were you also seeing slow requests?
[20:55] <absynth> lots
[20:55] <sjust> how slow?
[20:55] <absynth> i think in the 480sec range, at latest
[20:55] <sjust> ok
[20:55] <absynth> we eventually restarted the affected OSDs
[20:56] <absynth> apart from that, i have to say we're currently quite happy with our cluster :)
[20:56] <sjust> absynth: I think it may not be a leak, which is actually not good news
[20:57] <absynth> because it's some unexpected inconsistency that drives the OSD crazy?
[20:57] <sjust> no, might be legit memory use due to backed up requests
[20:57] <absynth> yikes
[20:58] <sjust> in your case, it's actually workaroundable
[20:58] <absynth> but that would mean that the deep scrub is taking a massive toll on production i/o?
[20:58] <sjust> well, it should take a massive toll on production i/o, just not for very long
[20:58] <sjust> I could easily be way off, looking around
[21:00] <absynth> either way, we left a question in #231 for either you or sage. it's a trivial thing, but we can't find concise documentation and the disk is nearing its EOL rapidly
[21:01] <sstan> Ah .. I had to mark my osd as down by _myself_ : ceph osd down osd.3
[21:01] <absynth> oliver just tells me that the node was already swapping heavily when we killed the OSD
[21:01] <absynth> so it was most definitely rss
[21:02] <absynth> right?
[21:02] <dmick> absynth: different number? #231 hasn't been updated for a year
[21:03] <absynth> sorry, not a bug ticket, a zendesk ticket
[21:03] <dmick> gotcha
[21:04] <absynth> right number, wrong scope ;)
[21:05] <sjust> absynth: hmm, I think my theory is wrong, I'll think of something else
[21:05] <absynth> we'll refrain from doing any scrub-related activities (other than watching an episode on TV) for the time being
[21:11] * jlogan2 (~Thunderbi@2600:c00:3010:1:2dd0:c27d:7304:3f9b) has joined #ceph
[21:17] * jlogan1 (~Thunderbi@2600:c00:3010:1:39ff:a0a8:8669:4dd6) Quit (Ping timeout: 480 seconds)
[21:17] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[21:21] <phantomcircuit> with 2/3 monitors up ceph -w doesn't show anything
[21:21] <phantomcircuit> does ceph -w have to connect to all the monitors?
[21:26] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[21:30] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:31] <lxo> so, I have a snapshot of an osd from a couple of days ago that I'd like to use to speed up its recovery (its current got messed up and I removed it), but its osdmap is older than the latest available in the mons (and in other osds AFAICT); any hope of bringing that osd back up from that point rather than from scratch?
[21:31] <lxo> (I removed more recent snap_*s too, I didn't expect the saved snapshot to be too old)
[21:32] <ShaunR> how are you guys handling expanding and shrinking of VM's diskimages via rbd?
[21:37] <dmick> phantomcircuit: I don't think so, but it will try to connect to everything mentioned in ceph.conf, and it ought to time out after 10s or so
[21:39] * DLange (~DLange@dlange.user.oftc.net) Quit (Quit: kernel upgrade)
[21:40] * Ul (~Thunderbi@ip-83-101-40-159.customer.schedom-europe.net) has joined #ceph
[21:42] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[21:46] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) Quit (Quit: Leaving)
[21:47] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: Leaving.)
[21:48] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[21:55] <phantomcircuit> ShaunR, dont do it
[21:55] <phantomcircuit> ShaunR, if you reallly reallllly want to then you need to create a new smaller volume and copy then delete the original
[21:55] <phantomcircuit> but the odds that the vm has correctly resized everything is ermm small
[21:57] <lurbs> Speaking of shrinking, does anyone have an up to date method/howto of getting trim to work from inside a VM backed by RBD?
[21:59] <lurbs> The docs claim that only the IDE libvirt driver works, not virtio. Is this still the case?
[22:00] <dmick> lurbs: that would have been a limitation in libvirt; might wanna check its source/docs
[22:02] <dmick> elder: yt?
[22:02] <elder> y
[22:03] <dmick> only because I know you were messing with udev lately: any idea why plana02 would be in a state of, on boot, dumping crazy amounts of udevd debug, and then dropping to (initramfs)?
[22:03] <dmick> if not, no worries, just perplexed
[22:03] <elder> I don't know.
[22:04] <elder> Is there something in the grub config that tells it to do that or something?
[22:05] <absynth> boot device changed its name?
[22:05] <absynth> vda/sda/hda/whateverda?
[22:06] <dmick> dunno. can't get to grub menu. bleh. I'll let somenoe else fight this one
[22:06] <ShaunR> phantomcircuit: so your saying dont use the qemu-img resize option?
[22:07] <dmick> huh. this time it booted to a shell anyway. <perplexed>
[22:12] * cashmont (~cashmont@c-76-18-76-30.hsd1.nm.comcast.net) has joined #ceph
[22:13] * BManojlovic (~steki@121-173-222-85.adsl.verat.net) has joined #ceph
[22:20] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[22:20] <mgalkiewicz> hi guys can anybody help with with cluster recovery? http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-February/000166.html
[22:21] <mgalkiewicz> it looks for me like some kind of osd bug
[22:21] <todin> lurbs: I have trim working in the vm, for now you have to use the ide driver
[22:22] <scuttlemonkey> mgalkiewicz: can you pastebin your crushmap?
[22:23] <mgalkiewicz> sure
[22:23] <todin> is it known that qemu 1.4.0 has a quite big memory leak in the rbd cache?
[22:25] <cashmont> it looks like there is a conflict with ceph-libs and ceph rpm versions on fedora 18 - I see this http://pastebin.com/nJJejtxZ
[22:26] <janos> cashmont: i recently did a clean f18 install
[22:26] <janos> got that
[22:26] <janos> yum remove'd ceph-libs
[22:26] <janos> then added the ceph repo
[22:26] <janos> installed ceph
[22:27] <janos> was fine
[22:27] <cashmont> ok. I will try that. Initially I tried, but lots of crying about other deps
[22:27] <janos> i did a minimal install - i don't recall many deps
[22:27] <mgalkiewicz> scuttlemonkey: https://gist.github.com/maciejgalkiewicz/4990116
[22:27] <janos> but tehn again, minimal doesn't usually even have wget
[22:27] <cashmont> hehe yeah, let me try. thanks
[22:28] <janos> that was less than a week ago - hopefully still relevant ;)
[22:29] <cashmont> it looks like things are being placed back on. Seems like ceph-libs was tied to lots of virt packages now. thanks again
[22:30] <janos> yeah that sounds familiar
[22:30] <dmick> mgalkiewicz: you have only one host
[22:30] <dmick> and you're choosing type host first
[22:30] <dmick> which means you'll only ever have one OSD per PG
[22:30] <yehudasa> gregaf1: also if you could look at wip-4150, Bob Tail will thank you
[22:31] <janos> cashmont: i have - on that ceph host - since made a qemi-ing against rbd and fired up a f18 guest to be sure it worked
[22:32] <dmick> try changing those "step chooseleaf firstn 0 type host" to "step chooseleaf firstn 0 type dev"
[22:32] <cashmont> cool, that part of my new testing plan. I had at one point gotten ceph and ovirt 3.1 to play nice (posix fs layer not rbd)
[22:32] <lurbs> I'm guessing that the "If you set rbd_cache=true, you must set cache=writeback or risk data loss" in the docs is still correct?
[22:33] <janos> lurbs, no idea, but i did it anyway
[22:33] <scuttlemonkey> dmick: thanks, stepped away a sec
[22:33] <dmick> nw
[22:33] <lurbs> Don't suppose you'd know how to force OpenStack to set cache='writeback'? :)
[22:33] <scuttlemonkey> dmick: I would rather spin up another host if it were my testing env though :)
[22:34] <janos> lurbs, sadly no
[22:34] <dmick> scuttlemonkey: sure that works too. It's easier to lift a few keys than a rack server :)
[22:35] <mgalkiewicz> dmick: if I understand correctly osds should be added to different hosts?
[22:35] <scuttlemonkey> pssht...you and your opportunity costs :P
[22:35] <dmick> mgalkiewicz: well it depends
[22:36] <dmick> you can put multiple OSDs on one host; of course that means if the host fails all the OSDs fail
[22:36] <dmick> but
[22:36] <dmick> the way your crushmap is written, it's trying to choose multiple hosts first, and then OSDs within the host
[22:36] <mgalkiewicz> yeah I mean the situation when both osds are on separate physical servers
[22:36] <dmick> if...they're on different hosts, then the crushmap should reflect that, yes, of course
[22:37] <mgalkiewicz> ok thx for help
[22:40] <mgalkiewicz> dmick: I have checked two other ceph cluster and all of them have all osds on single host but pgs are in active+clean
[22:41] <scuttlemonkey> mgalkiewicz: he wasn't saying they _couldn't_ live on the same machine
[22:41] <scuttlemonkey> it's how you are telling Ceph where they are
[22:42] <mgalkiewicz> ok so pgs state is not because osds are on the same host
[22:42] <scuttlemonkey> that's why he suggested changing your crushmap from 'type host' to 'type dev' (device)
[22:42] <scuttlemonkey> then it would try to iterate PGs across devices instead of hosts
[22:42] <scuttlemonkey> since you only have one host
[22:46] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:48] <junglebells> 'ceph osd pause' and ceph osd tell \* injectargs '--osd-recovery-max-active 0' doesn't seem to be pausing my rebuild. Anyone have any ideas?
[22:50] * ScOut3R (~scout3r@1F2EAE7E.dsl.pool.telekom.hu) has joined #ceph
[22:51] * eschnou (~eschnou@146.108-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[22:54] <mgalkiewicz> scuttlemonkey: not sure what should be changed I will take a look in docs
[22:54] <absynth> is the config change correctly propagated, i.e. do you see it in config show | grep max?
[22:54] <scuttlemonkey> mgalkiewicz: can you just update the crush map per dmick's suggestion and see if that fixes it?
[22:55] <scuttlemonkey> so: step chooseleaf firstn 0 type host -> step chooseleaf firstn 0 type dev
[22:56] * ScOut3R_ (~ScOut3R@1F2EAE7E.dsl.pool.telekom.hu) has joined #ceph
[22:58] <absynth> how can i dump my current crush map again? i forgot
[22:59] <scuttlemonkey> absynth: http://ceph.com/docs/master/rados/operations/crush-map/
[22:59] <mgalkiewicz> ceph osd getcrushmap -o file
[22:59] * Ul1 (~Thunderbi@ip-83-101-40-5.customer.schedom-europe.net) has joined #ceph
[23:00] <absynth> ah yeah, thanks
[23:00] <Vjarjadian> hows the work on the geo replication coming? have they decided on how it's going to work?
[23:01] <sstan> is there a command to add a host to the crush map?
[23:01] * Ul (~Thunderbi@ip-83-101-40-159.customer.schedom-europe.net) Quit (Ping timeout: 480 seconds)
[23:02] <absynth> i'm not sure, Vjarjadian. from reading the mailing list, it sounds a bit like the use-case is not something Ceph can currently cater for very well
[23:02] <scuttlemonkey> Vjarjadian: last I heard rev1 was going to use RGW since that's the low-hanging fruit
[23:02] <absynth> but i really only skimmed the mails
[23:02] <scuttlemonkey> but a much more robust solution will mature over time
[23:02] <scuttlemonkey> the first rev is designed so people can have DR stuff asap
[23:03] <Vjarjadian> i'm sure after a few versions it will be very nice
[23:03] <Vjarjadian> as soon as it's in i'll be trialing it :)
[23:03] * ScOut3R_ (~ScOut3R@1F2EAE7E.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[23:04] <scuttlemonkey> sstan: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ is best doc I know of
[23:05] <mgalkiewicz> scuttlemonkey: changed it to osd instead of dev but it still works!
[23:05] <mgalkiewicz> thx for help
[23:05] <scuttlemonkey> mgalkiewicz: awesome, glad we could help
[23:05] <dmick> mgalkiewicz: oops, sorry, my bad.
[23:05] <scuttlemonkey> sstan: specifically, http://ceph.com/docs/master/rados/operations/crush-map/#addosd
[23:06] <absynth> LA people: http://www.tugg.com/events/3107
[23:06] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) Quit (Quit: Ex-Chat)
[23:06] <wer> I have a ceph cluster that I have removed and reformated all the osd's.... will mkcephfs make the mons forget about the old cluster? The cluster reports all kinds of issues... but it is essentially all gone.
[23:07] <wer> I want it to be deader.
[23:08] <ShaunR> whats the deal with 'rados bench -p pool 900 seq' everytime i try to run it i get the following error... 'Must write data before running a read benchmark!'
[23:08] <ShaunR> trying to run benches simular to http://ceph.com/w/index.php?title=Benchmark&oldid=5733
[23:08] <wer> ShaunR: throw the no cleanup flag on a write test first I think.
[23:10] <dmick> ShaunR: yep. it reads a well-known set of objects, but doesn't write them
[23:10] <ShaunR> k
[23:11] <wer> dmick: will running a mkcephfs create a clean cluster from a dirty one?
[23:12] <junglebells> wer: you need to remove each of the mon's data if you're using the same directories with your 'new' cluster
[23:12] <wer> just wipe the mon's dir?
[23:12] <junglebells> Wipe the mon's, wipe your osd's (which you said you did) and delete your journals if you keep them somewhere else and that should be fairly safe
[23:13] <wer> ok sweet! ty!
[23:16] <absynth> night everyone
[23:17] * jlogan2 (~Thunderbi@2600:c00:3010:1:2dd0:c27d:7304:3f9b) Quit (Ping timeout: 480 seconds)
[23:20] <wer> looks like mkcephfs is creating journals even though I didn't specify them in the config. I am playing with btrfs and thought I could run without and improve write speeds?
[23:21] <wer> was that an incorrect assumption on my part?
[23:22] <dmick> no, you must have journals
[23:22] <wer> noooo all my throughput :)
[23:22] <wer> shoot. They may be a little small.... whatever the default is I guess :)
[23:23] <dmick> 5GB
[23:26] <wer> dmick: think that is OK for 2TB drives?
[23:28] <wer> hmm. two of my mons did not get the magic and refuse to start. Do I need to add extra mons only after running mkcephfs?
[23:29] <wer> or maybe start my osd's first :P
[23:44] * BManojlovic (~steki@121-173-222-85.adsl.verat.net) Quit (Quit: Ja odoh a vi sta 'ocete...)
[23:53] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[23:54] * Ryan_Lane (~Adium@ has joined #ceph
[23:55] * jjgalvez (~jjgalvez@ has joined #ceph
[23:56] * Ryan_Lane (~Adium@ Quit ()

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.