#ceph IRC Log


IRC Log for 2013-01-15

Timestamps are in GMT/BST.

[0:04] <nhm> heya guys, anyone compiled ceph on RHEL6 with tcmalloc?
[0:07] * alram (~alram@ has joined #ceph
[0:12] <gregaf> I believe people have, but I dunno if anybody who hangs out here does so
[0:13] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[0:16] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) Quit (Quit: Leaving.)
[0:21] * tnt (~tnt@216.186-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[0:23] * jlogan1 (~Thunderbi@2600:c00:3010:1:9cc3:821f:978c:5b0b) Quit (Quit: jlogan1)
[0:26] <dmick> there's a centos6 build that doesn't say notcmalloc on the gitbuilders...
[0:28] * jlogan (~Thunderbi@2600:c00:3010:1:a5f6:9d83:61c8:8617) has joined #ceph
[0:35] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:36] <amichel> dmick: did you find anything in my crushmap?
[0:36] <dmick> adding Spirit debug
[0:36] <dmick> I can't see anything wrong by comparing to good maps
[0:36] <dmick> parsers. /me spits on ground
[0:37] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[0:38] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[0:38] * Cube (~Cube@ has joined #ceph
[0:41] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[0:42] <gregaf> amichel: dmick: I don't see anything that ought to be wrong in a quick skim, but it does have varying use of whitespace and I don't remember how the parser handles that
[0:42] <gregaf> oh, also missing "firstn" in the data rule
[0:42] <amichel> Does it read like the ravings of a madman or does it seem sensical?
[0:43] <gregaf> and rbd is "choose" instead of "chooseleaf" for some reason
[0:43] <gregaf> I'm betting it's the missing firstn, but the rbd rule isn't going to work since you've got it emitting hosts rather than devices as it currently stands
[0:44] <amichel> Should it emit devices?
[0:44] <amichel> I didn't realize
[0:44] <gregaf> yes, that's what CRUSH is supposed to return — a list of devices to store data on
[0:44] <amichel> The example looks very similar
[0:44] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[0:44] <gregaf> the "chooseleaf" rules are choosing leaves from underneath n hosts
[0:45] <gregaf> but if you just do "choose" then you choose n hosts and then need to do something with those hosts to select devices inside of them
[0:45] <amichel> Oh
[0:45] <amichel> I see
[0:45] <amichel> Yeah, I just forgot to type leaf
[0:45] <gregaf> but the data rule is broken since it says "step chooseleaf 0 type host" and it should be "step chooseleaf firstn 0 type host"
[0:46] <amichel> Yeah, I fixed that also
[0:46] <amichel> Trying to compile again
[0:46] <amichel> same error
[0:50] <dmick> error has moved to end of file for me now tho
[0:50] <amichel> Oh, not for me
[0:50] <amichel> Huh.
[0:50] <amichel> still line 225 when I try it
[0:53] <dmick> just noticed min_size 0 rather than 1 in rule rbd
[0:53] <dmick> no change
[0:53] <amichel> min_size 0 was straight from the example
[0:54] <amichel> I don't entirely understand what I'm asking it to do with min_size so I left it alone
[0:58] <gregaf> you're specifying a range — "use this rule for sizes between min_size and max_size (inclusive)"
[0:58] <gregaf> you could have different rules for rbd sizes of 1-3 and 4-6, for instance
[0:59] <gregaf> I'm not sure if min_size 0 is valid or not though
[0:59] <dmick> changing to 1 doesn't help, so it's not the problem anyway
[1:08] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[1:15] <elder> gregaf, is there a message that the kernel client might send or receive that might contain a block of 64 bytes of random-ish data?
[1:16] <gregaf> sure, an OSD read...
[1:16] <gregaf> what're you looking at elder?
[1:16] <elder> I'm trying to find the source of a memory leak. The leak is of something that's around (but less than) 192 bytes.
[1:17] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[1:17] <elder> I'm looking at some raw memory, and I see a pattern of about 64 bytes of raw data surrounded by what might be some lengths, for example.
[1:18] <gregaf> is it in fact inside a message? or you have no idea?
[1:18] <elder> I don't know.
[1:18] <elder> What struct should I look at?
[1:18] <gregaf> umm
[1:19] <gregaf> it could be almost anything in any message
[1:19] <elder> ceph_msg is too big.
[1:19] <gregaf> or, from the description you're giving me, something completely unrelated to Ceph
[1:20] <elder> osd message header of some kind? Why would an osd read begin with 64 bytes of random data?
[1:20] <gregaf> what makes you think it's random?
[1:20] <elder> 98220bdad2c50b1a 8c6118a0827114e5
[1:21] <elder> Doesn't look like an address. An address would look like ffff8801b02c82b0
[1:21] <gregaf> all that means is it isn't ones or zeros, right?
[1:21] <elder> Doesn't look like a count or size, that would look like 0000000000000001
[1:21] * sjust (~sam@ has joined #ceph
[1:21] <gregaf> I guess you could look at the checksum infrastructure
[1:21] <elder> And it's not ASCII
[1:21] <gregaf> I don't remember what sizes those are
[1:22] <elder> I doubt the checksums are 64 bytes. Anyway, if it's not familiar that's fine.
[1:22] <gregaf> it could be some portion of an RBD header
[1:22] <elder> I don't think so.
[1:22] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:22] <gregaf> but no, I cannot with my mystical powers tell you which piece of encoded information that Ceph generates or sends over the wire a random 64 bytes might be :p
[1:23] <elder> I'll let you know if I am able to later.
[1:23] <gregaf> authentication handshakes would be the other one to check (besides checksums), if you really do think it's random
[1:24] <elder> It could be a UUID, but those are are normally only 16 bytes.
[1:24] <elder> Two consecutive UUID's maybe.
[1:25] <elder> (I mean 4) CRC's are normally 32 bits. And looking at cephx it has something with a block for 128 bytes of random payload but not 64.
[1:26] <dmick> sure it's initialized?
[1:26] <elder> Message footer has 3 32-bit CRC's plus a glag byte. Hmm.
[1:26] <elder> dmick, It looks that way. I just see a pattern and thought the structure might be telling.
[1:27] * jlogan1 (~Thunderbi@ has joined #ceph
[1:27] <elder> (Anybody see "A Beautiful Mind"? )
[1:27] <dmick> I have, and patterns do sometimes jump out, sure
[1:27] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[1:27] <elder> No, I was referring to the thought that maybe I'm insane.
[1:27] <dmick> heh
[1:28] <rturk> maybe it's an alien race trying to communicate with you, like in contact
[1:28] <elder> I'll check into that rturk
[1:28] <rturk> that could be Ceph's true purpose
[1:28] <dmick> "now up a perfect fifth"
[1:28] <dmick> "This *means* something. I'm *sure* of it."
[1:28] <rturk> :P
[1:30] * jlogan (~Thunderbi@2600:c00:3010:1:a5f6:9d83:61c8:8617) Quit (Ping timeout: 480 seconds)
[1:36] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[1:38] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[1:39] <phantomcircuit> uh
[1:39] <phantomcircuit> disk mon is on is dead
[1:39] <phantomcircuit> wat do
[1:39] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:39] <dmick> phantomcircuit: fix it :)
[1:40] <phantomcircuit> i dont have the monmap
[1:40] <phantomcircuit> it's dead dead
[1:41] <gregaf> only had one mon?
[1:41] <phantomcircuit> yeah
[1:41] <phantomcircuit> screwed?
[1:42] <dmick> can probably just create a new monitor, give it the right fsid, and hook the OSDs up to it, right, gregaf?..
[1:42] <gregaf> in theory it's something like that, but also with the OSDMap history that the OSDs require
[1:43] <gregaf> which is the harder part, although not necessarily impossible to fix up that one state machine
[1:43] <dmick> ah
[1:43] <gregaf> probably go look at your OSDs to see what maps they have, pull together a full set, put them in the directory with the right filenames, and then edit the first_committed, last_committed, and latest files appropriately
[1:44] <gregaf> where first_committed and last_committed are the epochs of the maps, and latest (if that's the filename, I think it is) is a full map instead of an incremental
[1:45] <gregaf> it it's probably the version and then the map in one file, sequentially
[1:48] <phantomcircuit> ceph_fsid is the cluster id so i got that
[1:48] * gohko (~gohko@natter.interq.or.jp) Quit (Quit: Leaving...)
[1:52] <phantomcircuit> i dont see anything in the osd that looks like a map
[1:53] <phantomcircuit> :(
[1:55] <gregaf> phantomcircuit: they'll be in the osd/current/meta directory (and subdirectories) with names like osdmap.1__0_FD6E49B1__none
[1:55] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[1:56] <phantomcircuit> ah ok yeah i see them
[2:06] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[2:06] * agh (~agh@www.nowhere-else.org) has joined #ceph
[2:11] * sagelap (~sage@65-123-3-106.dia.static.qwest.net) has joined #ceph
[2:14] <phantomcircuit> gregaf, you think it'll work with just latest?
[2:18] <gregaf> phantomcircuit: umm, if none of your OSDs ask for an older one you should be able to manage that
[2:18] <gregaf> were all your PGs active+clean?
[2:19] <gregaf> there might be some code assumptions about how many older ones you have, though, so you might need to create dummy files even if they're never read
[2:19] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[2:19] <gregaf> and you will still need to put it in both the "latest" (with correct format) and a named-for-the-epoch file, with matching [first|last]_committed
[2:21] <phantomcircuit> and there all need to be in osdmap_full right
[2:21] <phantomcircuit> osdmap/* seem to be a different format than what osdmaptool recognizes
[2:23] <gregaf> osdmap/* is the incrementals…wow, I don't remember which of those you'll need to include, actually
[2:23] * korgon (~Peto@isp-korex- has joined #ceph
[2:24] <phantomcircuit> i'll just do both
[2:26] * amichel (~amichel@salty.uits.arizona.edu) Quit ()
[2:29] * sagelap (~sage@65-123-3-106.dia.static.qwest.net) Quit (Ping timeout: 480 seconds)
[2:32] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[2:34] * rturk is now known as rturk-away
[2:34] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[2:38] <phantomcircuit> gregaf, lol why is there incrementals and full maps
[2:39] <gregaf> because you need both of them at various times
[2:39] <gregaf> I guess you could rebuild the incrementals on every request, but that'd get expensive
[2:39] <gregaf> and currently the monitors actually only replicate the incrementals, to save marginally on network bandwidth
[2:40] * dpippenger (~riven@ Quit (Remote host closed the connection)
[2:41] <phantomcircuit> gregaf, yeah i meant why would they need to be on disk
[2:41] <phantomcircuit> calculating the incrementals once and keeping them in memory would be cheap
[2:41] * alpharender (~alpharend@ has joined #ceph
[2:41] <alpharender> Hi
[2:41] <gregaf> it actually speaks incrementals when generating the updates, is why
[2:41] <phantomcircuit> also --osdmap to ceph-mon with --mkfs seems to build it for me
[2:42] <gregaf> anyway I'm out for the day, good luck!
[2:42] <phantomcircuit> is there anyharm in starting the mon and osd without being 100% sure?
[2:42] <gregaf> I would leave the OSD off and just turn on the monitor and see if it goes and lets clients connect
[2:42] <phantomcircuit> ok
[2:43] <gregaf> if you turn both on you could get the OSD taking actions based on monitor requests that cause divergent paths and other bad things
[2:45] * alram (~alram@ Quit (Quit: leaving)
[2:46] <alpharender> Is ceph capable of being used as a clustered - share disk san network? I can setup iscsid on a monitor to a rbd device?
[2:49] <phantomcircuit> gregaf, i think i'll backup the osds first...
[2:50] <dmick> alpharender: you can hook up iscsid to a kernel-mounted rbd device now, yes
[2:51] <dmick> I've written a prototype backend for tgt that connects without the kernel path just last week
[2:51] <dmick> mostly as a proof of concept
[2:52] <alpharender> how to HA the iscsid, and multi path?
[2:52] <alpharender> into ceph
[2:52] <dmick> alpharender: not sure how the iscsid HA side works, but
[2:53] <dmick> ceph is sorta multipath by design
[2:53] <dmick> I would imagine there are ways to mount multiple target instances as one failover set; I just haven't done it personally
[2:53] <alpharender> I thinking of storage solution for virtual machines to hypervisor
[2:54] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[2:54] <dmick> http://pve.proxmox.com/wiki/ISCSI_Multipath may be useful
[2:55] <alpharender> oh wow
[2:56] <alpharender> we are talking about another iscsid on another monitor…. I like to think the worst with spof and it would be my luck something happens to the single iscsid/mon
[2:56] <alpharender> ?
[2:57] <alpharender> Guess it depends what my application layer is and how I'm using the lun
[2:57] <dmick> alpharender: you can have more than one monitor, yes
[2:57] <alpharender> that is each rbd is an instance or I store the instances as files
[2:58] <dmick> not sure what you mean; if you want multipath access to one image, then there's only one image file
[2:58] <dmick> stored, redundantly, in the cluster
[2:58] <dmick> and accessed, redundantly, through multiple instances of iscsi
[2:59] <alpharender> yes multiple instances of iscsi on multiple machines/monitors ?
[3:00] <alpharender> I'm just wondering if I would be missing another clustering layer in iscsi
[3:00] <dmick> if you want to distribute iscsi instances across multiple machines, I believe you can do so, yes
[3:00] <alpharender> hmm
[3:00] <dmick> the number and location of monitors is independent; each iscsi daemon would see "a cluster connection"
[3:01] <alpharender> cluster connection eh
[3:04] <dmick> in the case of kernel rbd, each iscsi daemon would layer on a kernel rbd device
[3:04] <dmick> that kernel rbd device connects to the cluster in a failover-redundant way (if there are multiple monitors, which you'd have)
[3:05] <dmick> so you can blackbox the cluster connection as "a non-spof box"
[3:07] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[3:08] <alpharender> sounds awesome. I really like this concept… raid, and even zfs is really annoying for me
[3:12] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[3:14] * LeaChim (~LeaChim@b0fadd12.bb.sky.com) Quit (Ping timeout: 480 seconds)
[3:19] * lx0 is now known as lxo
[3:23] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[3:24] * via (~via@smtp2.matthewvia.info) Quit (Quit: updating)
[3:26] <elder> Found it.
[3:27] <dmick> leak?
[3:27] <elder> The structure that I'm pretty sure is leaking is a bio.
[3:27] <elder> I recognize the structure of what I'm looking at, and it's a bio.
[3:27] <elder> Now I'll find the leak.
[3:27] <elder> But first, a bite to eat.
[3:31] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[3:33] * jlogan1 (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[3:36] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Read error: Connection reset by peer)
[3:36] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[3:59] <phantomcircuit> ok so im guessing i wasn't successful
[4:00] <phantomcircuit> i was able get the osdmap fixed (at least it seems right the osd uuid's are right)
[4:00] <phantomcircuit> but i cant get the osds to connect to the mon
[4:01] <elder> joshd?
[4:01] <elder> dmick, is he still around?
[4:01] <phantomcircuit> joshd?
[4:02] <phantomcircuit> elder, the drive my one and only mon was on died
[4:02] <phantomcircuit> i wasn't able to recovery anything from it
[4:02] <elder> (Sorry phantomcircuit I doubt I can help you...)
[4:02] <phantomcircuit> ah
[4:02] <phantomcircuit> ok then
[4:04] <phantomcircuit> 2013-01-15 03:04:05.504319 2cb260a9780 20 osd.0 pg_epoch: 454 pg[2.47( v 454'154108 (454'153107,454'154108] local-les=454 n=1123 ec=1 les/c 454/454 453/453/449) [] r=0 lpr=0 (info mismatch, log(454'153107,0'0]) (log bound mismatch, actual=[454'153108,454'153719]) lcod 0'0 mlcod 0'0 inactive] read_log 556184 454'153720 (454'153713) modify ccaa4947/rb.0.1415.74b0dc51.0000000034ec/head//2 by client.7000.0:110276 2013-01-14 21:58:
[4:04] <phantomcircuit> 46.052859
[4:04] <phantomcircuit> that's not good
[4:05] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.89 [Firefox 17.0.1/20121128204232])
[4:09] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[4:13] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:31] * Kioob (~kioob@luuna.daevel.fr) Quit (Ping timeout: 480 seconds)
[4:31] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[4:32] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[4:34] <elder> gregaf, FYI the random data was *not* a clue about the content of the memory I was looking at. But I was able to use my mystical powers to figure it out anyway.
[4:34] * alpharender (~alpharend@ Quit (Quit: alpharender)
[4:46] <ghbizness> has anyone successfully deployed ceph using a chef cookbook while modifying some of the deployment scripts ?
[4:48] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[4:51] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[4:53] <dmick> elder: joshd is gone. Can I help?
[4:54] <elder> If you know what the meaning of snapid 0 is, maybe.
[4:54] <elder> David Zafman was asking about it. I believe it's invalid, but that it maybe takes the place of CEPH_NOSNAP in spots that isn't (yet) used.
[4:54] <dmick> I....think it's just the first snapshot, but I could be high.
[4:55] <elder> That's low for a first snapshot, not high.
[4:55] <elder> I think the first snapshot is 1.
[4:55] <elder> But I may be mistaken, hence was looking for a second opinion.
[4:56] * benner (~benner@ Quit (Read error: Connection reset by peer)
[4:59] <dmick> it's complicated by the fact that there are two different snapshot types
[5:01] <dmick> the first rbd snapshot in a new pool is id 2
[5:01] * benner (~benner@ has joined #ceph
[5:04] <dmick> and indeed that looks like the first unmanaged snapid
[5:04] <dmick> er, selfmanaged I mean
[5:04] <dmick> pool snapshots look more like they could start at 0
[5:04] <dmick> but those are largely useless IMO, as they refer to the entire pool
[5:09] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[5:09] * agh (~agh@www.nowhere-else.org) has joined #ceph
[5:25] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[5:30] <phantomcircuit> :/
[5:31] <phantomcircuit> i dont think im going to get this to work
[5:40] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[5:40] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit ()
[5:46] <dec> phantomcircuit: still having performance problems?
[5:46] <phantomcircuit> no much worse
[5:46] <phantomcircuit> the disk my only mon was on died
[5:46] <dec> ... oh
[5:46] <phantomcircuit> stuck trying to reconstruct it
[5:47] <phantomcircuit> didn't realize how important it was to have more than 1
[5:48] <phantomcircuit> dec, any ideas? :)
[5:49] <dec> nope, sorry
[5:49] <dec> beyond my knowledge level
[5:49] <dec> I just know to have multiple mons :)
[5:53] <phantomcircuit> the docs have a recommended number of thigns section
[5:53] <phantomcircuit> i didn't even consider that the info on the monitor was vital to operation
[5:53] <phantomcircuit> but that seems sort of obvious now
[6:00] <dec> I'm busy moving VMs off of RBD
[6:00] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[6:00] <dec> 0.56.1 performance is killing everything :(
[6:06] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:08] <phantomcircuit> that sucks
[6:27] * sagelap (~sage@ has joined #ceph
[6:46] <dec> can I force "rbd export" to overwrite an existing file?
[6:46] <dec> (because I'm either actually writing to a block device or named pipe, or something)
[6:47] <dmick> not by using filename
[6:47] <dmick> you can export to stdout and then do what you will
[6:47] <dmick> (use '-' as filename)
[6:51] * mmgaggle_ (~kyle@alc-nat.dreamhost.com) has joined #ceph
[6:51] * mdxi_ (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[6:51] * sleinen (~Adium@2001:620:0:26:d0b7:fdfa:b61a:4fdc) has joined #ceph
[6:52] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[6:52] <dec> dmick: oh, cool. :)
[6:52] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[6:52] * ChanServ sets mode +o elder
[6:52] <phantomcircuit> bah i think im screwed :|
[6:53] * mmgaggle (~kyle@alc-nat.dreamhost.com) Quit (Ping timeout: 480 seconds)
[6:53] <dec> dmick: nope. just ended up with a file called '-' (unless I'm doing something stupid)
[6:53] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[6:54] * sleinen (~Adium@2001:620:0:26:d0b7:fdfa:b61a:4fdc) Quit ()
[6:56] <dmick> dec: what version?
[6:58] <dec> 0.56.1
[6:58] <dec> stdin *import* works; seems export does not
[6:58] <dmick> doesn't make any sense
[6:58] <dmick> you sure the rbd is also updated?
[6:58] <dmick> rbd -V
[6:58] <dec> but again, I might be doing something stupid :)
[6:58] <dmick> sorry -v
[6:59] <dec> oh, hold up - this client is still old: ceph version 0.53 (commit:2528b5ee105b16352c91af064af5c0b5a7d45d7c)
[6:59] <dmick> that'll do it :)
[6:59] <dec> :)
[6:59] <dec> my bad
[7:02] <dec> qemu-img convert -f rbd -O raw ... seems to be working
[7:03] <dmick> is that different than it had been? I've lost track of what qemu-img status is
[7:05] * Exstatica (~exstatica@ Quit (Quit: leaving)
[7:06] <dec> I don't know
[7:09] * The_Bishop (~bishop@e177088104.adsl.alicedsl.de) has joined #ceph
[7:12] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[7:17] * sagelap (~sage@141.sub-70-196-194.myvzw.com) has joined #ceph
[7:30] <phantomcircuit> this isn't good
[7:37] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) has joined #ceph
[7:38] <xiaoxi> Hi, I have seen a lot of "slow request" warning in the OSD log, Why such request will be blocked for so long?
[7:49] * yoshi (~yoshi@p29244-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[7:51] * tnt (~tnt@216.186-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:07] * fghaas (~florian@178-190-195-172.adsl.highway.telekom.at) has joined #ceph
[8:10] * jrisch (~Adium@ has joined #ceph
[8:10] <phantomcircuit> yeah the pgmap is obviously totally wrong
[8:15] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[8:17] <iggy> xiaoxi: you should check the list, and if there isn't anything bring it up
[8:18] <iggy> I say that because I've seen it mentioned more than once the last few days
[8:20] <xiaoxi> you mean the maillist?
[8:21] * sleinen (~Adium@2001:620:0:25:8473:a0ee:2ded:7ebc) has joined #ceph
[8:28] <phantomcircuit> it's hard to even make sense of the lower level debug messages
[8:28] * jrisch (~Adium@ Quit (Read error: Connection reset by peer)
[8:28] <phantomcircuit> 2013-01-15 07:26:00.905121 2edb6949780 20 osd.0 pg_epoch: 454 pg[2.b( v 454'278189 (454'277188,454'278189] local-les=454 n=1082 ec=1 les/c 454/454 453/453/449) [] r=0 lpr=0 (info mismatch, log(454'277188,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log 249906 454'277169 (454'277167) modify f909088b/rb.0.1415.74b0dc51.0000000025fe/head//2 by client.7000.0:86606 2013-01-14 13:06:44.163808
[8:29] <phantomcircuit> osd.0 placement group 2.b has a log bound mismatch
[8:29] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[8:30] * agh (~agh@www.nowhere-else.org) has joined #ceph
[8:31] * hijacker (~hijacker@ Quit (Quit: Leaving)
[8:42] * cephalobot` (~ceph@ds2390.dreamservers.com) Quit (Ping timeout: 480 seconds)
[8:42] * loicd (~loic@ has joined #ceph
[8:43] * rturk-away (~rturk@ds2390.dreamservers.com) Quit (Remote host closed the connection)
[8:46] * ninkotech (~duplo@ip-94-113-217-68.net.upcbroadband.cz) Quit (Quit: Konversation terminated!)
[8:50] * fc__ (~fc@ has joined #ceph
[8:54] * fghaas1 (~florian@213162068057.public.t-mobile.at) has joined #ceph
[8:58] * sagelap (~sage@141.sub-70-196-194.myvzw.com) Quit (Ping timeout: 480 seconds)
[8:58] * fghaas (~florian@178-190-195-172.adsl.highway.telekom.at) Quit (Ping timeout: 480 seconds)
[9:02] * fghaas1 (~florian@213162068057.public.t-mobile.at) Quit (Ping timeout: 480 seconds)
[9:03] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) has joined #ceph
[9:05] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[9:09] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:11] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[9:13] * hijacker (~hijacker@ has joined #ceph
[9:13] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) Quit (Quit: Leaving.)
[9:22] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) has joined #ceph
[9:27] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[9:29] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Remote host closed the connection)
[9:30] * ScOut3R (~ScOut3R@ has joined #ceph
[9:32] * tnt (~tnt@216.186-67-87.adsl-dyn.isp.belgacom.be) Quit (Read error: Operation timed out)
[9:47] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[9:48] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[9:48] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: A fine is a tax for doing wrong. A tax is a fine for doing well)
[9:56] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[10:03] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[10:04] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Read error: No route to host)
[10:04] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[10:10] * Leseb (~Leseb@ has joined #ceph
[10:18] * schlitzer_ (~schlitzer@ has joined #ceph
[10:18] * schlitzer|work (~schlitzer@ Quit (Read error: Connection reset by peer)
[10:20] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[10:23] * LeaChim (~LeaChim@b0fadd12.bb.sky.com) has joined #ceph
[10:26] * cephalobot (~ceph@ds2390.dreamservers.com) has joined #ceph
[10:32] * gucki (~smuxi@80-218-125-247.dclient.hispeed.ch) has joined #ceph
[10:34] * rturk-away (~rturk@ds2390.dreamservers.com) has joined #ceph
[10:39] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:39] * BManojlovic (~steki@ has joined #ceph
[10:40] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[10:40] * JohansGlock (~quassel@kantoor.transip.nl) Quit (Remote host closed the connection)
[10:43] * JohansGlock (~quassel@kantoor.transip.nl) has joined #ceph
[10:48] * Morg (d4438402@ircip2.mibbit.com) has joined #ceph
[10:51] <gucki> hi there :)
[10:52] <agh> hi
[10:53] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[10:54] <gucki> i just upgraded one of my servers (2 osds) from 0.48.2argonaut to the latest 0.48.3argonaut and now of the osds is taking 100% for around 8 minutes. i never had this with the previous argonaut before. what shall i do? :)
[10:54] <gucki> shall i simply try to restart the osd?
[10:55] <gucki> memory usage is only 30mb rss (the other osd seems to be running normally, taking almost no cpu and around 300 mb rss)
[10:55] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Remote host closed the connection)
[10:56] <gucki> ceph -w reports the cluster is healty and everything is up...but it seems the one osd is hanging and not really doing anything?
[10:56] * ScOut3R (~ScOut3R@ has joined #ceph
[10:57] <gucki> here's the logfile of the osd: http://pastie.org/pastes/5687380/text
[10:58] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[10:58] <scalability-junk> hey I'm looking to test out ceph and play with it in a half production environment.
[10:59] <scalability-junk> primarely or in the beginning I just want to use the block storage with 1-2 replicas.
[10:59] <scalability-junk> actually I will start with one server and wanna have some offsite backup.
[11:00] <scalability-junk> is there a way to use ceph with replication 2 one used for good performance and one only used for backup (another datacenter with not the best link)
[11:00] <agh> scalability-junk: I'm not sure that Ceph is done for that
[11:00] <agh> scalability-junk: or, not yet. I've read there that a feature was planed to to async replication. But it's not done yet
[11:01] <scalability-junk> agh any eta on this feature?
[11:01] <agh> scalability-junk: I don't know
[11:01] <scalability-junk> probably best to start with ceph replica 1 or lvm with drbd as offsite storage...
[11:03] <scalability-junk> a better question could be how to best backup a whole ceph cluster?
[11:04] <gucki> strace is here (not really much): http://pastie.org/5687401/text
[11:05] <agh> scalability-junk: if you find the answer, i'm interested in !
[11:05] <scalability-junk> agh, yeah it seems like a strange thing to just rely on replication, which isn't even geo distributed...
[11:06] <agh> scalability-junk: the problem is the latency i suppose.
[11:06] <scalability-junk> and with big deployments the data size :)
[11:07] <scalability-junk> I imagine dreamhost is not really fond of doing weekly backups of their whole ceph cluster...
[11:07] <agh> scalability-junk: if you're sure that between your 2 sites you have a low latency link, with a good throughput, then replica should be ok
[11:07] <scalability-junk> agh, no unfortunately not, I will probably go with drbd or hdfs as offsite async "backup"
[11:07] <gucki> restarting the osd doesn't help....it shuts down cleanly, but after starting it it takes again 100% cpu :-(
[11:10] <zerthimon> is there a way to temporarily disable deep-scrub in 0.56.1 ?
[11:12] <tnt> scalability-junk: The replication here is mostly to handle the 'failed machine' case, not the 'datacenter hit by an asteroid' case :)
[11:14] <gucki> bug report is here: http://tracker.newdream.net/issues/3797
[11:18] <scalability-junk> tnt, yeah but what do to when asteroid hits the datacenter case?
[11:18] <gucki> too bad, the second osd is taking 100% cpu now too. is it safe to downgrade to 0.48.2argonaut?
[11:21] <madkiss> gucki: what does iostat report for your devices?
[11:21] <tnt> scalability-junk: well, in my case, there would be so much more problems if that happenned that I would just look for another job ...
[11:22] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[11:22] <gucki> madkiss: here's the output of dstat http://pastie.org/5687470
[11:22] * agh (~agh@www.nowhere-else.org) has joined #ceph
[11:23] <gucki> madkiss: not sure why it is writing so much now, before the update there was almost no disk activity
[11:23] <scalability-junk> tnt, haha true, but dreamhost must have some sort of backup or?
[11:25] <madkiss> gucki: try "iostat -xdm 5 /dev/sd*"
[11:27] <darkfaded> scalability-junk: my prognosis is there's another century passing by until a meteor strike is more probably than a logical software error causing data loss
[11:27] <gucki> madkiss: http://pastie.org/pastes/5687479/text
[11:27] <darkfaded> and that is almost neglible in comparism to accidential wiping
[11:27] <darkfaded> thats why backups are called backups and not replicas :)
[11:28] <scalability-junk> darkfaded, yeah still no idea how a whole cluster snapshot/backup is done :)
[11:28] <madkiss> gucki: nothing really suspicious there — where do you host your OSDs' metadata?
[11:29] <darkfaded> do you really one atomic backup of the whole cluster?
[11:29] <tnt> scalability-junk: in my case, I use a script that just check the files in the radosgw and syncs new ones and keeps an history of changes ...
[11:29] <darkfaded> imagine if you had a big san array or 20 of them, normally people don't snapshot all of that at once either
[11:29] <gucki> madkiss: i really doubt it's related to the storage, as the .2 argonaut was running fine without any problems...
[11:30] <scalability-junk> tnt, ok sort of a versioned filesystem with replication/backup to some other storage array
[11:30] <gucki> madkiss: or is .3 changing the on-disk format and so has to rewrite everything? but this still wouldn't explain, why osd.8 hangs right after starting it, and osd.9 jumped to 100% cpu after around 15 minutes...
[11:30] <scalability-junk> darkfaded, true, but continous backup with a versioned filesystem seems wrong when only ceph accessible objects are stored...
[11:31] <madkiss> gucki: i'm not saying that it's the hardware or something, I am just trying to figure out what exactly your Ceph is doing right now. There were some reported regressions for performance in 0.48.3 over the last days here (although none was officially confirmed just yet)
[11:31] <tnt> scalability-junk: yes. But it's actually in the same data center, just on a big nas and it only backups 'important files' (so for example, the original video, not the 5 different recompressed/rescaled versions of it that we could recompute in case of big problems).
[11:32] * zerthimon (~zerthimon@sovintel.iponweb.net) Quit (Quit: Leaving)
[11:32] <darkfaded> scalability-junk: ok i i think my problem is that i mostly thought about ceph fs and you're more considering rbd
[11:33] <gucki> madkiss: mh, do you know if it's safe to downgrade to .2?
[11:33] <darkfaded> if i still come up with some idea for you i'll say it
[11:34] <madkiss> guck
[11:34] <madkiss> er
[11:34] <madkiss> r
[11:34] <tnt> gucki: AFAIK, there is a bunch of safeties that will prevent the osd to launch at all if the filestore is not at a compatible version. So I would just try ...
[11:34] <scalability-junk> darkfaded, tnt just to be clear I assume data is backed up anyway, but as a storage provider having the ability to restore a cluster or something seems like a essential thing
[11:34] <madkiss> gucki: i have not seen any contrary reports whatsoever
[11:35] <scalability-junk> tnt, so you backup individual files by accessing the objects/fs directly and then backup one replica?
[11:35] <darkfaded> scalability-junk: you're looking for a snapmirror of sorts
[11:35] <darkfaded> idk if that would be possible as of now
[11:36] <scalability-junk> mhh
[11:36] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has left #ceph
[11:36] <tnt> scalability-junk: well, I use ceph as S3-like storage. I have RBD VM image as well but those are designed to not store any data (so we can rebuild them from scratch without much impact), all data is either on DB servers, or in the S3 storage.
[11:36] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[11:36] <scalability-junk> damn wrong key ;)
[11:37] <scalability-junk> let's assume one issue in an upgrade brings down the whole cluster with the data, makes a data wipe, usually you would have a hard time recovering...
[11:38] <scalability-junk> or am I assuming something not worth thinking about?
[11:39] <scalability-junk> If I were dreamhost or another storage provider, having some disaster backup should be something good. let's say retaining an daily snapshot and a weekly or so. or is that really something too expensive and not needed?
[11:41] <darkfaded> scalability-junk: depends on sla, risk and business model. someone has to run the numbers and might even come up with expensive/notneeded. but i guess thats not what they'd do, since it's alos a reputation issue
[11:41] <darkfaded> write it up as a real nice mail and send it to the ceph list
[11:41] <darkfaded> because it's a good question
[11:42] <darkfaded> also, snapshots, even remote can be corrupted
[11:42] <darkfaded> so there's more to think about, and one really has to do some cost / safety calculation
[11:43] <scalability-junk> yeah I'll write up a mail.
[11:43] <darkfaded> <- very much looking forward to the thread
[11:44] <scalability-junk> darkfaded, me too
[11:45] <darkfaded> in my past, people have always opted for a backup mechanism that is independent of the mirror/replication
[11:45] <gucki> ok, i just downgraded....now both osds seems to be running normally again.
[11:45] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[11:45] <darkfaded> so even where they had a filer metrocluster plus a third filer pair for snapmirror
[11:45] <darkfaded> the backups were done independently
[11:46] <darkfaded> i'm trying to say: keep in mind that a backup is only safe if it's independent of the original
[11:47] <darkfaded> splitting out a bunch of OSDs+MONs would be the nicest cheap solution by that end
[11:47] <darkfaded> i.e. you run at replication 3 all time, then "attach" a 4th replica and split it after synching
[11:47] <darkfaded> (d'oh, but you'd need two sets for split/attach at least)
[11:48] <liiwi> pretty often people forget that backups need to be made unreachable by rm or delete commands
[11:48] <darkfaded> so you're looking at 40% overhead in my :cheap: scenario
[11:48] <darkfaded> liiwi: probably people like me being unable to communicate stuff in simple terms is something that adds to that
[11:48] <darkfaded> and cleverness, of course
[11:49] <darkfaded> liiwi: backups without a readable format are also great
[11:49] <darkfaded> i.e. "bup" the git-based backup
[11:49] <darkfaded> which is totally not a viable backup if you go by "needs a structured on-disk format"
[11:50] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[11:54] * ScOut3R (~ScOut3R@ has joined #ceph
[11:58] <scalability-junk> darkfaded, git annex is alright :)
[12:14] <scalability-junk> darkfaded, am I allowed to name you in the mail?
[12:14] <darkfaded> sure but i have no more function than lurking here for two years
[12:15] <darkfaded> but of course no problem
[12:17] <scalability-junk> darkfaded, good
[12:18] <darkfaded> scalability-junk: if it cheers you up, what you're worried about is a very common problem and is also tricky to solve if you even just use md raid or lvm
[12:19] <darkfaded> (at least if you also wanna keep 2 copies at any time plus the one you're pulling off the server)
[12:20] <scalability-junk> darkfaded, yeah I love such stuff :P
[12:20] <scalability-junk> darkfaded, but your special replica idea is something I like
[12:23] <scalability-junk> mailinglist approvals mehhh
[12:24] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[12:24] * darkfaded chuckles thinking about names like "splitbrain crush backup"
[12:24] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[12:27] * maswan (maswan@kennedy.acc.umu.se) has joined #ceph
[12:27] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[12:28] * scalability-junk finished the mail \o/
[12:28] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:29] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[12:29] <scalability-junk> darkfaded, splitbrain?
[12:35] <darkfaded> since you'd need to split out all layers (osd, mon) it's really a split brain of sorts
[12:35] <darkfaded> but all of them can be declared "out" on the original side so it's quite safe
[12:36] <darkfaded> (just in case something went wrong and they would stay visible to the orig side
[12:36] <darkfaded> after splitting)
[12:38] * Leseb (~Leseb@ Quit (Quit: Leseb)
[12:39] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[12:39] <scalability-junk> darkfaded, kk hopefully it will be an interesting thread.
[12:40] <scalability-junk> another question, how easy is it to go from 1 replica to 3?
[12:59] * low (~low@ has joined #ceph
[13:05] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[13:08] * The_Bishop (~bishop@e177088104.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[13:12] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:16] * xdeller (~xdeller@broadband-77-37-224-84.nationalcablenetworks.ru) Quit (Quit: Leaving)
[13:20] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[13:23] * Leseb (~Leseb@ has joined #ceph
[13:23] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[13:26] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[13:39] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:42] <scalability-junk> reading through the docs I wonder what happens when adding a lot of osds to the recommended placement group number?
[13:43] <scalability-junk> when starting with 2 osds and therefore 2*100/2 --> 100 as number for placement groups
[13:43] <scalability-junk> wouldn't adding 98 osds result in a bad number of placement groups?
[13:43] <scalability-junk> 100 vs the recommended 100*100/2 --> 5000 ?
[13:51] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[13:53] <JohansGlock> I'm looking into radosgw now, and was wondering, is there support for versioning like in S3?
[13:54] <JohansGlock> I found that old versions dont support it
[13:54] <JohansGlock> but cant find any current information about it if it got supported :)
[13:54] <tnt> not that I know.
[13:55] <tnt> scalability-junk: yes
[13:55] <JohansGlock> oh nvm, i'm blind... not supported indeed
[13:55] <scalability-junk> tnt, ok changing the pool size considerably seems bad then?
[13:56] <tnt> scalability-junk: yes, multiplying your OSD count by 50 isn't ideal ...
[13:56] <tnt> however pg splitting / merging is an upcoming feature.
[13:56] <scalability-junk> what is upcoming? eta?
[13:57] <tnt> experimental split support is in 0.56.1
[13:59] * schlitzer_ is now known as schlitzer|work
[13:59] <scalability-junk> ok oh damn ubuntu is way back in sense of versions :P
[14:00] * Morg (d4438402@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[14:00] <scalability-junk> 0.41.1 omg
[14:01] <tnt> official distro repo are useless for this
[14:02] <scalability-junk> tnt, for ceph or for general production environments?
[14:02] <tnt> ceph
[14:03] <scalability-junk> what happens when I upgrade ceph on only half of my stack would that result in issues?
[14:03] <tnt> if you stick to 'stable' release (argonaut / bobtail) then it's fine.
[14:04] * tnt (~tnt@212-166-48-236.win.be) Quit (Quit: leaving)
[14:06] * ThomasOpod (~thomas@LLagny-156-35-38-195.w217-128.abo.wanadoo.fr) has joined #ceph
[14:07] <ThomasOpod> Hello
[14:09] <ThomasOpod> I have a question about ceph OSD, is there someone able to help ?
[14:09] <dennis> don't ask to ask, just ask
[14:10] <ThomasOpod> ok thanks, is it possible to change journal size in a running cluster ?
[14:11] <ThomasOpod> ceph 0.56.1
[14:13] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) Quit (Quit: Leaving.)
[14:14] <janos> ThomasOpod: i was able to change the location of my journals live, i would think size could be done. but not sure personally
[14:15] <janos> live cluster, not while each particular osd was live
[14:15] * ninkotech (~duplo@ has joined #ceph
[14:18] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[14:20] <ThomasOpod> janos: ok thx, seems I'm not able to shrink it
[14:21] <janos> many of the real knowledgable folks are not awake yet
[14:21] <janos> but i would think that if i could move it - from the osd to an ssd, you could possibly play some tricks by moving it to a new location with new size
[14:22] <janos> and then back, if you really felt like it
[14:22] <janos> the procedure is basically -
[14:22] <janos> you have to shut down the osd,
[14:23] <janos> run ceph-osd -i N --flush-journal,
[14:23] <janos> change config for that osd, and
[14:23] <janos> ceph-osd -i N --mkjournal
[14:23] <janos> actually, you may be able to avoid moving it with that procedure
[14:23] <janos> shut down that osd, flush it, change size in config for that one, and mkjournal it
[14:23] <janos> start it back up
[14:24] * agh (~agh@www.nowhere-else.org) has joined #ceph
[14:24] * thelan (~thelan@paris.servme.fr) has joined #ceph
[14:27] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) has joined #ceph
[14:40] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[14:42] <scalability-junk> tnt, thanks for your help
[14:43] * The_Bishop (~bishop@e177088104.adsl.alicedsl.de) has joined #ceph
[14:43] <scalability-junk> one question which came up while reading the docs: all data except the ceph.conf is replicated with the help of pools such as rbd etc right?
[14:47] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[14:49] <jmlowe> yes, the only thing you need to keep in sync yourself is the conf
[14:50] <scalability-junk> jmlowe, any recommended way? or is rsync or git pull alright
[14:50] <jmlowe> should be, it shouldn't change very much over time
[14:50] <ScOut3R> scalability-junk: or use a configuration management system, like chef or puppet
[14:51] <scalability-junk> yeah ok I think I go with chef + git :P
[14:51] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:52] <ThomasOpod> janos: thx I'll try that, currently my cluster is in reconstruction
[14:52] <jmlowe> most of the work is in setting up and mounting the filesystem for the osd and making sure you have ssh keys distributed, the config is the least of your worries
[14:52] <tnt> well, you really only need the mon config and the local osd config ...
[14:53] <scalability-junk> another question I would point my dns to radosgw, rbd or fs and keep that pointed to the ips for the servers running these services right?
[14:53] <scalability-junk> if I need more throughput I would add another radosgw ip/server to the dns for round robin right?
[14:53] <scalability-junk> tnt, jmlowe kk
[14:56] <jmlowe> you are correct as far as radosgw, rbd and fs take care of themselves if the clients have the config
[15:01] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[15:03] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[15:06] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[15:36] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[15:39] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:41] * vata (~vata@2607:fad8:4:6:221:5aff:fe2a:d1dd) has joined #ceph
[15:41] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[16:00] * jmlowe (~Adium@ has joined #ceph
[16:04] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[16:07] * xdeller (~xdeller@ has joined #ceph
[16:10] * xiaoxi (~xiaoxiche@jfdmzpr04-ext.jf.intel.com) has joined #ceph
[16:11] <sstan> Good morning
[16:12] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[16:16] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[16:16] * agh (~agh@www.nowhere-else.org) has joined #ceph
[16:27] * dosaboy (~user1@host86-164-232-154.range86-164.btcentralplus.com) has joined #ceph
[16:27] * dosaboy (~user1@host86-164-232-154.range86-164.btcentralplus.com) has left #ceph
[16:28] * The_Bishop (~bishop@e177088104.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[16:30] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Quit: Leaving.)
[16:34] * jjgalvez (~jjgalvez@ec2-54-235-219-17.compute-1.amazonaws.com) has joined #ceph
[16:38] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[16:39] <loicd> joshd: thanks for the hints about teuthology and coverage, very helpful :-D
[16:41] * jmlowe (~Adium@ Quit (Read error: Connection reset by peer)
[16:44] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[16:48] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[16:49] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[16:55] * low (~low@ Quit (Quit: Leaving)
[16:56] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has left #ceph
[16:59] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) has joined #ceph
[17:00] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[17:01] * aliguori (~anthony@cpe-70-112-157-151.austin.res.rr.com) Quit (Quit: Ex-Chat)
[17:10] <JohansGlock> http://ceph.com/docs/master/radosgw/purge-temp/ - deprecated, does anyone know how to purge now?
[17:12] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:12] * xiaoxi (~xiaoxiche@jfdmzpr04-ext.jf.intel.com) Quit (Remote host closed the connection)
[17:12] * jlogan (~Thunderbi@2600:c00:3010:1:a9fc:bead:751e:61d9) has joined #ceph
[17:13] * jtangwk1 (~Adium@2001:770:10:500:4b1:7be0:532e:4e6c) has joined #ceph
[17:14] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[17:14] * jtangwk (~Adium@2001:770:10:500:fc72:f03a:59a8:3b3f) Quit (Read error: Connection reset by peer)
[17:20] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[17:22] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) has joined #ceph
[17:22] * Dr_O (~owen@00012c05.user.oftc.net) Quit (Remote host closed the connection)
[17:23] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:28] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[17:31] * sander (~chatzilla@c-174-62-162-253.hsd1.ct.comcast.net) has joined #ceph
[17:38] * Leseb (~Leseb@ Quit (Read error: No route to host)
[17:39] * tnt (~tnt@216.186-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[17:39] * aliguori (~anthony@ has joined #ceph
[17:40] * Leseb (~Leseb@ has joined #ceph
[17:44] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[17:48] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[17:52] <scuttlemonkey> johansglock: not sure, but that may be built in. yehudasa would know better
[17:52] <scuttlemonkey> either way the doc should reflect what's up
[17:52] <scuttlemonkey> let our doc writer know
[17:55] <JohansGlock> scuttlemonkey: thx, where can i contact him? (afk now)
[17:56] <scuttlemonkey> yeah, the west coast folks are just starting their day. John (doc writer) is gonna check with him
[17:56] <scuttlemonkey> I'll have him mention to also put the answer in irc when he gets in, if that suits you
[17:56] <scuttlemonkey> else just email me (patrick@inktank.com) and I'll follow up when I hear back
[18:02] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[18:03] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[18:06] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[18:06] * agh (~agh@www.nowhere-else.org) has joined #ceph
[18:07] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[18:14] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[18:16] * The_Bishop (~bishop@2001:470:50b6:0:9b6:b9a7:942f:f769) has joined #ceph
[18:17] * alram (~alram@ has joined #ceph
[18:26] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:30] <yehudasa> JohansGlock: starting at ver 0.52 there's a garbage collection process that runs automatically within rgw, no need to run external tools
[18:36] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:39] * sbadia (~sbadia@yasaw.net) Quit (Quit: WeeChat 0.3.8)
[18:46] * Pagefaulted (~AndChat73@c-67-168-132-228.hsd1.wa.comcast.net) Quit (Ping timeout: 480 seconds)
[18:48] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) Quit (Read error: Operation timed out)
[18:48] * guilhemfr (~guilhem@AMontsouris-652-1-174-162.w92-163.abo.wanadoo.fr) has joined #ceph
[18:50] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:52] * ThomasOpod (~thomas@LLagny-156-35-38-195.w217-128.abo.wanadoo.fr) Quit (Quit: Lost terminal)
[18:57] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[18:59] * jjgalvez (~jjgalvez@ec2-54-235-219-17.compute-1.amazonaws.com) Quit (Quit: Leaving.)
[19:01] <phantomcircuit> http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/
[19:01] <phantomcircuit> I extremly recommend to use the following mount options for your hard drive disks: user_xattr,rw,noexec,nodev,noatime,nodiratime,data=writeback,barrier=0
[19:01] <phantomcircuit> wat
[19:02] * AaronSchulz (~chatzilla@ Quit (Remote host closed the connection)
[19:02] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[19:02] <guilhemfr> Hi
[19:03] <guilhemfr> I have a problem with my osd after a reboot
[19:03] <guilhemfr> Osd crash here : In function 'OSDMapRef OSDService::get_map(epoch_t)'
[19:03] <guilhemfr> osd/OSD.cc: 4436: FAILED assert(_get_map_bl(epoch, bl))
[19:07] * davidz1 (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:07] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Read error: Connection reset by peer)
[19:08] * davidz1 (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit ()
[19:09] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:11] * sjustlaptop (~sam@2607:f298:a:607:e5ad:d6f0:f7a2:1819) has joined #ceph
[19:18] * chutzpah (~chutz@ has joined #ceph
[19:19] <elder> glowell1, how long should I need to wait for gitbuilder to notice I updated a branch?
[19:19] * jjgalvez (~jjgalvez@ has joined #ceph
[19:20] <glowell1> Depends on the gitbuilder. The kernel gitbuilder is slow. Sometimes longer than 15 minutes.
[19:20] <elder> OK.
[19:20] <elder> I just don't see anything building, and it hasn't noticed something changed either.
[19:20] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[19:21] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[19:21] <gregaf> guilhemfr: that's an unfortunate bug which sjust has queued up fixes for, but there's not a way to fix it retroactively right now; assuming your data is fully replicated it's best to just delete the PG in question from that OSD
[19:22] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[19:22] <mikedawson> guilhemfr: http://tracker.newdream.net/issues/3770
[19:23] <guilhemfr> thanks
[19:23] <guilhemfr> we just see it 10 minutes before
[19:23] <guilhemfr> We are compiling right now
[19:24] * Cube (~Cube@ has joined #ceph
[19:24] <guilhemfr> (debian-testing isn't up-to-date )
[19:24] <mikedawson> guilhemfr: As gregaf said Sam's commit will not retroactively fix the bug, it is designed to prevent it in the future
[19:25] <elder> glowell1, it looks like it's not doing anything, but might be waiting for something with "git remote prune linus"
[19:25] <scheuk> what is the best way of performing an upgrade on a running cluster from 0.48.2 to 0.56.1?, after doing an apt-get upgrade, should I restart the osd's first(one at time) or the monitors first, then the OSDs?
[19:26] <glowell1> elder: I'll have a look.
[19:26] <elder> Thank you.
[19:26] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[19:27] <guilhemfr> michaeltchapman, gregaf : how to find the bad PG ?
[19:28] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[19:29] <mikedawson> scheuk: if you are looking for a stable branch, my experience is 0.56.1 is not quite ready yet
[19:32] <scheuk> mikedawson: what's wrong with 0.56.1?
[19:32] <mikedawson> among other things, several are struggling with http://tracker.newdream.net/issues/3770
[19:32] <scheuk> currentlly in my 0.48.2 cluster, the OSDs have a memory leak
[19:32] * sjustlaptop (~sam@2607:f298:a:607:e5ad:d6f0:f7a2:1819) Quit (Ping timeout: 480 seconds)
[19:33] <mikedawson> scheuk: a fix has been committed, so I'm hoping for 0.56.2 relatively soon
[19:35] * portante (~user@ has joined #ceph
[19:36] <scheuk> mikedawson: so what your suggesting is to wait for 0.56.2 :)
[19:37] <mikedawson> scheuk: based on my experience, I'm waiting for 0.56.2 (or later)
[19:37] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:38] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[19:38] <elder> Building now glowell1, thanks a lot.
[19:39] * agh (~agh@www.nowhere-else.org) has joined #ceph
[19:40] <scheuk> sounds good
[19:41] <xdeller> yehudasa, could you please suggest about ext4-large-xattr patch, does it have a worth to use instead of xfs today?
[19:41] * rlr219 (43c87e04@ircip1.mibbit.com) has joined #ceph
[19:41] * scalability-junk (~stp@188-193-202-99-dynip.superkabel.de) has joined #ceph
[19:41] <yehudasa> xdeller: no, doesn't worth it
[19:42] <yehudasa> xdeller: don't remember the details, but it didn't provide what we needed
[19:42] <yehudasa> xdeller: instead you can use the omap xattrs
[19:44] <xdeller> yehudasa: on bare ext4? I have run info impossibility of restoring of slightly damaged mixed ext4-xfs cluster days ago, ext instances died after some minutes not able to check next beat
[19:45] <xdeller> *into
[19:46] <yehudasa> xdeller: did your ext4 osds use omap xattrs or regular xattrs? did it fail due to xattrs issues?
[19:47] <xdeller> yehudasa: omap, of course
[19:48] <yehudasa> xdeller: well,in any case the large xattr patch wouldn't solve anything
[19:49] <mikedawson> nhm: time for a few performance questions?
[19:50] <xdeller> yehudasa: ok, thanks. I`m trying to solve strange issue I have mentioned in ml - xfs-backed nodes ate all available cpu in the context switches, so I suspect xfs in the first place and looking for possible replacement :)
[19:52] <sjust> guilhemfr: easiest thing would be for me to push a branch with a work around
[19:53] <sjust> guilhemfr: would that work for you?
[19:53] <mikedawson> Testing performance, when I write 100% sequential, I see roughly equivalent write throughput on the SSD journal partition and the SSD-backed OSD drive. When I write 100% random, I see very low throughput on the journal and the OSD has higher than the other case throughput. Any idea?
[19:53] * benpol (~benp@garage.reed.edu) has joined #ceph
[19:54] <sjust> mikedawson: sounds like the filestore and filesystem are causing some write amplification which for small random io is to be expected
[19:55] <mikedawson> sjust: this is 16K
[19:55] <sjust> hmm
[19:55] <sjust> how are you measuring the IO?
[19:56] <mikedawson> iops are about 8x better for 100% sequential vs 100% random at 16K, 100% write. Test is iometer on a KVM/libvirt instance using RBD
[19:57] <sjust> is this against the raw rbd device or on an fs?
[19:57] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[19:57] <mikedawson> sjust: raw rbd without a filesystem
[19:58] <sjust> the sequential was also 16k?
[19:58] <mikedawson> yes
[19:58] <sjust> rbd striping?
[19:58] <mikedawson> don't know
[19:58] <mikedawson> how do I check?
[19:58] <sjust> 56.1?
[19:58] <sjust> joshd: how should mikedawson check for rbd striping?
[20:00] <phantomcircuit> hehe
[20:00] <joshd> rbd info will show it
[20:00] <phantomcircuit> xfs logdev set to an ssd partition plus xattr
[20:00] <phantomcircuit> nice performance jump
[20:01] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[20:03] <mikedawson> root@node1:~# rbd -p volumes info volume-6d368682-14ff-4bd4-af12-f25969314c93
[20:03] <mikedawson> rbd image 'volume-6d368682-14ff-4bd4-af12-f25969314c93':
[20:03] <joshd> sjust: mikedawson: if you don't specify fancy striping at creation time, you're just using plain striping mapping object-sized chunks of the image to a single object
[20:03] <mikedawson> size 102400 MB in 25600 objects
[20:03] <mikedawson> order 22 (4096 KB objects)
[20:04] <mikedawson> block_name_prefix: rbd_data.11432ae8944a
[20:04] <mikedawson> format: 2
[20:04] <mikedawson> features: layering, striping
[20:04] <mikedawson> stripe unit: 4096 KB
[20:04] <mikedawson> stripe count: 1
[20:04] <guilhemfr> sjust, ok for us. you can send the link to guilhem.lettron _ at _ youscribe.com
[20:04] <sjust> guilhemfr: I'll push a branch to the ceph git repo
[20:05] <joshd> sjust: mikedawson: yeah, so that's the default 4MiB simple striping
[20:05] <phantomcircuit> joshd, what's the point of fancy striping?
[20:05] <guilhemfr> sjust, which name ?
[20:05] <joshd> phantomcircuit: it's really mainly useful for small sequential i/o workloads, like a db's journal
[20:06] <sjust> guilhemfr: working on it
[20:06] * xdeller (~xdeller@ Quit (Quit: Leaving)
[20:06] <mikedawson> joshd: if I want a more performant volume for small writes, how do I create a volume with fancy striping?
[20:07] <joshd> mikedawson: use --stripe-count, --stripe-unit, and --order (for object size) that the picture at the bottom of this page explains: http://ceph.com/docs/master/dev/file-striping/
[20:07] * Teduardo (~DW-10297@dhcp92.cmh.ee.net) has joined #ceph
[20:08] <joshd> mikedawson: it won't matter so much for random i/o though. just using smaller objects may help if too many of your writes end up going to the same object (and thus waiting for each other)
[20:09] <Zethrok> Hi, just curious - is it possible to move a rbd image between different pools?
[20:09] <Zethrok> Or would that equate something like an export and import?
[20:09] * guilhemfr (~guilhem@AMontsouris-652-1-174-162.w92-163.abo.wanadoo.fr) Quit (Quit: Quitte)
[20:10] <joshd> Zethrok: you can 'rbd cp' between pools, but it still has to copy all the data
[20:10] <joshd> Zethrok: or you can clone between pools
[20:10] <mikedawson> joshd: random or sequential, I can't get >150iops/drive with OSDs on SSD right now from Folsom, Cinder, KVM, and RBD. Something is killing performance
[20:11] <mikedawson> joshd: to be clear there are no spinners anywhere
[20:11] * sbadia (~sbadia@aether.yasaw.net) has joined #ceph
[20:12] <Zethrok> joshd: Thanks - was asked a question if it was possible to migrate a running rbd image between pools. So maybe a clone and then flatten or would that not work live?
[20:13] <joshd> Zethrok: it works live in 0.56 at least
[20:13] <phantomcircuit> joshd, maybe you can answer this for me, why is there a significant performance difference between 4MB blocks and 1KB blocks with rados bench
[20:13] <Zethrok> joshd: Great, thanks!
[20:13] <phantomcircuit> rados bench doesn't issue a flush afaikt so it should hypothetically be about the same minus processing overhead
[20:15] <joshd> phantomcircuit: there's no flush operation specifically - writes only complete when they're acked by all replicas. I'd guess 1kb is taking more advantage of the journal, but I'm not sure
[20:16] <phantomcircuit> it doesn't seem to be which is what im saying :)
[20:16] <phantomcircuit> end up with logs full of 2013-01-15 19:16:07.610137 osd.1 [WRN] slow request 79.994703 seconds old, received at 2013-01-15 19:14:47.615392: osd_op(client.4102.0:43273 benchmark_data_ns238708_7065_object18335 [delete] 2.4ab4df19) v4 currently waiting for sub ops
[20:18] <joshd> mikedawson: do you have filestore threads and osd op threads set higher than default?
[20:19] <mikedawson> joshd: I have a ceph.conf with hardly anything in it (no performance tuning attempted yet). I did however have to set the tunables to get a rebalance to complete.
[20:20] * benner_ (~benner@ has joined #ceph
[20:20] * benner (~benner@ Quit (Read error: Connection reset by peer)
[20:21] <phantomcircuit> gentoo has 0.49 set as stable
[20:21] <phantomcircuit> is that even a valid point release?
[20:21] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:22] <Teduardo> Has anyone tested using ceph as the backend for OnApp per chance? Onapp seems to really want a block device presented via ISCSI that it then creates a bunch of LVMs on top of, I suppose it doesn't rightly matter how the block device is presented, as long as it can go girl gone LVM wild on the block device?
[20:26] * scalability-junk (~stp@188-193-202-99-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[20:32] <Teduardo> or what I like to call, a whole lot of PVs
[20:32] <joshd> mikedawson: there are a bunch of tunings that help with iops on the mailing list - disabling debug logging entirely is unfortunately one that seems to help a bunch
[20:32] * scalability-junk (~stp@188-193-202-99-dynip.superkabel.de) has joined #ceph
[20:35] * yehudasa (~yehudasa@2607:f298:a:607:10dc:fb8:97ed:714f) Quit (Quit: Ex-Chat)
[20:35] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[20:37] <gregaf> Teduardo: I don't know about OnApp, but people are doing even crazier things like re-exporting RBD volumes over iSCSI
[20:38] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[20:39] <Teduardo> Well, you have to be a little bit crazy to use OnApp anyway
[20:39] <Teduardo> or you signed up for a 1 year free trial and then the free trial expired so you actually have to implement it
[20:52] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[20:55] <fghaas> gregaf: thanks for calling my blog posts crazy :P
[20:59] <dmick> fghaas: not the blog posts, just the tasks they describe :)
[20:59] <fghaas> dmick: yeah thanks, that's even worse ;)
[21:02] <scalability-junk> why use onapp when you have ceph?
[21:02] <scalability-junk> onapp is just another san provider...
[21:05] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[21:05] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[21:05] * agh (~agh@www.nowhere-else.org) has joined #ceph
[21:11] <Teduardo> Onapp is a virtualization "thingy"
[21:11] <Teduardo> like Cloudstack
[21:13] <scalability-junk> openstack kvm ftw...
[21:13] <Teduardo> yeahhhhhhhhhhhhh if you have an army of software developers =)
[21:15] <scalability-junk> Teduardo, onapp is so much better?
[21:16] <Teduardo> no, but it works out of the box so at least you can launch something while your army of software developers are working on an openstack deployment
[21:16] <Teduardo> and it has support when something breaks
[21:16] <scalability-junk> Teduardo, you can just launch an openstack cluster without an army too
[21:16] * nwat (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[21:16] <scalability-junk> stackops.org for example is an out of the box + support implementation of openstack
[21:17] <scalability-junk> same for swiftstack.org
[21:17] <phantomcircuit> hmm
[21:17] <scalability-junk> and for rackspace private cloud software
[21:17] <scalability-junk> with your own hardware there is other stuff too
[21:17] <phantomcircuit> i have journals on ssd (20+k IOPS easy) and filestore on a normal hdd
[21:17] <dmick> but the higher the interoperability, the better for all
[21:18] <phantomcircuit> using flashcache for the filestore increased iops from 120 to 500
[21:18] <phantomcircuit> (writeback cache)
[21:18] <phantomcircuit> this is all with rados bench and 1k block size
[21:18] <phantomcircuit> seems like there's room for improvement in the filestore code somewhere
[21:19] <phantomcircuit> theoretically flashcache shouldn't have helped at all with a write only work load
[21:19] <Teduardo> Openstack is moving fairly fast with features/releases, if you're going to deploy it now wouldn't you want to completely control all aspects of it instead of waiting for someone else to package new features?
[21:20] <scalability-junk> Teduardo, you wanted an all in one and out of the box solution
[21:20] <scalability-junk> Teduardo, I'm working on my chef scripts for my own cluster.
[21:20] <Teduardo> No, I didnt =) I wasn't complaining about not using openstack I just explained why I wasn't currently using it
[21:20] <scalability-junk> ^^ and I gave some great suggestions :P
[21:21] <Teduardo> I wasted about 3 weeks trying to get crowbar to magically build me an openstack deployment and then realized that it was a time sink
[21:21] <scalability-junk> Teduardo, why crowbar?
[21:22] <Teduardo> It seems like a good idea, and it also have ceph barclamps and since i was going to try and use ceph for the storage backend it seemed like an OK route
[21:22] <Teduardo> but it was anything but an OK route =)
[21:23] <scalability-junk> mh I will probably go the openstack + ceph with chef script route
[21:24] <Teduardo> sure sure, technically you could probably do it all in Ubuntu just using kickstart
[21:24] <scalability-junk> minimal server setup with 1 or 2 servers onsite and one offsite server for ceph backup, still no good idea on how to achieve offsite replication in a good way
[21:24] <scalability-junk> Teduardo, kickstart is crap in ubuntu
[21:24] <scalability-junk> doesn't support raid1 + encryption
[21:25] <Teduardo> ah, we use hw raid so that hasn't been an issue for us
[21:25] <scalability-junk> mh yeah the biggest issue is encryption
[21:25] <scalability-junk> and I'm still not sure how to deal with that on ceph
[21:25] <Teduardo> you encrypt your nova boxes?
[21:25] <scalability-junk> nova and ceph boxes
[21:25] <scalability-junk> actually I will start with 1 server for nova, control and ceph :P
[21:26] <scalability-junk> probably miserable performance, but I'll see
[21:26] <Teduardo> yeah, I have 6 dell c6100s Im using for playing around with openstack
[21:27] <scalability-junk> Teduardo, yeah I'm still not sure how to do the disk setup
[21:27] <Teduardo> can you just encrypt the VMs themselves or are you in financial services/healthcare?
[21:27] <scalability-junk> I have 2 disks and want raid1 for nova and no raid for ceph, but still have encryption for both... not the easiest to setup :D
[21:28] <scalability-junk> Teduardo, the thing is not the data while running, but data safety after disk exchange etc.
[21:28] <Teduardo> dban =]
[21:29] <Teduardo> also i think razor may end up being a lot better than crowbar
[21:29] <Teduardo> (puppet-razor) that is
[21:31] <scalability-junk> why puppet?
[21:33] <Teduardo> im not so much concerned with puppet/chef as I am taking 5 racks of baremetal servers and turning them into something useful automatically (which is what both crowbar and razor) might eventually do
[21:33] <scalability-junk> Teduardo, if the disk is broken dban isn't really usable encryption will save data on disk...
[21:33] <scalability-junk> Teduardo, any switch with os available?
[21:34] <scalability-junk> if you have arista switches a great solution from pistoncloud.com
[21:34] <scalability-junk> usb + arista -> easy cloud
[21:35] <Teduardo> ah, we can pxeboot/install any of the ports on our network
[21:36] <Teduardo> so anywhoo! im stuck with Onapp for now =)
[21:37] <Teduardo> if version 3 is as bad as version 2 that will be enough discomfort to get us to migrate away
[21:38] <dennis> Teduardo: what's bad about onapp?
[21:43] <scalability-junk> Teduardo, mh pxeboot is it really something that useful? not sure if I really want to read into it for my one server setup :D
[21:44] <janos> if my ceph performance has dropped (using rbd) and i see quite a few "currently waiting for sub ops" when doing ceph -w, is that an indicator of any particular bottleneck?
[21:45] <gregaf> annoying filesystem aging, maybe
[21:46] <janos> it's a fairly new cluster at home that i've started filling up to see how it does
[21:46] <janos> using btrfs
[21:46] <scalability-junk> anyone from the mailinglist team here? still waiting for my approval...
[21:48] <gregaf> scalability-junk: mailing list team? approval?
[21:49] <dmick> you mean majordomo/ceph-devel?
[21:49] <dmick> that's vger, not us
[21:49] <gregaf> it's just hosted on vger; we don't have any access to it and you don't need manual approval to subscribe or post
[21:50] <scalability-junk> anyone from the mailinglist team here? still waiting for my approval...
[21:51] <scalability-junk> dmick, mh kk
[21:51] <dmick> ?
[21:51] <scalability-junk> gregaf, it said I need approval...
[21:52] <gregaf> in reply to an email? are you subscribed, and did it include attachments?
[21:52] <gregaf> that's all I can think of, but I've never heard of that happening before on ceph-devel
[21:52] <scalability-junk> has been forwarded to the owner of the "ceph-devel" list for approval.
[21:53] <dmick> there's a relatively-small limit on message size. Is that *all* the bounce message said?
[21:53] <scalability-junk> This could be for any of several reasons:
[21:53] * jmlowe (~Adium@ has joined #ceph
[21:54] <scalability-junk> dmick, I just try it again
[21:54] <jmlowe> just out of curiosity, is there an easy way to tell how many objects are being stored in a ceph cluster?
[21:55] <scalability-junk> ok worked on second try
[21:55] <dmick> jmlowe: rados df
[21:56] <jmlowe> dmick: thanks
[21:57] <scalability-junk> let's see if my mail goes trough this time and I hopefully haven't double sent it...
[21:58] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[21:59] <jmlowe> sjust: may have cried wolf earlier, recreated my cluster over the weekend and I haven't been able to break it yet
[22:00] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[22:01] * jmlowe1 (~Adium@ has joined #ceph
[22:01] * jmlowe (~Adium@ Quit (Read error: Connection reset by peer)
[22:08] <dmick> scalability-junk: I just sent a test message from an account I know is not subscribed; it was accepted
[22:10] <benpol> A few days ago I boosted the replication level in my test cluster from 2 to 3. Seemed to go pretty smoothly, but I still have 3 PGs stuck in an "active+degraded" state (only 2 copies). The whole cluster's been rebooted, but those 3 PGs are still stuck.
[22:11] <scalability-junk> dmick, to which address?
[22:11] <benpol> Here's the output of 'ceph pg 0.a0 query' http://pastebin.com/raw.php?i=FUGHn4Yy
[22:11] <dmick> scalability-junk: ceph-devel@vger.kernel.org. Is that not what we're talking about?
[22:11] * jmlowe1 (~Adium@ Quit (Quit: Leaving.)
[22:12] <scalability-junk> dmick, yes it is did you get a mail about backup stuff?
[22:12] <elder> dmick any idea why I can easily download a file from github here but teuthology has trouble with the same URL?
[22:12] <dmick> scalability-junk: no
[22:12] <elder> (Happening at this instant)
[22:12] <dmick> elder: no
[22:12] <elder> Okey-doke.
[22:12] <dmick> but I will note that I cannot contact teuthology atm
[22:13] <dmick> so clearly something's wrong
[22:13] <elder> I can get into teuthology...
[22:13] <elder> Oh, and it looks like the other problem now has gone away.
[22:13] <dmick> hm. I have no openvpn.
[22:13] <scalability-junk> dmick, ok strange
[22:14] <dmick> that's better.
[22:21] * scalability-junk (~stp@188-193-202-99-dynip.superkabel.de) Quit (Quit: Leaving)
[22:21] * scalability-junk (~stp@188-193-202-99-dynip.superkabel.de) has joined #ceph
[22:25] <joshd> benpol: it's only being mapped to two osds. enabling crush tunables (http://ceph.com/docs/master/rados/operations/crush-map/#tunables) if you're not using kernel clients older than 3.5 may fix it
[22:25] * John (~john@astound-66-234-218-187.ca.astound.net) has joined #ceph
[22:27] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[22:28] <John> There is a basic set of definitions for states here: http://ceph.com/docs/master/rados/operations/pg-states/
[22:28] <John> However, I see some combinations that could use some clarification.
[22:29] <John> active+clean, stale+active+clean, stale+peering, stuck inactive, stuck stale, stuck unclean, degraded+unclean, stale+down+remapped+peering, down+peering
[22:30] <John> active vs. inactive seems like a boolean, but we also have "down".
[22:30] <sjust> stale means that the most recent report received by the monitors is not from the currently mapped master
[22:30] <John> How does "active" compare to "inactive" and "down"
[22:30] <sjust> down means that there may be updates which only exist on down osds
[22:31] <John> I have the basic definition of stale. How would we know if something is active if it is also stale?
[22:31] <sjust> we really don't, that's why we have stale+active
[22:31] <sjust> active is the last we heard
[22:31] <sjust> stale means that it's probably not current
[22:31] <John> ahh... that's what I thought...
[22:32] <John> so same with stale+peering? "Last time we checked, they were peering, but they haven't updated us in awhile?"
[22:32] <sjust> right
[22:33] <John> So down means a OSD has the most recent update, but it's down. What does "inactive" mean compared to "active"?
[22:33] <benpol> joshd: hmm thanks, but I've already tunables, odd.
[22:37] <joshd> benpol: anything else special about your crushmap? do you have many down osds?
[22:39] <absynth_47215> evening
[22:39] <absynth_47215> any inktankers around?
[22:39] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:39] <benpol> joshd: no, nothing other than the tunables and the replication values have been added to the crushmap.
[22:40] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[22:40] <fghaas> absynth_47215: sjust and joshd have been talking in the last 10 minutes, so I guess that's a yes :)
[22:40] <benpol> I shut down the first of the two listed OSDs and let the cluster rebuild, then started the OSD again and now I have 3 PGs in active+remapped state.
[22:40] <absynth_47215> they went into hiding 90 secs ago ;)
[22:43] <sjust> absynth_47215: what's up?
[22:44] <absynth_47215> is sage in the office this week?
[22:47] <sjust> not today anyway
[22:47] * gucki (~smuxi@80-218-125-247.dclient.hispeed.ch) Quit (Remote host closed the connection)
[22:48] <absynth_47215> i wish i wasn't either. ;)
[22:49] <absynth_47215> two weeks into the year and i'd like a holiday
[22:50] <absynth_47215> well, he's writing e-mails at least
[22:50] * agh (~agh@www.nowhere-else.org) Quit (Read error: Connection reset by peer)
[22:50] <absynth_47215> to the list
[22:51] * yehudasa (~yehudasa@2607:f298:a:607:39fc:6c7f:2c27:2458) has joined #ceph
[22:53] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) has joined #ceph
[22:56] <dmick> absynth_47215: sage never really stops working
[22:56] <dmick> I'm pretty sure he commits code from his dreams
[22:56] <joshd> benpol: if you could attach your osdmap to a bug in the tracker along with a pg dump, we can see what's going on
[22:57] <benpol> joshd: ok thanks, will do
[22:58] * xdeller (~xdeller@broadband-77-37-224-84.nationalcablenetworks.ru) has joined #ceph
[23:00] <phantomcircuit> gregaf, hey no luck on recoverying the osd's report issues with the pg log
[23:01] <gregaf> you mean getting your monitor back?
[23:08] * jjgalvez1 (~jjgalvez@ec2-54-235-219-17.compute-1.amazonaws.com) has joined #ceph
[23:10] <phantomcircuit> yeah
[23:10] <phantomcircuit> im pretty sure both the monmap and osdmap were right
[23:10] <phantomcircuit> but the pgmap was all wrong
[23:12] * jjgalvez (~jjgalvez@ Quit (Ping timeout: 480 seconds)
[23:16] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[23:18] <benpol> joshd: http://tracker.newdream.net/issues/3806
[23:18] <benpol> (and thanks for looking at it!)
[23:20] <phantomcircuit> gregaf, yeah that's what i mean
[23:20] <phantomcircuit> t
[23:20] <gregaf> yeah; it's definitely possible but it would be easiest for somebody who can read the code and we don't have any tools to make it easier :(
[23:21] <phantomcircuit> so im pretty sure i have the monmap & osdmap right
[23:21] <phantomcircuit> how would the pgmap be restored though?
[23:21] <phantomcircuit> it seems impossible
[23:23] <phantomcircuit> i backed up the osd's and am rebuilding with 0.48
[23:23] <phantomcircuit> im seeing weird performance issues with rados bench were write throughput will be ~120 MB/s (expected) and then drop to 0
[23:24] <phantomcircuit> the journals are 10 GB so should make it at least the 30 seconds the test runs before becoming full
[23:25] <nhm> phantomcircuit: you might have to tweak some of the journal settings. You might be hitting a max bytes or max ops limit resulting in journal flushes happening regularly.
[23:25] <gregaf> phantomcircuit: I don't think you need the pgmap as long as you declare it correctly on the monitor
[23:26] <gregaf> although I guess last time I looked at this I created an empty one with the right epochs because it wasn't a full data loss
[23:27] <phantomcircuit> nhm, yeah i did i thought i had increased them to ludicrous settings but maybe not http://pastebin.com/raw.php?i=tiLrF5Xk
[23:29] <nhm> phantomcircuit: if you check the admin socket for hte osd when the writes stall, you can tell what the state of the journal was.
[23:34] <phantomcircuit> i see what it is
[23:34] <phantomcircuit> the journal blocks everytime the filestore is writing out
[23:35] <nhm> huh
[23:35] <phantomcircuit> yeah im watching iostat
[23:35] <nhm> that hardly seems productive!
[23:36] <phantomcircuit> current throughput drops to zero at the same time the filestore goes from 0 activity to 130 MB/s
[23:36] <phantomcircuit> it's definitely blocking
[23:36] <phantomcircuit> sort of defeats the purpose of the journal huh
[23:36] <nhm> It will do that under some circumstances if it's hit a limit
[23:37] <gregaf> yeah, but it shouldn't hit that limit…sjust, things to check?
[23:37] <phantomcircuit> i cant imagine what limit it would be hitting
[23:37] <nhm> phantomcircuit: IE if it doesn't think there is space left.
[23:37] <phantomcircuit> i'll do a perf dump of idle writing and blocking
[23:37] <phantomcircuit> this is idle
[23:37] <phantomcircuit> http://pastebin.com/raw.php?i=qm53zmw8
[23:38] <phantomcircuit> running
[23:38] <phantomcircuit> http://pastebin.com/raw.php?i=mrL5UStf
[23:39] <phantomcircuit> blocking
[23:39] <phantomcircuit> http://pastebin.com/raw.php?i=xfrbhwL2
[23:39] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[23:42] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:42] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:44] <xdeller> http://imgur.com/wIRVn,k0QCS#0 http://imgur.com/wIRVn,k0QCS#1 kinda deadlock with ipoib and 3.8-rc3, starting nodes
[23:45] * vata (~vata@2607:fad8:4:6:221:5aff:fe2a:d1dd) Quit (Quit: Leaving.)
[23:45] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) has joined #ceph
[23:46] <phantomcircuit> http://pastebin.com/raw.php?i=7rx7Ab7G
[23:48] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:48] <phantomcircuit> the only significant difference i see is filestore.committing
[23:49] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[23:49] <sjust> phantomcircuit: do you have the flusher turned on?
[23:49] <sjust> in some annoying cases, the fs will delay flushing until the sync leading to this kind of behavior
[23:49] <phantomcircuit> no i dont should i try again with it on?
[23:49] <sjust> yeah, worth a shot
[23:49] <sjust> are you running 48 or 56?
[23:50] <phantomcircuit> 0.49 actually it's the stable branch in gentoo
[23:50] <sjust> oh man, I don't even know what's in 49
[23:50] <sjust> :(
[23:50] <phantomcircuit> it's a development release
[23:50] <phantomcircuit> so it's kind of bizarre that it's marked stable
[23:50] <sjust> yeah, but we backported many important things to 48.*
[23:50] <phantomcircuit> ah
[23:51] <sjust> (48 being argonaut)
[23:53] <sjust> the flusher also got a bit smarter somewhere in 5*
[23:54] <phantomcircuit> turning on the flusher the deadlock appears to have disappeared
[23:56] <phantomcircuit> just trading one evil for another
[23:56] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[23:56] <phantomcircuit> throughput is now consistent
[23:56] <phantomcircuit> but latency is consistently terrible
[23:56] <phantomcircuit> 8.09365 seconds
[23:56] <phantomcircuit> oh wait no that's unrelated
[23:57] <phantomcircuit> teach me to change multiple variables during performance testing
[23:58] <phantomcircuit> i need more threads :)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.