#ceph IRC Log


IRC Log for 2013-02-22

Timestamps are in GMT/BST.

[0:00] <sstan> If you're testing or don't need 100% performance, then there is no problem though
[0:00] <scalability-junk> sstan: sort of a production system...
[0:01] <sstan> if you can have 1 disk for OS and 1 for OSD it would be great, but else, it will work
[0:02] <scalability-junk> it will work with a total drawback or considering the limited machines it could be a good starting point
[0:02] <scalability-junk> ?
[0:06] <sstan> it will work 90% as fast as a two-disk solution
[0:06] <sstan> more or less
[0:07] <scalability-junk> sstan: sounds promising
[0:08] <scalability-junk> probably everyone would kill me for my "production" setup :P
[0:08] <sstan> actually, I think it's all about where you put the OSD journal
[0:08] <scalability-junk> kk
[0:09] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[0:09] <sstan> I put mine in /dev/shm, and the speed became decent only then. The drawback is that you lose the journal if the machine reboots
[0:10] <sstan> so ideally, you would have perhaps one SSD drive for journal, and an ordinary one for the OSD.
[0:10] <sstan> at that point, I'd install the OS on the ssd instead of having three drives per machine
[0:11] <lurbs> sstan: You're running btrfs on the OSDs?
[0:11] <sstan> yeah
[0:11] <scalability-junk> kk so if i have 2 disks non ssd the best option would be ram for journal...
[0:11] <scalability-junk> mhhh
[0:11] <sstan> we had a conversation in this channel 1 or 2 days ago about that
[0:13] <lurbs> scalability-junk: Even considering that, for anything past a test setup, means you'd need to use btrfs on the OSDs. You can't guarantee that any of the others will be in a consistent state if the journal dies.
[0:13] <scalability-junk> ok
[0:13] <lurbs> Likelihood is high that they won't.
[0:13] <sstan> ^ true .. journal on /dev/shm implies BTRFS
[0:15] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[0:15] <scalability-junk> mh I will probably go with journal not on ram first and then see how it goes
[0:15] * Jasson (~Adium@bowser.gs.washington.edu) has left #ceph
[0:16] <sstan> what's your OS, scalability-junk ?
[0:16] <lurbs> Personally I'd just put it there for comparative purposes, to see how it differs from running the journal on SSD or whatever.
[0:16] <scalability-junk> sstan: was thinking about ubuntu.
[0:17] <scalability-junk> wanted to start with 1-2 servers running ubuntu, diskencryption, openstack and rbd with each machine 2 disks... but not sure how it will turn out
[0:17] * vata (~vata@2607:fad8:4:6:5473:1c84:4a88:dd00) Quit (Quit: Leaving.)
[0:18] * KevinPerks1 (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[0:18] <lurbs> scalability-junk: http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/ <-- I found that useful, for OpenStack and Ceph/RBD.
[0:19] <phantomcircuit> sstan, you can have the journal on /dev/shm with btrfs and not end up with a corrupted filestore? wat
[0:19] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Read error: Connection reset by peer)
[0:19] <sstan> I started doing that 2 days ago, I didn't try to corrupt anything
[0:21] <scalability-junk> lurbs: thanks sounds interesting
[0:23] <lurbs> phantomcircuit: I believe that if Ceph detects btrfs it can (will?) operate the journal in a parallel mode, where it can guarantee that the btrfs OSD is in a consistent state.
[0:24] <lurbs> Disclaimer: I use XFS, and SSD journals. There be dragons.
[0:24] <sstan> losing the journal on xfs/ext4 kills the osd ..
[0:24] <scalability-junk> lurbs: network between nodes? 1gbit or more?
[0:24] <lurbs> And I'm not sure what steps are required to get the btrfs OSD up to date and rebuild a journal if you lose one.
[0:26] <sstan> ceph-osd -i <osd_id> --mkjournal
[0:26] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[0:26] <lurbs> scalability-junk: 10 Gb currently. My test setup has Ceph and Openstack running on the same boxes, due to lack of hardware. The plan is, at least, to keep 10 Gb between the Ceph nodes and the compute nodes.
[0:27] <phantomcircuit> lurbs, thar be dragons indeed, i run journal on ssd with the filestore on xfs on flashcache in writeback mode
[0:27] <phantomcircuit> nice and fast :)
[0:27] <scalability-junk> mhh yeah I would love that, but I not only have lack of hardware, but also lack of 10Gb networks...
[0:28] <scalability-junk> phantomcircuit: probably able to saturate the network quite good?
[0:28] <lurbs> Heh. I don't even have a 10 Gb switch yet, still investigating vendors. It's a ring, with bridges and spanning tree. :)
[0:28] <phantomcircuit> only on 1gbps so yeah
[0:28] <phantomcircuit> :)
[0:28] <phantomcircuit> easily
[0:29] <sstan> lurbs : i'm trying to do exactly that right now! please tell me more about that ring
[0:29] <lurbs> It's difficult to *not* saturate 1 Gb.
[0:29] <phantomcircuit> mostly the point is to absorb random io on the hdd as much as possible
[0:29] <phantomcircuit> this setup is all vms which has a tendency to result in a huge amount of io to very different parts of the disk
[0:29] <phantomcircuit> so flashcache reordering the writes is very important
[0:29] <lurbs> sstan: http://paste.nothing.net.nz/d5aeba
[0:30] <scalability-junk> phantomcircuit: ok got it
[0:30] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[0:30] <scalability-junk> how much ram and clockspeed per disk/osd?
[0:30] <lightspeed> I'm using a ring too, but routed, rather than bridged
[0:30] <lightspeed> so all links are active
[0:31] <phantomcircuit> scalability-junk, http://ceph.com/docs/master/install/hardware-recommendations/
[0:31] <sstan> lightspeed , lurbs : how many computers in the ring ?
[0:31] <lurbs> It's three machines. Machine 1 eth4 > machine 2 eth 5, machine 2 eth4 -> machine 3 eth 5, machine 3 eth 4 -> machine 1 eth5.
[0:31] <lightspeed> I only have 3
[0:31] <lightspeed> each one has a dual port infiniband adapter, with a direct connection to each other system
[0:32] <scalability-junk> phantomcircuit: was more asking about your setup ;)
[0:32] <phantomcircuit> lightspeed, that's a ring
[0:32] <phantomcircuit> lightspeed, it just doesn't look like it until you add the 4th node :)
[0:32] <sstan> lurbs, lightspeed : since you're using a spanning tree ... there will always be a link that is NOT used
[0:32] <phantomcircuit> scalability-junk, oh well i have the osd's on the same machines as the qemu instances
[0:33] <scalability-junk> on different disks?
[0:33] <phantomcircuit> so there is a large amount of memory available for page cache
[0:33] <lurbs> sstan: I'm okay with that, for the time being. I'm not saturating the network, yet.
[0:33] <phantomcircuit> qemu is running off of rbd :)
[0:33] <lightspeed> I have Ceph using IPs assigned to loopback adapters, with routing in place so that they know how to reach each other
[0:33] <scalability-junk> phantomcircuit: ah kk
[0:33] <phantomcircuit> all the qemu processes operate at reduced priority to avoid issues with ceph-osd and cpu time
[0:34] <phantomcircuit> (both ionice and nice)
[0:34] <phantomcircuit> but in general it isn't really an issue the hosts aren't even close to capacity
[0:34] <phantomcircuit> maybe like 30%
[0:34] <sstan> lurbs : what is the maximum cpu% observed ? 1 -> 2 -> 3 computer #2 will work really hard to transmit packets ?
[0:35] <scalability-junk> phantomcircuit: cool
[0:35] <scalability-junk> anyone running ceph on top of encrypted disks perhaps?
[0:36] <phantomcircuit> no but i cant imagine doing so will change anything
[0:36] <phantomcircuit> small performance penalty
[0:37] <lurbs> sstan: It's lost in the noise.
[0:37] <phantomcircuit> lurbs, at higher throughput it might not be :P
[0:37] <sstan> so bridging isn't cpu intensive?
[0:38] <phantomcircuit> sstan, generally no
[0:38] <lurbs> At higher throughput I'll have a switch, this ring is just until we decide on switches.
[0:38] <scalability-junk> why would someone link machines in a ring?
[0:38] <phantomcircuit> because you dont have a switch yet
[0:38] <phantomcircuit> heh
[0:38] <scalability-junk> ah to spare the switch first
[0:38] <scalability-junk> ;)
[0:38] <sstan> more reliable also
[0:39] <nwl> ls
[0:39] <scalability-junk> sstan: but could get a bit unscalable later on...
[0:40] <sstan> ture, I feel like it's worth only if you have 2 or maybe three computers
[0:51] <sstan> is it possible to make an OSD listen at TWO ip addresses
[0:53] <lightspeed> the public/cluster network options offer that, but whether that meets your needs depends what you're trying to achieve
[0:53] <lightspeed> which is why I have mine listening on IPs assigned to loopback interfaces
[0:55] <sstan> lightspeed: I was thinking about assigning one IP per NIC for all 6 nics that are on 3 computers
[0:56] <sagewk> joao monitor changes are merged! yay!
[0:56] <joao> woohoo!
[0:56] <joao> :D
[0:57] <gregaf> sagewk: joao: single paxos ones?
[0:57] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:57] <sagewk> none other
[0:57] <joao> gregaf, yeah
[0:57] <gregaf> oooooo
[0:57] <sjustlaptop> :)
[0:57] <sagewk> it's exciting if only because we can remove all these branches from ceph.git :)
[0:57] <sjustlaptop> joao: grats!
[0:57] <joao> sagewk, yeah
[0:58] <dmick> pruning time!
[0:58] <joao> sjustlaptop, ty, both gregaf and sagewk also poured a lot of reviewing into it :)
[0:58] <gregaf> I didn't realize; I'd heard it was a big rebase effort and thought it would take more time
[0:58] <sagewk> mostly gregaf :)
[0:58] <gregaf> haven't looked at it lately, I hope you tested the rebase :p
[0:58] <joao> I did
[0:58] <joao> a lot
[0:58] <joao> forgot one of the tasks running for a couple of days
[0:59] <joao> slowly eating up my bw
[0:59] <sagewk> joao: next up is to get any of your tests into ceph-qa-suite.git
[0:59] <joao> sagewk, and the task changes into teuthology
[0:59] <scalability-junk> phantomcircuit: so you don't run any localstorage for qemu and use rbd for persistant storage or for all vm storage?
[0:59] <scalability-junk> what replication level?
[0:59] <sagewk> yeah, altho your upgrade one sadly won't work with the deb stuf
[0:59] <sjustlaptop> I hope the merge commit included "One paxos to rule them all, one paxos to find them, one paxos to bring them all and in the darkness commit them"
[0:59] <joao> sagewk, that's not an issue
[0:59] <sagewk> sjustlaptop: not too late to force push, altho gregaf will frown at you
[1:00] <sjustlaptop> heh
[1:00] <joao> if we happen to have a test for cluster upgrade it will work just fine; otherwise, I'll look into how getting it done
[1:02] <sagewk> joao: i suspect it's a task called 'upgrade', or a param to task install, that will wait some period of time and then upgrade one (or more) nodes. similar to thrasher.
[1:02] <sagewk> at least, that would be one such test.
[1:02] <sagewk> probably all starts with install.py functions that upgrade_packages(version)
[1:02] <sagewk> and either we do it randomly or via a timer, or some other test explicitly calls it.
[1:02] <sagewk> like your mon test
[1:03] <joao> cool
[1:03] <joao> that will work just fine
[1:03] <joao> the sole purpose would be to silently convert the store
[1:04] <joao> as long as the monitor version changes, it serves the same purpose as those other changes I made
[1:04] <gregaf> sagewk: I think trigger- rather than time-based upgrades
[1:04] <joao> we should have both options if possible
[1:04] <joao> time-based upgrades are neat to test the monitor store conversion
[1:04] <phantomcircuit> scalability-junk, for all vm storage
[1:05] <phantomcircuit> scalability-junk, 2 copies on 2 hosts
[1:06] <phantomcircuit> step chooseleaf 0 type host
[1:06] <scalability-junk> using caching on local machines to not run into performance issues?
[1:08] <phantomcircuit> scalability-junk, rbd cache is setup but a lot of my clients are running bitcoind nodes
[1:08] <phantomcircuit> which tends to send hard flush operations fairly regularly
[1:08] <phantomcircuit> so the journal being on an ssd and flashcache running in writeback mode are enormously important
[1:08] <scalability-junk> network is probably something more than 1Gb I assume
[1:09] <phantomcircuit> nope
[1:09] <phantomcircuit> pretty much none of my clients do a lot of thoughput
[1:09] <phantomcircuit> it's almost entirely small io
[1:10] <phantomcircuit> and they're capped in qemu to 200/50 MBps read/write
[1:10] <scalability-junk> ok sounds not too bad
[1:12] <sagewk> yeah we'll eventually wan tto do both. construct actual proper tests that are explicit, then also do random upgrades across a large set of random jobs.
[1:13] <lightspeed> sstan: I put together a little diagram showing how I have my networking configured, in case you're interested... http://www.lspeed.org/tmp/ceph/ceph-net.png
[1:13] <sstan> thanks, I'll look into that right now
[1:14] <lightspeed> ceph listens on the 192.168 addresses, and the hosts have routes in place to allow them to route to each other over the point-to-point 10.x networks
[1:14] <lightspeed> so all links are active at all times
[1:15] <sstan> that looks better that bridges with simple Spanning Trees
[1:15] <sstan> lightspeed: what does ib stand for ?
[1:15] <lightspeed> infiniband (specifically, IPoIB)
[1:16] <lightspeed> but those could be standard ethernet
[1:18] <sstan> lightspeed : how did you assign a loopback interface to two networks?
[1:18] <sstan> with a bridge?
[1:20] <lightspeed> when you say "assign to two networks", do you mean because they still have the standard address as well?
[1:21] <sstan> ib0 and ib1 have different IPs, but the OSD must listen to both at the same time, at one IP
[1:21] <sstan> I feel like routing is unnecessary
[1:24] <sstan> I'm trying to figure out if there's a way that packets are transmitted on the right network adapter , in function of the destination IP
[1:24] <lightspeed> if so, "ip addr add" handles multiple IPs per interface quite happily
[1:24] <sstan> but at the same time, every pair of adapters has to have the same IP
[1:24] <lightspeed> there's no particular association between the loopback interfaces and the physical interfaces
[1:24] <lightspeed> by default linux happily accepts packets sent towards an IP assigned to any of its interfaces received over any of its other interfaces
[1:24] <lightspeed> so this pretty much "just works" once the routes are in place
[1:25] <sstan> is it possible to assign the same IP to two interfaces?
[1:26] <dmick> sstan: your routers will get upset with you, I bet
[1:26] <sstan> dmick : but it's on a ring of 3 computers
[1:27] <sstan> no routing required because it should be one closed network
[1:28] <dmick> s/router/kernel/ then
[1:28] * cocoy (~Adium@ has joined #ceph
[1:28] <dmick> "I have a packet for IPx. Which MAC should I give it? What, there are two?" I suspect that's a problem, but maybe things will round-robin these days?...
[1:29] * LeaChim (~LeaChim@b01bd511.bb.sky.com) Quit (Ping timeout: 480 seconds)
[1:30] <mjevans> So mkcephfs is giving me 'errors while parsing config file!' 'unexpected character while parsing putative key value' I've checked every setting and thing I can think of (and did get ntp staretd on the hosts during that time) What am I doing wrong? http://pastebin.com/WZFGy8bU Oh version is 0.56.3
[1:30] * jlogan1 (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[1:31] <mjevans> I also tried the config file with only one mon commented out (the local one) with similar errors.
[1:31] <mjevans> Ah drat I think I just noticed the problem
[1:31] <mjevans> Or at least another one... I forgot to alter the cluster address mask
[1:32] <sstan> dmick: from an external view angle, a computer asks, who has ip .. that request is answered with one only mac address, in the context of the ring topology I'm imagining
[1:32] <mjevans> Though... that didn't fix it.
[1:32] <dmick> mjevans: it does explain which character it thinks is the problem, so you can focus
[1:32] <mjevans> dmick: in which file?
[1:33] <dmick> from your pastebin:
[1:33] <mjevans> In the ceph.conf file or in one it generates?
[1:33] <dmick> 2013-02-21 16:21:30.554530 7fc846693780 -1 unexpected character while parsing putative key value, at char 41, line 51
[1:33] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[1:33] <mjevans> Yeah, it's balking at an IPv6 address
[1:33] <dmick> I would assume that's ceph.conf, but I dunno for sure; -v can help understand which command is saying it
[1:34] <mjevans> It matches up too well, but I didn't have a good example to provide me the formatting it desires
[1:34] <mjevans> Line 51 in the file with comments == line 28 in my paste
[1:36] <dmick> so I'm just guessing that's the /
[1:37] <mjevans> How else would you specify a network?
[1:37] <mjevans> It actually seems to be erroring out for each and every address. Is there an example IPv6 file?
[1:37] <dmick> not saying it's wrong, now investigating the theory
[1:37] <dmick> I'm not very hip to v6 addrs; it surprised me to see trailing ::
[1:37] <dmick> but it seems to be legal
[1:37] <mjevans> dmick: that is, it means 'expand this to all 0s'
[1:38] <dmick> yeah, I know it does embedded, but I was surprised to see it at the end
[1:38] <mjevans> You're allowed to use that once per address
[1:39] <mjevans> Yeah, I manually assigned addresses to each server... I could assign more and end it with ::1 if that's an issue, but ssh/ping6/etc work so it should be valid since it's not the subnet ::0
[1:39] <dmick> not saying it's invalid; just that it threw me momentarily. it's probably not connected.
[1:40] <mjevans> It actually starts to error at ::, is ceph expecting fully expanded addresses? It should not need that
[1:40] <dmick> I see that it's complaining about # or ;
[1:40] <dmick> I bet one of those colons is really a semi?...
[1:41] <lightspeed> sstan: it might be possible to avoid specifying routes, but I specifically wanted to use routing as it adds flexibiliy (ie the routing determines the traffic flow, and you could have multiple alternate less preferable routes, such as a fallback via the "main" (non-ring) network)
[1:41] <dmick> mj: nope
[1:41] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:44] <dmick> I've cut your file and reproduced it; it's ceph-conf complaining
[1:44] <dmick> seems to be complaining about the end of the line
[1:44] <dmick> I wonder if you quote the addrs...
[1:45] <mjevans> dmick: I have a tab in front, do those not parse well?
[1:45] <mjevans> inside [] ? ok
[1:45] <dmick> no, I meant
[1:45] <dmick> public network "fd00:89ab:cdef:0200::/49"
[1:45] <dmick> trying
[1:45] <dmick> nope
[1:46] <mjevans> Yeah I also am trying [fd00::::]/49 like that
[1:46] <mjevans> the [] is how you give an ipv6 address and port ina uri
[1:46] <dmick> ok lemme actually read this parser
[1:46] <dmick> (and yeh)
[1:47] <mjevans> Maybe it's tab sensitive? I'll replace them too
[1:47] <dmick> tab should be fine
[1:47] <sstan> brb ttyl
[1:48] * sstan (~chatzilla@dmzgw2.cbnco.com) Quit (Remote host closed the connection)
[1:49] <dmick> sigh I'm an idiot
[1:49] <dmick> you're missing =
[1:49] <mjevans> The page didn't have that
[1:50] <dmick> yeah, that's better
[1:50] <mjevans> http://ceph.com/docs/master/rados/configuration/ceph-conf/#networks lacks it
[1:51] <dmick> yep, but
[1:51] <dmick> *bug
[1:51] <mjevans> Yeah, that explains the mds doesn't it?
[1:51] <mjevans> rather the mons
[1:51] <mjevans> /they/ had it because I followed what was already there for addresses
[1:51] <dmick> kind of a crappy error message too
[1:51] <mjevans> Really though, it should have errored at the missing =
[1:52] <dmick> it was trying to accumulate all the text as the var name
[1:52] <mjevans> Like 'error near ^ in "context pre ^ keyword I don't know"'
[1:52] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[1:53] <dmick> it's treating # ; and \0 as all the same, as in "why the hell is this in a key name"
[1:53] <dmick> when really \0 should be handled differently. eh. If you wanna file a bug please do
[1:54] <mjevans> All RIGHT down to just one small error
[1:54] * diegows (~diegows@ has joined #ceph
[1:54] <mjevans> It's almost time for me to go home and I'd like to get this running before I do; what URL can I leave up for when I get in tomorrow?
[1:56] <dmick> leave..up?..eh?
[1:59] <mjevans> Where is your bugtracker?
[1:59] <dmick> ceph.com/tracker
[1:59] <dmick> fixing the missing = in the doc right now
[1:59] <mjevans> logical, but I have no idea how I'd have found that from the website
[2:00] <mjevans> Thanks, I'll file a bug about the parser tomorrow then
[2:00] <dmick> Resources/Development
[2:00] <mjevans> Or maybe tonight if I get this done and still need to wait out the commute
[2:01] <mjevans> Do I need to do anythign special to cleanup after a failed mkcephfs? that other error is: provided osd id 0 != superblock's -1
[2:03] <dmick> possibly. it's complaining about the daemon's directory having garbagey things in it; if you can wipe the daemon data dirs that should fix it
[2:04] <mjevans> is that the /var/lib/ceph area?
[2:04] <dmick> most likely
[2:04] <dmick> unless you've set them somewhere else
[2:05] <mjevans> There are some existing directories I made since the directions indicated that; the only stale files were under /data on the hosts where the OSDs and journals should be stored... I've removed the files but not the folders.
[2:06] <dmick> 'osd data' and 'mon data' is what I'm talking about
[2:06] <dmick> (and I suppose 'mds data' if you're using mds's)
[2:06] <mjevans> Now it's... trying to mount the data store as if it should have existed before. I think I might need to do something else
[2:06] <mjevans> I have mds in the config but only plan to use this as an RBD storage backend for libvirt usage.
[2:07] <elder> dmick, did you really insert that other "udevadm settle" before the unmap command in the rbd CLI?
[2:07] <dmick> no, Sage did, hence my latest update
[2:07] <dmick> mjevans: http://ceph.com/docs/master/rados/configuration/ceph-conf/#networks better now
[2:08] <elder> Oh. I see, it was assigned to you...
[2:08] <elder> I get it
[2:08] <mjevans> yay, less confusion
[2:08] <jmlowe> hmm, my cluster seems sick
[2:09] <mjevans> Ah I think I know what's wrong but not how to fix it... I don't want ceph managing the filesystem mounts at all.
[2:11] <mjevans> I guess I just comment out more stuff
[2:11] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[2:13] <dmick> it shouldn't be doing that unless you're supplying --mkfs, I would think
[2:15] * davidz1 (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Ping timeout: 480 seconds)
[2:15] <mjevans> yeah that wasn't it; mkcephfs just really sucks at blowing away prior failed attempts
[2:16] <mjevans> It does anything, and then even though I'm using the 'yes new data' setting it refuses to cleanup it's own past messes and start fresh
[2:16] <jmlowe> any dev's on?
[2:17] <mjevans> jmlowe: no, these are all filesystems in my /etc/fstab, partly because I'm focing all the journals to live on an SSD that will otherwise be mostly reads
[2:18] <jmlowe> sorry, I meant any inktank guys around?
[2:18] <jmlowe> any suggestions for this http://pastebin.com/DP32hWze
[2:18] <jmlowe> ?
[2:19] <mjevans> http://ceph.com/docs/master/rados/operations/monitoring/#checking-cluster-health
[2:19] * gregorg (~Greg@ Quit (Ping timeout: 480 seconds)
[2:19] <mjevans> 1) are your cluster nodes running time synchronization and is it working? 2) are you using either of the versions in the topic?
[2:20] <mjevans> that reminds me... hwclock -w done
[2:20] <jmlowe> I'm running 0.56.3
[2:22] <jmlowe> 1 yes and yes
[2:23] <mjevans> Then I can't really give you any more help than the above, http://ceph.com/docs/master/rados/configuration/ceph-conf/#logs-debugging and that to tell you that the last time I was in here (a week or so ago) a dev gave me instructions for modifying debug values on the fly and that this room is logged.
[2:24] <mjevans> You know know everything that I do which I think might be useful. Good luck and hopefully someone else can help further if poking at health status and info don't work out.
[2:24] <jmlowe> mjevans: thanks
[2:29] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[2:30] <mjevans> Ok, found another problem. I have a setup which presently (limited hardware) requires the third mon only be on the public network; does this mean I just give it's private network bogus info like ::1 ?
[2:30] * alram (~alram@ Quit (Quit: leaving)
[2:31] <dmick> ah, you'd like to have the global settings, but like to explicitly disable a cluster address for one host?
[2:32] <dmick> I think you could just set public and cluster addrs the same for that host
[2:32] <mjevans> dmick: actually I want the OSDs to exist on public and private networks, and just have a mon that exists on the public to bring me up to 3 mons.
[2:32] <dmick> (and btw, if it's not obvious, there's no reason to have the global settings if you're assigning explicit addresses to everyone, so, that would be simpler)
[2:32] <mjevans> That... was not obvious
[2:33] <mjevans> I wonder if it'll fix the error message
[2:33] <dmick> yeah, that's also really not very clear in the doc
[2:33] <mjevans> -1 unable to find any IP address in networks: fd00:89ab:cdef:8200::/49
[2:34] <mjevans> That's the mon complaining it can't talk to the private stuff
[2:34] <dmick> [global] settings allow you to not configure individual addresses (i.e. it looks at all the configured NICs and chooses one on the proper net)
[2:34] <dmick> but if you set addrs on the individual daemon nodes, it doesn't do that search
[2:35] <mjevans> A lot of duplicated lines, even with using the other sections... BUT mkcephfs completed! at last x.x
[2:36] <dmick> duplicated lines?...
[2:36] <mjevans> Tomorrow, or later tonight if I really feel like it, I add it to systemd and rc-update as applicable.
[2:36] <mjevans> dmick: I really hate having one config option everywhere, a macro or variable system would remove a lot of risk of error.
[2:36] <dmick> which line are you talking about?
[2:37] <mjevans> dmick: Things like arbitrary addreses or netmasks... IE moving that out of global caused it to duplicate about 10 times.
[2:37] <mjevans> or I should say, to a count of 10, ( not 2^10)
[2:38] <dmick> but...either you have a global subnet address, and daemons choose appropriately, or you have a configured address for a daemon
[2:38] <dmick> if most of the daemons can "use what NIC is appropriate", you only need override on the one that's a special case
[2:38] <mjevans> dmick: Yeah, however when I say 'subnet' is this and then the server can't reach that subnet, it should say 'I am not on this subnet, therefore it is a bad choice'
[2:39] <mjevans> Not 'can't reach subnet, error, die'
[2:39] <dmick> ok, but that doesn't duplicate lines, and moreover
[2:39] <dmick> what should the daemon do in that case?
[2:40] <mjevans> dmick: It shouldn't expect that because I tell it subnets exists that it must be part of them.
[2:40] <dmick> but....this is the cluster configuration. You're telling it the cluster is made up of hosts on these nets
[2:40] <mjevans> The duplication comes from moving the public/cluster subnet definitions to the places it should be (everywhere except one mon)
[2:41] <mjevans> Right, and if a mon is only on a public net that is OK, it must simply use that net for communications
[2:41] <mjevans> That mon has no network redundancy to this cluster
[2:41] <dmick> still confused. can you not set global public = P, cluster = C
[2:41] <dmick> and then all hosts use P and C as appropriate
[2:41] <dmick> except the exception
[2:41] <dmick> for which you specify public addr = P, cluster addr = P?
[2:41] <mjevans> dmick: it errored out with the message above during mkcephfs: 20130221-17:33:43 < mjevans> -1 unable to find any IP address in networks: fd00:89ab:cdef:8200::/49
[2:42] <dmick> yes, because at that point it was searching for the nets you told it it should be on
[2:42] <dmick> but it won't do that if you specify the per-daemon addrs
[2:42] <mjevans> dmick: right and as long as it's on either one it should be ok
[2:42] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[2:43] <mjevans> Oh, you're saying make an address for each of the mons like I did for the OSDs except not the cluster on that mon
[2:43] <dmick> not seeing that. if you've said there are two nets, and not specified otherwise, it's trying to segregate traffic; if it can't, because it's not really on that net, then your wishes are being ignored; I'd rather have an error in that case
[2:44] <dmick> maybe I should look at your current conf file
[2:44] <dmick> I thought this one mon was the only exception?
[2:44] <mjevans> dmick: My wishes are to establish a priority for sending traffic over a various path first, then another, then maybe another (OK I don't have this but others very well can), and not failing if a given path isn't in the list as long as A valid path is in the list.
[2:45] <dmick> yeah. that's not what this scheme is for.
[2:46] <mjevans> dmick: http://pastebin.com/DP32hWze But with address fixes as discovered earlier. Use case is two main servers with a crossover cable between, and a gigabit switch shared by a small buisness on the 'public' side. The third mon runs on a different system which will probably be replaced or moved at some point in the future and is only there for when maintenance or some failure takes out one of the main servers.
[2:47] <dmick> that pastebin looks like logs of slow requests
[2:47] <mjevans> Oops lastlog fails me
[2:47] <mjevans> time for grep
[2:48] <mjevans> http://pastebin.com/WZFGy8bU dmick
[2:48] <jmlowe> well, I fixed my problem by restarted some of my osd's
[2:49] <jmlowe> seems like I was able to overwhelm my cluster by constantly writing to cephfs
[2:49] <mjevans> jmlowe: sounds like your use case needs more network bandwidth between servers.
[2:50] <jmlowe> 10GigE isn't enough?
[2:50] <jmlowe> actually I think my disks were to slow and I need more osd's
[2:50] <mjevans> That could be the case too
[2:50] <mjevans> Or you need a dedicated journal SSD
[2:50] <mjevans> Have you done any tuning yet following that guide?
[2:51] <dmick> mjevans: why are you setting individual addresses for every daemon?
[2:51] <mjevans> dmick: The example had it
[2:51] <dmick> (and how is that working without =?)
[2:51] <mjevans> dmick: it isn't, that too was bug in the docs that was found about an hour ago
[2:51] <dmick> mjevans: yes, but we've discussed here how the example isn't correct
[2:51] * scuttlemonkey changes topic to 'v0.56.3 has been released -- http://goo.gl/f3k3U || argonaut v0.48.3 released -- http://goo.gl/80aGP || Deploying Ceph with Juju http://t.co/TspsYBeTej'
[2:52] <dmick> mjevans: clearly I know about the = bug, since I just fixed it...
[2:52] <mjevans> dmick: Right, so this is my old config not quite updated but sitll close enough to talk about what it looks like
[2:52] <dmick> ok. so do you understand why you don't need addr lines in each daemon section?
[2:53] <mjevans> dmick: the host info should have keyed it, though the example had it so I added it
[2:53] <mjevans> That isn't what I was complaining about though
[2:53] <mjevans> It's moving the network masks from global to the mds and osd sections and then to the two mons that need it.
[2:54] <dmick> why would you do that?
[2:56] <dmick> (and are the global network addresses supposed to be the same?)
[2:56] <mjevans> dmick: because mkcephfs complained about one of the mons not being on the cluster network, even though it could still reach all of the other mons and osds via the public network.
[2:56] <mjevans> dmick: they're not, that's a third thing repaired (literally 5 seconds after I posted the link to the pastebin file)
[2:56] <dmick> ok.
[2:57] <mjevans> It just happened like 2 hours ago now so I forgot that too was wrong.
[2:57] <dmick> I've recommended setting the nets in the global section and not setting them in the per-daemon section for everything that's got NICs on both nets
[2:57] <dmick> and setting addrs only for the exception monitor daemon
[2:57] <dmick> so far I don't see why that won't work.
[2:58] <mjevans> Oh that should work?
[2:58] <dmick> but I'm now lost in what you do and don't know and what your current thinking is. The reason for asking for your current ceph.conf was to get a handle on what your current thinking is, but since it's N hours out of date, it's confusing me greatly for that task
[2:58] <mjevans> I guess I can test that... I don't have anything IN the storage yet
[2:58] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[2:59] <dmick> so maybe you're unclear on how this works, since the doc doesn't help: if you configure a global public net, and all hosts have NICs on that net, nothing else need be done.
[2:59] <dmick> if you configure public and cluster nets, and all hosts have NICs on both, same.
[2:59] <dmick> in both those cases the daemons will find the right NIC and use the addr it's configured with.
[2:59] <mjevans> ok, so step by step setting this up how you want, for mons on both nets, specify two sets of mon addr = [fd00:6959:d45d:0210::]:6789 ?
[3:00] <dmick> If you don't want the "auto-pick-from-the-network" behavior, then and only then do you need to set individual daemon addresses.
[3:00] <dmick> if a daemon has a specific address setting, no NIC search is performed.
[3:01] <mjevans> dmick: fails with: /sbin/mkcephfs: monitor mon.h2 has no address defined.
[3:02] <mjevans> So mon daemons, 2 with public and cluster nics and 1 with a public nic only.
[3:05] <mjevans> It really doesn't help that there isn't a mon addr section either: http://ceph.com/docs/master/rados/configuration/mon-config-ref/
[3:07] <dmick> it's not a mon option; it's a {public|cluster} addr option
[3:07] <dmick> and I might have some theories for what's failing if I knew what your config was
[3:09] * jmlowe1 (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[3:10] <mjevans> dmick: http://pastebin.com/FpRq4p90
[3:12] <mjevans> dmick: mon.vh1 does not exist on the fd00:::8290::/49 network, only h-2 and h-3 do.
[3:13] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[3:15] <mjevans> dmick: If I make the config look like this, mkcephfs completes: http://pastebin.com/i3VxGAC0
[3:16] * xmltok (~xmltok@pool101.bizrate.com) Quit (Ping timeout: 480 seconds)
[3:17] <mjevans> afk
[3:18] * darkfader (~floh@ Quit (Read error: Connection reset by peer)
[3:19] <dmick> looking at the monitor source, it looks as though it only ever tries to be on the public net
[3:21] * rturk is now known as rturk-away
[3:22] * darkfader (~floh@ has joined #ceph
[3:23] <mjevans> dmick: then why would mkcephfs care if it can reach the cluster net directly at all?
[3:23] <dmick> however because mkcephfs wants to create the monmap, it demands that "mon addr" be present in the mon sect so that it can create the monmap as an entry ponit. The daemon itself could fall back to picking a public address, but
[3:23] <dmick> mjevans: do we have any evidence that it does care?
[3:23] <mjevans> dmick: yeah, when I run it it FAILS with that very error message
[3:24] <dmick> have you shown me that error message?
[3:24] <mjevans> dmick: top of paste -2
[3:24] <mjevans> IE the paste before the last
[3:25] <mjevans> Everything, even the other two monitors, are created correctly before it.
[3:28] <mjevans> I guess I'll file a test case config and bug about this tomorrow as well.
[3:28] <dmick> mjevans: it appears as though it does that because there's a generic "pick addresses" that tries to fill in both, even though the monitor only needs one. That would seem to be a bug.
[3:29] <dmick> and critical to your case.
[3:29] <mjevans> dmick: There is a workaround, using the example provided after it (:15 after the hour)
[3:30] <dmick> yep
[3:30] <dmick> semi-global subnet definitions. (Clearly specifying each address individually would also work.)
[3:30] <mjevans> This problem only exists because it's trying to use a binary feature, instead of generalizing the design that the crush map seems to imply.
[3:31] <dmick> don't see the connection to the crush map
[3:32] <dmick> inter-daemon communication is independent of data placement. This is just a somewhat nonstandard networking config that is difficult to express with the 'fewer explicit settings' scheme we have, and complicated by a bug in that scheme
[3:32] <mjevans> The jist of the crush map that I got is that it designs a routing scheme for how to file away incomming blobs of data. You get to design your topology across any segmentation map you desire
[3:33] <dmick> the daemons have to all be able to talk to one another symmetrically regardless of the crush placement in use.
[3:33] <dmick> they can do that on one shared net (the public) or you can segregate some of the high-traffic stuff (the cluster net, for OSD-OSD traffic) but either way the clients have to be able to reach all daemons and the daemons have to all reach each other
[3:34] <mjevans> Right; so the public part /is/ required to exist and everyone be on it; but multiple private parts might exist too.
[3:34] <mjevans> You might have a situation where redundancy requirements mean a mesh topology instead of a star
[3:35] <dmick> private in the sense of "only needs to include every OSD", I suppose.
[3:35] <mjevans> dmick: except that too is a flawed view.
[3:35] <dmick> but those OSDs must all also be present on the truly-globally-public net
[3:36] <mjevans> Maybe I can simplify this another way. Specify a 'public' network mask to indicate which is that. Any and all other routes that do exist can be assumed to be private
[3:36] <dmick> I think you're assuming there is a concept of 'private' because of the name 'public'; there isn't, really
[3:37] <mjevans> dmick: the alternative is to assign metrics to subnet masks manually. If it exists, this is the weight with which I would like you to try using it.
[3:37] <dmick> are we discussing how Ceph behaves, or how you think some other clustering system might be designed?
[3:38] <mjevans> I am discussing now how ceph should behave, which does not seem at a broad view to be different from how the configuration files lead me to think it might behave.
[3:39] <mjevans> If it behaved like that, then the thinking that the 'more preferable routes might exist' would not be reflective in other pieces of code such as the one that has this error.
[3:39] <mjevans> At any rate, this is more of a feature than a bug, and there is that bug
[3:40] <mjevans> Thanks for listening to me ramble about it, I'm long past when I should have stopped looking at this for today.
[3:41] <dmick> ok. if you've got an idea for a proposal, maybe you could flesh it out on the ceph-devel list. I'm not following what you mean. But yeah, this is a good bug to know about.
[3:41] <dmick> I can file the 'mon tries both nets and fails if it can't find cluster, even though it won't use it' bug
[3:42] <dmick> did you file one for the parsing error message from hell?
[3:43] <dmick> (I also filed http://tracker.ceph.com/issues/4227 which I may change now that I've looked at what the mons and mkcephfs do differently for the mons)
[3:47] <dmick> http://tracker.ceph.com/issues/4228 for monitor-fails-if-no-cluster-NIC
[3:47] <dmick> I'll file the parsing error in case you're already offline
[3:50] <dmick> http://tracker.ceph.com/issues/4229 for parsing error
[3:58] * madkiss1 (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[3:58] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Read error: Connection reset by peer)
[4:17] * KevinPerks1 (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[4:17] <cocoy> Psi-Jack: hi mate, http://tracker.ceph.com/issues/3640. what is workaround if any?
[4:17] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[4:36] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[4:37] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[4:41] * nick5 (~nick@ Quit (Remote host closed the connection)
[4:41] * nick5 (~nick@ has joined #ceph
[4:43] * xmltok (~xmltok@pool101.bizrate.com) Quit (Ping timeout: 480 seconds)
[4:47] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[4:54] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:59] * KevinPerks1 (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[5:00] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Read error: Connection reset by peer)
[5:28] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[5:28] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[5:29] * yehuda_hm (~yehuda@2602:306:330b:a40:55ce:5c6c:369a:5843) Quit (Ping timeout: 480 seconds)
[5:33] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[5:37] * yehuda_hm (~yehuda@2602:306:330b:a40:157b:3371:b971:2937) has joined #ceph
[5:41] * xmltok (~xmltok@pool101.bizrate.com) Quit (Ping timeout: 480 seconds)
[5:41] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[5:42] <mjevans> dmick: I was, I'd intended to file it when I got back in to work tomorrow. I also thought about the use cases more while offline; in most cases the present setup probably does work.
[5:56] * jlogan1 (~Thunderbi@2600:c00:3010:1:a9a1:e01e:fafb:f712) has joined #ceph
[5:58] <dmick> mjevans: cool
[6:08] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[6:12] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[6:14] <phantomcircuit> 800 MB blktrace files for a bonnie++ run
[6:14] <phantomcircuit> that seems... excessive
[6:22] <dmick> you want info, you got it
[6:23] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[6:26] * KevinPerks1 (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[6:28] * klnlnll (~DW-10297@dhcp92.cmh.ee.net) has joined #ceph
[6:30] <phantomcircuit> dmick, neat
[6:30] <phantomcircuit> flashcache ssd/hdd
[6:30] <phantomcircuit> really shows how effective it is
[6:31] <phantomcircuit> random writes all over the place on the cache device get turned into nice sequential writes
[6:32] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: Pogoapp - http://www.pogoapp.com)
[6:32] <dmick> yet another in a long series of things I'd love to have time to experiment with
[6:33] * Teduardo (~DW-10297@dhcp92.cmh.ee.net) Quit (Ping timeout: 480 seconds)
[7:03] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[7:03] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[7:11] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[7:11] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[7:13] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[7:15] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[7:17] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[7:21] * xmltok (~xmltok@pool101.bizrate.com) Quit (Ping timeout: 480 seconds)
[7:32] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[8:05] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:11] * jlogan1 (~Thunderbi@2600:c00:3010:1:a9a1:e01e:fafb:f712) Quit (Ping timeout: 480 seconds)
[8:11] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:16] * low (~low@ has joined #ceph
[8:18] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) has joined #ceph
[8:31] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[8:32] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[8:45] * lightspeed (~lightspee@ Quit (Ping timeout: 480 seconds)
[8:46] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[9:00] * eschnou (~eschnou@ has joined #ceph
[9:05] * gerard_dethier (~Thunderbi@ has joined #ceph
[9:12] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Download IceChat at www.icechat.net)
[9:27] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:37] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[9:40] * gaveen (~gaveen@ has joined #ceph
[9:46] * l0nk (~alex@ has joined #ceph
[9:46] * low (~low@ Quit (Quit: Leaving)
[9:48] * low (~low@ has joined #ceph
[9:50] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:51] * LeaChim (~LeaChim@b01bd511.bb.sky.com) has joined #ceph
[9:55] * trond (~trond@trh.betradar.com) Quit (Quit: leaving)
[9:56] * The_Bishop (~bishop@e179015198.adsl.alicedsl.de) has joined #ceph
[10:01] * esammy (~esamuels@host-2-102-68-175.as13285.net) has joined #ceph
[10:02] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) has joined #ceph
[10:04] * ScOut3R (~ScOut3R@ has joined #ceph
[10:06] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:08] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[10:11] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[10:13] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[10:17] <madkiss1> wido: are you about by chance? :-)
[10:17] * madkiss1 is now known as madkiss
[10:17] <wido> madkiss: yes
[10:17] <wido> in conversation at office, one moment
[10:17] <madkiss> wido: roger
[10:22] * lightspeed (~lightspee@fw-carp-wan.ext.lspeed.org) has joined #ceph
[10:25] * BManojlovic (~steki@ has joined #ceph
[10:26] * leseb (~leseb@mx00.stone-it.com) has joined #ceph
[10:50] * benr (~benr@puma-mxisp.mxtelecom.com) has left #ceph
[11:00] <wido> madkiss: I'm here
[11:01] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[11:03] <madkiss> wido: I am testing libvirt/rbd right now and I would like to run different VMs from within libvirt with different rbd caching parameters; AIUI, I could either pass the RBD caching options to qemu directly from within the "name" field or I could create separate clients for this in Ceph with different global rbd cache options and have the VMs started by these different users. Is that right, or am I missing something?
[11:07] * Lennie`away is now known as leen
[11:07] * cocoy (~Adium@ Quit (Ping timeout: 480 seconds)
[11:07] <wido> madkiss: You have to switch on the rbd_cache option in the name field
[11:07] <wido> you could also set the 'cluster' config option so each client would read a different .conf file in /etc
[11:07] <madkiss> ah. The different-users-scenario won't work then I guess?
[11:10] <leen> hi folks, is there is a known bug where osd
[11:11] <leen> where osd's get marked wrongly marked down ?
[11:12] <leen> because it does not seem like I'm CPU bound, network bound or disk I/O bound, in that case they shouldn't be marked down
[11:13] <fghaas> osds would get marked down (not out) when the osd dies, or when it fails to respond to other osds
[11:13] <fghaas> then after it's down for the mon osd down out interval, it would also be marked out
[11:15] <leen> yes, that is what I would expect. But I'm surprised as to why they would fail to respond
[11:15] <leen> if it's not CPU, memory, I/O or network
[11:16] <leen> that is why I mentioned: are there any known open or recent bugs ?
[11:18] <leen> maybe I should mention what I did, I added a large number of osd's, but I had forgot to set the weight. And then set the weight on many of them at ones
[11:18] <leen> that was probably a bad idea :-)
[11:19] <leen> anyway, I set the weight to 0 again instead to let everything get mostly to normal again.
[11:19] <leen> when nothing is degraded, I'll slowly add them
[11:19] <wido> madkiss: No, it's not a per user setting
[11:19] <wido> so you would have to set it per VM in the disk definition in libvirt
[11:20] * cocoy (~Adium@ has joined #ceph
[11:20] <madkiss> wido: thanks a lot
[11:21] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[11:25] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit ()
[11:41] * loicd1 (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[11:41] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Read error: Connection reset by peer)
[12:29] <absynth> madkiss: you don't want rbd_cache, i think.
[12:29] <absynth> breaks under load
[12:34] <absynth> errm
[12:34] <absynth> joao: there?
[12:35] <absynth> http://www.zendesk.com/blog/weve-been-hacked
[12:35] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:39] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[12:46] * leseb (~leseb@mx00.stone-it.com) Quit (Ping timeout: 480 seconds)
[12:51] <joao> holy crap
[12:51] <joao> absynth, I'm here but am going to walk the dog for the next 20 minutes or so
[12:52] <joao> is there anything I can help you with?
[12:52] <joao> or was the zendesk everything?
[12:52] <joao> *zendesk thing
[12:52] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[12:59] * loicd1 (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[12:59] <absynth> nah, thats everything
[13:00] * leseb (~leseb@mx00.stone-it.com) has joined #ceph
[13:09] <joao> so the good news is that only three zendesk customers were affected; the bad news is that three zendesk customers were affected
[13:15] <leen> joao: the other bad news is, this is what they _think_ happend, you can't be a 100% sure
[13:17] <leen> anyway, have a good day
[13:17] * leen is now known as Lennie`away
[13:18] * leseb_ (~leseb@mx00.stone-it.com) has joined #ceph
[13:18] * leseb (~leseb@mx00.stone-it.com) Quit (Read error: Connection reset by peer)
[13:28] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:33] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[13:34] * jks (~jks@3e6b5724.rev.stofanet.dk) Quit (Ping timeout: 480 seconds)
[13:37] * diegows (~diegows@ has joined #ceph
[13:51] * jks (~jks@3e6b5724.rev.stofanet.dk) has joined #ceph
[14:00] * leseb_ (~leseb@mx00.stone-it.com) Quit (Read error: Connection reset by peer)
[14:01] * leseb (~leseb@mx00.stone-it.com) has joined #ceph
[14:06] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[14:07] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[14:08] <jackhill> Hi, I need to setup a small amount of storage (~70 TiB raw) that will be consumed by a dozen hosts. I'm trying to decide between having those dozen hosts run a ceph cluster for themselves, or putting all the storage on a dedicated storage host.
[14:09] <jackhill> Is this type of small cluster a good fit with ceph?
[14:09] <fghaas> jackhill: sure, however how exactly are your consumers going to use ceph, primarily?
[14:09] <jackhill> cephfs
[14:10] <fghaas> then you'll have to have a dedicated storage cluster
[14:10] <fghaas> mounting cephfs on devices that are themselves ceph osds == not good
[14:10] <nhm> jackhill: we aren't recommending cephfs for production use yet
[14:10] * ScOut3R (~ScOut3R@ has joined #ceph
[14:11] <fghaas> nhm: bah, as of 3.9 it's no longer CONFIG_EXPERIMENTAL ;)
[14:11] <jackhill> okay, thanks.
[14:12] <jackhill> fghaas: can you expand on why mounting cephfs from ceph osds is not good? Would the answer be different if I was using rbd or librados?
[14:12] <fghaas> yes it would
[14:13] <fghaas> mounting cephfs from something that is itself a ceph osd creates a significant risk of deadlock in the writeout path under memory pressure
[14:13] <fghaas> (this is no different from pretty much any other kernel network fs)
[14:14] <fghaas> you don't have any of those issues for anything that doesn't use ceph from in-kernel
[14:15] <jackhill> ah. (not that I'm going to do this, but) would ceph-fuse be different?
[14:15] <fghaas> hence, you're fine with librados, with rbd it depends whether you're using kernel rbd (bad), qemu/rbd (good), python librbd (good), or fuse-rbd (good, though highly experimental)
[14:16] <fghaas> I presume ceph-fuse _would_ be different, yes, but the same "not recommended for production" caveat applies to it as much as to kernel cephfs
[14:16] <fghaas> (and in case it wasn't obvious, the CONFIG_EXPERIMENTAL reference above was tongue in cheek, however true it may technically be)
[14:18] <fghaas> wido: I saw a tweet of yours fly by about the storage arch rewrite in cloudstack being merged; does that mean better rbd integration _is_ coming to 4.1 after all, or is this still expected for 4.2?
[14:18] <jackhill> right. I have been playing with it for the last couple days and it is really cool. Someday I will live in the future :)
[14:20] <fghaas> jackhill: what's your use case for those 70TB?
[14:21] <Psi-jack> cocoy: Oh! heh.. I figured out the problem was something to do with cweb2.. I ended up re-cloning cweb1 to cweb2 and had no issues..
[14:22] <Psi-jack> cocoy: I still don't know /what/ the problem was, unfortunately./
[14:24] <jackhill> fghaas: data for numerical analysis of nuclear physics. I was asked by the professor doing that research if we could combine the storage placed in his compute nodes into one unified filesystem, so I'm investigating possibilites.
[14:24] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:25] <fghaas> jackhill: that sounds mildly awesome. :) are you on lustre now?
[14:26] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[14:26] <jackhill> fghaas: This is a new installation. Nothing yet.
[14:26] <fghaas> I meant what you've previously been using for this purpose
[14:27] <fghaas> (this or similar purposes, I should say)
[14:27] <nhm> jackhill: ooh, neat
[14:29] <jackhill> Nothing. This is the first time we are supporting a many hosts needing access to the same set of data. The other computation machines we manage are just one machine and have direct attached or iSCSI storage.
[14:30] * low (~low@ has left #ceph
[14:33] <cocoy> Psi-jack: so you still end up ceph as a shared filesystem for web1 and web2?
[14:34] <cocoy> Psi-jack: i'm planning to put web files too in a mounted ceph-fs on several web-servers.
[14:34] <Psi-jack> cocoy: Yep. There was something odd apparently going on with cweb2 at the time of that incident. Though, both servers were originally clones of each other for years, when I moved them over into RBD through my conversion process of booting up into a VM, raw copying the disks from their NFS-stored qcow2 to RBD, something must have happened, or already had happened, that corrupted something...
[14:35] <Psi-jack> cocoy: Basically when I re-cloned cweb1 over cweb2 from the cweb1 rbd disk, altered it back to cweb2 per IP, ssh host key, etc.. all was good, and still running today, smoothly.
[14:37] <wido> fghaas: No, it won't hit 4.1
[14:37] <fghaas> wido: ok, thanks
[14:37] <wido> The feature freeze for 4.1 has already passed
[14:37] <fghaas> gotcha
[14:37] <wido> it's only in the master, nog 4.1 branch
[14:37] <wido> not*
[14:37] <fghaas> sue
[14:37] <fghaas> sure
[14:37] <fghaas> don't sue :)
[14:43] <cocoy> Psi-jack: cool. The setup is we have a running ceph server and plan to mount /data/webdir for nginx root dir. web servers are not using rdb disks only plain mount.ceph.
[14:44] <cocoy> Psi-jack: seems in your cweb1 you're mounting ceph-fs dir also?
[14:45] * gregorg (~Greg@ has joined #ceph
[14:45] * gregorg (~Greg@ Quit ()
[14:46] <Psi-jack> cocoy: Heh, yeah, my setup is having all VM's running off RBD disks for their primary disks, and for shared storage, cephfs was used for things like /var/www. /home, etc.
[14:46] <Psi-jack> cocoy: cweb1 and cweb2 on my setup does mount, as cephfs, /var/www from a portion of the cephfs (not / basically, but a pre-made directory within cephfs)
[14:53] <cocoy> Psi-jack: that's nice. our only concern is putting cepf-fs for production. seems the docs says it's not yet ready. what can you say about that? :) or maybe I have to test more.
[14:54] <cocoy> for rdb we have a go signal to use it for prod though.
[14:57] <Psi-jack> cocoy: Well, I'm using RBD-OS disks for each cweb1, and cweb2, then within each of those, CephFS mounts for /var/www, and with all that, I'm running it with LVS to load balance by direct routing to each, and running http://www.linux-help.org/ -- And I have had 0% downtime since, and I have Pingdom monitoring my servers 24/7/365. No downtime for 2.25 months.
[14:57] <Psi-jack> Well, no unexpected downtime. I had one downtime due to a power outage, but that's unrelated to anything of this nature. :)
[14:58] <Psi-jack> But with that in mind, the power outage caused all my servers to eventually go offline, since the APC's could only keep then afloat for 45 minutes, the ceph cluster itself when it all came back online was able to self recover without any issue. :)
[15:06] <cocoy> Psi-jack: that's very nice. btw we're using Ubuntu 12.04 w/ the kernel 3.20.xx I think. So far on my tests mounting ceph-fs are stable with my nginx machines. the only problem i notice is sometimes there is lag on ceph server espcially on heavy read/writes ex. running bonnie++
[15:06] * KevinPerks1 (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[15:06] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[15:06] <cocoy> though haven't experience any crash.
[15:06] <Psi-jack> cocoy: What's your CephFS cluster of server's using?
[15:06] <Psi-jack> Disk/hardware wise.
[15:07] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[15:08] <Psi-jack> I, so far, experience practically no lag, and maintain easily 80~100 MB/s transfer rates over 1000mbit Ethernet accross 3 ceph servers, 3 OSD's per server, each OSD spindle disk backed by SSD for ceph journalling and xfs journalling both.
[15:09] <Gugge-47527> one SSD pr journal, or 3 journals on the same SSD?
[15:09] <Psi-jack> I haven't even seen ANY performance issues since I switched from qcow2 over NFSv4 to CephFS RBD+CephFS
[15:10] <Psi-jack> Gugge-47527: 6 journals on the same SSD.
[15:10] <Psi-jack> 3 ceph, 3 xfs
[15:10] <Psi-jack> I think I also have my mon and mds on straight SSD.
[15:11] <Gugge-47527> Ohh, you have both the xfs and ceph journal on the SSD :)
[15:11] <Psi-jack> Yes. :)
[15:11] <Psi-jack> On an OCZ Agility 3
[15:12] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[15:12] <Psi-jack> The base OS of each Ceph server is also on that, in like a 20~30gb partition of it.
[15:13] <Gugge-47527> Im planning to do something like that, using two SSD's in raid1 though.
[15:13] <Psi-jack> Bad idea.
[15:14] <Psi-jack> SSD + RAID == Recipe for failure.
[15:14] <Gugge-47527> Why?
[15:15] <Gugge-47527> I would hate to loose all my journals.
[15:15] <Psi-jack> Heh, because of how SSD's work. They're lifetime functionality is based upon their number of writes, afterall.
[15:16] <cocoy> Psi-jack: it's running on a test hardware 4Gb, (not sure of the hardisk if it's ssd around 100gb), 8 core cpu
[15:16] <Gugge-47527> If that was the only way for an SSD to fail, sure :)
[15:16] <Psi-jack> As it is, I'm writing a script that will occasionally rotate out partitions of my SSD's to what ceph and xfs will utilize, rotated about every 3 or 6 months, so they're not constantly using the same portion of the disk, this also will allow me to format the paritions, trim them, then zero them back out for next rotation.
[15:17] <Psi-jack> Gugge-47527: Which you'd be cloning the failure to the cloned raid disk. LOL
[15:17] <Psi-jack> As I said, recipe for failure.
[15:18] <Psi-jack> Better is to use LVM and LVM Snapshots, and backup the data, but the journal isn't quite the issue here. The journals themselves are raw data, no filesystem involved, so you can't exactly just initiate trim yourself.
[15:19] <Gugge-47527> Well, i need my journals to live, even though one SSD fails.
[15:19] <Gugge-47527> And im not that converned about trim on newer SSDs
[15:19] <Psi-jack> You should be.
[15:19] <mattch> Gugge-47527: Bear in mind also that with RAID your journal writes are now going to multiple disks, which takes away a lot of the perfromance benefit of SSDs :)
[15:20] <Psi-jack> mattch: Also true.
[15:20] <Gugge-47527> But still a lot better than the 7200 RPM disk
[15:20] <mattch> Gugge-47527: but not that much better than 15k enterprise SAS I believe
[15:21] <Gugge-47527> I have an SSD mirror for other stuff, i can do 50k+ iops to that
[15:21] <Gugge-47527> 15k SAS cant :)
[15:21] <Psi-jack> Okay, if you want to totally ignore all logical reasoning, please, feel free to do the most ignorant route of things. :)
[15:22] <Gugge-47527> Well, the biggest problem is that the OSD's has to be rebuilt if the journal dies
[15:22] <Psi-jack> Which can be done live.
[15:22] * Lennie`away is now known as leen
[15:22] <Gugge-47527> I guess i could split the journals between multiple SSDs
[15:22] <mattch> Gugge-47527: http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals is a good read on this area.
[15:22] <leen> fghaas: found the problems, turns out it was a networking problem
[15:23] * leen is now known as Lennie`away
[15:23] * markl_ (~mark@tpsit.com) has joined #ceph
[15:23] * markl_ (~mark@tpsit.com) Quit ()
[15:23] * Psi-jack sighs, REALLY hating away nicks.
[15:23] <Gugge-47527> mattch: sure, splitting the journals to not loose all OSD's in one box at once would be acceptable
[15:24] <mattch> Gugge-47527: I try to look at it the other way round - assume that everything will probably fail at some point, and work out what the time to recovery is for any particular failure domain - then work out your journals-per-SSD ratio from that
[15:24] <Gugge-47527> mattch: i agree
[15:25] <Gugge-47527> I just dont want a single SSD failure to cause a rebuild of all OSDs in one box.
[15:25] <Gugge-47527> splitting the journals between multiple SSDs is fine
[15:27] <mattch> Gugge-47527: Yep - I'm looking at 2 x SSDs for journalling supporting 4-6 OSD disks, so 2-3 journals on each one. Though I'm also aware that total server failure will cause 6 x OSD replication in my calculations.
[15:27] <Psi-jack> Yeah, that's definitely an option. :)
[15:27] <Psi-jack> One I should probably look into myself.
[15:28] <Psi-jack> When I built my ceph cluster, I had saved up enough to buy one SSD per ceph server, so that's what I went with for the time being. :)
[15:28] <fghaas> mattch: that's a very sane approach
[15:29] <Psi-jack> Knowing of course, I could improve upon it at a later date, which I still intend to do, because I need to eventually start matching up the drives to their sizes. Right now, each ceph server has 3 OSD's. 1 320GB HDD, 1 500GB HDD, and 1 1TB HDD.. Need to replace the 500 and 320's with 1TB's. :)
[15:29] <mattch> fghaas: thanks for the vote of confidence - always good to have someone with more experience in this area comment on these things... (or at least not start yelling :-p)
[15:29] <fghaas> oh I can yell if you want to :)
[15:30] <Psi-jack> hehehehe
[15:30] <fghaas> I'd totally yell at Psi-jack if his OSDs are unweighted in the crushmap, for example
[15:31] <Psi-jack> fghaas: Heh, yeah. They weren't, initially, but I did weight them after I got it all going. :)
[15:31] <Psi-jack> Cause I knew if I didn't, I could mysteriously run out of space just because the 320 filled. LOL
[15:32] <fghaas> have you run into an -ENOSPC on an OSD in bobtail yet?
[15:32] <cocoy> Psi-jack: thanks for replies. good day all. :) ill be back for more help tomorrow so don't worry. haha
[15:32] <Psi-jack> In April, I'm actually doing a speech on Ceph at our local LUG. CRUSH mapping is one of the things I'll be covering.
[15:32] <mattch> My bosses current suggestion to save money is to make the journals 15k sas disks, rather than SSDs. I still can't work out what the reduction in performance would be, even with all the lovely performance benchmarks online...
[15:32] <fghaas> Psi-jack: which LUG is that?
[15:32] <Psi-jack> fghaas: GoLUG, in central Florida.
[15:33] <Psi-jack> They got a mild taste of it last meeting we had, which is every 1st wednesday of every month, because they were covering Proxmox VE stuff, and I was the only one there that had a truely fully clustered environment setup with it, and ceph. :)
[15:33] <janos> that sounds like "gulag"
[15:33] <Psi-jack> heheh
[15:34] <fghaas> Psi-jack: feel free to steal my slides from barcelona if they're useful
[15:34] <Psi-jack> janos: Well, we have one guy that's kind of a biggish name, Steve Litt, whom has done a lot of the LaTeX documentation. :)
[15:34] <Psi-jack> fghaas: Where? I could use anything that would be helpful, since I have just over a month to prep for this, and I haven't given speeches in years. :)
[15:35] <Psi-jack> And all next week, I'll be in Japan. Not caring about computers at all. hehe
[15:36] <fghaas> https://github.com/fghaas/lceu2012, or you could also use the ones from https://github.com/fghaas/lca2013, that was the tutorial from linux.conf.au
[15:36] <Psi-jack> Spiffy. Thanks! When I get back I'll look over them and integrate them most likely. :)
[15:37] * ScOut3R_ (~ScOut3R@ has joined #ceph
[15:37] <fghaas> I hope you don't hate impress.js
[15:37] <Psi-jack> heh
[15:38] <fghaas> also, it's all CC-BY-SA, so state your source and do reciprocally share your slides
[15:38] <Psi-jack> Heh. Roger that! :)
[15:39] <Psi-jack> Already planned on that anyway, mr Florian Haas. :)
[15:39] <fghaas> wasn't insinuating otherwise :)
[15:41] <Psi-jack> Hmmm, some of the links apparently aren't working anymore. Connection refused.
[15:42] <scuttlemonkey> fghaas: we should build a repo of demos/talks/etc and encourage anyone chatting about ceph to contrib
[15:42] * Ludo (~Ludo@falbala.zoxx.net) has joined #ceph
[15:42] <fghaas> scuttlemonkey: yeah that would be a great idea for the wiki... oh wait :)
[15:42] <scuttlemonkey> dunno about others but my talks always get better when I see other people's methods of explaining stuff...helps round out how I talk about things to people
[15:42] <scuttlemonkey> I know ><
[15:43] <scuttlemonkey> I have been gently pushing on platform stuff
[15:43] <scuttlemonkey> even played a bit with that new 'discourse' tool from Atwood's crew
[15:44] * cocoy (~Adium@ Quit (Quit: Leaving.)
[15:44] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[15:51] <Psi-jack> fghaas: Ahhhhh.. The URL's defined in these slides try to open localhost, eh?
[15:51] <Psi-jack> Well, guess that means my presentation laptop will need a webserver. No problem. LOL.
[15:52] <fghaas> Psi-jack: for the demos, yes. firing up shellinaboxd and connecting it to a running Ceph VM is left as an exercise for the reader :)
[15:52] <Psi-jack> hehe
[15:52] <Psi-jack> Cool. :)
[15:52] <Psi-jack> That part I will be able to demonstrate directly anyway, from my VPN connection to my home infrastructure. :)
[15:53] <fghaas> yeah but "direct" demonstrations suck because you have to switch windows. thus, shellinabox ftw
[15:54] <scuttlemonkey> psi-jack: what focus is your presentation taking? I'm talking to the NYC LUG in sept and I haven't decided what makes the most sense beyond the usual 101 stuff
[15:54] <fghaas> scuttlemonkey: have you seen the tutorial from lca? there's a full video
[15:55] <scuttlemonkey> hmm
[15:55] <Psi-jack> scuttlemonkey: What Ceph is, how it works, and how it can be used. :)
[15:55] <fghaas> http://mirror.linux.org.au/linux.conf.au/2013/ogv/Ceph_object_storage_block_storage_file_system_replication_massive_scalability_and_then_some.ogv
[15:55] <Psi-jack> So yeah, basically A to Z. :)
[15:55] <fghaas> ah, nvm. question was for Psi-jack, not me. sorry scuttlemonkey :)
[15:56] <scuttlemonkey> fghaas: oh, great! I only got to see part of this one and lost the link
[15:56] <Psi-jack> hehe
[15:56] <Psi-jack> Cool! NOW I have faces for the people. :)
[15:56] <Psi-jack> Muahahahaaha. :)
[15:56] <scuttlemonkey> hehe
[15:57] <fghaas> Psi-jack: sadly I'm the boring clean-shaven guy, Tim's the pirate :)
[15:57] <scuttlemonkey> before speaking at that open source cloud day in brussels I hadn't given a talk in a _long_ time
[15:57] <Psi-jack> fghaas: Hehehehehehe
[15:57] <Psi-jack> fghaas: Eh, it's all good, I actually just RECENTLY cut my hair off, though, not clean shaven, :)
[15:58] <scuttlemonkey> so I learned 2 things 1) I'm never using keynote again and 2) I do better when I have a demo and not just slides
[15:58] <Psi-jack> hehe
[15:58] <fghaas> scuttlemonkey: then impress.js+shellinaboxd is just for you, too
[15:58] <Psi-jack> scuttlemonkey: Yeah, when I do presentations, I go all out. :)
[15:59] <scuttlemonkey> fghaas: sounds like it...saved the url locally this time :)
[15:59] <scuttlemonkey> psi-jack: yeah, it's more fun that way
[15:59] <Psi-jack> fghaas: Now, here's a question. Do you know what wireless remote you had in this presentation I presume you used to switch between the slides?
[16:00] <fghaas> Psi-jack: http://www.trust.com/products/product.aspx?ProductCategory=INPUT&ProductGroup=PRESENTERS&artnr=16661
[16:00] <fghaas> best of these I've ever had
[16:00] <Psi-jack> Very nice. :){
[16:00] <fghaas> super tiny, implements USB HID, switches between being a presenter (pgup/pgdn, f5, esc) and a mouse
[16:00] <Psi-jack> And Linux supported, too?
[16:01] <fghaas> and has a laser pointer
[16:01] <fghaas> yeah, just HID. works like a charm.
[16:01] <Psi-jack> Perfect. :)
[16:01] <scuttlemonkey> ...and doesn't appear to be sold in the US
[16:01] <Psi-jack> heh, is it sold in Japan? Cause I'll be there soon! :D
[16:02] <Psi-jack> Whaaaat? Not even Japan? Grrr.
[16:02] <fghaas> scuttlemonkey: bugger. some FCC regulation insanity?
[16:02] <scuttlemonkey> no doubt, our government is becoming sillier than most
[16:02] <Psi-jack> Yeaah...
[16:02] <scuttlemonkey> ah hah!
[16:02] <scuttlemonkey> http://www.amazon.com/Trust-Preme-Wireless-Laser-Presenter/dp/B002TIL86G
[16:02] <scuttlemonkey> Amazon to the rescue
[16:03] <Psi-jack> hehe
[16:03] <Psi-jack> Only 1 left in stock?
[16:03] <Psi-jack> MINE!
[16:03] <scuttlemonkey> haha
[16:03] <Psi-jack> heh
[16:03] * janos quickly snakes it to drive up price
[16:03] <scuttlemonkey> yeah, right now I just use my iphone
[16:03] <scuttlemonkey> rowmote serves my needs well enough
[16:04] <fghaas> one other thing you can try is anyremote, turns any java or android phone into a bluetooth or wifi remote
[16:04] <Psi-jack> scuttlemonkey: Hmmm.. That's an interesting idea as well.
[16:04] <scuttlemonkey> fghaas: I'll have to try that when my new Nexus 4 gets here :)
[16:04] <Psi-jack> fghaas: Yeaa. That'd be a good secondary option.
[16:05] <fghaas> I just found my own phone a tad too large to be comfortable presenting. I just want that clicker to be invisible in my hand
[16:05] <Psi-jack> Yeaaah, mine too, fghaas.
[16:05] <Psi-jack> My phone's not quite as big as those new Galaxy Note's, but it's definitely bigger than iPhones are.
[16:05] <Psi-jack> Pesky little iphones. :)
[16:06] * ScOut3R_ (~ScOut3R@ Quit (Remote host closed the connection)
[16:06] * ScOut3R (~ScOut3R@ has joined #ceph
[16:06] <fghaas> what I'd really want is a skin color glove that has a bluetooth transmitter, with the pgdown button on my index finger's first joint and the pgup on the second
[16:06] <Psi-jack> LOL.
[16:06] <fghaas> anyone convince thinkgeek to ship that?
[16:07] * ron-slc_ (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) has joined #ceph
[16:07] <Psi-jack> Well, as long as we're dreaming, why not go with the Google Glass approach that doesn't even require anything special besides glasses? ;)
[16:07] <fghaas> the glove would be less that $1.5k, and they wouldn't make you beg with #ifihadglove?
[16:07] <fghaas> "less than"
[16:08] <Psi-jack> hehe
[16:08] <fghaas> it seems that anytime scuttlemonkey and I are both in this channel, the talk is OT galore...
[16:09] <Psi-jack> Heh, wow..
[16:09] <Psi-jack> I go to staples.co.uk, and the available categories are just inks and toner..
[16:09] <scuttlemonkey> hehe
[16:09] <Psi-jack> For me, coming from the US.
[16:09] <nhm> fghaas: It's all part of our grass-roots marketing strategy. ;)
[16:09] <fghaas> nhm! question for you! :)
[16:10] <fghaas> (actually on topic, mind you)
[16:10] <nhm> fghaas: uh oh... :)
[16:10] <fghaas> commonly, when folks build storage servers, they would default to using the deadline scheduler, because it would be expected that their controllers are smart & fast
[16:11] <fghaas> now for a ceph osd node, we often have rather slow (and dumb) sata drives, and super fast and super smart SSDs
[16:11] <fghaas> is it more prudent to default to cfq for the filestore, and noop for the journal devices?
[16:13] <nhm> fghaas: good question! I haven't explicitly tested things that granularly, but it may not be a bad hypothesis. Have you seen the IO scheduler article et?
[16:13] <fghaas> this one? http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/
[16:14] <nhm> fghaas: yeah. It looks like it may also depend on the filesystem being used and if you are read heavy or write heavy.
[16:15] <nhm> And of course the IO size.
[16:15] <fghaas> well, sure. but in that article you always switched the scheduler for both the filestore and the journal, no?
[16:16] <nhm> fghaas: yep, so I don't have a good answer if switching them independently might lead to even better results
[16:16] <nhm> fghaas: It may.
[16:18] <fghaas> gotcha, thanks
[16:21] <scuttlemonkey> fghaas: although speaking of demo stuff...did you see the juju article I posted? I think that may be my go-to for demos...the gui stuff plays nicely to folks who don't really know whats happening
[16:22] <fghaas> I've not demo'd installation. I've found that both I and my audiences have always preferring playing with the stuff when it was already running
[16:22] <scuttlemonkey> sure
[16:22] <fghaas> but yeah, we need deployment demos too. I'm more of a puppethead myself, so I'd love to build a demo for that
[16:22] <scuttlemonkey> I just meant using the gui to show things like failing hosts and whatnot
[16:22] * scalability-junk should really remove his highlight on demo ;)
[16:23] * vata (~vata@2607:fad8:4:6:5473:1c84:4a88:dd00) has joined #ceph
[16:30] * PerlStalker (~PerlStalk@ has joined #ceph
[16:45] * leseb (~leseb@mx00.stone-it.com) Quit (Remote host closed the connection)
[16:47] * jlogan (~Thunderbi@2600:c00:3010:1:a9a1:e01e:fafb:f712) has joined #ceph
[16:48] * leseb (~leseb@mx00.stone-it.com) has joined #ceph
[16:52] * gerard_dethier (~Thunderbi@ Quit (Quit: gerard_dethier)
[16:53] * leseb (~leseb@mx00.stone-it.com) Quit (Read error: Connection reset by peer)
[16:53] * leseb (~leseb@mx00.stone-it.com) has joined #ceph
[17:00] * absynth (~absynth@irc.absynth.de) Quit (Ping timeout: 480 seconds)
[17:21] * josef (~seven@li70-116.members.linode.com) has left #ceph
[17:25] * jms_ is now known as jms
[17:26] * jms (~jason@milkyway.csit.parkland.edu) has left #ceph
[17:36] * hybrid5121 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Read error: Connection reset by peer)
[17:39] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[17:42] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:56] * vata (~vata@2607:fad8:4:6:5473:1c84:4a88:dd00) Quit (Ping timeout: 480 seconds)
[17:57] * leseb (~leseb@mx00.stone-it.com) Quit (Remote host closed the connection)
[18:04] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[18:10] * vata (~vata@2607:fad8:4:6:d24:265e:768f:89f7) has joined #ceph
[18:13] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[18:14] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[18:20] * l0nk (~alex@ Quit (Quit: Leaving.)
[18:24] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[18:26] * The_Bishop_ (~bishop@e179015082.adsl.alicedsl.de) has joined #ceph
[18:27] * ScOut3R (~scout3r@54007948.dsl.pool.telekom.hu) has joined #ceph
[18:28] * absynth (~absynth@irc.absynth.de) has joined #ceph
[18:29] * sstan (~chatzilla@dmzgw2.cbnco.com) has joined #ceph
[18:30] * carson (~carson@2604:ba00:2:1:fd02:5699:c1b2:2a29) has left #ceph
[18:33] * The_Bishop (~bishop@e179015198.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[18:36] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:40] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[18:45] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:52] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[19:00] * chutzpah (~chutz@ has joined #ceph
[19:01] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[19:03] * absynth (~absynth@irc.absynth.de) Quit (Ping timeout: 480 seconds)
[19:09] * drokita (~drokita@ has joined #ceph
[19:12] * absynth (~absynth@irc.absynth.de) has joined #ceph
[19:13] * Cube (~Cube@ has joined #ceph
[19:13] * loicd (~loic@lvs-gateway1.teclib.net) has joined #ceph
[19:15] * markbby (~Adium@ has joined #ceph
[19:15] * sjustlaptop (~sam@ has joined #ceph
[19:16] * ScOut3R (~scout3r@54007948.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[19:17] <markbby> hi all .. is there any documentation on the norecover and nobackfill osdmap options?
[19:19] <gregaf> *searches docs*
[19:19] <gregaf> apparently not
[19:19] <gregaf> but they do what they sound like
[19:19] <gregaf> if set, the OSDs won't perform recovery or backfill operations, respectively
[19:20] <gregaf> you are unlikely to want to set them...
[19:21] <markbby> so if I set those options, take an osd out temporarily for maintenance, bring it back online, then unset the option it will then recover?
[19:21] <markbby> is that what they are meant for?
[19:21] <markbby> .. for example if I need to reboot a host ..
[19:22] <gregaf> no, just a regular reboot should be dealt with fine by the normal down and out mechanisms
[19:22] <gregaf> they're tools for dealing with large misbehaving clusters in a very specific set of circumstances
[19:22] <markbby> ok
[19:22] <markbby> thanks
[19:22] <gregaf> yep
[19:23] * drokita (~drokita@ Quit (Quit: Leaving.)
[19:23] * drokita (~drokita@ has joined #ceph
[19:23] * nick5 (~nick@ Quit (Remote host closed the connection)
[19:24] * nick5 (~nick@ has joined #ceph
[19:25] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[19:30] * danieagle (~Daniel@ has joined #ceph
[19:31] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[19:31] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[19:34] * gaveen (~gaveen@ has joined #ceph
[19:40] * Jasson (805fd346@ircip2.mibbit.com) has joined #ceph
[19:50] <Jasson> I could use some clarification on how ceph and rbd images work. Based on advice from yesterday I have back ported kernel running on my rhel6 servers and have the rbd images now mounted. I'm looking at sharing them to the rest of our client machines over NFS.
[19:51] <Jasson> but copying data into the mounted image on one server and then checking on the other server which is mounting the same image locally I don't see the files copied into it until it is unmounted and remounted.
[19:52] <janos> when you say mounting the image - do you mean the rbd?
[19:52] <janos> or nfs
[19:52] <janos> it sounds like rbd
[19:52] <Jasson> the rbd mounted locally.
[19:52] <janos> do not mount that in two places - rbd means Rados Block Device
[19:52] <Jasson> I have not setup the NFS portion of this yet, just trying to plan for every senario before taking it live.
[19:53] <janos> try plugging two sata cables into one drive for a visual ;)
[19:53] <Jasson> ok, so don't mount on the second server until we have a failure on the first one and have to switch over.
[19:53] <janos> if it acts like other block devices, you will end up with corruption if you mount in two places
[19:54] <Jasson> I was starting to think that might be the case, thanks.
[19:54] <janos> anytime
[19:55] <Jasson> So second question, the rbd mapping, does that survive a reboot? And if so do they keep the same /dev/rbdX designation each time?
[19:55] <janos> i haven't done in a little while - i don't think it survives reboot
[19:55] <janos> i think i wrote a start up script to do the repetion for me
[19:56] <janos> i alaways got the same rbd number, but i don't know if that's just due to my scripts executing the same order each time or a feature
[19:57] * loicd (~loic@lvs-gateway1.teclib.net) Quit (Ping timeout: 480 seconds)
[19:57] <Jasson> ok, so sounds like my best bet is to write a script to always do them in the same order. At least that can be ported to both servers incase one goes down. Then it's just the remounting the images and reconnecting the NFS shares.
[19:58] <janos> i end up making bash scripts with notes for everything ;)
[19:58] <janos> too much for me to rmemeber
[19:59] <Jasson> yeah, we have kind of a messy wiki to do the same thing.
[19:59] <janos> whatever maintains knowledge and process works!
[20:03] <mjevans> dmick: I added a ceph.conf file to the bug from yesterday. I also distilled my thoughts in to a more positive direction with this: http://tracker.ceph.com/issues/4239 Automatically determine routing data based on host routing data (entirely deprecate public/cluster explicit specification)
[20:05] <mjevans> janos: Well, he could use one of those cluster filesystems over the RBD right? The ones that are designed to be mounted by multiple computers at the same time?
[20:07] <janos> mjevans: i suppose but that sounds kind of crazy
[20:07] <mjevans> http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems Yeah, it would be
[20:07] <janos> cluster over a cluster
[20:07] <mjevans> janos: no no, there ARE ones designed for expensive hardware
[20:07] <mjevans> sorry wrong sublink 'shared disk filesystems'
[20:08] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[20:08] <janos> i'd rather make an rbd-backed qemu image with a vm that handles NFS with the VM having failover
[20:08] <janos> yeah, there are many ways up the mountain ;)
[20:08] <mjevans> My understandingis that the RBD /should/ be updated simultainously (or close enough to it) from all viewpoints
[20:08] <mjevans> Also that is almost exactly what I'm doing janos, only samba.
[20:08] <janos> yeah, i'm preparing to rework my house that way
[20:09] <mjevans> :( users are used to windows, cry and whine about not being able to install their virus laden crapware
[20:09] <dmick> mjevans: I don't immediately grok what you're saying, but I will note that the 'networks' options can have lists of networks on them, if that helps
[20:09] <dmick>
[20:09] <dmick> (and it's assumed that all public networks route to one another, and all cluster networks do as well)
[20:09] <mjevans> dmick: as usual, it helps but the documentation doesn't... also autoconfig == useful
[20:10] <dmick> http://tracker.ceph.com/issues/4049
[20:10] <dmick> the source is available ;)
[20:12] <Jasson> In my crazy scenario I was hoping to see the image update real time on both servers so I had a quicker fail over recovery option if one of the 2 storage servers went down. But if it's just a oh crap have to mount the rbd images then start NFS that's not to bad.
[20:13] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[20:18] <mjevans> Jasson: you should probably try what I am... setting up libvirt to use rbd as it's storage backend. Then you can have a VM image xml config and start it up if the other sever goes down hard for long enough; OR teleport the instance to another host since the storage is accessiable via the same path from both.
[20:19] <mjevans> That, I think, is what most ceph users likely use it for given the feature set. That or giant 'loose file' stores for mega scale web-apps.
[20:25] <Jasson> basically what we need is something to replicate data across 2 large storage servers that we can then share out smaller chunks for different tasks across the rest of our server network. So shared data mounted on each of the servers our users actually use, db storage, web data, etc.
[20:25] <Jasson> we were trying to keep the storage servers running not on VM's to increase performance.
[20:25] <Jasson> but I'm considering any suggestions as well :)
[20:32] * loicd (~loic@ has joined #ceph
[20:35] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[20:38] <Gugge-47527> how do i compile only the cephfs fuse client?
[20:39] <Gugge-47527> im gonna try if i can get it working on freebsd :)
[20:51] <mjevans> This... is somewhat interesting. the mkcephfs script doesn't seem to 'start' things correctly on a set of systemd systems. I'm fairly sure it complete, it's just that none of the monitors came up. Do I get out of this by issuing a single monitor ceph.conf file, starting that monitor, then following the 'add monitor' steps for the other monitors?
[20:52] <Gugge-47527> doesnt "service ceph -a start" start it all?
[20:53] <mjevans> Gugge-47527: systemctl start ceph-mon
[20:53] <mjevans> Gugge-47527: systemctl start ceph-osd@num
[20:53] <Gugge-47527> i guess you have to manually start all the mons on all the hosts then
[20:53] <Gugge-47527> if you dont have the init script supporting -a :)
[20:53] <mjevans> Yes that's done, they are spamming my logs with errors
[20:54] <Gugge-47527> strange, they should be talking to each other and go online :)
[20:54] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) has joined #ceph
[20:54] <mjevans> I guess I can kill everything and try again after lunch
[20:55] <mjevans> (I do mean /everything/)
[20:56] <mjevans> The monitors appear to more or less be saying that they have a fault with/reaching an OSD
[20:56] <mjevans> for every single OSD
[20:56] <Gugge-47527> and the osd's are running?
[20:57] <Gugge-47527> does ceph -s output anything?
[20:57] <mjevans> Yes on the first
[20:58] <mjevans> messages like these two lines on the second:
[20:58] <mjevans> 2013-02-22 11:57:49.398702 7fdf4f69a700 0 -- :/8857 >> [fd00::211::]:6789/0 pipe(0x1d8d310 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault
[20:58] <mjevans> 2013-02-22 11:57:52.399057 7fdf5502e700 0 -- :/8857 >> [fd00::210::]:6789/0 pipe(0x7fdf44000c00 sd=5 :0 s=1 pgs=0 cs=0 l=1).fault
[21:00] * loicd (~loic@ Quit (Quit: Leaving.)
[21:00] <mjevans> 1 mon.name(probing) e0 discarding message auth(proto 0 26 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[21:00] <mjevans> Is from another monitor only host
[21:00] <Gugge-47527> do you have firewall rules preventing the mons to talk to each other?
[21:00] <mjevans> So yeah, pretty much it failed quorum on startup because (duh) only one monitor was running when it first started
[21:01] <Gugge-47527> as soon as you start the missing monitors, they should get quorum
[21:01] <mjevans> Gugge-47527: No, but one of the monitors is /only/ on the public, not cluster, network
[21:01] <Gugge-47527> well, all the monitors should be able to connect to each other
[21:02] <mjevans> They can, over the public network
[21:02] <Gugge-47527> and they all use the public network ip?
[21:02] <mjevans> Yes
[21:02] <Gugge-47527> next step for me would be to fire up tcpdump and check if they are communicating :)
[21:04] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Remote host closed the connection)
[21:05] * yehuda_hm (~yehuda@2602:306:330b:a40:157b:3371:b971:2937) Quit (Ping timeout: 480 seconds)
[21:05] <mjevans> http://pastebin.com/eyref2Au That's what the config looks like on each system
[21:08] * mistur_ is now known as mistur
[21:16] <elder> Anyone else having trouble reaching machine teuthology?
[21:18] <dmick> me. no pingy
[21:18] <dmick> looking
[21:22] * ron-slc_ (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) Quit (Remote host closed the connection)
[21:23] * markbby (~Adium@ Quit (Quit: Leaving.)
[21:23] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[21:23] <mjevans> I think I figured out what the problem was; I'd enabled the ceph-mon service, but it too needed to have it's service definition renamed to include the @ and then have it's argument stipulated. Thanks Archlinux maintainer... I'll file another bug.
[21:23] * markbby (~Adium@ has joined #ceph
[21:25] <mjevans> Yup... that fixed it
[21:26] <mjevans> I'll edit my crush maps after lunch; thanks for the extra hint; it added a little more credence to a slight oddity and I was able to figure out what was needed thanks to the total of small warnings.
[21:27] * leseb_ (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[21:27] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[21:27] <dmick> the VM is receiving no packets on either 'real' interface, only lo
[21:28] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:29] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has joined #ceph
[21:29] * leseb_ (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[21:29] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[21:30] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[21:31] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[21:34] <dmick> the networking service is 'stop/waiting'; that can't be good
[21:34] * eschnou (~eschnou@65.72-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:35] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:37] <dmick> big hammer: /etc/init.d/networking restart; seems to have gotten it unstuck
[21:38] <dmick> do not understand the interplay between upstart and init.d on this score
[21:44] <elder> Nor do I, but it's working for me again, thank you.
[21:44] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:45] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:49] <sstan> does osd weights affect both read and write distribution?
[21:50] <sjust> yes
[21:50] <sjust> well, no
[21:50] <sjust> they affect object distribution
[21:50] <sjust> if your reads and writes are evenly distributed over objects, they will follow osd weight in terms of osd distribution
[21:50] <sstan> but when comes the time to read an object, it is equally read from everywhere where the object is available?
[21:50] <sjust> no, always the primary
[21:51] <sjust> what is your use case?
[21:51] <sstan> have 3 machines. #1 is slow, #2 is fast, #3 is slow. Weights are distributed accordingly and replication is 2
[21:52] <sjust> ah, it might be good to craft a custom crush map to place the primaries on #2?
[21:52] <sstan> almost every object is written to no.2 , but half of the copies will be on #1 and other half on #3, right?
[21:52] <sjust> not necessarily, how is your crush map set up?
[21:52] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:52] * Jasson (805fd346@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[21:53] <sstan> ruleset 2
[21:53] <sstan> type replicated
[21:53] <sstan> min_size 2
[21:53] <sstan> max_size 10
[21:53] <sstan> step take default
[21:53] <sstan> step choose firstn 0 type osd
[21:53] <sstan> step emit
[21:53] <sjust> is that 3 osds?
[21:53] <sstan> yes
[21:53] <sjust> or three machines with more than one osd each?
[21:53] <sjust> ok
[21:54] <sjust> in that case, #2 will simply end up storing more objects based on the weight
[21:54] <sstan> yes, it means it will be writing more, and that's good because it's fast
[21:54] <sjust> right
[21:55] <sstan> but, no.1 and no.3 will be sharing the replication
[21:55] <sjust> yeah
[21:55] <sstan> however, when I need to read an object, will all 3 computers equally contribute?
[21:55] <sstan> or it's just the primary, like you said
[21:55] <sjust> no, 2 will still dominate
[21:56] <sjust> sorry, all three will contribute based on their crush weights as with writes
[21:56] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) has left #ceph
[21:57] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has left #ceph
[21:58] <sstan> so there is no way to address the issue where OSD write speed is limited, but read speed is abundant ?
[21:59] <sjust> oh, you want the two slow ones to have a larger role in writes?
[21:59] <sjust> **in reads?
[21:59] <sstan> yes
[21:59] <sjust> hmm, never considered that, you could arrange for *all* pgs to have their primaries on 1/3 with all replicas on 2
[21:59] <sjust> but that would be a bit heavy handed
[21:59] <sjust> **on 1 & 3
[22:01] <sstan> primary for writes would be no.2 (fast write) (with replicas equally distributed between no.1 and no.3)
[22:02] <sstan> and when I need to reed, they all offer the same read speed, so clients would be pulling from all 3 machines
[22:03] <sjust> no good way to do that
[22:03] <sjust> primary for writes needs to be primary for reads
[22:03] <sjust> unless the objects you will be reading are known to be read-only?
[22:03] <sstan> no, they are normal objects
[22:04] <sjust> yeah, that isn't supported and probably won't be any time soon, rados semantics make that dificult
[22:05] <sjust> it's also less friendly to the page cache on the machines
[22:06] <sstan> ah I didn't know that .. so read requests per OSD are respectively proportional to their weight 1
[22:06] <sstan> * to OSD weights in the crushmap
[22:06] <sjust> object placement is proportional to weight and reads are always served by the primary, so yeah, assuming the distribution of reads over objects is reasonable
[22:07] <janos> sorry for noob question, but how is primary determined - purely by weight?
[22:07] <sjust> the primary is just the first osd chosen for a pg by crush
[22:07] <sjust> so with a standard crush map, yeah
[22:07] <sstan> every PG has a primary ... and the frequency of OSDs who are primaries among PGs is proportionnal to the OSD weights I guess
[22:08] <sjust> sstan: basically, for a normal crushmap
[22:08] <janos> thanks
[22:08] <sjust> though you can do creative things like create one crush heirarchy of ssds and a second of spinning disks and ensure that all pgs have their primary in the ssd heirarchy and replicas in the spinning disks
[22:09] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[22:10] <sstan> in a replica size 2 situation, does every PG point to only 2 OSDs ?
[22:10] <sjust> right
[22:10] <sjust> *except for certain times during recovery
[22:10] <sjust> when it might be 1 or 3
[22:11] <sstan> 3 that's impossible ... since the objects are available at most 2 OSDs
[22:11] <sjust> it's during the time when we are copying objects over to a new replica
[22:12] <sstan> ah only temporarily
[22:12] <sjust> right
[22:13] <sstan> I made an interesting test :
[22:13] <sstan> http://pastebin.com/xiBM21DJ
[22:14] <sstan> when using rbd cp machine1 machine2, it takes more time that creating machine2, mapping machine1 and machine2 , then dd /dev/rbd1 /dev/rbd/2
[22:15] <sjust> hmm, not sure, joshd?
[22:17] <joshd> cp is sync read, then sync write for each object. it's not optimized at all
[22:18] <sstan> I tried dd with oflag=direct, it's still faster that cp
[22:18] <sstan> I don't know if I'm doing something wrong
[22:19] <sstan> *than
[22:21] * sleinen1 (~Adium@2001:620:0:26:6de8:89b8:59ec:81a6) has joined #ceph
[22:22] * vata (~vata@2607:fad8:4:6:d24:265e:768f:89f7) Quit (Quit: Leaving.)
[22:22] <sstan> Now that I'm thinking about it, one part of cp being slow is that the write/read happens in the OSD cluster (i.e. some OSDs have to do both write and read operation at the same time)
[22:22] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[22:24] <sstan> one solution might be to tell each OSD to copy the objects without using the network (just do a copy-paste in the hard drive)
[22:26] <sjust> sstan: in general, two objects won't be on the same machine
[22:27] <sjust> more importantly the probably won't be in the same pg so it would be...tricky
[22:28] <sstan> doesn't matter ... it's an RBD copy, so if the rbd is 100 machines, each machine copies whathever it is storing that's related to a particular RBD
[22:28] <sstan> * rbd separated on 100 machines
[22:30] <sjust> sstan: an rbd image is a set of objects {(machine#, object#)}, copying one is a matter of copying objects (machine1, N) -> (machine2, N) for all blocks N in the rbd image
[22:30] <sjust> you can only do a local copy when (machine1, N) and (machine2, N) happen to be on the same machine
[22:31] <sjust> possible, but surprisingly difficult to do
[22:31] <sjust> also, it gets less useful the larger your cluster
[22:32] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:36] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:36] <sstan> imagine an RBD with replica size = 2. Each object n of the rbd is on two OSDs i.e: [ n , {j,k} ]. In each OSD, copy object n and you get [ copy_of_n, {j,k} ]. There is a copy of the RBD , by using local copy
[22:37] <sstan> then, it might be rebalanced according to the OSD weights, but that would be low priority traffic, if even needed (because objects n are already distributed according to weights)
[22:39] <sstan> where j,k are elements of the set of OSDs and there is only one object n per OSD, where n is an object form the set composing some RBD
[22:51] * eschnou (~eschnou@65.72-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:07] <mjevans> Is there a method for making crushtool -c more verbose? It tells me there's a problem, but produces an empty string that seems to correlate to the switch between my set of root nodes and my set of rules
[23:08] <mjevans> crushmap.txt:113 error: parse error at ''
[23:12] * markbby (~Adium@ Quit (Quit: Leaving.)
[23:14] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[23:14] <ShaunR> So reading around i always see ceph articles showing 3 replicas, is there some reason for that, is having only 2 that bad?
[23:16] <gregaf> 2 isn't bad, but there's a huge different in number of nines expected data durability going from 2 to 3
[23:17] <gregaf> with two-copy you expect to lose "an amount of data" at around the same frequency as with RAID-5 arrays (although the amount lost is much, much smaller)
[23:17] <gregaf> with 3-copy the probability is more likely that a natural disaster will knock out your data center before your disks combine to lose data
[23:18] <gregaf> mjevans: sorry, it's a lousy parser — but IIRC it probably means your enumeration of the nodes is off somehow
[23:19] <gregaf> dmick might remember better; he's spent more time digging around that than I have recently
[23:22] * sleinen1 (~Adium@2001:620:0:26:6de8:89b8:59ec:81a6) Quit (Quit: Leaving.)
[23:23] <junglebells> mjevans: I had a similar problem with syntax. Check VERY carefully. I ended up sorting out words and using sort uniq's to find my typo heh
[23:25] <ShaunR> Z1F15EXJST3000DM001
[23:26] <ShaunR> lol, disk serial anybody?
[23:26] <junglebells> OH ME ME ME
[23:28] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[23:28] <mjevans> Looks like it
[23:28] <ShaunR> thats because it is.
[23:28] <mjevans> http://pastebin.com/DqDCdHzu crushmap-out.txt:113 error: parse error at '' :: I redacted things a bit but just names so the line numbers match.
[23:29] <mjevans> I don't get why it's complaining at line 113 though
[23:30] <mjevans> It's right after the h-ssd definition; who's only real oddity is that it's 'unlucky' -13 and has a total weight of 0.400
[23:32] <junglebells> mjevans: You don't have weight specifications for your items below line 113. You need to have "weight [nn.nnnn]"
[23:32] <mjevans> junglebells: the default one does list them but they're commented out with #
[23:33] <mjevans> Oh crap
[23:33] <mjevans> I see now
[23:33] <junglebells> ;)
[23:33] <mjevans> How could I have missed that?
[23:33] <junglebells> Too much staring at it. Happens to all of us
[23:33] <junglebells> Or at least I'm man enough to admit it happened to me
[23:37] <mjevans> Yeah It for sure happened to me... I think I saw the value at the end that my brain 'knew' was the weight and didn't groak the parsing failure on the word before.
[23:38] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:43] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.