#ceph IRC Log

Index

IRC Log for 2015-04-19

Timestamps are in GMT/BST.

[0:10] * xcezzz (~Adium@pool-100-3-14-19.tampfl.fios.verizon.net) has joined #ceph
[0:10] * xcezzz1 (~Adium@pool-100-3-14-19.tampfl.fios.verizon.net) Quit (Read error: Connection reset by peer)
[0:16] * wicope (~wicope@0001fd8a.user.oftc.net) Quit (Remote host closed the connection)
[0:18] * K3NT1S_aw (~elt@5NZAABUW7.tor-irc.dnsbl.oftc.net) Quit ()
[0:18] * Aal (~PierreW@176.10.99.204) has joined #ceph
[0:23] * B_Rake (~B_Rake@45.56.23.41) has joined #ceph
[0:36] * B_Rake (~B_Rake@45.56.23.41) Quit (Read error: Connection reset by peer)
[0:37] * B_Rake (~B_Rake@2605:a601:5b9:dd01:d890:1e1c:11db:a30f) has joined #ceph
[0:40] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) has joined #ceph
[0:42] * rendar (~I@host83-182-dynamic.37-79-r.retail.telecomitalia.it) Quit ()
[0:48] * Aal (~PierreW@98EAABDXN.tor-irc.dnsbl.oftc.net) Quit ()
[0:49] * QuantumBeep (~xul@62-210-170-27.rev.poneytelecom.eu) has joined #ceph
[0:49] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) Quit (Ping timeout: 480 seconds)
[0:53] <flaf> litwol: if you have just one host for your 3 OSDs and if the size of your pool is > 1, then it's normal. Ceph would like to dispatch each replica in different hosts but it can't.
[0:55] * sjmtest (uid32746@id-32746.uxbridge.irccloud.com) Quit (Quit: Connection closed for inactivity)
[1:07] * B_Rake (~B_Rake@2605:a601:5b9:dd01:d890:1e1c:11db:a30f) Quit (Remote host closed the connection)
[1:08] * B_Rake (~B_Rake@2605:a601:5b9:dd01:d890:1e1c:11db:a30f) has joined #ceph
[1:17] * B_Rake (~B_Rake@2605:a601:5b9:dd01:d890:1e1c:11db:a30f) Quit (Remote host closed the connection)
[1:18] * QuantumBeep (~xul@425AAANGJ.tor-irc.dnsbl.oftc.net) Quit ()
[1:18] * GuntherDW1 (~rapedex@nooduitgang.schmutzig.org) has joined #ceph
[1:19] <litwol> flaf: i have 3 osd hosts. each host with 1 osd. hosts are virtual machines on the same physical host. This is a test install and for simplisty all vms are on same machine.
[1:20] <florz> litwol: do the machines have different host names?
[1:21] <litwol> yes.
[1:21] <florz> litwol: is that reflected in the crush hierarchy (ceph osd tree)?
[1:21] <flaf> Ah ok and what is the size of the pool too?
[1:22] * davidz (~davidz@2605:e000:1313:8003:dcab:185d:7b16:d24c) Quit (Quit: Leaving.)
[1:22] <litwol> http://dpaste.com/24JQ4MR
[1:23] * oms101 (~oms101@p20030057EA2BAF00EEF4BBFFFE0F7062.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[1:23] <litwol> i deleted all pools
[1:24] <litwol> now i get warning that i have too few PGs per OSD
[1:24] <litwol> even though my ceph.conf sets default to 333
[1:26] <litwol> here's the conf http://dpaste.com/3VH35X6
[1:26] <flaf> Try to create a pool with pg_num == 128 for instance
[1:27] <litwol> created
[1:27] <litwol> ceph osd pool create rbd 128 128
[1:28] <flaf> And is "ceph -s" ok?
[1:29] <litwol> http://dpaste.com/0046F8Z
[1:29] <litwol> no
[1:30] <flaf> oh -> http://dpaste.com/24JQ4MR
[1:30] <flaf> Weights seems to be curious.
[1:30] <litwol> ?
[1:30] <litwol> What should it be/
[1:30] <flaf> weight == 0
[1:31] <litwol> What should it be?
[1:32] <florz> wrights should be proportional to storage space
[1:32] <florz> *weights
[1:32] <flaf> yes.
[1:32] * vjujjuri (~chatzilla@204.14.239.107) has joined #ceph
[1:32] * mattrich (~Adium@c-98-207-79-48.hsd1.ca.comcast.net) has joined #ceph
[1:32] * mattrich (~Adium@c-98-207-79-48.hsd1.ca.comcast.net) Quit ()
[1:33] <flaf> Generally people uses the rule "1TG <=> weight == 1.0" but it's not mandatory.
[1:33] * oms101 (~oms101@p20030057EA5FCE00EEF4BBFFFE0F7062.dip0.t-ipconnect.de) has joined #ceph
[1:33] <flaf> You should change the weight of your OSDs.
[1:34] <flaf> http://ceph.com/docs/master/rados/operations/crush-map/#adjust-an-osd-s-crush-weight
[1:35] * litwol updates salt state files
[1:35] <litwol> oh
[1:35] <litwol> so if i have weight 0
[1:36] <litwol> it's like announcing 'i ahve no space' ?
[1:36] <flaf> Yes.
[1:36] <litwol> darn
[1:36] <flaf> weight == 0 <==> "Please, no data in this OSD".
[1:37] * wushudoin (~jB@c-73-189-76-103.hsd1.ca.comcast.net) has joined #ceph
[1:43] <litwol> w00t
[1:43] <litwol> flaf: thank you so much!
[1:47] <flaf> ;)
[1:48] <litwol> flaf: is there a way to query ceph current configuration of osd's, buckets, etc?
[1:48] * GuntherDW1 (~rapedex@425AAANGU.tor-irc.dnsbl.oftc.net) Quit ()
[1:48] <litwol> flaf: i'd love to avoid relying on grepping "osd tree" to figure out if an osd.N is there and in the right host bucket
[1:50] <litwol> I'm using this ugly thing to see if osd.N exists and is in the right bucket: ceph osd tree | pcregrep -M "node-3.*(\n|.)*0 osd.3"
[1:50] <litwol> since creating osd puts it at the bottom of the list, with weight o. so far this grep came out to be "accurate"
[1:50] <litwol> not sure how reliable it is though
[1:53] * Morde (~Pommesgab@37.187.129.166) has joined #ceph
[1:57] <flaf> litwol: i think it is possible to have json as format with --format option.
[1:59] <flaf> but personally i find the ceph osd tree readable. No?
[2:00] <litwol> flaf: i'd love to see something more concrete like "query bucket parent/root of osd.X"
[2:00] <litwol> right now i have to grep for spaces between weight 0 and osd name
[2:00] <litwol> that tells me if osd.X was moved anywhere.. but i cannot know where it was moved after that.
[2:01] <litwol> unless i add more complex logic into grep.. which gets ugly very quickly and too complicated.
[2:02] <flaf> but why do you use grep?
[2:02] <litwol> flaf: to find "state" of an osd
[2:02] <litwol> flaf: i am using configuration management: saltstack
[2:02] <flaf> just ceph osd tree is not enough?
[2:03] <flaf> ah ok
[2:03] <flaf> here is the problen
[2:03] <litwol> humanly readable osd tree is enough
[2:03] <flaf> and json output?
[2:04] <litwol> flaf: json output does not display information that i can identify as "hierarchical"
[2:04] <flaf> personally i use puppet
[2:04] <flaf> but i manage osd, crush etc. manually
[2:05] <florz> litwol: ceph osd crush dump gives you json with all the info in it
[2:05] <litwol> http://dpaste.com/3PN6BYH
[2:06] <litwol> crush dump http://dpaste.com/1CCED39
[2:06] <litwol> oh
[2:06] <flaf> to my mind, unless you have a very BIG cluster, it is better to manage osd instances manually
[2:06] <litwol> so "items" describe OSDs in the bucket
[2:07] <litwol> for now i'm just learning this thing
[2:07] <litwol> so i wrote some states to bootstrap ceph onto 4 nodes
[2:07] <litwol> so i can wipe all clean and restart
[2:07] <litwol> by restart i mean create the cluster again
[2:08] <litwol> manually running commands to get back up is a long process :-\
[2:17] <litwol> one of those "it's saturday and i'm bored" kind of projects >.<
[2:23] * Morde (~Pommesgab@2WVAABRYV.tor-irc.dnsbl.oftc.net) Quit ()
[2:23] * Hazmat (~Yopi@herngaard.torservers.net) has joined #ceph
[2:24] <litwol> all right. now to the fun part
[2:24] <litwol> i've read that i can have snapshots
[2:24] <litwol> i've done "mount -t ceph [mon ip]:/ "
[2:25] <litwol> i've watched youtube talks on ceph stating that i can create snapshot as easy as mkdir .snaps/...
[2:25] <litwol> but idont see the snaps folder
[2:27] <litwol> oh it's there!
[2:28] <litwol> not a "linux-type hidden"
[2:28] <litwol> which is just a dot named
[2:28] <litwol> but actually hidden
[2:28] <litwol> interesting
[2:29] <flaf> snapshot with cephfs? sure?
[2:29] * p66kumar (~p66kumar@c-67-188-232-183.hsd1.ca.comcast.net) Quit (Quit: p66kumar)
[2:29] <litwol> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040461.html
[2:30] <litwol> flaf: i'm learning what ceph has to offer... having said that.. cephfs snaps.. yes.. is there a problem with it ?
[2:33] <flaf> I knew snapshots with rados block device but not with cephfs. Do you have links please?
[2:35] <litwol> flaf: literally "mkdir .snap/[arbitrary folder/snap name]" in your ceph mount
[2:36] <litwol> flaf: ".snap" is a magic/hidden folder. it is not visible to a simple "ls"
[2:36] <litwol> flaf: but you can cd into it!
[2:38] <flaf> litwol: I'm not sure to understand. Do you have some links?
[2:38] <litwol> flaf: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/034645.html
[2:38] <flaf> what happens if I launch "cd /mnt && mkdir .snap/d1/foo"?
[2:39] <litwol> flaf: assuming your entire "/mnt" is a mounted ceph fs..
[2:39] <flaf> I think there nothing about that in the online documentation but maybe I'm wrong.
[2:39] <litwol> flaf: then your "d1/foo" will become a snapshot!
[2:40] <litwol> flaf: you can then safely delete /mnt/d1/foo, and it will continue to exist in /mnt/.snap/d1/foo
[2:40] <litwol> flaf: creting directories inside an invisible folder ".snap" has a magic behavior of creating snapshots. simple as that
[2:41] <litwol> "cd [ceph mount]; mkdir -p .snap/state-backup-$(date +%F)"
[2:43] <flaf> The snapshot will be a snapshot of what? The entire cephfs? Is possible to snap just a sub-directory?
[2:43] <litwol> flaf: according to the reading, snapshot will be the snapshot of the directory and it's subdirectories
[2:44] <litwol> flaf: thus creating [ceph fs]/.snap/my-snap: will snapshot entire tree from very top root
[2:44] <litwol> flaf: creating [ceph fs]/foo/bar/baz/.snap/another-snap, this will snapshot only baz/ and subdigs
[2:44] <litwol> subdirs*
[2:44] <litwol> *** according to what i read***. i have not tested it
[2:45] <flaf> ok yes of course.
[2:45] <flaf> So the snap is in the "snaped" directory. So if I remove the directory, the snap will be remove too.
[2:46] <flaf> it curious. I can't found information about that in the online documentation.
[2:46] <florz> which permissions are required to create such a snapshot?
[2:46] <litwol> flaf: probably because there is a big warning this functionality is unstable
[2:47] <flaf> litwol: yes indeed.
[2:47] <litwol> i had to run this command before ceph allowed me to mkdir anything inside .snap folder
[2:47] <litwol> ceph mds set allow_new_snaps true --yes-i-really-mean-it
[2:50] <litwol> flaf: do you use snapshots?
[2:50] <flaf> No, I didn't know before you talking about that. ;)
[2:50] <litwol> flaf: you mentioned rados blcok devices
[2:51] <litwol> at this point i am interested to learn of any working snapshotting with ceph
[2:51] <litwol> with current setup ceph protects wonderfully against system breakdowns. nodes going up and down..
[2:51] <litwol> but it doesn't protect against users screwing up
[2:52] <flaf> Ah, not a lot, but it's stable for rados block device http://ceph.com/docs/master/rbd/rbd-snapshot/
[2:52] <flaf> litwol: ceph is not a backup.
[2:53] * Hazmat (~Yopi@2WVAABRZW.tor-irc.dnsbl.oftc.net) Quit ()
[2:53] * WedTM1 (~Frymaster@37.187.129.166) has joined #ceph
[2:53] <litwol> well
[2:54] <litwol> combination of snapshots and distributed hardware == backup
[2:54] <florz> not really
[2:54] <litwol> snapshot protects individual data within a pool. distributed hardware protects against individual pool (and all of its snapshots) from disappearing.
[2:54] <flaf> If snapshots are in ceph, if you have a disaster... you have a problem ;)
[2:55] <litwol> although.. a coffee deprived sysadmin could equally 'ceph destroy cluster command' as easily as 'zpool destroy pool'
[2:56] <litwol> only way to protect against that that i can think of is having distinct clusters copy of one another, where destroying pools/fs in one cluster doesn't affect the other.
[2:56] <florz> there are tons of failures that ceph + snapshots does not protect against, but backups do
[2:57] <litwol> give me a few examples please.
[2:57] <florz> ceph does not magically decorrelate node failures
[2:58] <litwol> what does that mean?
[2:59] <flaf> (do you know how to get the current value of the "allow_new_snaps" value? "ceph mds get allow_new_snaps" doesn't work.)
[2:59] <litwol> i dont know how to query ceph for settings yet
[2:59] <litwol> :noh
[2:59] <litwol> flaf: you should be able to deduce its value by attempting to perform "mkdir .snap/test"
[3:00] <litwol> flaf: you will be "operation denied" if value of allow_new_snaps is false.
[3:00] <florz> litwol: ceph builds on the assumption that the probability of failure of one storage device is not strongly correlated with failure of other storag devices - only does the distribution and replication help
[3:01] <florz> litwol: but if, say, lightning strikes your data center's power lines, chances are that if that destroys one node of the cluster, it will also destroy other nodes, so the failure is correlated
[3:02] <litwol> that example is understandable
[3:02] <litwol> my scenario assumes multiple geographic datacenters
[3:02] <litwol> again.. this is hypothetical/learning-ceph scenario
[3:03] <litwol> 3 datacenters + snaps.
[3:03] <litwol> 3 copies on each file
[3:03] <litwol> you end up with possibility of a nuke going off on 1 datacenter with chances of full recovery
[3:03] <florz> well, that doesn't really work well for most workloads due to latency
[3:03] <litwol> hmm
[3:04] <litwol> i've read that ceph is smart enough to utilize local journal for quick writes, and OSD prioritization based on locality
[3:04] <florz> but yeah, you can try to minimize failure correlation, the important thing is to be aware that ceph does not magically remove it
[3:04] * litwol nods
[3:04] <litwol> nukes and lightnings aside
[3:05] <litwol> i would love to have protection against individual nodes going down by having multiple copies of files, and protect against data user-deletions via snapshots.
[3:06] <florz> and no, you cannot "utilize local journal for quick writes" if you want fault tolerance
[3:06] <flaf> Ah ok, after "ceph mds set allow_new_snaps true", I have the warning "Error EPERM: Snapshots are unstable and will probably break your FS! Set to --yes-i-really-mean-it if you are sure you want to enable them". I'll wait. :)
[3:06] <litwol> florz: by local i mean "this osd node is "within same datacenter, journal there first. let ceph distribute it then after "
[3:06] <florz> if that local journal acknowleges the write and then fails, it could not recover after that
[3:07] <litwol> florz: is that not how it works?
[3:07] <florz> no, it can not
[3:07] <litwol> florz: i see. thx for the tip.
[3:07] <florz> it's impossible to build such a system
[3:07] <litwol> damn data reliability issues!
[3:08] <litwol> heh. i have development environments in which i run entire sql database inside tmpfs mounts
[3:08] <litwol> ultra high performance.. doesn't survive reboots.
[3:08] * root2 (~root@p5DDE49BB.dip0.t-ipconnect.de) has joined #ceph
[3:09] <florz> yeah ... of course, you can build a system that replicates asynchronously, and that will give you better write performance, but it will either not give you availability or not give you consistency
[3:10] <litwol> you're shattering my dreams >.<
[3:11] <litwol> how does one even go about creating backup system for a distributed filesystem?
[3:12] <litwol> here. this is where i got my ideas from: https://www.youtube.com/watch?v=La0Bxus6Fkg&t=43m10s
[3:12] <florz> if you want to survive a node crashing without losing data, you have to have a copy of that node's data elsewhere, and being sure that you have that copy requires that you wait for that other storage node's acknowledgement, which will take a certain time depending on the distance ...
[3:13] <litwol> i understand that
[3:14] <florz> and well ... it is a possible choice to just not make backups of data :-)
[3:14] <florz> I mean, ceph, probably is more reliable than a single desktop drive or something
[3:14] <litwol> no way! absolutely not acceptable. hehe. gosh never again without backups!
[3:14] <litwol> O_O
[3:14] <litwol> i have raidz3 pool at work
[3:15] <florz> depends on the data and what it would cost to recreate it and all that
[3:15] <litwol> https://bpaste.net/show/fdcf69bd6bae
[3:15] * root (~root@pD9E9D140.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[3:15] <litwol> single drive desktops is not a good comparison
[3:16] <litwol> i am considering invidual massive pools + backups via equivalent sized pool (zfs send/receive), versus ceph distributed fs with 3copies
[3:16] <flaf> I often ask myself this question: how do big enterprises that use ceph (or any distributed storage) with lot of PetaBytes of data? Do they backups? Do Youtube backups all its videos?
[3:17] <florz> but in addition to correlated node failures, ceph also doesn't protect against non-benign component faults
[3:18] <litwol> i don't understand terminology :(
[3:18] <litwol> flaf: i watched a talk on how google handles data.. fascinating. basically lots and lots of copies. individual nodes die and is considered a "normal thing". to compensate they spin up new nodes to sync quickly.
[3:19] <litwol> flaf: i understand the backups via copies approach.. but distributed filesystem offers you copy of file's "now" state.. you ahve 3 copies of the file as it is /now/
[3:19] <florz> flaf: with youtube, for example, I would expect that they don't backup, but instead have many replicas of the originals, and then only relatively few authoritative copies of re-encoded versions - those are relatively cheap to recreate
[3:19] <litwol> if you override that file you have "new" file, and your distributed fs loses the "old"" file
[3:20] <litwol> that is where snapshots become useful! you can always restore the old file
[3:20] <litwol> while having the benefit of distributed copies(backup)
[3:20] <florz> litwol: actually, that depends on the kind of fault
[3:21] <florz> litwol: if a sector of the hard disk gets corrupted, that affects the original as well as the snapshot, for example
[3:21] <litwol> florz: /me nods. yes .. that is also on my plate to investigate.
[3:21] <litwol> florz: i am explicitly testing ceph above zfs filesystem.
[3:21] <litwol> florz: zfs ensures data integrity
[3:22] <litwol> and offers COW/snapshots
[3:22] <litwol> zfs+ceph sounds like a dream come true to be honest.
[3:22] <florz> litwol: re non-benign faults: Well, a benign fault is when a component simply stops responding or simply refuses requests
[3:23] * WedTM1 (~Frymaster@2WVAABR01.tor-irc.dnsbl.oftc.net) Quit ()
[3:23] <florz> litwol: a non-benign fault is when a component processes requests, but incorrectly
[3:23] * Defaultti1 (~Bobby@nx-74205.tor-exit.network) has joined #ceph
[3:24] <florz> litwol: like, a disk returning different data than it should according to spec on a read request
[3:24] <litwol> florz: is there an "ideal environment" state in which ceph guarantees data integrity and overall quality of service? what's the ideal conditions for ceph?
[3:24] <flaf> litwol: which OS do you use for ceph + zfs backend?
[3:24] <litwol> flaf: currently i do /not/ have ceph anywhere except my virtual machine for learning/testing.
[3:24] <litwol> flaf: but i use zfs heavily on linux.
[3:24] <litwol> flaf: gentoo specifically.
[3:25] <flaf> Ok, so ZoL (ZFS on Linux)
[3:25] <litwol> flaf: that is why am testing zfs+ceph specifically. zfs guarantees data correctness on read.
[3:25] <litwol> flaf: every read is checksumed
[3:25] <florz> litwol: infinitely many disks as close together as possible but isolated from one another as well as possible ... you can choose what to give up ;-)
[3:26] <florz> also, with disks that only fail in benign ways
[3:26] <litwol> flaf: fyi so far my findings are negative regarding ceph+zfs.
[3:26] <litwol> flaf: i've read that "ceph relies heavily on underlying filesystem for a lot of functionality, notably snapshots"
[3:27] <flaf> litwol: I have read in the ceph ML that ceph + zfs was possible but needs some tunings, doesn't work well out-of-the-box.
[3:27] <litwol> flaf: and i just discovered that creating snapshot in ceph (mkdir .snap/foobar) does /not/ create zfs snapshot
[3:28] <florz> litwol: zfs does not protect against all non-benign faults either - if the disk executes writes in a non-serializable manner, zfs probably won't be able to help you either
[3:28] <florz> (for example)
[3:28] <flaf> xfs seems to be the more stable backend fs currently and one day maybe it will be btrfs.
[3:32] <litwol> i wish i knew what are 'benign failures'.. i mean, specific examples/list of failures.
[3:32] <florz> 03:22 < florz> litwol: re non-benign faults: Well, a benign fault is when a component simply stops responding or simply refuses requests
[3:33] <litwol> florz: is 'component' an osd?
[3:33] <litwol> doesn't ceph rebuild failed OSDs after few minutes?
[3:33] <florz> a component is anything that's part of the system
[3:34] <florz> RAM, CPU, hard disk, kernel, ceph, ...
[3:34] <litwol> sad
[3:35] * KevinPerks (~Adium@ip-64-134-99-56.public.wayport.net) Quit (Quit: Leaving.)
[3:36] <litwol> my motivation for distributed fs is, as i described previously, : i'm tired of handling backups and moving data around. i want something simple such that moving data involves few manual steps such as (1) start new node in destination, (2) shut down source after sync finished.
[3:36] <florz> a benign fault is essentially a fault that is detected as such and prevented from propagating
[3:36] <litwol> distributed fs solves that scenario wonderfully.
[3:36] <litwol> but if i can't guarantee data integrity.. ah that's problematic on all levels.
[3:37] <litwol> i've read that ceph scrubs itself once a week
[3:37] <litwol> to repair damaged files
[3:37] <florz> "guarantee[d] data integrity" doesn't exist
[3:37] <litwol> but it does not checksum files during read... so before scrub you might be reading broken files
[3:38] <florz> also, lack of inconsistency does not necessarily mean correct
[3:38] <florz> it's all a matter of probabilities
[3:39] <litwol> amount of information i've learned from you today makes me feel like crossing a marathon finish line, then fall on the floor and let audience come around and kick you some more on the floor.
[3:39] <florz> and the lower you want the probablity that you might lose data, the more expensive it becomes
[3:39] <florz> hrrhrr =:-)
[3:39] <litwol> "i've leraned a lot, but damn all the possibilities of failures hurt"
[3:40] <litwol> florz: okey lets say i /do/ make absolutely separate backups.
[3:40] * haomaiwang (~haomaiwan@114.111.166.250) Quit (Read error: Connection reset by peer)
[3:40] <litwol> florz: given what we just said.. ceph does not guarantee read data integrity..
[3:40] * haomaiwang (~haomaiwan@114.111.166.249) has joined #ceph
[3:40] <litwol> florz: that means my backup has all the possibility of /backing up/ faulty data.
[3:41] <florz> yeah
[3:41] <florz> not necessarily likely, but possible
[3:41] * litwol sigh
[3:41] <litwol> that brings me back to zfs
[3:41] * haomaiwang (~haomaiwan@114.111.166.249) Quit (Read error: Connection reset by peer)
[3:41] <litwol> zfs guarantees data read correctness based on the original written data and checksum
[3:41] <florz> that makes it less likely, still not impossible
[3:42] * haomaiwang (~haomaiwan@114.111.166.249) has joined #ceph
[3:43] <litwol> okey. lets try a different angle on this
[3:43] <litwol> what can i do to my [test/learning] cluster to encourage failures?
[3:43] <litwol> cycling nodes up and down to simulate osd failures .. that i can do
[3:43] <litwol> what are some non-obvious failures can i simulate in a test?
[3:44] <florz> flip some bits in the VM's RAM?
[3:44] <litwol> don't know how to do that
[3:45] <florz> doesn't make much sense any how, I guess
[3:45] <litwol> in my zfs pool i've had two "failures" which i was able to consistently recover from
[3:46] <litwol> 1) disk failed - recovered due to tripple parity. replace disk & scrub.
[3:46] <florz> the real problem are failures that you don't notice ;-)
[3:46] <litwol> 2) user deletes file [by mistake] - recover due to hourly + daily + monthly rolling snapshots
[3:47] * haomaiwa_ (~haomaiwan@114.111.166.249) has joined #ceph
[3:47] * haomaiwang (~haomaiwan@114.111.166.249) Quit (Read error: Connection reset by peer)
[3:47] <litwol> florz: between ECC ram and zfs data checksumming i dont know how data within this "closed loop" can "fail" (fail == corrpt/ lose/etc)
[3:47] <litwol> florz: of course data moving over network wire can get corrupted.. but that is way way way out of filesystem scope
[3:47] * litwol thinks
[3:48] <litwol> yeah i don't know enough to come up with a failure scenario
[3:48] <florz> in many ways, a SATA cable is just a network connection ;-)
[3:50] <florz> one way that things can fail is simply by failing more than protection mechanisms can cope with
[3:50] <florz> like, more bits flipping that ECC will catch
[3:50] <litwol> hmm
[3:50] <litwol> that sounds like a far fetched scenario... possible. but err.. how likely?
[3:52] <florz> there is a certain probability for any bit to flip in a given span of time, let's call that p, if you assume individual bit flips to be uncorrelated, n bits flipping in the same interval will happen with probability p^n
[3:52] <litwol> you are killing me -_-
[3:53] <litwol> so much for a brain-tearing-and-fun saturday project
[3:53] <florz> so ... yeah, the more bits, the less likely, but never impossible
[3:53] <litwol> >.<
[3:53] <litwol> lol
[3:53] <florz> *g*
[3:53] * Defaultti1 (~Bobby@2WVAABR1M.tor-irc.dnsbl.oftc.net) Quit ()
[3:53] <litwol> s/tearing/teasing/
[3:53] <florz> heh :-)
[3:53] * Spessu (~JamesHarr@46.182.106.190) has joined #ceph
[3:56] <florz> the point is, yeah, ECC reduces the rate of silent corruption massively, but not to zero ... and on the other hand, adding more machines with more RAM and thus more bits to flip in any given time interval also increases the probability that somewhere in your distributed system such a fault will happen
[3:58] <florz> but simple corruption indeed is the easier problem to deal with: add checksums until the probability of undetected corruption is low enough :-)
[3:59] <florz> the hard problem is correct time behaviour
[4:00] <litwol> hmm
[4:00] <florz> a storage system doesn't just store some static data, but rather has to execute write and read requests, thus constantly changing the data that is stored
[4:00] <florz> and there are certain constraints on which writes have to be visible in which reads
[4:01] <litwol> while this problem is possible.. i dont think it is super critical in an office environment where the risk is simply losing a word document or two.. or a few bits flipped in a video file.
[4:01] <florz> well, think of bits in the file system meta data flipping :-)
[4:02] <litwol> that's beyond me. i just don't know things on that level.
[4:02] <florz> like, structures where the file system stores which sectors belong to which file in which order
[4:02] <litwol> right
[4:02] <litwol> so i lose a file..
[4:03] <florz> or a directory
[4:03] <litwol> it should not be a big deal for a COW + snapshot filesystem
[4:03] <litwol> a flipped bit is as bad as a user deleting a file and calling me with "oops i did it again!"
[4:03] <litwol> 'snapshot rollback' and you are back to the pre-flip bit state.
[4:03] <litwol> is that not so?
[4:03] <florz> nope
[4:03] <florz> well, it can be, depends on which bits flip :-)
[4:04] <litwol> oh damn you
[4:04] <litwol> lol
[4:04] <litwol> this is so frustrating. i'm going to go wash dishes.
[4:04] <florz> I mean, just imagine a COW snapshot where there was nothing written so far
[4:04] <florz> then the original and the snapshot actually share the same storage
[4:05] <florz> so if a bit flips in the original, the snapshot will be affected as well
[4:05] <litwol> florz: zfs rollback doesn't work on per-file. it rolls back for entire filesystem. that menas checksum for the old 'correct file state' also rolls back.
[4:05] <litwol> oh
[4:05] <litwol> you mean flip happens between original stire and first snapshot?
[4:05] <litwol> yeah.. very unfortunate
[4:06] <florz> no, after the snapshot
[4:06] <florz> the snapshot is just an abstraction
[4:06] <litwol> oh
[4:06] <litwol> hmm interesting
[4:06] <florz> that's what COW means
[4:07] <florz> data gets copied only at the moment you try to write over it
[4:07] <litwol> how does one even protect from this scenario?
[4:07] * Aid2 (~quickshot@equuleus.whatbox.ca) has joined #ceph
[4:07] <Aid2> hi
[4:08] <florz> and all the snapshot does is to take care of always keeping around a copy of the old data and to make sure accesses through the snapshot always see the old state while accessed through the "original" only see their own state including any changes that happen there
[4:09] <florz> but that's all based on the assumption that the storage doesn't get corrupted, of course
[4:09] <litwol> yeah that is an interesting detail
[4:09] <litwol> i didn't think of that
[4:09] <litwol> based on this, it sounds like even zfs can get corrupted
[4:10] <litwol> https://clusterhq.com/blog/file-systems-data-loss-zfs/
[4:10] <litwol> florz: http://docs.oracle.com/cd/E19253-01/819-5461/gaypb/index.html
[4:10] <litwol> first sentence
[4:11] <litwol> "With ZFS, all data and metadata is verified using a user-selectable checksum algorithm. "
[4:12] <florz> yeah - all that means is that if your "original" gets unreadable due to a flipped bit that's caught by the checksum, the snapshot will as well
[4:12] <Aid2> yep that's what I understand of ZFS too :)
[4:12] <litwol> by the way
[4:12] <litwol> lets consider data capacity/efficiency
[4:12] <litwol> lets assume ideal scenarios (ie no funky failures)
[4:13] <litwol> raidz3 with 11 disks (8 data + 3 parity)
[4:13] <litwol> versus ceph with 11 osds.
[4:13] <litwol> ceph data integrity strategy is 3 copies
[4:13] <litwol> so we get data dencity of 11/3
[4:13] <litwol> 3.6
[4:13] <litwol> 3.6 disks worth of data
[4:14] <litwol> with raidz3 11 disk setup you get 8 disks data capacity
[4:14] <litwol> lets double this to introduce /true/ backup via remote secondary copy
[4:15] <litwol> 22 disks to wrote 8 disks worth of data (original raidz3 + clone backup. 11 disks each)
[4:16] * litwol scratches head.
[4:16] <litwol> i thik i am not doing good math here
[4:16] <litwol> brb
[4:16] * vjujjuri (~chatzilla@204.14.239.107) Quit (Ping timeout: 480 seconds)
[4:17] <litwol> right.
[4:17] <florz> well, you have to be careful to compare setups with similar reliability
[4:17] <litwol> and with ceph's 22 disks and 3 copies you end up with 7.3 disks worth of data
[4:17] <florz> at least as far as that is relevant to your needs
[4:17] <litwol> actually maybe this is not a fair comparison. because i am comparing ceph's 3 copies, with zfs's 2 copies (original + backup)
[4:18] <Aid2> Erasure coding (8+3) vs RAIDZ3 (8+3) would be better
[4:18] <litwol> maybe more correct to compare 33 disks ceph: 11 disks of data, with 3 copies of zfs raidz3, 33 disks worth of 8 disks of data
[4:19] <litwol> i'm not sure what erasure coding is
[4:19] <Aid2> Erasure coding is a software level of traditional raid
[4:19] <Aid2> erasure coding allows for RAIDZ#
[4:20] <Aid2> similar to how btrfs you can specify higher amounts of parity bits per data bits
[4:20] <florz> also, ceph has higher availability if distributed over multiple nodes
[4:20] <Aid2> Also allows for distributed rebuilds
[4:21] <Aid2> Is ceph OSD bound or can you do it over multiple OSDs?
[4:21] <florz> if the single zfs machine dies, the data might be fine, but it still will be unavailable
[4:21] <Aid2> IE. If you lose 2 disks in a JBOD can you rebuild that data amongst other JBODS or just the one?
[4:22] <litwol> Aid2: to quickly bring you up to speed on the convo. i am looking into a filesystem that solves two things (1) hardware integrity via distribution (ceph does that), (2) user error prevention (oops i deleted a file. want it back!) via snapshots.
[4:22] <litwol> my current setup is zpool 2x(8+3)
[4:22] <Aid2> Thanks. I also went off on a tangent :)
[4:23] <litwol> i've read that cephfs snapshots are unstable
[4:23] * Spessu (~JamesHarr@2WVAABR2Y.tor-irc.dnsbl.oftc.net) Quit ()
[4:23] <Aid2> We're looking at various object stores right now and ceph is on the view
[4:23] <florz> Aid2: I don't quite understand your question
[4:23] <litwol> i dont know of a reliable way to stress test this.. ie, i dont know enough about ceph to simulate breakage.
[4:24] * Szernex (~Jones@onions.mit.edu) has joined #ceph
[4:24] <litwol> also i was testing ceph's documented claim that ceph relies heavily on underlying filesystem for a lot of heavy lifting.. such as taking snapshots. i tested ceph over zfs and found that cephfs snapshot (mkdir .snap/foo) does NOT create zfs snapshot.
[4:24] <litwol> so there's that .
[4:24] <Aid2> florz: let me try again, let's say I have 3 OSDs, each have 60 drives attached to them. Let's say two drives in a single OSD die. Does the two drives objects get rebuilt on the remaining 58 or the remaining 178?
[4:25] <litwol> Aid2: from what i've read, you should have 1 osd per disk.
[4:25] <litwol> Aid2: what you are describing is 3 nodes, with 60 disks/osds each ?
[4:25] <florz> Aid2: you generally have one OSD per drive (unless those drives are somehow abstracted into some virtual device, like a RAID or LVM or something)
[4:26] <litwol> or 3 node with 20 disks each
[4:26] <litwol> ?
[4:26] <Aid2> yep, I got the termonology wrong
[4:26] <Aid2> let's say I have 3 nodes with 60 drives each. two drives in one node die does it rebuild over the remaining 58 or the reamining 178?
[4:27] <litwol> Aid2: 178
[4:27] <florz> litwol: with the default crush config, it gets rebuilt from the remaining 120 ;-)
[4:27] <florz> erm
[4:27] <florz> Aid2: with the default crush config, it gets rebuilt from the remaining 120 ;-)
[4:27] * litwol thinks
[4:27] <litwol> ohhhh
[4:27] <litwol> right!
[4:27] <Aid2> so it doesn't rebuild on the one with 58? Interesting
[4:28] <litwol> hmm
[4:28] * bkopilov (~bkopilov@bzq-79-183-107-179.red.bezeqint.net) Quit (Read error: Connection reset by peer)
[4:29] <litwol> i could have sworn i've seen diagram showing ceph's "data integrity" scheme
[4:29] <florz> Aid2: by default, Ceph avoids placing copies of the same data on the same machine so that it's more likely to survive if the machine dies ... so there is nothing on that machine that could be used to rebuild the failed osds
[4:29] <Aid2> cool
[4:29] <Aid2> that would be a FAST rebuild
[4:29] <Aid2> :D
[4:29] <litwol> each file/object broken into "blocks" based on your configuration, and then each osd contains random copies of blocks
[4:29] <florz> yeah, fast, but mostly useless ;-)
[4:29] <flaf> err... it gets rebuilt from the remaining 120 to the 58 OSDs.
[4:30] <Aid2> Understood, thanks for the clarification
[4:31] <Aid2> What's the biggest ceph cluster you guys have seen?
[4:31] <Aid2> node wise
[4:32] <litwol> biggest cluster i've managed runs inside 4 virtual machines on local host with 3 osds ;)
[4:32] <litwol> hehehe
[4:33] <litwol> sandbox
[4:33] <Aid2> gotta start somewhere :)
[4:33] <litwol> i really like ceph so far
[4:33] <litwol> like.. really really like
[4:33] <litwol> i just need few more bits clarified before i start using
[4:33] <litwol> i'm desperately trying to get snapshots to work
[4:33] <Aid2> There's a lot to it :)
[4:33] <litwol> i /do/ have "mkdir .snap/foo" working.
[4:34] <litwol> i do have those snapshots in cephfs, against all warnings of how unstable it is.
[4:34] <litwol> so that works.
[4:34] <Aid2> What's unstable about it?
[4:34] <litwol> now i'm trying to find out ways to snapshot ceph cluster.
[4:34] <Aid2> It just links to the old objects.. right?
[4:34] <litwol> Aid2: i've no idea. i'm just regurgitating what i've read.
[4:35] <Aid2> Gotcha.
[4:35] * KevinPerks (~Adium@ip-64-134-99-56.public.wayport.net) has joined #ceph
[4:36] <litwol> Aid2: for example. attempting to "mkdir .snap/foo" (ie, create snapshot) will trigger this error: Error EPERM: Snapshots are unstable and will probably break your FS! Set to --yes-i-really-mean-it if you are sure you want to enable them
[4:36] <litwol> Aid2: i mean, this one: ceph mds set allow_new_snaps true --yes-i-really-mean-it
[4:36] <Aid2> haha nice
[4:37] <litwol> after this command you can create snapshots on cephfs level by creating a folder such as '.snap/foo-backup'
[4:37] <Aid2> Doesn't look like it's production ready yet then
[4:38] <litwol> that's the thing
[4:38] <litwol> i wish i knew of a way to stress test it
[4:38] <Aid2> What about lots of files with lots of changes to them?
[4:38] <litwol> i'm so used to running --dev version of various products
[4:38] <litwol> it is just a matter of stress testing it to /your/ needs
[4:40] <Aid2> Yep..
[4:40] <Aid2> I have some employee's that are going to work on testing ceph this summer for me :)
[4:41] <litwol> hmm
[4:41] <litwol> i suppose it would be not difficult to simulate a test similar to badblocks
[4:42] <Aid2> I am looking at performance testing RADOS Block Dev on top of Erasure coding with the Pool caching feature.. I want to see how viable this could be to replace enterprise disks
[4:42] <florz> the hard part to get correct about distributed systems is coordination between nodes under all possible reorderings
[4:42] <litwol> write a script which creates a few files in ECC tmpfs. each file in "dir1, "dir2", dir3"... each file is /the same name/
[4:43] <litwol> then continuously loop over copying that to ceph, then read & checksum
[4:43] <florz> that's essentially impossible to evaluate just by testing
[4:43] * rotbeard (~redbeard@2a02:908:df10:d300:76f0:6dff:fe3b:994d) Quit (Quit: Verlassend)
[4:43] <litwol> what does that mean? "reorderings"?
[4:43] * KevinPerks (~Adium@ip-64-134-99-56.public.wayport.net) Quit (Ping timeout: 480 seconds)
[4:44] <florz> essentially orders in which things can happen
[4:44] <Aid2> Well, there's functional tests first.. baby steps
[4:44] <litwol> what can you do with rados block device?
[4:44] * DV (~veillard@2001:41d0:1:d478::1) Quit (Ping timeout: 480 seconds)
[4:44] <litwol> i haven't read that far into the documentation
[4:45] <Aid2> rbd is a way to simulate a block device in linux and write to ceph object store instead of a RAID of HDD/SSD
[4:46] <Aid2> that's a simplified version
[4:46] <litwol> i'm assuming you mkfs that block device?
[4:46] <Aid2> yep
[4:46] <litwol> hmm
[4:46] <litwol> couldn't i then just.. instead of using cephfs
[4:46] <litwol> use rbd, feed that device to zfs ?
[4:46] <Aid2> It'll allow you to use legacy systems that rely on block devices to work with cheap object storage
[4:47] <litwol> create pool out of it and get magic snapshot functionality?
[4:47] <Aid2> I think rbd is pretty tied into Ceph but there's other ones out there that thinking _could_ work
[4:48] <Aid2> Anywho, I am heading out for the night. Good luck litwol
[4:48] <litwol> gn
[4:49] <florz> litwol: in principle you could, yeah, different question is the performance you'll get from it ;-)
[4:50] <litwol> unlike cephfs, rados claims to have stable snapshot support
[4:50] <litwol> that, or i didn't find documentation that states rbd snaps to be unstable
[4:50] <florz> what would be nice would be some way for zfs to force reads from different replicas
[4:51] <litwol> with rbd snaps i could avoid using zfs snapshot, and instead simply use zfs for filesystem and its promiced checksun on read.
[4:51] <florz> then zfs' auto-recovery could actually help ceph with recovering from corrupted storage
[4:51] <litwol> florz: isn't forcing reads from specifics node ceph's scope? zfs would just say 'give me data', and hten ceph combines striped reads
[4:52] <litwol> florz: that's the cool part about zfs.. if it catches error during read and checksum, it repairs and writes correct data back.
[4:52] <florz> yeah, I mean for the error case, not for normal operation
[4:52] <litwol> ceph would see it as (1) file read, (2) differnt file (repaired old file) written
[4:52] <florz> nope
[4:52] <litwol> no?
[4:53] <florz> ceph would see it as (1) file read
[4:53] <litwol> florz: but zfs writes repaired data back.
[4:53] * Szernex (~Jones@5NZAABU7N.tor-irc.dnsbl.oftc.net) Quit ()
[4:53] * KrimZon (~Sirrush@195.169.125.226) has joined #ceph
[4:53] <florz> but zfs does not perform magic ;-)
[4:53] <litwol> a repaired write would be performed right after reading corrupted file
[4:53] <florz> when zfs has only one backend store, it cannot repair detected corruption
[4:54] <litwol> actually i need to confirm that. where would zfs store parity data on non-raid zfs pools. ie, single disk pool.
[4:54] <litwol> i c
[4:54] <florz> that's why it would be nice if it could make use of the redundancy "inside" ceph
[4:54] <litwol> very interesting
[4:55] <florz> forcing a read from a different copy, and when that matches the zfs checksum, write that back to all replicas to repair the damage
[4:56] <florz> well, you could of course have zfs store multiple copies on ceph, but then you'd have duplication at multiple levels, that would be rather expensive
[4:57] <litwol> i wonder how expencive ceph scrub is ?
[4:57] <litwol> lets say we forego on-read integrity check, and substitute it with /more frequent/ data scrubs
[4:58] <litwol> can i basically have continuous scrub going at all times?
[4:58] <litwol> evne more amazing, would be to configure it to something like' consume 10% resources to scrub, give 90% to regular operation'
[4:58] <florz> well, it consumes I/O bandwidth on your storage devices ...
[4:59] <litwol> then let scrub be never stopping
[5:00] <florz> well, the problem is that due to locality effects, that doesn't scale linearly
[5:00] <florz> especially with mechanical disks, having two concurrent readers massively reduces throughput vs. a single reader
[5:05] <florz> (every seek consumes about as much time as reading a megabyte or so of data, so, if you read in 1 MB chunks, two readers will give you about half of the total throughput vs. a single reader)
[5:06] * DV (~veillard@2001:41d0:a:f29f::1) has joined #ceph
[5:06] <flaf> Question: if I have 2 datacenters connected via a fiber/
[5:06] <litwol> i thought the idea of ceph is that it will stripe reads over multiple osds?
[5:07] <litwol> so when you read, you actually read slices of single file from multiple nodes, and recombine for client.
[5:07] <litwol> is this not so ?
[5:07] <flaf> 12 OSDs in DC1 and 12 OSDs in DC2. How can I evaluate the throughput I need in the fiber?
[5:08] <litwol> florz: nowdays disks read at maximum 150MB/s
[5:08] <litwol> that's sustained sequential reads. at least on my WD RE 3TB disks it is so. 170 on outside of the spindle, ~90MB on the inside .
[5:09] <litwol> ^ tested using badblocks
[5:09] <florz> yeah, the exact numbers vary
[5:09] <litwol> at best and highest throughput you will get 12x150MB/s
[5:09] <florz> read throughput, yeah
[5:10] <florz> and with constant scrubbing a quarter of that, as a very rough estimate
[5:10] <litwol> yeah
[5:11] <litwol> flaf: single 10 Gbit link should be enough :)
[5:11] <litwol> at concervative estimates
[5:11] <litwol> double that if you intend to move around backups and such.
[5:12] <litwol> by the way
[5:12] <flaf> litwol: is there a formula to evaluate this, in function of the disk throughput etc.?
[5:12] <florz> whether you actually need that depends on the workload, though :-)
[5:12] <litwol> just tested a highly simplistic cephfs snapshot
[5:12] <litwol> it worked
[5:12] <litwol> i've shut down my vms
[5:12] <litwol> booted them out-of-order
[5:13] <litwol> as soon as enough services came up to establish quorum and data parity.. it worked
[5:13] <litwol> i was able to rsync data out of .snap/ folder back into the root.
[5:14] <florz> flaf: well, just sum all the disks' throughput?
[5:14] <litwol> This is amazing
[5:15] <florz> flaf: doesn't mean you actually need that much bandwidth, but you won't ever be able to read more data than all disks at maximum load are able to deliver :-)
[5:15] <litwol> flaf: i run 'badblocks' on all new disks before i put them into use. that gives me both quality assurance of the disk ("at the time") and overview of its throughput
[5:15] <litwol> flaf: and 7200 rpm disks nowdays read at about 150-170MB/s
[5:15] <litwol> you can equally simulate that with "dd" for writes.
[5:15] <litwol> and reads
[5:16] <florz> (modulo reads from cache, of course ...)
[5:16] <litwol> do that on each disk and then sum their throughputs.. because that is what ceph claims to do.. ceph scales linearly with every added osd's throughput performance.
[5:16] <litwol> caches are small
[5:16] <litwol> chose file to dd bigger than a gig and you'll be fine
[5:18] <litwol> flaf: run this: time dd if=/dev/zero of=testing-throughput bs=1M count=$((1024*10))
[5:18] * qybl_ (~foo@kamino.krzbff.de) has joined #ceph
[5:18] <florz> with a decent-size machine that'll mostly tell you memory bandwidth ;-)
[5:18] <litwol> flaf: on each disk. you can script it of course.
[5:19] <litwol> florz: hmm? how come?
[5:19] <litwol> florz: as long as you are writing data bigger than your cache and ram..
[5:19] <florz> because the write happens to the page cache only
[5:19] <litwol> it'll hit disk directly
[5:19] <florz> yeah, that's what I mean
[5:19] <florz> 10 GB RAM isn't exactly uncommon anymore
[5:20] <litwol> that's why i separated the "10" above :). to make it very easy to crank up
[5:20] <litwol> anyhow. this is very crude benchmark.
[5:20] <litwol> use ssomething like bonnie++
[5:20] <litwol> or stress
[5:20] <litwol> whatever you want :)
[5:20] <florz> just write to the raw device with dd
[5:20] <florz> also avoids filesystem overhead
[5:20] * qybl_ (~foo@kamino.krzbff.de) Quit ()
[5:20] <litwol> true true
[5:21] * qybl (~foo@2a03:b0c0:2:d0::17f:4001) Quit (Ping timeout: 480 seconds)
[5:22] <litwol> can't wait to land a project big enough where i can buy few 10Gbe servers to try this out on :-D
[5:22] <litwol> florz: by the way. earlier i did cringe worthy math with regards to 3 copies policy.
[5:22] <litwol> florz: what about just 2 copies?
[5:23] * Vacuum_ (~vovo@i59F7951E.versanet.de) has joined #ceph
[5:23] <florz> flaf: also, if you have anything database-like running on it, latency is probably more important than bandwidth
[5:23] * KrimZon (~Sirrush@2WVAABR4Z.tor-irc.dnsbl.oftc.net) Quit ()
[5:23] * RaidSoft (~Deiz@justus.impium.de) has joined #ceph
[5:23] <flaf> florz: litwol: ok, I see. And if the OSD use journal in SSD, how can I adapt the equation "need = Sum(bandwidth of disks)"?
[5:23] <florz> litwol: what about just 2 copies?
[5:24] <florz> flaf: the journal is not involved in reading
[5:24] <litwol> florz: i'm wondering in terms of parity and nodes dying. how many nodes can i lose before entire cluster falls apart.
[5:24] <litwol> i'd love to see a formula to calculate this stuff
[5:25] <florz> litwol: one - as long as there is at least one copy remaining, you are fine
[5:25] <litwol> one guaranteed,
[5:26] <florz> litwol: (that means an up-to-date copy, though, out-of-date copies don't help)
[5:26] <litwol> but could be 2 with reduced chance i think
[5:26] <flaf> florz: reading is the only operations that I should take into account to evaluate the bandwidth I need between DC1 and DC2?
[5:26] <litwol> because of how object slices are distributed
[5:26] <litwol> they are shuffled around.
[5:26] <florz> flaf: well, the maximum of read and write speed, but at least with mechanical disks, those tend to be in the same ballpark
[5:27] <florz> litwol: oh, I guess I kinda mis-read/mis-tought
[5:27] * qybl (~foo@kamino.krzbff.de) has joined #ceph
[5:28] <florz> litwol: depends on what kinds of nodes and what you mean by "fall apart" ;-)
[5:28] <litwol> 'fall apart' == non functional
[5:28] <litwol> all data lost
[5:28] <florz> like, "unable to recover" or "currently unavailable"
[5:29] <litwol> currently unavailable (asusming node just shut down, disks are still fine)
[5:29] <litwol> but that can as easily be assumed to 'drives got nuked'
[5:29] <litwol> florz: 'unable to recover'
[5:30] * Vacuum (~vovo@88.130.199.58) Quit (Ping timeout: 480 seconds)
[5:31] <litwol> florz: by the way thx for the chat. this gave me a lot to think about.
[5:31] <florz> well, if you permanently lose more than a minority of the monitors, you won't be able to recover
[5:32] <florz> losing osds will never affect data that isn't (nominally) stored on them
[5:53] * RaidSoft (~Deiz@5NZAABU9P.tor-irc.dnsbl.oftc.net) Quit ()
[5:53] * brianjjo (~theghost9@46.183.220.132) has joined #ceph
[6:07] * bkopilov (~bkopilov@nat-pool-tlv-t.redhat.com) has joined #ceph
[6:12] * ft1 (~fattaneh@194.225.33.201) has joined #ceph
[6:15] <ft1> hi all, i want to run "ceph-deploy mon create mon1" but there is no admin socket in /var/run/ceph
[6:23] * brianjjo (~theghost9@5NZAABVA6.tor-irc.dnsbl.oftc.net) Quit ()
[6:23] * jakekosberg (~danielsj@marylou.nos-oignons.net) has joined #ceph
[6:29] <litwol> florz: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg13178.html " It is probably moderately "
[6:29] <litwol> stable if snapshots are only done at the root or at a consistent point in
[6:29] <litwol> the hierarcy (as opposed to random directories), but there are still some
[6:29] <litwol> basic problems that need to be resolved.
[6:29] <litwol> oops. sry for line boken paste.
[6:35] <litwol> fascinating
[6:35] <litwol> rm -r .snap/foo, doesn't work
[6:35] <litwol> rmdir .snap/foo, works.
[6:37] <flaf> "the root", ok I see, but "a consistent point in the hierarchy", I don't...?
[6:48] * ft1 (~fattaneh@194.225.33.201) Quit (Quit: Leaving.)
[6:48] * ft1 (~fattaneh@194.225.33.201) has joined #ceph
[6:53] * jakekosberg (~danielsj@425AAANJ6.tor-irc.dnsbl.oftc.net) Quit ()
[6:53] * TGF (~Random@exit1.ipredator.se) has joined #ceph
[6:55] <litwol> flaf: same as root..
[6:55] <litwol> flaf: basically " i always make snaps inside folder foo/bar/baz/.snaps" or inside "/var/lib/mysql/.snaps" or in "/home/.snaps"
[6:55] <litwol> flaf: and never in any other place.
[6:55] <litwol> ie. do not mix /.snap and /home/.snap
[6:55] <litwol> choose one or ther other.
[6:56] <litwol> ^^ assuming all of the above directories in my examples are hosted within cephfs mount.
[7:23] * TGF (~Random@5NZAABVDR.tor-irc.dnsbl.oftc.net) Quit ()
[7:58] * DoDzy (~GuntherDW@ncc-1701-a.tor-exit.network) has joined #ceph
[8:12] <flaf> litwol: ah ok thx. If i snap a directory, i must not snap one of its subdirectory.
[8:27] * DoDzy (~GuntherDW@5NZAABVGN.tor-irc.dnsbl.oftc.net) Quit ()
[8:32] * Skyrider (~Pommesgab@tor-exit2-readme.puckey.org) has joined #ceph
[8:39] * jclm1 (~jclm@ip24-253-45-236.lv.lv.cox.net) has joined #ceph
[8:43] * jclm2 (~jclm@ip24-253-45-236.lv.lv.cox.net) has joined #ceph
[8:44] * jclm2 (~jclm@ip24-253-45-236.lv.lv.cox.net) Quit ()
[8:45] * jclm (~jclm@ip24-253-45-236.lv.lv.cox.net) Quit (Ping timeout: 480 seconds)
[8:49] * jclm1 (~jclm@ip24-253-45-236.lv.lv.cox.net) Quit (Ping timeout: 480 seconds)
[8:52] * oro (~oro@80-219-254-208.dclient.hispeed.ch) has joined #ceph
[9:02] * Skyrider (~Pommesgab@5NZAABVHY.tor-irc.dnsbl.oftc.net) Quit ()
[9:02] * osuka_ (~Thayli@212.7.194.71) has joined #ceph
[9:03] <litwol> flaf: bingo.
[9:03] <litwol> flaf: the /goal/ is to not have such restriction. and practically you *can* create subdirectory snaps... however you are strongly advised to avoid that.
[9:06] <litwol> http://ceph.com/category/releases/ is developing fairly quickly.
[9:14] * oro (~oro@80-219-254-208.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[9:20] * vbellur (~vijay@122.178.242.128) Quit (Ping timeout: 480 seconds)
[9:22] * B_Rake (~B_Rake@2605:a601:5b9:dd01:61a3:7590:bf68:bcf3) has joined #ceph
[9:31] * B_Rake (~B_Rake@2605:a601:5b9:dd01:61a3:7590:bf68:bcf3) Quit (Remote host closed the connection)
[9:32] * osuka_ (~Thayli@98EAABD8E.tor-irc.dnsbl.oftc.net) Quit ()
[9:32] * Skyrider (~Altitudes@UtopianNoise.tor-exit.sec.gd) has joined #ceph
[10:02] * Skyrider (~Altitudes@5NZAABVJ6.tor-irc.dnsbl.oftc.net) Quit ()
[10:02] * elt (~tritonx@nx-01.tor-exit.network) has joined #ceph
[10:03] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) has joined #ceph
[10:05] * qwebirc52632 (~oftc-webi@116.255.132.3) has joined #ceph
[10:06] * qwebirc52632 (~oftc-webi@116.255.132.3) has left #ceph
[10:20] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) Quit (Ping timeout: 480 seconds)
[10:26] * derjohn_mob (~aj@ip-95-223-126-17.hsi16.unitymediagroup.de) Quit (Ping timeout: 480 seconds)
[10:27] * SpaceDump (spacedmp@tjatt.spacedump.se) has joined #ceph
[10:31] * oro (~oro@80-219-254-208.dclient.hispeed.ch) has joined #ceph
[10:32] * elt (~tritonx@425AAANMB.tor-irc.dnsbl.oftc.net) Quit ()
[10:32] * OODavo (~hyst@chomsky.torservers.net) has joined #ceph
[11:02] * OODavo (~hyst@2WVAABSJH.tor-irc.dnsbl.oftc.net) Quit ()
[11:02] * Tenk (~Scrin@37.187.129.166) has joined #ceph
[11:29] * oblu (~o@62.109.134.112) Quit (Ping timeout: 480 seconds)
[11:30] * oblu (~o@62.109.134.112) has joined #ceph
[11:30] * leseb_ (~leseb@81-64-215-19.rev.numericable.fr) Quit (Ping timeout: 480 seconds)
[11:30] * leseb_ (~leseb@81-64-215-19.rev.numericable.fr) has joined #ceph
[11:32] * Tenk (~Scrin@2WVAABSKY.tor-irc.dnsbl.oftc.net) Quit ()
[11:32] * roaet (~pepzi@chulak.enn.lu) has joined #ceph
[11:40] * rendar (~I@host247-176-dynamic.37-79-r.retail.telecomitalia.it) has joined #ceph
[11:46] * Concubidated (~Adium@71.21.5.251) Quit (Read error: Connection reset by peer)
[11:47] * Concubidated (~Adium@71.21.5.251) has joined #ceph
[11:55] <ft1> hi all, i run "ceph -s " but it has an error, the complete error is here: http://pastebin.com/MRKPbWLF
[12:02] * roaet (~pepzi@98EAABEC6.tor-irc.dnsbl.oftc.net) Quit ()
[12:02] * rcfighter (~CydeWeys@exit1.ipredator.se) has joined #ceph
[12:04] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) Quit (Remote host closed the connection)
[12:08] * bkopilov (~bkopilov@nat-pool-tlv-t.redhat.com) Quit (Remote host closed the connection)
[12:29] * ft1 (~fattaneh@194.225.33.201) Quit (Quit: Leaving.)
[12:30] * ft1 (~fattaneh@194.225.33.201) has joined #ceph
[12:32] * rcfighter (~CydeWeys@5NZAABVQS.tor-irc.dnsbl.oftc.net) Quit ()
[12:32] * Misacorp (~Helleshin@5.61.34.63) has joined #ceph
[12:36] * ft1 (~fattaneh@194.225.33.201) has left #ceph
[12:50] * Concubidated (~Adium@71.21.5.251) Quit (Quit: Leaving.)
[13:02] * Misacorp (~Helleshin@2WVAABSOC.tor-irc.dnsbl.oftc.net) Quit ()
[13:07] * drdanick1 (~clarjon1@2WVAABSPN.tor-irc.dnsbl.oftc.net) has joined #ceph
[13:09] * ft1 (~fattaneh@194.225.33.201) has joined #ceph
[13:09] * ft1 (~fattaneh@194.225.33.201) has left #ceph
[13:36] * drdanick1 (~clarjon1@2WVAABSPN.tor-irc.dnsbl.oftc.net) Quit ()
[13:37] * kalleeen (~Sophie@tor-exit0-readme.dfri.se) has joined #ceph
[13:38] * bkopilov (~bkopilov@nat-pool-tlv-t.redhat.com) has joined #ceph
[13:41] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) has joined #ceph
[13:41] * Sysadmin88 (~IceChat77@054527d3.skybroadband.com) Quit (Quit: If you think nobody cares, try missing a few payments)
[13:43] * subscope (~subscope@92-249-244-167.pool.digikabel.hu) has joined #ceph
[13:54] * jksM (~jks@178.155.151.121) Quit (Read error: Connection reset by peer)
[13:54] * jks (~jks@178.155.151.121) has joined #ceph
[13:59] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) Quit (Remote host closed the connection)
[14:01] * fghaas (~florian@91-119-140-224.dynamic.xdsl-line.inode.at) has joined #ceph
[14:06] * kalleeen (~Sophie@5NZAABVUS.tor-irc.dnsbl.oftc.net) Quit ()
[14:07] * Hideous (~Xeon06@marylou.nos-oignons.net) has joined #ceph
[14:14] * oro (~oro@80-219-254-208.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[14:23] * kamalmarhubi (sid26581@id-26581.uxbridge.irccloud.com) has joined #ceph
[14:29] * derjohn_mob (~aj@ip-95-223-126-17.hsi16.unitymediagroup.de) has joined #ceph
[14:30] * fghaas (~florian@91-119-140-224.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[14:36] * LeaChim (~LeaChim@host86-143-18-67.range86-143.btcentralplus.com) has joined #ceph
[14:36] * Hideous (~Xeon06@5NZAABVVT.tor-irc.dnsbl.oftc.net) Quit ()
[14:37] * TehZomB (~redbeast1@puffy.keystretch.com) has joined #ceph
[14:37] * bkopilov (~bkopilov@nat-pool-tlv-t.redhat.com) Quit (Ping timeout: 480 seconds)
[14:43] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) has joined #ceph
[14:47] * joao (~joao@249.38.136.95.rev.vodafone.pt) has joined #ceph
[14:47] * ChanServ sets mode +o joao
[14:51] * KevinPerks (~Adium@ip-64-134-99-56.public.wayport.net) has joined #ceph
[14:52] * jluis (~joao@249.38.136.95.rev.vodafone.pt) Quit (Ping timeout: 480 seconds)
[14:57] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) has joined #ceph
[14:58] * vbellur (~vijay@121.244.87.124) has joined #ceph
[15:00] * KevinPerks (~Adium@ip-64-134-99-56.public.wayport.net) Quit (Quit: Leaving.)
[15:03] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) Quit (Remote host closed the connection)
[15:06] * TehZomB (~redbeast1@5NZAABVW3.tor-irc.dnsbl.oftc.net) Quit ()
[15:07] * Bj_o_rn (~loft@tor-exit0-readme.dfri.se) has joined #ceph
[15:12] * Bj_o_rn (~loft@2WVAABSUB.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[15:12] * MonkeyJamboree (~Ian2128@162.247.73.74) has joined #ceph
[15:19] * MACscr (~Adium@2601:d:c800:de3:15b1:57b6:74fd:f86a) has joined #ceph
[15:20] * subscope (~subscope@92-249-244-167.pool.digikabel.hu) Quit (Quit: Textual IRC Client: www.textualapp.com)
[15:23] * zviratko (~zviratko@241-73-239-109.cust.centrio.cz) Quit (Ping timeout: 480 seconds)
[15:23] * yghannam (~yghannam@0001f8aa.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:42] * MonkeyJamboree (~Ian2128@2WVAABSUI.tor-irc.dnsbl.oftc.net) Quit ()
[15:44] * vbellur (~vijay@121.244.87.124) Quit (Ping timeout: 480 seconds)
[15:53] * fmanana (~fdmanana@bl13-130-213.dsl.telepac.pt) Quit (Ping timeout: 480 seconds)
[15:57] * Sysadmin88 (~IceChat77@054527d3.skybroadband.com) has joined #ceph
[16:10] * fdmanana (~fdmanana@bl13-130-213.dsl.telepac.pt) has joined #ceph
[16:12] * offer (~Hidendra@tor-exit1.arbitrary.ch) has joined #ceph
[16:20] * bkopilov (~bkopilov@bzq-79-183-107-179.red.bezeqint.net) has joined #ceph
[16:42] * offer (~Hidendra@2WVAABSXH.tor-irc.dnsbl.oftc.net) Quit ()
[16:42] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) has joined #ceph
[16:42] * Peaced (~Chrissi_@exit1.ipredator.se) has joined #ceph
[16:44] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) Quit (Remote host closed the connection)
[16:45] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) has joined #ceph
[16:53] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) Quit (Ping timeout: 480 seconds)
[17:04] * zviratko (~zviratko@241-73-239-109.cust.centrio.cz) has joined #ceph
[17:12] * Peaced (~Chrissi_@5NZAABV1Y.tor-irc.dnsbl.oftc.net) Quit ()
[17:12] * jacoo (~mason@exit1.ipredator.se) has joined #ceph
[17:22] * MACscr (~Adium@2601:d:c800:de3:15b1:57b6:74fd:f86a) Quit (Quit: Leaving.)
[17:23] * oro (~oro@80-219-254-208.dclient.hispeed.ch) has joined #ceph
[17:28] * scuttlemonkey is now known as scuttle|afk
[17:42] * jacoo (~mason@98EAABEOM.tor-irc.dnsbl.oftc.net) Quit ()
[17:43] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) Quit (Quit: Leaving)
[17:46] * segutier (~segutier@c-24-6-218-139.hsd1.ca.comcast.net) has joined #ceph
[17:47] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) has joined #ceph
[17:50] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) has joined #ceph
[18:04] * bobrik (~bobrik@83.243.64.45) Quit (Quit: (null))
[18:10] * vpol (~vpol@000131a0.user.oftc.net) has joined #ceph
[18:12] * ZombieL (~osuka_@5NZAABV53.tor-irc.dnsbl.oftc.net) has joined #ceph
[18:15] * vjujjuri (~chatzilla@static-50-53-42-60.bvtn.or.frontiernet.net) has joined #ceph
[18:22] * segutier (~segutier@c-24-6-218-139.hsd1.ca.comcast.net) has left #ceph
[18:30] * nathharp (~nathharp@90.222.95.22) has joined #ceph
[18:42] * ZombieL (~osuka_@5NZAABV53.tor-irc.dnsbl.oftc.net) Quit ()
[18:42] * Harryhy (~Shesh@exit.tor.uwaterloo.ca) has joined #ceph
[18:47] * ajazdzewski__ (~ajazdzews@p200300406E606800BD72EB2243C3BF3A.dip0.t-ipconnect.de) has joined #ceph
[18:47] * Hemanth (~Hemanth@117.213.180.186) has joined #ceph
[19:05] * xcezzz (~Adium@pool-100-3-14-19.tampfl.fios.verizon.net) has left #ceph
[19:09] * bobrik (~bobrik@83.243.64.45) has joined #ceph
[19:10] * vjujjuri (~chatzilla@static-50-53-42-60.bvtn.or.frontiernet.net) Quit (Ping timeout: 480 seconds)
[19:12] * Harryhy (~Shesh@2WVAABS29.tor-irc.dnsbl.oftc.net) Quit ()
[19:13] * cholcombe (~chris@pool-108-42-125-114.snfcca.fios.verizon.net) has joined #ceph
[19:13] * cholcombe (~chris@pool-108-42-125-114.snfcca.fios.verizon.net) Quit ()
[19:16] * brutuscat (~brutuscat@105.34.133.37.dynamic.jazztel.es) Quit (Ping timeout: 480 seconds)
[19:17] * vjujjuri (~chatzilla@static-50-53-42-60.bvtn.or.frontiernet.net) has joined #ceph
[19:25] * KevinPerks (~Adium@nc-184-3-224-58.dhcp.embarqhsd.net) has joined #ceph
[19:27] * KevinPerks1 (~Adium@nc-184-3-224-58.dhcp.embarqhsd.net) has joined #ceph
[19:27] * KevinPerks (~Adium@nc-184-3-224-58.dhcp.embarqhsd.net) Quit (Read error: Connection reset by peer)
[19:33] * KevinPerks1 (~Adium@nc-184-3-224-58.dhcp.embarqhsd.net) has left #ceph
[19:36] * vbellur (~vijay@122.166.89.81) has joined #ceph
[19:42] * Kalado (~AG_Scott@98EAABETK.tor-irc.dnsbl.oftc.net) has joined #ceph
[19:47] * TMM (~hp@178-84-46-106.dynamic.upc.nl) has joined #ceph
[19:49] * KevinPerks (~Adium@nc-184-3-224-58.dhcp.embarqhsd.net) has joined #ceph
[19:49] * Hemanth (~Hemanth@117.213.180.186) Quit (Quit: Leaving)
[20:00] * KevinPerks (~Adium@nc-184-3-224-58.dhcp.embarqhsd.net) Quit (Ping timeout: 480 seconds)
[20:00] * subscope (~subscope@92-249-244-167.pool.digikabel.hu) has joined #ceph
[20:06] * rotbeard (~redbeard@aftr-95-222-27-149.unity-media.net) has joined #ceph
[20:11] * Concubidated (~Adium@71.21.5.251) has joined #ceph
[20:12] * Kalado (~AG_Scott@98EAABETK.tor-irc.dnsbl.oftc.net) Quit ()
[20:44] * derjohn_mob (~aj@ip-95-223-126-17.hsi16.unitymediagroup.de) Quit (Ping timeout: 480 seconds)
[21:06] * wschulze (~wschulze@cpe-74-73-11-233.nyc.res.rr.com) has joined #ceph
[21:12] * Zeis (~RaidSoft@176.10.99.200) has joined #ceph
[21:13] * scuttle|afk is now known as scuttlemonkey
[21:42] * Zeis (~RaidSoft@5NZAABWD5.tor-irc.dnsbl.oftc.net) Quit ()
[21:42] * bret1 (~Thononain@bakunin.gtor.org) has joined #ceph
[22:00] * vjujjuri (~chatzilla@static-50-53-42-60.bvtn.or.frontiernet.net) Quit (Ping timeout: 480 seconds)
[22:00] * kaisan (~kai@zaphod.xs4all.nl) Quit (Quit: leaving)
[22:03] * chasmo77 (~chas77@158.183-62-69.ftth.swbr.surewest.net) Quit (Quit: It's just that easy)
[22:04] * subscope (~subscope@92-249-244-167.pool.digikabel.hu) Quit (Quit: Textual IRC Client: www.textualapp.com)
[22:12] * bret1 (~Thononain@5NZAABWE0.tor-irc.dnsbl.oftc.net) Quit ()
[22:17] * DV (~veillard@2001:41d0:a:f29f::1) Quit (Ping timeout: 480 seconds)
[22:21] * rotbeard (~redbeard@aftr-95-222-27-149.unity-media.net) Quit (Quit: Leaving)
[22:28] * DV (~veillard@2001:41d0:1:d478::1) has joined #ceph
[22:36] * wushudoin (~jB@c-73-189-76-103.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[22:36] * wushudoin (~jB@c-73-189-76-103.hsd1.ca.comcast.net) has joined #ceph
[22:43] * Pommesgabel (~Shesh@46.182.106.190) has joined #ceph
[22:47] * nathharp (~nathharp@90.222.95.22) Quit (Quit: nathharp)
[22:49] * vjujjuri (~chatzilla@static-50-53-42-60.bvtn.or.frontiernet.net) has joined #ceph
[23:04] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:07] * sleinen1 (~Adium@2001:620:0:82::102) Quit (Ping timeout: 480 seconds)
[23:08] * vjujjuri_ (~chatzilla@204.14.239.17) has joined #ceph
[23:12] * Pommesgabel (~Shesh@2WVAABTCR.tor-irc.dnsbl.oftc.net) Quit ()
[23:12] * Helleshin (~Cue@spftor5e2.privacyfoundation.ch) has joined #ceph
[23:13] * rendar (~I@host247-176-dynamic.37-79-r.retail.telecomitalia.it) Quit (Ping timeout: 480 seconds)
[23:15] * vjujjuri (~chatzilla@static-50-53-42-60.bvtn.or.frontiernet.net) Quit (Ping timeout: 480 seconds)
[23:16] * rendar (~I@host247-176-dynamic.37-79-r.retail.telecomitalia.it) has joined #ceph
[23:19] * badone (~brad@CPE-121-215-241-179.static.qld.bigpond.net.au) has joined #ceph
[23:25] * vpol (~vpol@000131a0.user.oftc.net) Quit (Quit: vpol)
[23:35] * michael_ (~michael@p4FF9CECE.dip0.t-ipconnect.de) has joined #ceph
[23:42] * Helleshin (~Cue@2WVAABTDS.tor-irc.dnsbl.oftc.net) Quit ()
[23:42] * Arfed (~nih@TerokNor.tor-exit.network) has joined #ceph
[23:49] * vjujjuri_ (~chatzilla@204.14.239.17) Quit (Ping timeout: 480 seconds)
[23:54] * sherlocked (~watson@14.139.82.6) has joined #ceph
[23:55] * sherlocked (~watson@14.139.82.6) Quit ()

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.