#ceph IRC Log

Index

IRC Log for 2011-10-11

Timestamps are in GMT/BST.

[0:10] * Dantman (~dantman@S010600259c4d54ff.vs.shawcable.net) has joined #ceph
[0:15] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[0:19] * cp (~cp@206.15.24.21) Quit (Quit: cp)
[0:33] * cp (~cp@206.15.24.21) has joined #ceph
[0:47] * cole (~cole@74.85.19.35) has joined #ceph
[0:47] <cole> wow
[0:51] <gregaf> nak: there's not currently a way to find out what OSDs a file lives on, if that's what you're asking about
[0:51] <gregaf> or rather, not a good user-facing one
[0:53] <gregaf> cole: can we help you with something? :)
[0:56] <cole> gregaf: maybe!
[1:00] <jantje_> is there any comparision with gluster? I know some guys are looking into useing that, but my inner me says that I need to convince them to use ceph
[1:00] <cole> is there anyone with information on how to run a replicated rbd / rados cluster without a ceph.conf?
[1:01] <cole> <---intimately familiar with gluster
[1:01] <jantje_> well, i'm not
[1:01] <jantje_> :)
[1:02] <cole> ceph understand multi-tenancy much better at this point but gluster is very easy to configure. the other advantage to ceph at the moment is it's support and integration of block devices into qemu.
[1:03] <gregaf> cole: perhaps we should rename our default away from ceph.conf...
[1:04] <gregaf> but it's how you specify what disk the OSD should use, and the monitors to connect to, so you're always going to need it or something like it :)
[1:05] <jantje_> thanks cole, perhaps I'll dig into it myself, but if someone knows of some existing comparision, that would be great
[1:06] <cole> greg: Dallas or Tommy one told me that it's possible (when not using the filesystem) to point to the actual pool for monitor information and that would allow us to skip the .conf file altogether.
[1:07] <cole> jantje_: what kind of comparison. they are both clustered file systems.
[1:07] <gregaf> I think they were mistaken :/
[1:07] <gregaf> clients can just point to a monitor, but the host daemons need a bit more configuration
[1:07] <gregaf> *server daemons*
[1:08] <cole> clients can point to a monitor where? via cli?
[1:09] <gregaf> yeah
[1:09] <cole> ya..i get that.
[1:09] <gregaf> mount -t ceph 192.168.0.1:6789:/ mountpoint
[1:09] <gregaf> or ceph-fuse bla
[1:09] <cole> so for context
[1:09] * damian_ (~damian@ns1.v-dns.com) Quit (Read error: Operation timed out)
[1:11] <cole> the conversation I had with Dallas was basically asking if it would be possible to combine the daemons(for OSD / MON) and make ceph smart about the role it would take on when talking to the host.
[1:12] <cole> but i'm bummed to hear that we'll need to suffer with a ceph.conf file. Really liked the idea of the intelligence living in the pool as metadata
[1:13] <gregaf> it doesn't need to be much of a conf file — if you were going for configless setup you'd presumably have all your nodes be identical anyway
[1:13] <cole> ideally, a guide on how to implement ceph for nova block volumes / glance / object storage without using OSD's would be a fantastic place to start!
[1:14] <gregaf> with the same directories for OSD data storage and stuff
[1:14] <gregaf> without OSDs…?
[1:14] <cole> maybe i'm wrong again..
[1:14] <cole> but
[1:14] <cole> if not using the filesystem, is it not possible to get that functionality without OSD's?
[1:16] <cole> we want to use rbd / librados for openstack without pointing to a posix location!
[1:16] <sagewk> cole: you're probably thing without MDS's
[1:16] <sagewk> (i.e. no posix file system)
[1:16] <cole> actually, yes!
[1:16] <cole> sorry
[1:16] <gregaf> ah, that's better :)
[1:17] <gregaf> so I think TV's docs cover this pretty well
[1:17] <gregaf> http://ceph.newdream.net/docs/latest/
[1:18] <cole> right! no mds's. so basically, trying to point (for opentstack) nova block volumes / glance / and possibly swift compatible object storage at ceph.
[1:18] <cole> excellent!
[1:19] <gregaf> specifically http://ceph.newdream.net/docs/latest/start/object/
[1:19] <gregaf> if you do run across any references to the MDS…just don't include them :)
[1:20] <cole> cool. on the list is also to bug you guys about xfs support!
[1:20] * cp (~cp@206.15.24.21) Quit (Quit: cp)
[1:21] <cole> i'll be back later. thanks for the links!
[1:21] * cole (~cole@74.85.19.35) has left #ceph
[1:35] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[1:36] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:42] * Dantman (~dantman@S010600259c4d54ff.vs.shawcable.net) Quit (Remote host closed the connection)
[1:42] * cp (~cp@206.15.24.21) has joined #ceph
[1:49] * adjohn (~adjohn@50.0.103.34) Quit (Quit: adjohn)
[1:58] <nak> gregaf: sorry got pulled away. yep, that's exactly what I was looking for, want to verify that the crushmap is functioning as expected.
[2:00] * jojy_ (~jojyvargh@108.60.121.114) has joined #ceph
[2:00] * jojy (~jojyvargh@108.60.121.114) Quit (Read error: Connection reset by peer)
[2:00] * jojy_ is now known as jojy
[2:07] * bchrisman (~Adium@108.60.121.114) Quit (Quit: Leaving.)
[2:16] * Dantman (~dantman@S010600259c4d54ff.vs.shawcable.net) has joined #ceph
[2:22] * jojy (~jojyvargh@108.60.121.114) Quit (Quit: jojy)
[3:40] * cp (~cp@206.15.24.21) Quit (Quit: cp)
[3:41] * cp (~cp@206.15.24.21) has joined #ceph
[3:41] * cp (~cp@206.15.24.21) Quit ()
[4:19] * votz (~votz@pool-108-52-121-23.phlapa.fios.verizon.net) Quit (Quit: Leaving)
[4:45] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[6:02] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:43] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[8:31] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[10:21] <psomas> What tool do you suggest to benchmark ceph/rados/rbd performance? Thanks.
[10:22] <psomas> And btw, has anyone tried bcache with ceph?
[10:55] * Kioob (~kioob@luuna.daevel.fr) Quit (Quit: Leaving.)
[12:10] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:56] <jantje_> psomas: I don't think you need that
[12:56] <jantje_> you can put the ceph journal on a ssd
[12:56] <jantje_> afaik, I'm not really an ceph expert ;)
[12:57] <psomas> atm, for testing i keep the journal on tmpfs
[12:58] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[12:58] <psomas> and i was thinking of adding bcache (on ssds) before you go to sas/sata disks for the actual data
[13:00] <jantje_> I don't think you would get much performance benefits from doing that. tmpfs isn't *that* fast I think
[13:41] <bchrisman> psomas: could use an ssd bcache'd device directly for the journal
[15:05] * verwilst (~verwilst@dD57697EC.access.telenet.be) has joined #ceph
[16:08] * slang (~slang@chml01.drwholdings.com) Quit (Quit: Leaving.)
[16:12] <nak> anyone have any feedback on optimum directory sizes (ie number of entries), and any falloff points in performance?
[16:45] * Dantman (~dantman@S010600259c4d54ff.vs.shawcable.net) Quit (Remote host closed the connection)
[16:48] * Dantman (~dantman@S010600259c4d54ff.vs.shawcable.net) has joined #ceph
[17:00] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[17:31] * nak (~nak@ca.classicsanimated.com) Quit (Quit: Lost terminal)
[17:38] * slang (~slang@chml01.drwholdings.com) has joined #ceph
[17:43] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[17:55] * damian_ (~damian@ns1.v-dns.com) has joined #ceph
[18:23] * damian_ (~damian@ns1.v-dns.com) Quit (Quit: Leaving)
[18:24] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[18:39] * Dantman (~dantman@S010600259c4d54ff.vs.shawcable.net) Quit (Read error: Operation timed out)
[18:51] * aliguori (~anthony@32.97.110.59) has joined #ceph
[18:58] <gregaf> nak: you can look at the PG map if you like — files are striped across objects, and objects are mapped into PGs (Placement Groups), and the PGs are what CRUSH maps onto OSDs
[18:58] <gregaf> ceph pg dump -o -
[18:59] <gregaf> that will give you all the PGs and what OSDs they're mapped to
[19:03] * jojy (~jojyvargh@108.60.121.114) has joined #ceph
[19:04] <gregaf> at some point in the future dir size won't matter, because Ceph is capable of "fragmenting" directories
[19:04] <gregaf> unfortunately that's not enabled by default right now since it still needs more QA work
[19:05] <gregaf> but even so you're limited mostly by the size of the MDS cache (limited by RAM and what you've pointed it at)
[19:06] <gregaf> directories with more entries than your "mds cache size" (default 100k, but you should be able to make it much larger) will be very painful, anything smaller than that should not be a big deal
[19:06] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[19:07] <gregaf> psomas: the appropriate tools for benchmarking are the tools that mimic your workload ;)
[19:07] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[19:09] <psomas> ok, so the only thing left (for us), is to actually make those bench tools :P
[19:11] <psomas> gregaf: about the rbd image layering, i was wondering if there's something done (if a repo exists etc)
[19:11] <gregaf> nope, we haven't actually done any work on layering yet
[19:11] <psomas> i've seen a mail by sagewk at the ml about that, and it's included in some roadmap, a couple of months ago
[19:12] <gregaf> yeah, it's coming up, but we haven't quite reached it yet
[19:12] <sagewk> psomas: there's a fairly detailed design, but we haven't implemented it yet
[19:12] <gregaf> probably 2 or 3 sprints from now, if I had to guess
[19:13] <psomas> ok, i was asking, in case there was a repo/branch containing any work done
[19:14] <psomas> sagewk: is the design available somewhere?
[19:18] * verwilst (~verwilst@dD57697EC.access.telenet.be) Quit (Quit: Ex-Chat)
[19:21] <psomas> btw, is this going to be implemented in the userspace librbd?
[19:33] <gregaf> psomas: it'll be implemented in everything, but given our historical patterns it will probably go into the userspace librbd first :)
[19:41] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[19:42] <psomas> gregaf: ok, thanks a lot for all the info :)
[19:42] <gregaf> the design docs (such as they are) went out on the mailing list, check out http://marc.info/?l=ceph-devel&m=129867273303846&w=2
[19:42] <gregaf> there's another thread too but I can't seem to find it, and that link covers the main points
[19:43] <psomas> http://www.spinics.net/lists/ceph-devel/msg01525.html
[19:43] <psomas> i think it's this one? this the one i found, i hadn't seen the one you gave now
[19:45] <gregaf> yep, that's the one :)
[19:57] * Dantman (~dantman@199.119.234.2) has joined #ceph
[20:11] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) Quit (Quit: cp)
[20:18] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:38] * cole (~cole@206.15.24.21) has joined #ceph
[20:39] * McFly_ (~McFly@octopus.neptune.uvic.ca) has joined #ceph
[20:39] <McFly_> anyone here running hbase on ceph?
[20:42] * adjohn (~adjohn@50.0.103.34) Quit (Quit: adjohn)
[20:44] * cp (~cp@206.15.24.21) has joined #ceph
[20:44] <gregaf> McFly_: don't think so
[20:44] <McFly_> no?
[20:45] <gregaf> I'm not aware of anybody actually running Hadoop software on it right now :(
[20:45] <gregaf> though it fits the use case pretty well!
[20:45] <McFly_> yeah, it seems to be a pretty new and untested thing
[20:45] <McFly_> lots of potential though
[20:47] <gregaf> if you want to try it out make sure you get the latest Ceph sources — the FileSystem interface for Hadoop is a couple years old and it had bitrotted a bit through the various API changes :)
[20:49] <McFly_> sounds good, guess I just have to play around with it and see what happens :P
[20:50] <gregaf> we'd appreciate hearing back on your experiences — we're a small team right now so that kind of information helps a lot when determining resource allocation
[20:53] <McFly_> I'll let you know how it goes then. I think Yahoo is looking into using ceph to replace hdfs as well
[20:54] <gregaf> sounds good
[20:56] * McFly_ (~McFly@octopus.neptune.uvic.ca) Quit (Quit: Leaving)
[21:04] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[21:57] * Dantman (~dantman@199.119.234.2) Quit (Ping timeout: 480 seconds)
[22:05] <NaioN> if the journal device is a SSD, does ceph use TRIM? or is there another way to have some wearleverling on a SSD if it's used as journal?
[22:06] <NaioN> and does ceph use the whole journal device or does it only use what's needed and e.g. only at the beginning of the device?
[22:07] <df__> trim isn't fundamentally required for wearleveling
[22:07] <gregaf> NaioN: depends on the journaling mode — you can either give it a raw partition, in which case I think it uses the whole thing, or you can give it a file and a size
[22:09] <gregaf> it doesn't use TRIM commands or anything, though
[22:09] <gregaf> but I'm not sure why you'd need it to, it just "circles" around the journal space it has
[22:09] <NaioN> we use the whole device e.g. /dev/sda if that's the ssd
[22:10] <NaioN> if it "circles" it isn't a problem, because then you would have "natural" wearleveling...
[22:10] <gregaf> yeah
[22:10] <NaioN> df__: true, therefor my second question, maybe ceph uses another method...
[22:11] <gregaf> though even without circling a journal is pretty easy to wear level; I can't imagine even the bad old SSDs would have had a problem with it
[22:11] <NaioN> how?
[22:12] <df__> naion, wear leveling is internal to the SSD
[22:12] <psomas> since you're talking about journals, keeping the journal in memory/tmpfs is not an option for production use, right?
[22:12] <df__> the os should not care about it
[22:12] <NaioN> psomas: indeed :)
[22:13] <gregaf> well a journal is either going to circle or else punch holes in the beginning and go forever, a naive implementation of block allocation can handle either of those cases just fine, as I understand them...
[22:13] <NaioN> df__: i agree, but i wnt to be sure...
[22:13] <df__> naion, even if you rewrite the same "block" over and over again, that won't get mapped to the same place on the SSD
[22:14] <NaioN> df__: ok well then it's no problem...
[22:14] <NaioN> because we were thinking about reserving some space for wear leveling
[22:15] * Dantman (~dantman@S01060023eba7eb01.vc.shawcable.net) has joined #ceph
[22:15] <df__> a good device has an amount of extra internal space it uses for rotating allocations
[22:15] <psomas> keeping the journal in memory is dangerous in case of osd crashes etc? isn't an osd able to recover after a crash if the journal is kept on tmpfs for example, and 'lost' on a crash?
[22:16] <bchrisman> there's no fsck right now
[22:16] <bchrisman> so if you lose journal.. could lose writes.
[22:16] <NaioN> df__: ok, i thought that you needed trim for that...
[22:16] <df__> or you can use an SLC device
[22:16] <df__> naion, no you don't need trim for it, its just a method that was invented for a poor era of devices
[22:17] <psomas> bchrisman: but when the osd comes up again, shouldn't it be able to recover prorperly even with a lost journal?
[22:17] <gregaf> psomas: doesn't matter for daemon crashes, but it's very bad for power failures or kernel failures or whatever
[22:17] <psomas> y, i mean power/kernel failures
[22:17] <df__> trim was to allow blocks to be returned to the list fo free blocks, but a good device doesn't need that
[22:17] <bchrisman> because data *can* be in the journal but not in the filestore
[22:18] <NaioN> df__: then the question is: what's a good device :)
[22:18] <gregaf> under most circumstances it can recover fine if you lose the journal, but it's not guaranteed at that point
[22:18] <psomas> ok, but when the osd comes up again, won't it 'recover' based on what is written in the filestore?
[22:18] <gregaf> because you're relying on other nodes existing and having the right data
[22:18] <psomas> right
[22:18] <df__> i'm currently using intel X25-E devices, peak at 180-200MB/sec writes
[22:19] <NaioN> df__: ok we have intel 320
[22:19] <gregaf> df__: expensive devices, those
[22:19] <NaioN> that's the consumer version :)
[22:19] <gregaf> any modern SSD should do fine though
[22:19] <df__> however, i'd probably now use the intel 311
[22:19] <NaioN> the E version is the "Enterprise" version...
[22:20] <NaioN> df__: that's something between?
[22:20] <bchrisman> single node imploding won't lose data, but if journal is in ram only and site loses power.. could lose data.
[22:20] <NaioN> bchrisman: why is that?
[22:20] <df__> its a 20GB device, SLC, but only has half the write performance of the X25-E, however, i'd split things up more to use them
[22:21] <gregaf> data can end up in the journal (and clients could get a commit) without going into the main data store
[22:21] <bchrisman> one scenario: write goes to osd1, gets mirrored to osd2.. both write to journal (but don't get all the way out to disk), cluster loses power, then those writes are lost.
[22:21] <NaioN> if you get a write to the first osd and that osd puts it in the journal and crashes, then you have a problem if it's not a persistent journal?
[22:22] <gregaf> NaioN: no, under that circumstance it won't send an ack or anything back to the client
[22:22] <NaioN> ok
[22:22] <bchrisman> with a single failure, you're okay because there's a replica.. but if power goes to a swath of nodes.. then you can lose data because it's in ram on multiple nodes.
[22:22] <NaioN> so it first mirrors to both journals and then commits?
[22:22] <psomas> btw, in case of a power failure for example, is there a chance the the 'underlying' filestore will get corrupted, and the osd will use the journal to fix it for example?
[22:22] <gregaf> you need to set up more complicated scenarios to lose data if your journal is in tmpfs
[22:23] <gregaf> but by doing that you end up making it *possible* to lose data in ways that aren't if your journal is on permanent storae
[22:23] <gregaf> psomas: right now the OSD won't really notice if the underlying storage gets corrupted (although there are hooks for that with scrub and such)
[22:24] <gregaf> but when it boots up it replays everything in its journal onto the last known-good state, so that covers the likely areas of corruption
[22:24] <gregaf> NaioN: journaling and committing strategy varies a bit depending on the underlying FS and such
[22:25] <psomas> kk
[22:25] <df__> gregaf, does it recover from scrub detected errors? eg: [ERR] 1.5d scrub stat mismatch, got 260/259 objects, 0/0 clones, 813084/809363 bytes, 937/933 kb.
[22:25] <gregaf> but if you're not using btrfs, the write will get sent to both OSDs, each will journal the data, then write it to the filestore, then the replicas send "commits" to the primary and the primary sends them back to the client
[22:25] <bchrisman> heh… yeah.. what greg said.. :)
[22:26] <psomas> with btrfs?
[22:26] <gregaf> df__: not by default, though it recovers from *some* of them if you run a manual scrub and tell it to fix errors — sjust could tell you more
[22:26] <NaioN> gregaf: ok, but you get a ack after it gets to the journal of the first osd?
[22:27] <gregaf> NaioN: no, right now you get an ack once *each* OSD has seen it — "ack" means it's in non-stable storage, "commit" means it's on stable storage
[22:27] <gregaf> the Ceph filesystem only cares about commits, but if you were writing your own application you could have it respond to acks instead if that were safe
[22:27] <NaioN> ok so a ack after it gets to the journal of every osd and a commit after it gets to the filestore of every osd
[22:28] <gregaf> psomas: with btrfs, thanks to snapshots it can write to the journal and the filestore simultaneously — then on startup it rolls back to the latest snapshot and replays the journal on that
[22:28] <gregaf> NaioN: ack doesn't mean in journal
[22:28] <NaioN> ok it could even be in mem of the osd?
[22:28] <gregaf> it means it's been applied to the in-memory state of the OSD (ie, written to filesystem, but FS not flushed or synced)
[22:29] <gregaf> of course, the ack versus commit semantics don't really matter for you unless you're writing your own apps on top of it :)
[22:29] <NaioN> ok
[22:29] <NaioN> but then i don't get it
[22:29] <NaioN> whta's the use of the journal?
[22:29] <gregaf> suffice to say that the system has guaranteed consistency and if you're running it without doing silly things like journaling on tmpfs, you're not going to lose data without knowing it
[22:30] <NaioN> well w don't plan to :)
[22:30] <gregaf> journaling makes commits into a streaming write, so lower latency, and makes crash recovery *possible*
[22:30] <gregaf> if you don't have a journal and crash, you don't know whether your on-disk data is any good or not since you might have been in the middle of a transaction or something
[22:31] <NaioN> so the journal is more for consistency then for performance?
[22:31] <bchrisman> journal is a very good place to put ssds.
[22:31] <NaioN> btw we use btrfs, so it can make a write to both...
[22:31] <gregaf> well, under ideal circumstances it improves both
[22:32] <bchrisman> a lot of other clustered filesystem doesn't have that kind of capability.
[22:32] <gregaf> since you can get data on-disk without having to do random seeks across the disk
[22:32] <NaioN> bchrisman: yeah that's what i really like about ceph
[22:32] <bchrisman> I think performance of the journal device is underplayed… some of the perf reports on the list show excellent perf with even a small ssd journal...
[22:34] <df__> do you have a method for not journalling the data?, just the transaction state. ie, a client would hold on and resubmit the request in the event of a failure
[22:34] <df__> that way we wouldn't have issues with SSD performance being the limiting factor
[22:35] <gregaf> df__: you can run it in a no-journal mode, but you can expect that to be slow — IIRC it requires doing a sync after every write op so you don't end up in a bad disk state
[22:35] <df__> yes, that would be suboptimal
[22:36] <df__> afaict, the only reason to journal the data is so that the client can forget about it more quickly and move on to the next op
[22:36] <NaioN> df__: you need a lot of disks to outperform an SSD on iops
[22:36] <df__> i'm not iop limited
[22:37] <NaioN> so if you would sync after every write i doubt the ssd would be the bottleneck
[22:37] <df__> the streaming write performance of the SSD is my bottleneck
[22:37] <NaioN> ok
[22:37] <NaioN> we have a random workload
[22:37] <NaioN> so therefor a ssd is perfect
[22:38] <df__> but the ssd only ever does streaming writes it seems, irrespective of the work load
[22:38] <gregaf> df__: presumably you've got a beefy RAID card to drive that — it doesn't have support for NVRAM or a battery-backed stick of RAM or something?
[22:39] <gregaf> I'd have to grab sage to figure out what all the options are with that, but there are some nice ones (that I think are already implemented)!
[22:39] <NaioN> df__: why would it only doe streaming writes?
[22:40] <df__> the journal is just a single device just now. the storage is is quite a fast array that significantly out performs the ssd in terms of write b/w
[22:40] <df__> naion, because it is a log
[22:40] <NaioN> df__: yeah ok, but for every write i have to do a sync?
[22:41] <NaioN> so it's a lot of small writes
[22:42] <df__> granted.
[22:43] <psomas> 23:25 < gregaf> but if you're not using btrfs, the write will get sent to both OSDs, each will journal the data, then write it to the filestore, then the replicas send "commits" to the primary and the primary sends them back to the client
[22:43] <NaioN> psomas: yeah i was confused by that
[22:43] <psomas> hm, i'm confused, doesn't this mean that even with a power failure you'll still have the data even if the journal is lost?
[22:44] <gregaf> sorry, I typoed there — it sends a commit after the journaling step
[22:44] <NaioN> because i thought it commits after it hits every journal of every osd
[22:44] <sjust> psomas: the data written to the filestore may not have been synced
[22:44] <NaioN> gregaf: ok!
[22:44] <psomas> ok, it makes sense now
[22:44] <NaioN> psomas: indeed!
[22:44] <psomas> the same is true for btrfs though, right?
[22:45] <NaioN> psomas: yes
[22:45] <NaioN> but with btrfs you can write to the journal and filestore at the same time
[22:45] <gregaf> no, actually — we use btrfs snapshots extensively, so if the current on-disk state is bad in btrfs that's not a problem because it just rolls back to the latest snapshot
[22:45] <NaioN> and give a commit after one of those is ready
[22:45] <gregaf> and so it can issue simultaneous writes to the filestore and the journal
[22:46] <gregaf> and issue a commit after one of them hits disk, as NaioN says :)
[22:46] <NaioN> this would solve df__ problem
[22:46] <NaioN> because if he has large sequential writes, those could be committed first by the filestore
[22:47] <gregaf> unfortunately, not that — it would require a lot of smarts which we don't have to handle things being in the filestore that didn't make it into the journal
[22:47] <psomas> gregaf: i meant that you can still lose data with journal in memory...
[22:47] <gregaf> psomas: ah, yes
[22:47] <NaioN> gregaf: ok
[22:47] <gregaf> don't put your journal in memory if you care about your data!
[22:48] <df__> naion, cancelling already issued writes would be a severe cold towel moment
[22:49] <psomas> gregaf: yeah, got that :) i'm just trying to understand why :)
[22:50] <df__> doing parallel journal and filesystem writes only helps you in terms of latency. if you're pushing enough data so that you saturate one of them, you have a problem.
[22:50] <gregaf> psomas: so if you only lose one OSD, and you have others which still have the data, you aren't going to lose it even if your journal's lost
[22:50] <gregaf> but if you lose all of them, then obviously the data is gone
[22:51] <psomas> but if you lose all of the replicas while the data haven't been synced, you have a problem
[22:51] <df__> given that our tiny cluster has nigh on 0.5TiB of ram, the journal device is an obvious issue
[22:52] <psomas> btw, if the jouranl is lost, but a replica has the data, will it take longer to 'sync' the crashed osd, than if you had the journal?
[22:52] <gregaf> or you could have more subtle things, like the PG is running "degraded" because it's supposed to have 2 OSDs (primary and replica) but the replica is rebooting and so only the primary is up right now — if it reboots, boom! data loss
[22:52] <gregaf> psomas: yeah, it'll take longer, but it will sync
[22:52] <psomas> from which 'point' will the sync begin if the journal is lost?
[22:53] <psomas> with the journal, it'll get replayed from what i understand and 'sync' from there
[22:54] <gregaf> yeah, without the journal it'll go from whatever's on the filestore — in btrfs, from the last snapshot (which won't be too far back from what the journal would have had anyway), or in something else from whatever the last sync point was
[22:54] <gregaf> in general it should only be a few seconds off the journal, but depending on how much write activity there is that could be the difference between replaying off the journal and coming up, or having to transfer data over the network
[22:55] <psomas> ok, i think i got it now
[22:55] <psomas> thanks again :)
[22:55] <gregaf> yep!
[23:05] * cole (~cole@206.15.24.21) has left #ceph
[23:24] * pmjdebruijn (~pascal@overlord.pcode.nl) has left #ceph
[23:54] * aliguori (~anthony@32.97.110.59) Quit (Remote host closed the connection)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.