#ceph IRC Log


IRC Log for 2012-01-25

Timestamps are in GMT/BST.

[1:50] * amichel (~amichel@ has joined #ceph
[1:55] <amichel> I've just got a sort of ignorant question about multi-device btrfs filesystems and how one would lay out a 45 disk server for ceph OSD at that level. Anyone around that might have some knowledge to drop on me in that area?
[3:23] * amichel (~amichel@ has joined #ceph
[6:37] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[16:55] * aneesh (~aneesh@ has joined #ceph
[17:26] <ceph> can anyone point me in the ceph source to where the metadata for files stored in the file system is extracted (and then cached on the MDS cluster)
[18:59] <gregaf> ceph: I'm not quite sure what you're asking…metadata is stored in objects in RADOS, but it's not "extracted" except in that the MDS reads it out of RADOS
[18:59] <gregaf> that happens throughout the MDS system, largely in MDCache.cc
[19:00] <gregaf> Kioob`Taff1: you want to look at each on-disk copy of it or something?
[19:01] <sage> gregaf, etc: will miss the standup this morning. will be on irc later though
[19:01] <gregaf> there's not a great way still but if you've got a log you can find the inode and then search for that to turn up the objects it's stored in...
[19:01] <gregaf> roger that, sage
[19:08] <ceph> gregaf: thank you. i will look in MDCache.cc because that is part of what i am looking for. i am also interested in where/how/when the metadata contents of each directory are stored in objects
[19:09] <gregaf> each directory is an object which stores all the inodes in that directory, along with some extra directory-specific metadata…
[19:09] <gregaf> changes are written to a journal (which is just striped along objects) and flushed out to the permanent directory objects every so often
[19:11] <ceph> if i was to copy some new data into the file system, what (or who, which file) initiates the metadata object creation for each directory copied?
[19:13] <gregaf> I don't remember what the specific functions are, but you would start in Server.cc::handle_client_mkdir :)
[19:13] <gregaf> be back in a bit!
[19:13] <ceph> ok. thanks
[21:24] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Quit: Konversation terminated!)
[21:24] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[22:16] <sage> elder: you saw http://tracker.newdream.net/issues/1990 right?
[22:17] <elder> Yes, Josh pointed it out.
[22:17] <elder> I haven't investigated any further than that thoug.
[22:18] <elder> Also note the "doesn't exist any more" which is the problem with the rebasing that I don't like very much :)
[22:20] <sage> elder: hrm yeah
[22:39] <lxo> sage, gregaf, been doing some testing with snapshot timestamps *and* mds loss of dir layout info. it seems that the problems are related
[22:40] <sage> lxo: that sounds right
[22:40] <lxo> AFAICT when the mds kicks a dir out of the cache, when it reads things back, it loses layout info, but the snapshot timestamps are right
[22:40] <sage> lxo: i think in both cases the fix is to journal the old_inode content
[22:40] <sage> lxo: er oh, hmm!
[22:41] <lxo> however, when the mds recovers that info from the mds journal, it gets the timestamps wrong, even though it seems to get dir layout info right
[22:41] <sage> lxo: sounds like journaling of layout is correct, writeback is not; journaling of old_inode is wrong, writeback is correct.
[22:41] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[22:42] <lxo> so I'm inclined to believe that we don't need to journal old_inodes, we just have to deal properly with creation of snapshots when replaying the journal, and review how layout info is stored in/recovered from dir objects
[22:42] <lxo> yep, that's a shorter way to put it
[22:43] <lxo> now, as for fixing old_inodes, I still don't see why we have to do that (don't know enough about the mds journal yet), but it appears to me that the journal should already have all the necessary info
[22:43] <sage> lxo: i think we need to journal old_inodes either way..
[22:44] <elder> sage, your last btrfs hack commit added a build warning because you inserted a printf prior to a local variable declaration in a block.
[22:44] <lxo> as in, if the previous state of the snapshotted directory is in the journal, we can use that info when replaying (?) a snapshot-creation op; if it's not in the journal, we can recover it from the osd dir object
[22:44] <elder> Code still works though--it's legal C now, just wasn't before hence the warning.
[22:44] <sage> lxo: sort of.. the way the mds works snapshots may be updated after they are cowed because client writeback is asynchronous
[22:45] <elder> sage, I am going to recommit those patches soon I think so I will fix it for you (unless you object).
[22:45] <sage> elder: k. we can clean it up or ignore.. it was just for the next qa run so i can see wehre einval is coming from
[22:46] <lxo> as for recovering the info, I don't see that journal old_inodes gains us any info; AFAICT we could avoid the misbehavior by refraining from assuming old_inodes timestamps are the same as the updated dir, and instead fetch them from disk even though the updated dir info is in the journal already
[22:47] <lxo> I can see this would change snapshot data, but metadata?!?
[22:48] <elder> sage, already fixed in my local area; I'll fix it if/when I commit these changes.
[22:48] <sage> there's more in old_inodes beyond mtime: recursive accounting stats, for instance. we don't learn that until potentially well after the snapshot is created (when clients finish writeback), so even tho it's in a different time bucket, it can change going forward, and is part of the directory state.
[22:49] <sage> ...and all such state is journaled when it is updated.
[22:49] <lxo> is there any way to force an mds to fully flush the cache and the journal, and then restart/quit? this would be a nice way to reliably test what I'm observing empirically
[22:49] <sage> lxo: i think 'ceph mds stop 0' will make it flush and shut down
[22:50] <sage> lxo: but fwiw i'm 99% sure the lack of old_inode journaling is an oversight/bug. fixing that should make that half of your problem go away
[22:50] <lxo> will that empty the journal too, or should I do that separately? (or is there some other way to ensure the contents of the journal have made to dirs)
[22:51] <lxo> I believe that, but I think it's overkill. the fact that I don't see subdirectories' timestamps become wrong after the snapshot is taken makes me a bit doubtful that this is necessary. I feel I'm still missing something
[23:31] <sage> i accidentally clobbered the -b qa run .. scheduled a new one.
These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.