#ceph IRC Log


IRC Log for 2011-09-19

Timestamps are in GMT/BST.

[0:33] * MarkN (~nathan@ has joined #ceph
[1:11] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Remote host closed the connection)
[1:18] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[1:39] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[2:14] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[4:02] * MarkN (~nathan@ Quit (Ping timeout: 480 seconds)
[4:03] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[4:09] * MarkN (~nathan@ has joined #ceph
[5:13] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[5:21] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Ping timeout: 480 seconds)
[5:40] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[5:41] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[9:27] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[9:35] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Ping timeout: 480 seconds)
[11:27] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[11:35] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Ping timeout: 480 seconds)
[14:10] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[14:53] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:39] <lxo> hey, I got a filesystem in which some 15 files cause the mds to fail an assertion whenever I try to make any changes to them: removing, renaming, truncating, anything
[15:41] <lxo> assert(inode_map.count(in->vino()) == 0); // should be no dup inos! in MDCache::add_inode, called by cow_inode, is what fails
[15:42] <lxo> the files are all present in snapshots that I don't want to remove, and they haven't been touched since the snapshot was taken
[15:42] <lxo> any thoughts on how to recover from this (short of rolling the filesystem back to an earlier state that doesn't display the problem)
[15:43] <lxo> this is with ceph 0.34, FWIW
[16:25] * monrad-51468 (~mmk@domitian.tdx.dk) has joined #ceph
[16:29] * monrad (~mmk@domitian.tdx.dk) Quit (Ping timeout: 480 seconds)
[17:36] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[17:43] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[17:45] * lxo (~aoliva@9YYAABK50.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[17:45] * lxo (~aoliva@9KCAAA5OH.tor-irc.dnsbl.oftc.net) has joined #ceph
[17:53] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[17:54] * Serge (~serge@eduroam-49-215.uni-paderborn.de) has joined #ceph
[17:56] <Serge> hello folks, i was able to install ceph on one host last time. today i tried to install it on the other but everytime i get following message: "mkcephfs requires '-k /path/to/admin/keyring'. default location is /etc/ceph/keyring."
[17:56] <Serge> i disabled the auth parameter in the ceph.conf but the message is still here
[17:56] <Serge> some suggestions?
[18:03] <gregaf> Serge: I think right now it still needs the keyring even if it's empty; you should be able to find it on your original host
[18:05] <Serge> gregaf: what du exacty mean by "find it on your original host"? I dont have any keyring, should there be one?
[18:06] <Serge> sry: *what you...
[18:06] <gregaf> Serge: hmm, I could be mistaken, haven't done it much
[18:08] * Tv|work (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:09] <gregaf> Serge: so you have previously run mkcephfs on a single node and it worked
[18:09] <gregaf> are you trying to expand your cluster to two nodes now?
[18:11] <Tv|work> the mkcephfs -k parameter is a "write the new keyring here", not "read this keyring"
[18:11] <Serge> yes, I installed it previously on one host for test purpose only. now I'm trying to install it on other host. no cluster, just on host for everything
[18:11] <Tv|work> it'll create a new random key for client.admin (moral equivalent of root account for ceph), and write the keyring there
[18:12] <Tv|work> http://ceph.newdream.net/docs/latest/man/8/mkcephfs/
[18:17] <Serge> TvIwork: thank you, but after executing this command: "mkcephfs -c /etc/ceph/ceph.conf -k /etc/ceph/key --allhosts -v" i get this one: http://pastebin.com/yzJUkngh
[18:19] <Tv|work> oh that's a new one
[18:20] <Tv|work> Serge: can you share your ceph.conf?
[18:21] <Tv|work> i'm thinking the section headers will look funky
[18:21] <Tv|work> but i want to know how
[18:22] <Tv|work> there's an ugly shell string handling thing trying to extract the section headers, in mkcephfs
[18:22] <Tv|work> that thing is probably really easy to break
[18:22] <Serge> http://pastebin.com/m5vt3QyY
[18:23] <Serge> it is the same config file which i used to install ceph on other host
[18:23] <Tv|work> Serge: huh, that looks perfectly normal
[18:23] <Tv|work> actually.. your mkcephfs output doesn't match what i expected, either
[18:24] <Tv|work> i see just one place that uses shell arithmetic, and there's an echo just before it
[18:24] <Tv|work> your output doesn't show the echo
[18:24] <Tv|work> hrmmm
[18:24] <Tv|work> Serge: can you re-run with sh -x /path/to/mkcephfs ....
[18:25] <Tv|work> Serge: it'll output a *lot* more, but it'll be more useful for locating the problem
[18:25] <Serge> ok, going to do that
[18:32] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[18:34] <Serge> Tv|work: sry, I think I didnt understand you... I cannon find mkcepfs on my filesystem
[18:34] <Tv|work> Serge: earlier you can mkceph ...
[18:34] <Tv|work> now run sh -x /path/to/mkcephfs ...
[18:35] <Tv|work> *ran
[18:35] <Tv|work> that'll make the shell output more information about what it's doing
[18:39] <Serge> i tried this one: "sh -x /usr/bin/cephfs -c /etc/ceph/ceph.conf -k /etc/ceph/key --allhosts -v" and this one: "sh -x mkcephfs -c /etc/ceph/ceph.conf -k /etc/ceph/key --allhosts -v" and nothing works
[18:40] <Tv|work> first one has a typo, missing "mk"
[18:40] <Tv|work> second one fails because it doesn't do PATH lookups, hence me asking you to specify /path/to/mkcephfs
[18:41] <Serge> i cannot find mkcephfs on my system, i can execute it but it is not in /usr/bin/
[18:42] <Tv|work> Serge: try /usr/sbin
[18:42] <Serge> if Im trying the first one with "mk" Im getting sh: Can't open /usr/bin/mkcephfs
[18:43] <Tv|work> Serge: sbin
[18:47] <Serge> Tv|work: Thanks! I pasted the output to http://pastebin.com/ydsSnsDB
[18:50] <Tv|work> Serge: what version of ceph is this?
[18:50] <Serge> ceph
[18:51] <Serge> ops, sry
[18:51] <Tv|work> i think you're about 5 months out of date
[18:52] <Tv|work> about 0.27 or so
[18:52] <Serge> oh.. i installed it by apt-get install... going to check the sources.list
[18:52] <Tv|work> err not necessarily 0.27, i'm reading this output wrong
[18:52] <Tv|work> but old
[18:57] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[19:04] <Serge> Tv|work: thank you very much. i added the source links to my sources.list and it works fine
[19:05] <Tv|work> nice
[19:05] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[19:06] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:08] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) Quit (Quit: cp)
[19:30] * Serge (~serge@eduroam-49-215.uni-paderborn.de) Quit (Remote host closed the connection)
[19:42] * cp (~cp@ has joined #ceph
[19:46] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:59] * bchrisman (~Adium@ has joined #ceph
[20:02] <gregaf> bchrisman: pushed some file locking fixes to master (or you can cherry-pick off the head of wip-flock if you prefer)
[20:03] <bchrisman> gregaf: thanks.. we'll get some testing on that soon
[20:04] <lxo> gregaf, speaking of file locking... I got a question: it appears that most of my problems in ceph have had to do with mdses in standby-replay taking over while a lagging mds isn't quite dead yet
[20:04] <gregaf> bchrisman: haven't had a chance to check over jojy's code yet, but tell him it's in my queue
[20:05] <lxo> is there anything that stops the lagging mds from messing with the mds journal or osd data once another mds took over?
[20:05] <gregaf> lxo: yes, actually
[20:05] <gregaf> we did some work several months ago to strengthen it, but a new MDS doesn't take over until the old mds has been blacklisted, so OSDs refuse to talk to it
[20:05] <gregaf> and the old MDS shuts down once it finds out it's supposed to be dead
[20:06] <lxo> oh well... I'll have to come up with another explanation for why the number of failures I get has lowered so much once I stopped having standby-replay mdses
[20:06] <gregaf> lxo: what kind of failures?
[20:06] <lxo> if the old mds won't mess with data, maybe it is something with recovery that doesn't quite get things right
[20:07] <gregaf> standby-replay stresses teh replay code, so it wouldn't be surprising that the standbys die, but it shouldn't effect the active MDS at all
[20:07] <lxo> mainly corruption in the ceph filesystem. one example is the one I mentioned earlier (I think you weren't on), in which I have some 15 files out of 4M objects that, if modified in any way, will cause the mds to crash
[20:08] <gregaf> hrm
[20:08] <gregaf> lxo: can you put info in the tracker?
[20:09] <gregaf> on that bug, I mean
[20:09] <lxo> I was thinking of standby-replays exposing more errors because for some reason mdses often go lagging for me, and when there's a standby, it often takes over, whereas if there isn't, the lagging mds will often carry on on its own
[20:09] <gregaf> it looks like it's trying to add duplicate inos to the cache, but I'll need backtraces and stuff to do anything with it
[20:09] <gregaf> lxo: well, it might be exposing race problems that we don't hit very often??? :/
[20:10] <lxo> I'm not quite sure what to report. I'm pretty sure what leads to the problem is something that's already corrupted on the osd, but I have little idea of what it could be
[20:10] <gregaf> lxo: probably, but the backtrace will at least give us more information to hunt down the problem
[20:10] <gregaf> or if you can reproduce with logging on that would probably help too
[20:11] <lxo> I can reproduce the crash reliably given the current filesystem, for sure, but I don't know how to get to that evidently inconsistent state. it doesn't occur very often
[20:12] <gregaf> if you can reproduce it with debugging we can probably work out what happened to the on-disk state, which might let us work out how it happened
[20:12] <lxo> as soon as I'm done with the current session, I'll tune logging up and trigger the problem
[20:13] <lxo> you mean, trigger it within a gdb session? I could do that, sure
[20:14] <gregaf> no, just turn on the MDS logging and compress it and post it somewhere
[20:14] <lxo> I guess it might make sense to rebuild the mds without optimization to help debugging
[20:14] <lxo> 'k
[20:16] <lxo> one interesting bit: it seems like for all of the files that trigger the problem (I have tested it on a handful of the 3 handfuls), when I first ls the snapshotted directory that contains a copy of the file, shell glob expansion sees the file twice
[20:17] <lxo> subsequent runs of ls list the file only once, until I restart the mds or remount the filesystem
[20:17] <lxo> I noticed this with cfuse only, haven't tested with the kernel module
[20:17] <gregaf> that's odd but kind of makes sense, the assertion is testing that when you add an inode to the cache it doesn't already exist
[20:17] <gregaf> but it does
[20:18] <gregaf> so something very strange is happening
[20:18] <gregaf> got to run, be back later
[20:18] <lxo> 'k, thanks
[20:18] <Tv|work> gregaf: sounds like we should run ceph-qa-suite with standby mdses, see if that's enough to trigger this
[20:19] <lxo> Tv|work, do you often get lagging mdses in your tests?
[20:20] <Tv|work> lxo: well the tests push them fairly hard.. but we don't do anything to explicitly make them lag, as far as i know
[20:20] <Tv|work> at some point, we should totally intentionally slow down individual daemons
[20:21] <lxo> I just wondered if it happened to you as well. btrfs has this oddity that sometimes syncs will take a loooong time to complete
[20:24] * The_Bishop_ (~bishop@port-92-206-29-232.dynamic.qsc.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[21:05] * The_Bishop (~bishop@sama32.de) has joined #ceph
[21:24] <wido> 8736 pgs: 8736 active+clean; 19443 MB data, 62116 MB used, 74329 GB / 74520 GB avail
[21:24] <wido> a "df -h" still shows 191GB used
[21:24] <wido> I know that there is a delay, but 191GB is pretty much, I didn't upload more than this 19GB of data to this cluster
[21:25] <wido> [2a00:f10:113:1:230:48ff:fed3:b086]:/ 73T 191G 73T 1% /mnt/ceph
[21:34] <gregaf> wido: oh, you mean the recursive stats show 191GB but you don't actually have that much
[21:34] <gregaf> any sparse files?
[21:34] <gregaf> those aren't handled nicely by the recursive stats, unfortunately :(
[21:37] <bchrisman> :)
[21:38] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[21:47] * kirkland (~kirkland@ has joined #ceph
[21:47] <kirkland> howdy! I'm having some trouble enabling rbd support into qemu
[21:47] <kirkland> ceph version 0.34
[21:48] <kirkland> qemu git head
[21:49] <kirkland> hmm, okay, i'm a bit further along now
[21:51] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[21:53] <jim> hi. my project is looking at couchdb, and one product in particular bigcouch which has a document replication/query router very similar to my understanding of how ceph handles object redundancy.
[21:54] <jim> so much so that it leads me to question whether a kernel module could pick up the slack for an erlang blob store
[21:54] <yehudasa> kirkland: any specific problem? generally you need to compile qemu with --enable-rbd
[21:56] <yehudasa> jim: is that a question?
[21:57] <Tv|work> what's http://ceph.newdream.net/debian-rgw/ and how is it different from the other autobuilt debs?
[21:57] <Tv|work> i mean, i see it has apache2 etc, but it *also* has ceph
[21:57] <yehudasa> Tv: might be just an old repository that I used to push into?
[21:58] <Tv|work> hrmmph
[21:58] <Tv|work> i dislike old things laying around
[21:58] <jim> yehudasa: i dont think so but it is of significant interest to me as to why one wouldn't simply overhaul couchdb with ceph
[21:58] <Tv|work> jim: couchdb is something *very* different from cpeh
[21:59] <Tv|work> jim: couchdb is meant to handle disconnected operation; ceph very much isn't
[21:59] <jim> Tv|work: the bigcouch replication abstraction is not
[21:59] <Tv|work> jim: "store N copies" is simple enough to look similar in many places
[21:59] <gregaf> I assume it maps sort of onto RADOS, but I'm not very familiar with how CouchDB does its replication
[22:00] <jim> Tv|work: so why would you trust an erlang hack when a simple C kernel module would do?
[22:00] <Tv|work> each couchdb instance is basically a btree database
[22:00] <jim> whoa you mean like btrfs ?
[22:00] <jim> gee
[22:00] <Tv|work> yup, quite a lot like that
[22:01] <Tv|work> with extra features on top for keeping track of peer replication and view updates
[22:01] <jim> sort of like mounting a btrfs volume from any set of metadata would do ?
[22:02] <Tv|work> totally different things, though
[22:02] <Tv|work> would you like to pay the syscall overhead for every key fetch?
[22:02] <jim> isn't that what userspace does?
[22:03] <Tv|work> databases are almost *all* about amortizing syscall overhead over multiple operations
[22:04] <jim> you mean btrfs would bog down trying to lock a vfs node to perform lookup ? i wonder why they never thought of fast access
[22:04] <Tv|work> it'd be slow before it got anywhere near btrfs code
[22:05] <Tv|work> compared to "oh i have this block already in ram, here you go"
[22:06] * The_Bishop (~bishop@sama32.de) Quit (Ping timeout: 480 seconds)
[22:06] <jim> so you are saying a master-master replication system is somehow faster at doing what ceph does because it's somehow more refined rpc/ipc than what the kernel can do to avoid wasting IO on access patterns?
[22:06] <Tv|work> no, not at all
[22:06] <Tv|work> i'm saying putting a database into the kernel isn't a smart move
[22:07] <jim> if couchdb was relational, i would agree the nodes don't have any similar granularity to vfs. im not sure how a key value store is infrerior to userspace batch operations
[22:08] <Tv|work> we have a communication problem, and it doesn't seem to be resolving
[22:08] <Tv|work> never mind what i said
[22:08] <jim> if the btree underneath is simply multi-volume btrfs, the query language is javascript for transient indexes, no userspace savings opprtunites are apprent
[22:08] <Tv|work> only if you put a javascript interpreter in the kernel
[22:09] <Tv|work> if you're down that path, well, umm, good luck
[22:10] <jim> i dont see how javascript in the kernel would make a bit of difference. so what if you have a syscall to access a partial extent of something sitting in buffercache with the rest of the parent extent
[22:11] <jim> i dont see how ceph would increase page fault delta overall, im just missing the magical erlang usersauce that makes it somehow faster to maintain a btree internally on master-master replication
[22:11] <gregaf> jim: couchdb is just done very differently than RADOS is
[22:12] <gregaf> and the amount of in-kernel code for replication is zero, all the replication is done in the userspace OSDs
[22:14] <jim> ahh got it. perhaps i am optimistically blurring the btrfs multivolume coordination with ceph's
[22:14] <gregaf> yeah, those two things aren't related at all
[22:15] <jim> gregaf: couchdb specifally doesnt do what bigcouch does for couchdb
[22:15] * adjohn (~adjohn@ has joined #ceph
[22:16] * The_Bishop (~bishop@p4FCDF1C6.dip.t-dialin.net) has joined #ceph
[22:16] <jim> couchdb is a closed loop doc store. bigcouch wires up auto sharding
[22:27] <kirkland> yehudasa: i needed to install *both* librados-dev and librbd-dev
[22:28] <kirkland> yehudasa: previously, i was just install librados-dev
[22:28] <kirkland> yehudasa: now I'm building ;-) thanks!
[22:28] <yehudasa> kirkland: yeah, we moved a great chunk of code into librbd
[22:29] <kirkland> yehudasa: it might be nice to mention those on http://ceph.newdream.net/wiki/QEMU-RBD
[22:29] <kirkland> under "Building"
[22:29] <yehudasa> kirkland: yep, thanks
[22:33] * slang1 (~Adium@ has joined #ceph
[22:33] <slang1> http://dl.dropbox.com/u/18702194/osd.17.log
[22:34] <slang1> pg is 1.0p17
[22:34] * sagelap (~sage@m8c0536d0.tmodns.net) has joined #ceph
[22:34] <slang1> http://dl.dropbox.com/u/18702194/osd.17.log
[22:34] <slang1> pg is 1.0p17
[22:35] * sagelap (~sage@m8c0536d0.tmodns.net) has left #ceph
[22:37] * cp (~cp@ Quit (Quit: cp)
[22:39] <slang1> http://dl.dropbox.com/u/18702194/osd.17.log.gz
[22:39] <slang1> might be a bit fast
[22:39] <slang1> er
[22:40] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[22:54] <wido> gregaf: i was syncing my homedir
[22:55] <wido> now you say so, could be that I have some sparse KVM vms in some dir
[22:55] <wido> hidden somewhere
[22:55] <wido> i'll check for that
[22:56] <wido> could be qcow2 or raw images
[23:01] * cp (~cp@ has joined #ceph
[23:03] * danielf (~df@ has joined #ceph
[23:09] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[23:09] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Read error: Connection reset by peer)
[23:20] * slang1 (~Adium@ Quit (Quit: Leaving.)
[23:23] * danielf (~df@ Quit (Quit: danielf)
[23:23] * danielf (~df@ has joined #ceph
[23:32] * slang1 (~Adium@ has joined #ceph
[23:32] * Anticimex (anticimex@netforce.csbnet.se) Quit (Ping timeout: 480 seconds)
[23:32] * lxo (~aoliva@9KCAAA5OH.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[23:34] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[23:36] * slang1 (~Adium@ Quit ()
[23:37] * lxo (~aoliva@19NAADS2A.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:45] * slang1 (~Adium@ has joined #ceph
[23:48] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Remote host closed the connection)
[23:55] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.