#ceph IRC Log


IRC Log for 2010-11-06

Timestamps are in GMT/BST.

[0:15] <stingray> 2010-11-06 02:13:46.968856 7f20184b0700 log [ERR] : unmatched fragstat size on single dirfrag 100000cccba, inode has f(v6 m2010-11-05 06:15:53.589696 2=2+0), dirfrag has f(v7)
[0:15] <stingray> mds/CInode.cc: In function 'void CInode::finish_scatter_gather_update(int)':
[0:15] <stingray> mds/CInode.cc:1582: FAILED assert(!!"unmatched fragstat size" == g_conf.mds_verify_scatter)
[0:15] <sagewk> the very code i'm working on now :)
[0:17] <stingray> sagewk: btw, are you working on this fulltime?
[0:17] <sagewk> yes :)
[0:18] <stingray> heh
[0:18] <stingray> great
[0:57] * Ifur (~osm@big.aksis.uib.no) has joined #ceph
[0:59] <Ifur> would advice against trying ceph in a production environment?
[1:00] <Ifur> at my new job, I am desperate to toss out AFS
[1:00] <sagewk> lfur: yes
[1:02] <Ifur> the alternatives aren't much better... :S
[1:02] <sagewk> for a bit longer at least!
[1:04] <Ifur> I'm fine with beta testing, data loss are not a big issues either (can be worked around -- short term). as long as it can be brough up against fast.... have a cluster that needs to read configuration files simultaneously.... and with AFS it is taking a long time...
[1:05] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[1:05] <gregaf> lfur: well our stance is that it's not production-ready yet
[1:06] <Ifur> and this is in large part also due to brtfs?
[1:06] <gregaf> but if you want to try it out that sounds like a fairly safe workload to put it through, especially if you have alternatives you can switch back over to
[1:06] <gregaf> we haven't encountered many issues with btrfs lately, actually
[1:07] <cmccabe> also, it can be run with btrfs or ext
[1:07] <gregaf> yep, it can run with either
[1:07] <gregaf> mostly it's just that there are some known bugs and it hasn't gotten the requisite amount of testing to be done yet — every time we get a new user they put it through different workloads and find new bugs
[1:07] <gregaf> usually they're pretty trivial, but they add up
[1:08] <Ifur> then my second question: What is it kind on cpu useage for the storage nodes?
[1:09] <gregaf> umm, not really sure — it depends on the workload I suppose!
[1:09] <gregaf> it's not free, but I don't think I've ever seen it over 1 core on a reasonably-modern system
[1:10] <Ifur> third question then: if i put the storage nodes in virtual machines, and stop the storage node to prevent CPU usage, would ceph handle this... as in not crashing when they are put online again?
[1:11] <gregaf> assuming no bugs, nothing will crash
[1:11] <Ifur> (would be a nice way to prevent filesystem useage during run-time, since this is a online system not writing results to disk.
[1:12] <gregaf> but in general stopping and starting its systems isn't recommended since the nodes do monitor each other to ensure data is replicated properly
[1:12] <Ifur> can you for example mount it read-only one place, and read write in another place on the same client node?
[1:12] <Ifur> lots of questions suddenly...
[1:12] <gregaf> not sure I understand what you mean
[1:13] <Ifur> to have control, in a physical sense, in preventing clients wirting data and only reading data.
[1:13] <cmccabe> there was a proposal to allow read-only bind mounts a while ago
[1:13] <cmccabe> I don't know if they actually did it (in the core VFS)
[1:14] <Ifur> that would be useful to me at least, since writing data would usually come from dedicated machines for development purposes.
[1:15] <gregaf> well mount has a read-only option...
[1:15] <gregaf> if you're really paranoid you could also use an authenticated cluster
[1:16] <gregaf> and then give your read-only clients keys that don't have write permissions for the OSDs
[1:16] <Ifur> ceph supports kerberose?
[1:16] <cmccabe> I tried read-only bind mounts on 2.6.36rc8, not working
[1:17] <cmccabe> oh, maybe your question is just whether ceph can be mounted read-only? you definitely can do that.
[1:17] <gregaf> not kerberos
[1:17] <gregaf> it's not scalable enoug
[1:17] <gregaf> *enough
[1:18] <gregaf> but it has a Kerberos-like authentication mechanism built in (cephx)
[1:18] <yehudasa_hm> gregaf: not sure whether kerberos not scalable enough is the right issue
[1:18] <gregaf> well I thought it was about key distribution and management getting too expensive?
[1:19] <gregaf> I'm not actually that familiar with Kerberos outside of our authentication being modeled after it, but better :)
[1:19] <gregaf> (better for us, obviously; not for everything)
[1:19] <yehudasa_hm> there were a few issues
[1:19] <yehudasa_hm> first, just integrating it into the ceph server would have complicated deployment
[1:20] <gregaf> I'm afraid it's time for me to head out — back on in a bit
[1:20] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[1:21] <yehudasa_hm> I'll continue without greg.. second, we wanted to have a single authentication with each entity (osd, mds, mon) on one hand, but otoh, we didn't want to share a secret between them
[1:21] <cmccabe> kerberos tries to be sort of universal
[1:21] <yehudasa_hm> (which we do with the monitors, but that's a different issue)
[1:21] <Ifur> ceph is a very promising project, not at the stage of being overly complicated because of edge-case-use yet :-)
[1:21] <cmccabe> so you get a TGT (ticket-granting-ticket) and then every application on your system is then trusted
[1:22] <yehudasa_hm> yeah, well.. we do the same
[1:22] <Ifur> kerberose is quite good, use it here... but ticket life time, etc etc is quite annoying.
[1:22] <cmccabe> yehudasa_hm: out of curiosity, do all the OSDs have the same secret?
[1:22] <Ifur> also, there are strange bugs and behaviour with kerberose and pam, at least on ubuntu...
[1:22] <yehudasa_hm> there are two types of secrets
[1:23] <yehudasa_hm> for the osds and the mons
[1:23] <yehudasa_hm> and the mds
[1:23] <yehudasa_hm> one is the secret for each entity
[1:23] <yehudasa_hm> and then there's some shared secret that is being generated every period
[1:23] <yehudasa_hm> so once they are authenticated they're notified about this shared secret
[1:23] <yehudasa_hm> and this is the secret that is being used for the clients tickets
[1:24] <cmccabe> ic
[1:24] <cmccabe> so the MONs all trust one another
[1:25] <yehudasa_hm> the mons are and exception
[1:25] <cmccabe> but when clients authenticate they get a temporary secret that allows them to do stuff for a while
[1:25] <yehudasa_hm> yeah, the clients get a temporary ticket, and they have to renew it once in a while
[1:26] <yehudasa_hm> and the osds and mds generate a new shared key every once in a while too
[1:26] <Ifur> kerberose is one of the things that makes AFS horrible, that it requires it.
[1:26] <yehudasa_hm> but they keep something like 3 keys that they accept.. the previous, the current and the next one
[1:27] <Ifur> not to rant, but AFS is conceptually flawed.
[1:28] <cmccabe> lfur: I still think AFS is better than NFS.
[1:28] <cmccabe> lfur: ceph is a different thing than either one of those, though.
[1:28] <Ifur> i think that depends heavily on whats underneath nfs and how its used.
[1:28] <cmccabe> well, the NFS protocol itself is horrible
[1:29] <Ifur> AFS is great if you dont use it (much).
[1:29] <cmccabe> it completely breaks posix semantics in a lot of ways... weird and surprising behavior with lockfiles, O_CREAT, rmdir, multiple readers/writers, etc.
[1:29] <cmccabe> the whole .nfs_temp_XXXX hack
[1:29] <Ifur> it doesnt scale for read, it doesnt scale for write, it doesnt have paralell read or write, I mean... its a slow archive, probably great for tape.
[1:30] <cmccabe> well, AFS tries to cache a lot of stuff at the clients, in my understanding.
[1:30] <Ifur> which is great if your underlying storage is slow to begin with, hence tape.
[1:30] <cmccabe> the original version of AFS didn't even write any part of the file back to the server until you closed the file!
[1:31] <Ifur> being an IBM invention, it was probably designed around the idea of peddeling their expencive tape storage solutions.
[1:31] <cmccabe> AFS was invented at Carnegie Mellon
[1:31] <cmccabe> Andrew File System came from Andrew Carnegie, the name of the college's founder
[1:32] <cmccabe> later there was a spinoff company to commercialize it called TransArc. Somehow that ended up getting acquired by IBM many years later
[1:32] <cmccabe> the version you're using is probably OpenAFS, a re-implementation in the Linux kernel
[1:33] <Ifur> so a bit more involed then I thought, but I would be shocked if IBM didn't intend to capitalize on AFS through tape storage solutions.
[1:33] <cmccabe> I'm not aware of any reason why AFS would be slower than NFS, besides a bad implementation
[1:33] <Ifur> few storage nodes, many clients. :)
[1:33] <cmccabe> AFS has open-to-close consistency so once you open the file, basically there's no guarantee that you'll see anyone else's changes made after that point
[1:34] <cmccabe> NFS provides a similar (lack of) guarantee
[1:34] <Ifur> ah,didn't know that. So NFS, would also not be great for this use.
[1:34] <cmccabe> well, NFS supports flock()
[1:36] <cmccabe> setting up security under NFS is supposed to be really hard for sysadmins
[1:37] <cmccabe> I think AFS forces you to use Kerberos which is also quite difficult to set up
[1:37] <cmccabe> There is a new version of NFS coming out called pNFS (parallel NFS) that really will be faster. But I don't know who supports it yet.
[1:39] <Ifur> currently been testing fhgfs, easy to set up, works quite well so far... but lacks functionality like failover, redundancy etc... no option to make data available/distributed accross many nodes.
[1:39] <Ifur> I'd heard about cluster NFS, but guess this is the old nfsv3? pNFS is for NFSv4?
[1:40] <cmccabe> pNFS is also known as NFS4.1
[1:40] * greglap (~Adium@ has joined #ceph
[1:40] <cmccabe> pNFS is not a true clustered filesystem, but it does have separate metadata and data servers
[1:41] <Ifur> guess its not production ready either? =)
[1:43] <cmccabe> I'm not really sure where it's at
[1:45] <cmccabe> NFS has always been geared more to environments where the nodes don't share a lot of data with each other
[1:47] <cmccabe> Ceph, Lustre, PVFS are more focused on having multiple readers and writers
[1:48] <cmccabe> also, unlike Ceph, NFS metadata servers still store explicit maps of where everything is
[1:49] <cmccabe> ah. Found an article about it. Actually with pNFS it's metadata server (singular.) "There can be only one."
[1:51] <Ifur> I'm fortunatly only at a couple of hundred nodes, wouldn't hit aperformance ceiling there...
[2:01] <cmccabe> bye
[2:01] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) Quit (Quit: Leaving.)
[2:08] * sjust (~sam@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:10] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[2:11] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[2:14] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[2:22] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[2:35] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[2:48] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[3:08] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[4:57] * MarkN (~nathan@ Quit (Ping timeout: 480 seconds)
[5:42] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) has joined #ceph
[5:51] * MarkN (~nathan@ has joined #ceph
[6:38] * p-static (~ravi@ has joined #ceph
[6:38] <p-static> hey, configuration question
[6:38] <p-static> what's the right way to configure a host that has multiple hard drives in it?
[6:38] <p-static> for some reason all the examples on the wiki assume one disk per host
[6:38] <p-static> at least as far as I can see
[6:39] <sage> multi device btrfs, or raid
[6:39] <yehudasa_hm> or multiple osds on a single host?
[6:39] <p-static> multiple osds in a single host I guess
[6:40] <p-static> the disks are different sizes so I doubt btrfs or md raid would do what I want :)
[6:40] <yehudasa_hm> sage, correct me if I wrong, but btrfs is supposed to handle that, right?
[6:45] <p-static> so putting all the devs in a btrfs raid is the preferred way to do it?
[6:46] <p-static> seems kind of suboptimal
[6:57] <p-static> sage, yehudasa_hm?
[6:58] <yehudasa_hm> p-static: why would it be suboptimal?
[6:59] <p-static> if a dev fails out of a btrfs raid0, then I lose all the data on that server, when I should be able to only lose the data from that one disk
[7:00] <p-static> and if I use a raid1 instead, then I don't get full flexibility managing my replication
[7:00] <yehudasa_hm> there's tradeoff for everything
[7:01] <p-static> you mentioned multiple osds on a single host earlier, is that a possibility?
[7:01] <yehudasa_hm> yeah, sure
[7:02] <p-static> how would I configure that - just add multiple osd sections in the config, that point to the same host?
[7:02] <yehudasa_hm> yes
[7:02] <p-static> mmkay, cool, I'll try that
[7:02] <p-static> thanks
[7:03] <yehudasa_hm> np
[7:04] <p-static> do I have to do anything special at startup - launch cosd twice, or anything like that?
[7:05] <yehudasa_hm> if you use the /etc/init.d/ceph script it should do it for you
[7:05] <p-static> woot, thanks
[7:28] * MarkN (~nathan@ Quit (Ping timeout: 480 seconds)
[7:44] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[7:44] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[7:45] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[8:04] * p-static (~ravi@ Quit (Quit: leaving)
[8:14] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has joined #ceph
[8:22] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[8:31] * MarkN (~nathan@ has joined #ceph
[8:36] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[8:57] * MarkN (~nathan@ Quit (Ping timeout: 480 seconds)
[9:57] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) has joined #ceph
[10:22] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[10:41] * yehudasa_hm (~yehuda@ppp-69-228-129-75.dsl.irvnca.pacbell.net) Quit (Ping timeout: 480 seconds)
[10:56] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[11:42] <stingray> and it turned into a pumpkin again.
[12:16] * johnl (~johnl@ has joined #ceph
[12:17] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[12:36] * allsystemsarego (~allsystem@ has joined #ceph
[12:56] * johnl (~johnl@ Quit (Quit: bye)
[13:32] <stingray> 2010-11-06 15:32:00.445148 log 2010-11-06 15:27:23.672793 mds0 5 : [ERR] rfiles underflow -2 on [inode 100000ccc67 [...2,head] /test/backup-files/contester/.git/objects/ auth v2097 pv2113 f(v244 m2010-11-06 15:27:21.624581 1=0+1) n(v741 rc2010-11-06 15:27:21.624581 b-944 -1=-2+1) (inest mix w=1 dirty) (ifile excl dirty) (iversion lock) caps={5305=pAsLsXsFsx/p@173},l=5305 |
[13:32] <stingray> dirtyscattered lock dirfrag caps dirty 0x7f5a2810b208]
[13:33] <stingray> hahaha drwxr-x--- 1 501 wheel 18446744073709550672 Nov 6 15:27 contester
[13:46] <jantje_> nice size :-)
[14:29] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[14:32] <wido> stingray: that's a pretty large file :)
[14:41] <stingray> it's even funnier - it's a directory with no files in it
[14:52] <wido> oh, indeed
[14:56] * sentinel_e86 (~sentinel_@ Quit (Quit: sh** happened)
[14:57] * sentinel_e86 (~sentinel_@ has joined #ceph
[15:27] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[15:40] * julienhuang (~julienhua@77130.cuirdandy.com) has joined #ceph
[18:04] * sentinel_e86 (~sentinel_@ Quit (Quit: sh** happened)
[18:05] * sentinel_e86 (~sentinel_@ has joined #ceph
[18:13] * Yoric (~David@ has joined #ceph
[18:49] * Yoric (~David@ Quit (Quit: Yoric)
[19:34] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[19:44] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[19:53] * julienhuang (~julienhua@77130.cuirdandy.com) Quit (Quit: julienhuang)
[19:59] <sage> wido: weird, my .debs have that dependency.. it seems to be autodetected by the package build process. double checking the latest unstable.
[19:59] <wido> sage: ok :)
[20:00] <wido> btw, what I just found, usr/share/ceph_tool isn't included either in the deb
[20:00] <wido> should be added to ceph-client-tools.install
[20:01] <sage> oh right
[20:02] <sage> hmm i wonder if that should really be ceph_tool or just ceph
[20:02] <wido> Hmm, true. But most systems running "ceph" wil be a server, no GTK present
[20:04] <sage> yeah
[20:05] <sage> yeah it may make sense to try to separate it out from the ceph tool then. will need to make that code more modular.
[20:05] <sage> and have a separate ceph-gui package.
[20:06] <wido> that would be nice, since gtkmm depends on a lot of libraries, which most users won't need/use
[20:12] <stingray> sage
[20:12] <stingray> so I tested your frag checks
[20:12] <stingray> you committed yesterday
[20:12] <stingray> the results are above
[20:12] <stingray> more interesting question is how to fix it
[20:13] <sage> yeah
[20:13] <sage> hold on, i'll push the latest
[20:13] <stingray> I mean, it is a directory that looks empty but huge and cannot be deleted.
[20:13] <stingray> ok
[20:13] <stingray> i'll deploy it there then :)
[20:14] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[20:23] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[20:40] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) has joined #ceph
[22:16] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[22:19] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[22:45] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[22:47] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[23:01] <stingray> sage?
[23:02] <stingray> I only see a582345c57ef349bfa0a7735a8cbcf81f5a99567 here
[23:08] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:16] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[23:28] * stingray is now known as vuvuzela
[23:28] <vuvuzela> VUVUUUUUUUUUUUUUUU~~~~~~~~~~~~~~~~~~~
[23:34] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.