#ceph IRC Log


IRC Log for 2010-08-10

Timestamps are in GMT/BST.

[0:50] <darkfade1> 1.0?
[0:50] <darkfade1> !list me
[0:50] <darkfade1> (that worked on aol)
[1:38] * Anticimex (anticimex@netforce.csbnet.se) has joined #ceph
[2:15] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[2:16] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[5:00] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[5:00] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) has joined #ceph
[7:02] * f4m8_ is now known as f4m8
[7:39] * mtg (~mtg@vollkornmail.dbk-nb.de) has joined #ceph
[8:11] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[9:43] * clochette (~clochette@ANantes-552-1-20-225.w86-203.abo.wanadoo.fr) has joined #ceph
[9:45] * clochette (~clochette@ANantes-552-1-20-225.w86-203.abo.wanadoo.fr) Quit ()
[10:11] * allsystemsarego (~allsystem@ has joined #ceph
[11:49] <todinini> gregaf: I don't think I am logging to much on the mds, the config is debug mds = 1
[12:14] * allsystemsarego_ (~allsystem@ has joined #ceph
[12:19] * allsystemsarego_ (~allsystem@ Quit (Quit: Leaving)
[12:48] * unenana (~unenana@ANantes-552-1-20-225.w86-203.abo.wanadoo.fr) has joined #ceph
[12:50] * unenana (~unenana@ANantes-552-1-20-225.w86-203.abo.wanadoo.fr) Quit ()
[13:49] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[15:11] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[15:50] * f4m8 is now known as f4m8_
[16:22] * allsystemsarego (~allsystem@ Quit (Ping timeout: 480 seconds)
[16:27] * allsystemsarego (~allsystem@ has joined #ceph
[16:42] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) has joined #ceph
[16:43] * Osso_ (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) has joined #ceph
[16:43] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) Quit (Read error: Connection reset by peer)
[16:43] * Osso_ is now known as Osso
[16:48] * mtg (~mtg@vollkornmail.dbk-nb.de) Quit (Quit: Verlassend)
[16:55] * ghaskins_mobile (~ghaskins_@pool-68-163-130-94.bos.east.verizon.net) has joined #ceph
[17:05] <todinini> I got a trace on a ceph client in vmruncate_work http://pastebin.com/rCD29nF7 kernel is vanila 2.6.25
[17:12] <jantje> I hope you meant 2.6.35
[17:14] <todinini> jantje: yep 2.6.35
[17:21] * ghaskins_mobile (~ghaskins_@pool-68-163-130-94.bos.east.verizon.net) Quit (Quit: This computer has gone to sleep)
[17:48] <wido> sagewk: you there yet?
[17:59] * gregphone (~gregphone@ has joined #ceph
[18:00] <gregphone> todinini: yeah, the mds might just be song that much CPU
[18:00] <gregphone> *using
[18:00] <gregphone> It has to do a lot
[18:01] <gregphone> wido: I think sage is out today at linuxcon or something
[18:02] <wido> ah, ok
[18:02] <wido> i've got a weird problem, not sure what it is
[18:02] <wido> btw, tcmalloc++!
[18:03] <gregphone> Haha, yeah
[18:03] <wido> http://www.pastebin.org/466049
[18:03] <gregphone> Should have pushed it a while ago but I wanted to get a few other things done first
[18:03] <gregphone> Ah well
[18:03] <wido> the find has been hanging for a few hours on that point, same goes for a rsync, fails hanging on the same file / directory
[18:03] <wido> any idea to debug this? why it hangs there
[18:06] <gregphone> Did the mds crash?
[18:07] <wido> no, it is still up
[18:07] <wido> other parts of the filesystem are working
[18:08] <gregphone> Not sure then
[18:08] <gregphone> I have been getting mds hangs lately but they're on a private branch which I thought I just broke
[18:08] <gregphone> Well, metadata hangs, not mds hangs
[18:09] <gregphone> Are you running one mds or a cluster?
[18:09] <wido> two MDS
[18:09] <wido> that is still unstable, i know
[18:10] <gregphone> Okay, I'll see if I can reproduce it on unstable later today
[18:11] <wido> it might be hard, i've synced kernel.org, ubuntu and debian
[18:12] <gregphone> You could try just restarting both of them, maybe
[18:12] <wido> but kernel.org still seems to be a good test, try syncing it and re-syncing it again, that will fail on some point
[18:12] <wido> yes, i will
[18:13] <gregphone> Hopefully a bug won't take that long, I've usually been hitting it input qa suite
[18:14] <gregphone> *in our qa suite
[18:14] <gregphone> But it was on a cleanup branch so I just thought I'd deleted something too early
[18:14] <gregphone> It'd be kind of a relief for me if it was a bug in unstable ;)
[18:18] <todinini> gregphone: if the mds is busy shouldn't the client handle that a bit more gratefull?
[18:20] <gregphone> What do you mean?
[18:21] * sagelap (~sage@ has joined #ceph
[18:45] <wido> gregphone: btw, my MDS is still eating 94% memory, might be because of the many files i have on the fs
[18:48] * gregphone (~gregphone@ Quit (Ping timeout: 480 seconds)
[18:52] <wido> sagelap: speaking or attending linuxcon?
[18:54] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[18:56] <yehudasa> I think it's the filesystem and storage summit
[18:56] * gregphone (~gregphone@ has joined #ceph
[18:57] * gregphone (~gregphone@ has left #ceph
[18:57] <yehudasa> wido: there are a few issues with multiple mds, mainly the import/export caps is kinda broken which can lead to hanging clients
[18:57] <wido> so for now, you recommend a single MDS?
[18:58] <yehudasa> yeah
[18:58] <yehudasa> if you're looking for more stability
[18:58] <wido> ok, next week i'll install the 32G machine, to see how the MDS handles that
[18:58] <wido> with a lot, a lot of files
[19:04] <wido> yehudasa: about issue #174, that seems to be something i have to check further. Upload the 1.1G file went fine, 4.4G didn't, only 400MB got uploaded, might be a client issue, not sure
[19:06] <yehudasa> hmm, ok
[19:06] <gregaf> wido: if your MDS is using up too much memory you can try setting the mds_cache_size config option to a lower value
[19:06] <gregaf> it defaults to 100000 and IIRC it's the max number of inodes to cache in-memory
[19:07] <yehudasa> wido: sounds like a 32bit size issue
[19:07] <wido> yehudasa: hmm, you could be right, the client indeed is 32bits
[19:08] <wido> gregaf: any idea how much one inode takes of memory? sage mentioned something about 2k?
[19:08] <yehudasa> yeah, well.. that shouldn't matter, but somewhere we hold the size as 32 bit so it gets truncated
[19:08] <gregaf> that's all I know, yeah
[19:08] <gregaf> but that's only ~200MB by default so I dunno where the rest of the gig or so is coming from
[19:11] <wido> it's about 3.8G of memory what it's eating
[19:12] <gregaf> 3.8GB?
[19:12] <gregaf> that's, uh, a lot
[19:12] <wido> yes, 3.8GB, 91.4% of the memory
[19:14] <gregaf> how long does it take to get to that usage?
[19:15] <wido> i'll check
[19:20] <wido> gregaf: when the MDS starts it eats 7.9% of the 4GB, but then when i start the find the memory consumption grows rapidly to about 90%
[19:22] <gregaf> and then the find hangs
[19:22] <gregaf> I suspect that memory use would drop if the find finished; I'm not sure how rigorously the maximum cache size is enforced
[19:25] <wido> ok, might be an issue. But i'll leave it for now, i'll test it next week again with much more memory, see what that does
[20:06] <wido> yehudasa: tried the 4.4GB upload again, size reported by the GW in a bucket listing is fine, the ETag also matches, but when downloading the file, a size of 391212054b is reported
[20:11] <wido> ok, it is a overflow somewhere, 4686179350 is the total filesize, where 391212054 is the remainder of the filesize minus 4GB
[20:11] <wido> ((1024 * 1024 * 1024) * 4) + 391212054 = 4686179350
[20:14] <wido> i'm afk for todat
[20:14] <wido> today*
[20:20] <yehudasa> wido: I opened a new bug for that size overflow
[23:12] <sage> yehudasa: bad news from james bottomley and hch... we should refactor the common bits into lib/ceph or similar before pushing rbd upstream
[23:13] <sage> i'm going to go ahead and push the rest of the queue for 2.6.36 today
[23:13] <sage> :(
[23:15] <yehudasa> oh
[23:15] <yehudasa> hch?
[23:15] <sage> christoph
[23:15] <yehudasa> how hard would it be to refactor it?
[23:16] <sage> not _too_ bad i don't think, but some work definitely. worth it in the end.
[23:16] <yehudasa> yeah
[23:16] <yehudasa> we'll need to have the rados.h under include/linux
[23:17] <yehudasa> do a few EXPORT_SYMBOL
[23:17] <yehudasa> but the code is quite separate
[23:17] <sage> i started playing with it a bit, and a lot of types and stuff need to move around between rados.h and ceph_fs.h. and the interfaces need to be simplified some probably.
[23:17] <sage> yeah
[23:18] <yehudasa> the question is whether rbd will rely on ceph or on some cephlib?
[23:18] <sage> also messenger.h, mon_client.h, osd_client.h, i suspect.
[23:18] <sage> ideally, libceph
[23:18] <yehudasa> yeah
[23:19] <yehudasa> where would libceph reside?
[23:19] <sage> the biggest thing besides all the shuffling is to rework the current ceph_client and mount process
[23:20] <sage> lib/ceph i guess? that's easy to change, its the separation that's difficult.
[23:23] <yehudasa> hmm.. I'm not totally convinced of the merits of having a separate lib/ceph.. it might be nicer to have separate layers each responsible for its own thing, but it's kinda artificial
[23:23] <yehudasa> of course, if it's needed we'll do it
[23:24] <todinini> to how many osd will ceph scale? and how big was the biggest cluster testet?
[23:24] <sage> well, what would it take to just have a separate rbd module (that requires ceph.ko)? and is that a step along the way?
[23:25] <yehudasa> that's much more easy, as there's a minimal set of functions needed to be exported
[23:25] <sage> todinini: the design goal is tens of thousands. largest tested was 256 osds, but that was a while ago.
[23:27] <todinini> sage: ceph does have a fully meshed communication, rigth? does that scale to thousands node?
[23:28] <gregaf> it depends on the size of the cluster, but in general Ceph is not fully meshed
[23:29] <gregaf> the metadata servers are fully meshed, as are the monitors, but the OSDs have a maximum number of peers bounded by the number of PGs placed on them
[23:30] <todinini> gregaf: ok, i thought so, because in the netstat on the osd, there was a tcp connection do every other node open
[23:30] <yehudasa> sage: http://pastebin.org/467181 -- list of functions that would need exporting/including
[23:30] <gregaf> yeah, in a small cluster that's going to be the case because there are a lot of PGs by default
[23:31] <todinini> gregaf: what is a PG`?
[23:31] <gregaf> we haven't run tests or anything on the ideal number of PGs/OSD lately but it'll probably be in the tens of PGs/OSD
[23:31] <gregaf> placement group
[23:32] <gregaf> you have pools which are a logical distinction and then placement groups within each pool which determine where data actually ends up on the cluster
[23:34] <todinini> can I display the pgs on a osd?
[23:36] <gregaf> yes, but I don't remember the command
[23:36] <gregaf> sage will
[23:36] <todinini> ok, the doku is a bit sparse
[23:41] <sage> yehudasa: oh, not that big a list at all.
[23:41] <yehudasa> that's for the rbd over ceph.ko
[23:41] <sage> todinini: ceph pg dump -o - will dump all pgs, how big each one is (bytes, objects), and which osds it currently maps to (in brackets)
[23:42] <sage> yeah, i think that's a good compromise first step.
[23:42] <yehudasa> right
[23:43] <jantje> hmm
[23:43] <jantje> sage: you're the right guy to ask, I'd like to understand more about ceph by reading the source, but I really have no clue where to start
[23:44] <sage> james basically asked whether rbd is useful without ceph, i said yes, and he said something it would be nice to be able to build it separately. and christoph, when i mentioned rbd, said something thing "oh and btw you should really factor things into some sort of library".
[23:46] <yehudasa> hmm.. the problem would be that for a library make sense, we'd need to put most of the fs/ceph stuff in it
[23:47] <sage> yep.
[23:47] <yehudasa> so the only things outside the library would be fs/ceph/super.c and drivers/block/rbd.c
[23:47] <yehudasa> and the include files
[23:47] <sage> i tried to start separating it out on my laptop.. basically all that doesn't go in are addr, file, inode, dir, super, mdsmap, caps, snap, and mds_client. super.c would need refactoring. and the headers would need lots of reorganization.
[23:48] <yehudasa> well.. it would make sense to send mds_client into the library
[23:48] <yehudasa> even if it will only be used by the fs
[23:48] <sage> the main issue i think is the api requirements between the fs and libceph is a lot more than rbd.. lots of headers in include/linux/
[23:49] <yehudasa> yeah
[23:49] <sage> hmm, that would simplify things a bit, i guess.
[23:50] <gregaf> jantje: where to start in the code really depends on what you want to do
[23:50] <gregaf> understanding it from source is going to be hard, we haven't done a great job documenting it
[23:50] <gregaf> if you just want to know more I'd recommend reading all the available papers and stuff
[23:50] <sage> i wonder if i should push some of the rbd prep patches upstream now (osdc changes, monc pool op support, etc.)
[23:51] <yehudasa> well.. the only thing that you don't really need to push is rbd.c
[23:51] <sage> jantje: yeah, i would start with the osdi '06 paper to get a high level overview before diving into the source code.
[23:51] <sage> bios support in messenger?
[23:51] <yehudasa> hmm..
[23:51] <sage> class call, rollback support?
[23:52] <yehudasa> I'll have a look at the commits
[23:57] <yehudasa> the lookup_pool, refactor osdc requests, enabling of clients that don't need mds, refactor mount related functions, generalize mon requests, and the fix payload_len
[23:57] <yehudasa> can go in
[23:57] <yehudasa> however, I'm afraid that we'd ommit some implicit bug fix
[23:58] <sage> fix payload len fixes which patch?
[23:59] <yehudasa> probably the 'refactor osdc requests creation function'

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.