#ceph IRC Log


IRC Log for 2011-04-21

Timestamps are in GMT/BST.

[0:01] <Ifur> ubuntu, mostly... unfortunatly... have to suffer decisions made a decade ago.
[0:01] <Ifur> but that reminds me, what is the best distro to run ceph on, what do you develop on in general?
[0:01] <Ifur> (plan on migrating to debian, and then maybe to centos6 by the end of the year)
[0:03] <darkfader> i think the standard is either debian or ubuntu in here
[0:03] <darkfader> my last test boxes were ubuntu
[0:03] <darkfader> anything that gets me a current kernel with little pain
[0:03] <Ifur> I can mention that I have basically two clusters, one development and one production, the dev-cluster is for random stuff and batch-farm. prod-cluster is for running a specific application that needs small io performance at start of a run.
[0:04] <Ifur> use vanilla kernel in that case, or build the latest ubuntu kernel available?
[0:05] <Ifur> anything you'd recommend removing from the kernel that increases stability? :P
[0:05] <darkfader> i deleted the lab too rebuild it. but i dont recall making a new kernel for it because i didn't do heavy testing
[0:05] <darkfader> 00:05 < Ifur> anything you'd recommend removing from the kernel that increases stability? :P
[0:05] <darkfader> ask that tomorrow to Tv/sage/greg
[0:05] <darkfader> err any of the regulars
[0:06] <iggy> btrfs from what I read earlier today...
[0:07] <Ifur> so ceph sources include its own btrfs module?
[0:07] <iggy> no, it was a joke
[0:07] <Ifur> hehehehe
[0:07] <Ifur> doh
[0:07] <Ifur> got a early kernel panic earlier from the bigphysarea patch i *have* to use...
[0:08] <Ifur> hopefully just a question of disabling the kernel feature of scanning low memory.
[0:11] <Ifur> but yeah, is ceph stable, certain things it tend not to be stable on?
[0:13] <cmccabe> lfur: I think multi-mds is still kind of unstable
[0:13] <Ifur> so more then 1?
[0:13] <cmccabe> lfur: at the moment.
[0:13] <Ifur> ah, ok, good to know!
[0:14] <cmccabe> lfur: I mean, you don't need me to tell you this, it's all there in the printf when you start the daemons :)
[0:14] <cmccabe> lfur: last time we talked about it, the target was to have 1.0 be the stable release
[0:15] <Ifur> hehe, there are no stable distributed filesystem, those who claim they are lie! :P see it more as a sign of honesty, really...
[0:15] <cmccabe> lfur: I'm curious what level of stability lustre and those guys have attained
[0:15] <cmccabe> lfur: I've been told that it's 300k lines of code... and all kernel code too
[0:16] <Ifur> I know of fhgfs, because they ditched lustre because of instability... my understanding/impression is that you need to build your cluster around it.
[0:17] <Ifur> filesystem went down on a weekly basis with lustre, and lost data every other week.
[0:17] <Ifur> they used ubuntu I think.
[0:17] <Ifur> fhgfs, seems to be quite stable, but doesnt bufer metadta properly, so have to tune kernel page size and queue etc...
[0:18] <cmccabe> the impression I get is that the HPC guys do not use lustre for archiving data
[0:18] <Ifur> had a bad experience with that, someone dumped 6 million files onto it, and it just kneeled over, didnt have SSD on the metadata so couldn't haldnt it.
[0:18] <cmccabe> they use lustre while they're calculating something; then they take the data out of lustre and store it on some more conventional system
[0:18] <Ifur> cmccabe: yup!
[0:18] <cmccabe> but that was just one guy I talked to, so maybe other people use it differently?
[0:19] <Ifur> think thats the case, but you can imagine a batch farm with thousands of nodes, doing jobs that runs over a week. having a constant queue of 10k+ jobs, then filesystem goes down AND you lose data.
[0:19] <cmccabe> lfur: haven't heard much about fhgfs
[0:19] <Ifur> first its down for a day, trying to recover, then things have to be rerun...
[0:20] <Ifur> cmccabe: not many that use, bad license etc...
[0:20] <Ifur> a german research lab behind it, fraunhofer.
[0:20] <Ifur> the loewe cluster uses it, about #22 on top500 atm.
[0:20] <Ifur> 500 raid controllers last i heard.
[0:20] <Ifur> but zero redundancy on metadata
[0:21] <Ifur> and no failover, there, so need constant metadata backup etc.
[0:21] <cmccabe> lfur: seems to be impossible to actually find information about it online
[0:21] <cmccabe> lfur: is it proprietary?
[0:21] <Ifur> http://www.fhgfs.com/cms/
[0:22] <Ifur> ehm, well, no GPL
[0:22] <cmccabe> lfur: thanks for the link, but I was hoping for some kind of overview
[0:22] <Ifur> and the eula basically sates, "we can screw you and demand lisence feees any time we see it fit"
[0:22] <cmccabe> lfur: I guess it's parallel rather than clustered, which means no multi-mds for them
[0:22] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:22] <cmccabe> lfur: so then my next question is why would I use this instead of pNFS?
[0:23] <Ifur> cmccabe: yup, but i know people have tried adding hardware failover on mds, with no luck... the design simply doesnt allow it.
[0:23] <cmccabe> lfur: well, there is that DRBD thing
[0:23] <Ifur> cmccabe: native RDMA
[0:23] <Ifur> cmccabe: they embedd the hostname of the MDS in the metadata
[0:23] <cmccabe> lfur: I guess you can fall back on RAID if you're so inclined
[0:24] <Ifur> it supports multiple MDS's, but you cannot in no way add faultolerance to them
[0:24] <cmccabe> lfur: oh, I see. RDMA.
[0:24] <Ifur> performance wise on par with lustre, and depending on your setup, more stable.
[0:24] <cmccabe> lfur: I assume it does clever things with data striping too
[0:24] <Ifur> these people always want to squeeze out the performance they can.
[0:25] <Ifur> cmccabe: it does striping, but wouldnt call it clever?
[0:25] <cmccabe> lfur: configuring lustre was described to me as a job that requires a PhD... partly because of the striping config :)
[0:26] <Ifur> the filesystem it self is quite stable, even if you lose OSD or MDS, the filesystem will be up, just that data disapears -- so uhm, wouldn't call that a HA feature.
[0:26] <Ifur> cmccabe: there are quite good tutorials out there, but yeah, I can imagine if your hardware/architecture isnt designed around the fact that you will use lustre.
[0:26] <cmccabe> lfur: I wonder why the HPC guys never discovered FC or iSCSI
[0:26] <Ifur> cost too much, money is better spent on compute power
[0:27] <cmccabe> lfur: it sounds very much like fhgfs == SAN implemented via RDMA
[0:27] <Ifur> alot of HPC soltions are run by physicsts, clock cycles is something they understand.
[0:27] <Ifur> cmccabe: mhm, yes, like most network filesystems are :P
[0:28] <cmccabe> lfur: I think eventually the declining R&D budgets here in the US and the rising business analytics budgets will lead to a culture clash :)
[0:28] <Ifur> same reason they use infiniband as they use these types of filesystems, latency matter *alot*
[0:29] <cmccabe> lfur: I was in networking when ethernet started conquering the world
[0:30] <Ifur> the more money put in business analytics and the less in R&D, the less the economic growth. :P
[0:30] <cmccabe> lfur: I was still in college back when x86 was killing alpha and the other better architectures
[0:30] <Ifur> still in networking?
[0:30] <cmccabe> lfur: but same principle
[0:30] <cmccabe> lfur: nah, storage now... ceph in particular
[0:31] <Ifur> looked much into IB?
[0:31] <cmccabe> lfur: I'm on the ceph team here :)
[0:31] <Ifur> ah, nice :P
[0:31] <cmccabe> lfur: I don't know much about IB
[0:31] <Ifur> if you have ambition of penetrating HPC market, then IB is something that needs to be tuned for.
[0:32] <cmccabe> lfur: we talked about it briefly a while ago
[0:32] <Ifur> main reason to use it, is that you have 1,3 us latency from a core in node to a core in another, and 100-200 ns latency per HOP.
[0:33] <cmccabe> lfur: I can't speak for sage, but I get the impression it would be something we would implement if a particular user was keen
[0:33] <Ifur> we had to upgrade from ethernet last year, was just not possible to run on GbE, mainly because of overhead and latency.,
[0:33] <cmccabe> lfur: IB is interesting because of the low latency
[0:34] <cmccabe> lfur: ethernet still has a problem with that... depending on the drivers and the hardware you're running.
[0:34] <Ifur> in theory, I guess a distributed/paralell filesystem should be able to outperform spindle disks.
[0:34] <cmccabe> lfur: and of course TCP itself introduces some latency
[0:34] <cmccabe> lfur: if the networking hardware is up to the task, and the tuning is right
[0:35] <Ifur> performance here went up by a factor of 5 moving from GbE to 40Gbs IB, and for some reason, ethernet was using 80MB/s, but with IB it actually went down to 30MB/s
[0:35] <Ifur> not sure what the case is now, but this is unfortunatly a part of the system i dont understand well.
[0:35] <cmccabe> 80 MB/s sounds about right for a real-world gigE throughput
[0:35] <cmccabe> assuming TCP and a standard stack
[0:36] <Ifur> so could be the ceiling?
[0:36] <cmccabe> technically the ceiling is supposed to be in the triple digits for ethernet
[0:36] <Ifur> yeah, nothing fancy there. but switches were poorley configured, and standard server grade GbE
[0:36] <cmccabe> and you can do better if you optimize TCP or use UDP
[0:36] <Ifur> time and resources :P
[0:36] <cmccabe> yeah
[0:36] <Ifur> native IB and RDMA is effort better spent, tho...
[0:37] <cmccabe> I think ethernet is going to get cheap a lot faster than IB or RDMA
[0:37] <Ifur> I actually doubt that, think it depends mostly on whether the industry screws up or not.
[0:38] <Ifur> but IB is basically the same architecture as SAS (an expander is a switch, etc...)
[0:38] * gregorg_taf (~Greg@ has joined #ceph
[0:38] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[0:38] <cmccabe> it may depend on whether 40gigE or 100gigE get used in businesses
[0:38] <cmccabe> that would push volumes
[0:38] <cmccabe> I think telcos are embracing them though
[0:39] <cmccabe> anyway... have to finish up something here
[0:39] <iggy> we're looking at 40Ge for some interconnect and backbone stuff
[0:39] <cmccabe> interesting stuff
[0:39] <Ifur> they have to... but the thing is, all you have to do to get 10GbE if you have QSFP 40Gbps IB, is to buy a 50$ quad to serial adapter, and plug it into a 10GbE switch instead of a IB switch.
[0:40] <Ifur> IB is also completely hardware accelerated, and atm supports virtual protocol interfance (VPI) so you can implement any protocol you want ontop of IB hardware
[0:41] <iggy> we're releasing a product later this year that's 10ge, so we definitely need faster than that for interconnects/backbone
[0:41] <Ifur> I know they are also working on being able to allow the IB card RDMA straight from a GPU without going through the CPU.
[0:41] <iggy> that would be interesting
[0:42] * yehudasa_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[0:42] <Ifur> real advantage is IB does streaming, its point to point serial, so no need to divide into packets.
[0:42] <Ifur> switches are basically dumb, with a fast powerpc on it, to deal with congestion.
[0:44] <Ifur> but outside of HPC, IB lacks all the nice things that ethernet has, dont have the same security.
[0:44] * pombreda (~Administr@29.96-136-217.adsl-dyn.isp.belgacom.be) Quit (Quit: Leaving.)
[0:48] <cmccabe> so who do I talk to about the hadoop bindings
[1:17] <Ifur> what steps do i need to take to figure out why im no longer able to mount ceph, even after rebooting everything involved except one of the client nodes? neither cfuse not kernel client mounts, take indefinite time...
[1:18] <Ifur> erh, input output error the kernel client complains
[1:19] <Ifur> guess restart all daemons and play more tomorrow :)
[1:21] <Ifur> also, a last question: is Gentoo a good choice on the server side, and does it tend to make a difference with differing kernel versions on either client or server side, or between the servers it self?
[1:21] <cmccabe> lfur: I think messenger debug would be a good place to start
[1:21] <cmccabe> lfur: also see if any daemon has crashed
[1:22] <cmccabe> lfur: I don't know much about gentoo. We use debian mostly
[1:22] <cmccabe> lfur: you will probably want to compile your own kernel if you're using the kernel client. You can do that in any distro though
[1:22] <cmccabe> lfur: kernel version for the machines running the servers isn't as important, unless you hit one with a btrfs bug
[1:23] <cmccabe> lfur: a lot of people are still running fairly old kernels, and that's not a good idea since btrfs had a bunch of bugs in the older versions
[1:23] <Ifur> define old?
[1:24] <cmccabe> lfur: well, I'm running 2.6.32-5 on my server test machine
[1:24] <cmccabe> lfur: so I would consider that old, but not old enough that btrfs crashes all the time :)
[1:24] <Ifur> ok, thats good i guess. :)
[1:24] <cmccabe> lfur: on the other hand, ext3 does crash all the time if I try to use it with ceph on this kernel
[1:24] <cmccabe> lfur: bug in xattrs
[1:24] <Ifur> also, I would liek to point out that I think ceph is even more promising after talking to you guys :P
[1:25] <cmccabe> lfur: thanks
[1:25] <cmccabe> lfur: we're going to be using it internally as the backend for our S3 object store
[1:25] <cmccabe> lfur: that's what the rgw work is all about
[1:25] <Ifur> and I'd be more then willing to let you advertise that I (or really, that the project I'm involved with uses it if all goes well)
[1:26] <Ifur> Ideally, I want to use it mainly (the down the line features/stabiluty i guess) for fault tolerance and reliability.
[1:27] <Ifur> https://wiki.kip.uni-heidelberg.de/ti/HLT/index.php/Main_Page
[1:27] <cmccabe> lfur: thanks lfur.
[1:28] <Ifur> HLT is more experimental then ceph, by a large margin :P
[1:28] <cmccabe> lfur: ceph is still in unstable mode but we're at the point where we're trying to stabilize
[1:28] <Ifur> (lost 50 nodes yesterday due to memory leak / hung kernel)
[1:29] <Ifur> everything down here is still what you'd consider 'proof of concept' so not a huge issues, need to think more long term.
[1:29] <cmccabe> lfur: "trigger" is an interesting name
[1:30] <cmccabe> lfur: on a semi-related note, I heard that there was an attempt to create a database suited to scientific needs recently
[1:30] <Ifur> basically what it does... ALICE detects a collision, triggers (filters if you will) for noise, then triger higher up in the system, correlate and sync triggers between detectors. then this system does a rough reconstruction to check if the collision was interesting.
[1:30] <Ifur> probably
[1:31] <Ifur> I know CMS uses mongodb to store data and so on.
[1:31] <cmccabe> lfur: the idea was that SQL was too tied to business needs rather than sciency needs like finding correlations and such
[1:31] <Ifur> but yeah, all experiments implement things differently.
[1:32] <cmccabe> lfur: I don't know if the project ever went anywhere... I seem to have forgotten the name :(
[1:32] <Ifur> nosql is fine :P
[1:32] <cmccabe> lfur: yeah, sometimes it's hard to come up with a framework that fits everyone
[1:32] <Ifur> actually, i find the idea of implement a fuse driver for mongodb gridfs to be good idea.
[1:33] <Ifur> but yeah, ALICE doesnt have the "biggest" data issues, CMS and ATLAS are the big players with tons of resources.
[1:33] <cmccabe> lfur: ALICE sounds interesting. That's a big system!
[1:33] <Ifur> heavy ion is a small area
[1:33] <cmccabe> lfur: at least by my standards :)
[1:33] <Ifur> well, yeah, its cool. but when your down here, its cool for a while, then you realise that there is too much to do and not enough people to do it.
[1:33] <cmccabe> lfur: would a FUSE driver for mongodb give POSIX semantics?
[1:34] <Ifur> dont know, probably could. but the proejct seems to have died.
[1:35] <cmccabe> lfur: seems like mongodb has some flavor of eventual consistency
[1:35] <Ifur> comparable to ceph but in the nosql sphere :P
[1:35] <Ifur> very promising!
[1:35] <cmccabe> lfur: well, Ceph has POSIX semantics
[1:35] <cmccabe> lfur: it's kind of a big deal because it affects the way you architect the system
[1:36] <Ifur> well, was thinking more of the database aspect of mongodb, the fuse client is more of a neat/cool/interesting idea.
[1:36] <Ifur> i mean, if you need to store millions upon millions of files...
[1:36] <Ifur> CMS has like 1000 million colissions per second, each even might not be a whole lot of KB's
[1:36] <cmccabe> lfur: the problem with giving something a filesystem interface is that then people start to expect it to behave like a filesystem
[1:36] <Ifur> *event
[1:37] <cmccabe> lfur: and they get confused by when they, for example, write foo to a file, and then read it back right after, and get bar
[1:37] <Ifur> so databse makes perfect sense, but it needs throughput
[1:37] <cmccabe> lfur: but with eventual consistency, there are no guarantees about that sort of thing
[1:37] <Ifur> *nod*
[1:37] <cmccabe> lfur: all that's guaranteed is that the data will get there someday
[1:38] <cmccabe> lfur: it seems like a lot of the people who have experimented with eventual consistency have gotten burned and backed off
[1:38] <Ifur> but a fuse driver for such a thing IMO, would be either read or write, enver both... you read to compute, then you write the results, but you still want to keep orginial data.
[1:38] <Ifur> so can be tighly coupled to application.
[1:38] <cmccabe> lfur: amazon is now offering read-after-write consistency in every availability zone except US-east
[1:38] <cmccabe> lfur: there was also some design discussion with the
[1:38] <cmccabe> lfur: googleFS people where they said if they were doing it again they would ditch eventual consistency
[1:38] <cmccabe> lfur: for what it's worth
[1:39] <Ifur> makes sense...
[1:39] <cmccabe> lfur: I mean you don't want full POSIX semantics in googlefs
[1:39] <cmccabe> lfur: but maybe you want something like read-after-write
[1:39] <Ifur> I get the impression that you dont want POSIX, hehe
[1:39] <cmccabe> lfur: programmers want POSIX semantics because they're used to them
[1:40] <cmccabe> lfur: it's comforting, but performance-limiting
[1:40] <cmccabe> lfur: the POSIX filesystem API is an old and actually pretty archaic API
[1:40] <cmccabe> lfur: it's probably never going to change at this point, but there are some wild-eyed dreamers out there who hope so
[1:40] <Ifur> sure, but when you offer something better. the programmers that choose the better option will hopefully out-compete the others and through evolution, posix goes away, but for now..
[1:40] <cmccabe> lfur: like the featherstitch guys
[1:41] <Ifur> hehe
[1:41] <cmccabe> http://featherstitch.cs.ucla.edu/
[1:41] <Ifur> well, you also have the guys who are waiting for ternary computers to be the standard
[1:41] <cmccabe> lfur: the truth is that POSIX does well enough for what it's actually used for in practice
[1:42] <cmccabe> lfur: but databases, for example, just completely ignore it and blast data right into the raw partition
[1:42] <Ifur> http://en.wikipedia.org/wiki/Ternary_computer The inventor of LISP stated that ternary computers was inevitable
[1:42] <cmccabe> lfur: because the POSIX FS API doesn't give you enough tools to really make a good database backend
[1:43] <Ifur> yeah, impossible to predict these things...
[1:43] <cmccabe> lfur: anyway. It's best to just ignore POSIX and define an API that fits what you want to do
[1:43] <cmccabe> lfur: that's how S3 was created
[1:44] <cmccabe> lfur: and really all the NoSQL stuff
[1:44] <Ifur> Concurr that its unlikely that a new standard will happen by itself without some serious backing by all the right people instutions and frameworks.
[1:45] <cmccabe> lfur: I guess there is an elegance to the "everything-is-a-file" concept that is lost when you move from POSIX storage to NoSQL
[1:45] <Ifur> actually, part of my problem is that they didnt do that here...
[1:46] <Ifur> unmanageable amounts of files, but mostly because there wasnt time to do it properly or perhaps didnt think about the scalability seriously enough.
[1:46] <cmccabe> lfur: that might justify implementing some kind of FUSE layer, at least in theory
[1:46] <cmccabe> lfur: well, there are filesystems out there that do handle lots of small files well.
[1:46] <Ifur> indeed, thats exactly why i like the idea
[1:46] <Ifur> gives you an escape route until you have migrated to a new solution.
[1:46] <cmccabe> lfur: reiserFS had that honor for a long while; not sure if it's still the champion or if btrfs has the title now
[1:46] <Ifur> or the ability to do it step wise.
[1:47] <Ifur> yeah, but the problem is that it needs to be distributed... these files have to be served to lots of nodes...
[1:47] <Ifur> so local storage doesnt cut it.
[1:47] <cmccabe> lfur: I'm always kind of torn on this kind of issue. There is a certain elegance in keeping on distinct thing in one file
[1:48] <cmccabe> lfur: in practice, though, you may need to consider other concerns like how the data is actually used in practice
[1:48] <Ifur> but at some point it makes more sense to have one large file, instead of many small
[1:48] <Ifur> can just as well unpack it in memory
[1:48] <cmccabe> lfur: it all depends on the usage patterns, really.
[1:49] <cmccabe> lfur: I have some friends in the GPS business who implemented file formats that were incredibly advanced
[1:49] <Ifur> yeah, but you hit something somewhere always, whether it be context switching sempahires or whatever...
[1:49] <cmccabe> lfur: they could be read while still compressed... even searched for street names and points of interest while still compressed
[1:49] <cmccabe> lfur: and updated incrementally in chunks
[1:49] <Ifur> whoa
[1:50] <cmccabe> lfur: and incredibly good compression... this was 200x, when a GPS device might have only a gigabyte or two of flash.
[1:50] <cmccabe> lfur: so having more maps included was a competitive advantage, and they threw their best people at it
[1:50] <Ifur> but yeah, the point i was trying to make with small files vs keeping several in one archive/file is, that as time goes by, you will always gain more cache on CPU, more memory and so, and you are limited by clock cycles...
[1:51] <cmccabe> lfur: I/O has been a huge bottleneck for a while
[1:51] <Ifur> in theory, max clock freauency is 10Ghz, in reality you need more power to cool, then to actually run the CPU above 4-6Ghz.
[1:51] <cmccabe> lfur: I guess now we all have these huge multi-core processors that are very-starved
[1:52] <cmccabe> lfur: I/O-starved
[1:52] <Ifur> so map reduce, and larger blocks and paralellization is your friend.
[1:52] <cmccabe> lfur: I am curious when HPC will discover MapReduce.
[1:52] <cmccabe> lfur: I mean, obviously, it's not suitable for every workload
[1:52] <cmccabe> lfur: but still, it seems a lot easier to develop for than things like OpenMP
[1:53] <Ifur> takes time, the code base that runs these applications are usually a decade old, so the code tend to be that far behind.
[1:53] <cmccabe> lfur: yeah
[1:53] <Ifur> at least here, physicists know C very well, and python is popular, but fortan i think was the standard until the mid 90's
[1:54] <Ifur> at least.
[1:54] <Ifur> the framework for doing physics reconstruction is absolutely massive
[1:54] <cmccabe> lfur: I am curious whether anyone's still pushing fortran
[1:55] <cmccabe> lfur: there were a bunch of threads about Fortran 2008 recently, something I didn't even know existed until then
[1:55] <Ifur> so massive in fact, that a PhD student here doing optimization basically said:, memory leaks in offline is perfectly fine.
[1:55] <cmccabe> lfur: C++ or C?
[1:55] <Ifur> oh, there are plenty of physicists in their 40-50's svearing to fortran, trust me :P
[1:55] <Ifur> C
[1:55] <Ifur> well, offline is asically anything, fortran, C, C++ java, you name it.
[1:56] <Ifur> its in excess of 2GB library.
[1:56] <cmccabe> heh
[1:56] <Ifur> software stack needed to run reconstruction in HLT is between 3 and 7 GB
[1:56] <Ifur> add the fact its mostly written by students, and you get the idea.
[1:56] <cmccabe> yeah
[1:57] <cmccabe> I worked at the CMU robotics institute for a while; we had some of the same issues
[1:57] <cmccabe> in research, it's hard to know what code will last for 15 years, and what will last for 15 days
[1:57] <Ifur> think its pretty much the standard everywhere in science, its just that the larger the projects the more obvious it gets.
[1:58] <cmccabe> so things tend to not get very polished
[1:58] <Ifur> yeah, heard a story about a physics students coding for this project, took an integer to t float, to do bit flipping, and the back to integer
[1:59] <Ifur> congrats, you made your code slower by a factor of 100000
[1:59] <cmccabe> well, at least be grateful you're not using C++
[1:59] <cmccabe> there was never a worse language for novices than C++
[2:00] <Ifur> i cant even code :D other then basic scripting, reason why im here is basically ability to learn, self motivated and using linux for well over a decade
[2:00] <Ifur> was actually studying philosophy before i came here
[2:00] <Ifur> dont quite belong in that respect
[2:01] <Ifur> got first computer in 98, linux 3 months after.
[2:01] <cmccabe> well, sounds like you've learned a lot
[2:01] <Ifur> wish i got the job earlier
[2:01] <cmccabe> not everyone needs to be a dev... there are other aspects of using computers :)
[2:01] <Ifur> but luckily, as far as the coding short comings go, im surrounded by phd's. they just need me to get the computers working. :P
[2:02] <Ifur> yeah, there are many rules of thumbs that cant quite be explained or tought.
[2:02] <Ifur> taught.
[2:02] <cmccabe> well, they can, it just takes time.
[2:02] <cmccabe> brb... have to finish something up
[2:03] <Ifur> yeah, but does take experience getting hardware and software to play well to give you perfromance... I used to go on and on in ~1999 -2002 on how i got 40% more frames per second on q3 under linux compared to windows on the same hardware
[2:04] <Ifur> yeah, gonna go take a walk and smore... getting late :P
[2:04] <cmccabe> lfur: later
[3:09] <Ifur> good night, thanks for the talks!
[3:09] * Ifur (~osm@pb-d-128-141-48-242.cern.ch) Quit (Quit: leaving)
[3:13] * cmccabe (~cmccabe@ has left #ceph
[3:17] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:31] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:41] * djlee (~dlee064@des152.esc.auckland.ac.nz) has joined #ceph
[3:47] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Remote host closed the connection)
[3:56] * greglap (~Adium@ has joined #ceph
[4:01] <djlee> greglap: are you available now?
[4:04] <greglap> djlee: sure
[4:05] <djlee> so i've created 0.1m files of 2kb to the ceph mount (=200mb or 196MiB), reported by du command
[4:05] <djlee> but when i actually go to each osd nodes (4 of them), and checked for df -h
[4:06] <djlee> they were like 1600mb (and so 4 nodes = 6581MiB)
[4:06] <greglap> is anything else on those mounts?
[4:06] <djlee> so why space stored only 196mb became 6581mb?
[4:06] <greglap> log dirs? operating system?
[4:06] <djlee> nope
[4:07] <djlee> ive removed entire logs, using the recent post by colin
[4:07] <greglap> well, they'll use up some space to store the metadata, and the MDS log for creating 100k inodes might get pretty big (not sure how big)
[4:08] <greglap> and the 200MB is doubled for replication
[4:08] <djlee> argh wait, -.- i forgot the 2gb journal chunk
[4:08] <greglap> plus if the OSD journals are on there...
[4:08] <greglap> ;)
[4:08] <djlee> for each osd
[4:10] <djlee> is it normal that du -h won't report the journal size in /data/osd..?
[4:11] <greglap> it probably depends on how the journal's set up
[4:11] <greglap> if it's just grabbing part of the raw block device it won't show up; if it's a file it should
[4:11] <greglap> or do you mean du -h on the Ceph filesystem?
[4:11] <greglap> in which case yes, of course that's normal
[4:12] <djlee> du -h on ceph is fine, i get what i expect, e.g., 2kbx100000=196mb
[4:13] <djlee> each osd/current reports at least 100mb
[4:13] <djlee> so 12 osd = 1200mb (excluding journal)
[4:13] <greglap> yeah, that's probably because of the MDS journal
[4:15] <djlee> greg another big issue
[4:15] <greglap> if you look at the PGs you'l notice that they all start with a digit
[4:15] <greglap> that digit is the number of the pool they're in
[4:16] <greglap> I forget which number data is but you could look at the amount of data in each one and you should find that one of the pools has the196MB of data and the other big one is the metadata pool
[4:18] <djlee> you mean ceph osd dump -o - command?
[4:19] <djlee> ceph osd pool get data pg_num just shows the 768
[4:20] <greglap> oh, heh, you could look at the pg dump too
[4:20] <greglap> but I meant the actual OSD data partitions
[4:20] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[4:21] <djlee> oh the filenames under data/osd#/current ?
[4:21] <greglap> yeah, they're still directories
[4:21] <greglap> each directory is one placement group
[4:22] <djlee> generally i should be expecting 200000 objects, and i can check this..? sorry
[4:23] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Remote host closed the connection)
[4:23] <djlee> yeah i see some objects in some head directories
[4:25] <greglap> the numbers of the objects are based on the inodes
[4:27] <djlee> when i did 2kbx10000 (20mb) file write and random-read back, i was getting e.g., 60mb/s
[4:28] <djlee> but now with one extra zero, 2kbx100000 (200mb), random-read back has slowed down to 10mb/s
[4:28] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[4:29] <djlee> these are on lowspec machines, i suspect that more files make certain operation really heavy..?
[4:30] <greglap> you're doing the reads while doing writes?
[4:30] <djlee> no i do a full seq-write first to it
[4:30] <djlee> and after that, i do the random read
[4:30] <greglap> are you writing more total data in the second run?
[4:31] <greglap> ie, is it just cache effects?
[4:32] <djlee> first-run (20mb) and second-run(200mb), is entirely separate, i did full format disk and ceph reinitiate..
[4:32] <greglap> no, I mean how many of each file did you write?
[4:32] <greglap> was it just one of each?
[4:32] <greglap> 10 of each? 100x20MB and 10x200MB?
[4:33] <djlee> i write 10000x2kb, (20mb)
[4:33] <djlee> directio off, so all goes to journal first, and ffsb does its own sync after the write (takes additional 10~20s)
[4:34] <greglap> oh, so that's 100,000 files the first time and 1,000,000 files the second time, not a change in file size
[4:34] <greglap> sorry, I was misreading you
[4:34] <djlee> yeah sry for confusing heh
[4:34] <greglap> that's probably due to cache effects — more of your total data was in cache after the first write
[4:34] <djlee> yeah exactly
[4:35] <greglap> (since you're doing small files, the metadata is taking a significant portion of the cache space)
[4:35] <djlee> i couldnt try 1million file,
[4:35] <djlee> the ceph crashed as i reported in the bugtrack before
[4:36] <djlee> so there's something when i shove too many write files at once..
[4:36] <greglap> which bug is that?
[4:37] <djlee> http://tracker.newdream.net/issues/970
[4:38] <greglap> oh right
[4:38] <greglap> that looks like you're running out of memory on your client machine
[4:39] <djlee> yeah, it's got 4gb ram,
[4:39] <greglap> did you see if it ran out of RAM?
[4:40] <greglap> if it did then that's a kernel client bug that needs to be addressed (it's probably holding on to caps too long — or perhaps the test isn't closing files)
[4:41] <djlee> i see
[4:43] <djlee> i only got iostat evidence, next time i'll run the ps_mem to check the available memory
[4:44] <greglap> the error there was a kernel malloc failure so I'm pretty sure that's what's going on :)
[4:46] <djlee> you see Ted's ffsb examples are all in huge chunk file, like1mb to 5mb
[4:46] <djlee> i was doing 2kb, and so i suppose i'll get the crappy performance
[4:47] <greglap> yeah
[4:47] <djlee> the only way to increase 2kb result is, for the osd it just need to have bigger ram to 'hopefully' cache-hit the random-file
[4:47] <djlee> i.e., bigger ram with many files = ok
[4:47] <greglap> pretty much — random reads are slow even when they're local ;)
[4:48] <djlee> small ram, well then i gotta have smaller set of files (to match the cache-hit)
[4:48] <djlee> yeah
[4:48] <greglap> there was a guy in here earlier today who was very excited to be getting 6MB/s random reads over GigE
[4:48] <greglap> because with his production filesystem over InfiniBand they were only getting 10MB/s
[4:49] <djlee> right so this goes back with the discussion other day, where the the highspec machine w 18gb has better hit-ratio
[4:51] <djlee> that guys 10mb/s, hmm, it must be doing every sync per block?
[4:51] <djlee> e.g., directio
[4:51] <djlee> i think directio is just wrong for network environment
[4:51] <greglap> no, it's just that once you exit cache you need to wait for a disk access on every read
[4:52] <greglap> if you parallelize things you can more aggregate bandwidth but you're still just waiting .1s per read for the disk to spin to the right places
[4:53] <djlee> thats right, but how often the 'miss' happens depends on the actual ram and the scope (total size/files), correct me?
[4:54] <djlee> e.g., with 4gb ram, i test 4gb worse of files, then all will be 'hit', but say 8gb worse of files, i get 50%, and so on
[4:54] <djlee> and the average 'mb/s' will be like super-fast mb/s (cache hit) + super-slow mb/s (cache miss)
[4:55] <greglap> yeah, except that the Ceph metadata also goes into the OSD caches
[4:55] <djlee> right
[4:56] <greglap> the exact amounts of that can vary but in your case with millions of newly-created 2KB files the metadata is crowding out a significant portion of the data too
[4:57] <djlee> i see but what does that big crowding out means?
[4:57] <djlee> i mean, as compared to normal blockbased fs or etc
[4:57] <djlee> more files means more metadata, i got this part :p
[4:58] <greglap> so in a normal filesystem data is cached via the page cache
[4:58] <greglap> but the page cache doesn't include any metadata; that's kept track of and kept in or out of memory quite separately
[4:59] <greglap> since Ceph is a distributed filesystem which uses local filesystems to store its data, though, then Ceph metadata is cached by the local filesystem just like Ceph's data is
[4:59] <greglap> ie, in ext4 the ext4 journal isn't using up memory that could otherwise be used for caching actual data stored in ext4
[5:01] <greglap> gotta go though, my train's reaching the sation
[5:01] <greglap> *station
[5:01] <djlee> oh ok!
[5:01] <djlee> thankas heaps!
[5:05] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[6:50] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[7:04] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[7:52] * MK_FG (~MK_FG@ Quit (Ping timeout: 480 seconds)
[7:53] * MK_FG (~MK_FG@ has joined #ceph
[8:13] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:37] <wonko_be> is there a way to tell ceph to replicate over multiple hosts, but not over two OSD's running on the same host?
[8:53] * gregorg_taf (~Greg@ Quit (Quit: Quitte)
[8:53] * gregorg (~Greg@ has joined #ceph
[8:54] <greglap> http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH
[8:54] <wonko_be> perfect, exactly what I was looking for it seems
[8:56] <greglap> glad to help, but it's bed time for me :)
[9:09] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[9:18] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:47] * allsystemsarego (~allsystem@ has joined #ceph
[10:10] * Yoric (~David@87-231-38-145.rev.numericable.fr) has joined #ceph
[10:34] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[10:43] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[10:44] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[10:51] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[10:54] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[13:06] * chraible (~chraible@blackhole.science-computing.de) has joined #ceph
[13:07] <chraible> hi @all when I do the following command (after ceph compiling and configuring) " mkcephfs -c /etc/ceph/ceph.conf --allhosts --mkbtrfs -v -k admin.keyring " i got following error message
[13:07] <chraible> http://pastebin.com/vSmxaUSx
[13:08] <chraible> my ceph.conf is http://pastebin.com/fB7gpxuk
[14:27] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[14:31] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:40] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[17:05] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:26] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:52] * greglap (~Adium@ has joined #ceph
[18:03] <greglap> chraible: try updating your ceph.conf names from mon0 to mon.0, see if that fixes it
[18:03] <greglap> looks like there's a bit of parsing trouble when mkcephfs is generating the monmaptool commands, I'll get somebody to look at it later
[18:06] * bchrisman (~Adium@ has joined #ceph
[18:26] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:29] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[18:37] * nolan (~nolan@phong.sigbus.net) Quit (Ping timeout: 480 seconds)
[18:40] <wido> hi wonko_be ;)
[18:40] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[18:41] <bchrisman> fwiw: I haven't tracked this down yet, but this is what's coming out of my samba vfs layer right now: http://pastebin.com/FWpaUwQh
[18:42] * nolan (~nolan@phong.sigbus.net) has joined #ceph
[18:47] <Tv> bchrisman: funky.. most people are not in the office quite yet, so either email the list, file tickets, or try again in half an hour or so
[18:47] <Tv> testing gitbuilder-i386 bootup, short repeated outages on it for a while
[18:47] <bchrisman> ahh yeah.. cool beans…
[18:52] <Tv> huh, that worked.. retrying with gitbuilder (amd64)
[18:56] * aliguori (~anthony@ has joined #ceph
[19:01] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:03] * aliguori (~anthony@ Quit (Read error: Operation timed out)
[19:15] * aliguori (~anthony@ has joined #ceph
[19:17] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:18] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:20] * yehudasa_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) Quit (Read error: Operation timed out)
[19:23] <Tv> + sprintf(hash, "zzz");
[19:23] <Tv> // chosen by a fair dice roll
[19:34] <bchrisman> ahh the gang is here...
[19:35] <bchrisman> fwiw: I haven't tracked this down yet, but this is what's coming out of my samba vfs layer right now: http://pastebin.com/FWpaUwQh
[19:35] * Yoric (~David@87-231-38-145.rev.numericable.fr) Quit (Quit: Yoric)
[19:37] <gregaf> bchrisman: hmm, that message means that either messages are getting resent by the messaging layer when they shouldn't be, or they somehow got moved out of order
[19:37] <gregaf> or else there's a problem with the message numbering code
[19:38] <gregaf> perhaps because of that fault with nothing to send, though I don't think that should influence things
[19:39] * Juul (~Juul@c-76-21-88-119.hsd1.ca.comcast.net) has joined #ceph
[19:45] <bchrisman> I'm going to retrofit for the new libceph… and run that test program.. make sure libceph is working in my environment.
[19:45] <cmccabe> bchrisman: can you put the retrofit off till later today? I want to apply the changes talked about on the ML :)
[19:45] <bchrisman> ahhh okay.. :)
[19:46] <cmccabe> bchrisman: changing DIR* -> something else, and combining connect and mount
[19:46] <cmccabe> bchrisman: shouldn't take long to apply those
[19:46] <bchrisman> movign DIR * to some ceph wrapper for it?
[19:47] * Yulya_ (~Yulya@ip-95-220-147-191.bb.netbynet.ru) has joined #ceph
[19:47] <bchrisman> cmccabe: will you be verifying that testceph works? :)
[19:47] <cmccabe> bchrisman: yep :)
[19:47] <cmccabe> bchrisman: at some point I need to get hadoop working too
[19:48] <bchrisman> cmccabe: by summer at latest? :) eheh
[19:48] <cmccabe> bchrisman: I was spending some time on JNI yesterday, and it wasn't pretty
[19:48] <bchrisman> cmccabe: sounds like a pain in the butt.
[19:48] <cmccabe> bchrisman: the "enterprise grade" way to store C pointers in a Java class is to use a long
[19:49] <bchrisman> cmccabe: we talked a little bit while visiting the LA office about getting you to come to our San Mateo office occasionally once we've got our new floor/office...
[19:49] <bchrisman> heh
[19:49] <cmccabe> bchrisman: all those years and that budget, and they couldn't even achieve the same level of functionality that ctyes has in python
[19:49] <cmccabe> *ctypes
[19:49] <cmccabe> bchrisman: yeah, I'd like to visit sometime
[19:50] <cmccabe> bchrisman: bit of a drive, but sometimes I work from home
[19:50] * Juul (~Juul@c-76-21-88-119.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[19:51] <bchrisman> cmccabe: yeah.. pretty sure it'd help a lot… it'll be another month or so before we get the space.
[19:54] * Yulya (~Yulya@ip-95-220-130-133.bb.netbynet.ru) Quit (Ping timeout: 480 seconds)
[20:01] <wonko_be> wido: hey
[20:23] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:05] * atgeek (~atg@please.dont.hacktheinter.net) Quit (Remote host closed the connection)
[21:05] * atg (~atg@please.dont.hacktheinter.net) has joined #ceph
[21:11] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Remote host closed the connection)
[21:20] <wido> wonko_be: hi ;)
[21:21] <wido> sjust: Commenting out the assert did work, it passed that point, but other assert's are bringing the OSD's down now
[21:29] <cmccabe> bchrisman:pushed the new libceph api changes
[21:29] <cmccabe> bchrisman: I think it's probably safe to start converting now
[21:30] <cmccabe> bchrisman: we may tinker with the get/set layout functions, but I think the big changes are done
[21:30] <sjust> wido: yeah, do_peer gets called in some inappropriate times
[21:30] <sjust> wido: we are fixing it now
[21:30] <sjust> wido: although, what asserts were they?
[21:31] <wido> sjust: OSD::handle_pg_query
[21:32] <wido> I've got about 6TB of data on the cluster, it would be cool to get this cluster up and running again
[21:34] <sjust> backtrace?
[21:34] <wido> sure, one moment
[21:35] <wido> sjust: http://pastebin.com/eSWk6urF
[21:37] <sjust> wido: that one seems to be from sage earlier this week, if you go back to b05f1ba4c46976bd96a1df8a582dffd89860eb34, you should be able to get past that (sorry about the mess, we are trying to clean up the peering/recovery code :) )
[21:38] <wido> sjust: np! I'll give that a try
[21:48] <bchrisman> cmccabe: cool so I should be able to run testceph against my cluster then?
[21:48] <bchrisman> was going to use that to make sure everything's in order first.
[22:49] <cmccabe> bchrisman: back
[22:50] <cmccabe> bchrisman: doing some testing now
[23:02] <bchrisman> cool.. will be back later this evening..
[23:02] <bchrisman> cmccabe: thanks
[23:03] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[23:08] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:28] * eternaleye_ (~eternaley@ has joined #ceph
[23:28] * eternaleye (~eternaley@ Quit (Read error: Connection reset by peer)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.