#ceph IRC Log


IRC Log for 2011-03-18

Timestamps are in GMT/BST.

[0:06] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:11] * DJLee (82d8d198@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[0:16] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) Quit (Quit: Leaving.)
[0:22] * MarkN (~nathan@ has left #ceph
[0:51] <Tv> lots of NDN services seem to be down right now :(
[0:52] <sjust> hmm
[0:52] <sjust> ?
[0:52] <Tv> http://tracker.newdream.net/
[0:52] <Tv> etc
[0:52] <Tv> http://ceph.newdream.net/
[0:52] <sjust> hmm
[0:52] <sjust> thats probably bad
[0:57] <Tv> back up
[0:57] <sjust> ah
[1:06] <Tv> finally a test failure i can blame on myself
[1:06] <Tv> ;)
[1:06] <Tv> (now if i only knew why..)
[1:07] * lxo (~aoliva@ Quit (Quit: later)
[1:07] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[1:12] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[1:15] * lxo (~aoliva@ has joined #ceph
[1:27] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[1:37] * rajeshr (~Adium@ Quit (Quit: Leaving.)
[2:15] * cmccabe (~cmccabe@ has left #ceph
[2:53] * rajeshr (~Adium@99-7-122-114.lightspeed.brbnca.sbcglobal.net) has joined #ceph
[3:06] * joshd (~jdurgin@adsl-75-28-69-238.dsl.irvnca.sbcglobal.net) Quit (Read error: Connection reset by peer)
[3:17] * lxo (~aoliva@ Quit (Ping timeout: 480 seconds)
[4:27] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[5:05] * MK_FG (~MK_FG@219.91-157-90.telenet.ru) has joined #ceph
[7:03] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[7:06] * joshd (~jdurgin@adsl-75-28-69-238.dsl.irvnca.sbcglobal.net) has joined #ceph
[7:19] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) has joined #ceph
[7:38] * pombreda1 (~Administr@ has joined #ceph
[7:44] * pombreda (~Administr@186.71-136-217.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[7:47] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) Quit (Quit: Leaving)
[8:10] * joshd (~jdurgin@adsl-75-28-69-238.dsl.irvnca.sbcglobal.net) Quit (Quit: Leaving.)
[9:10] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:58] * allsystemsarego (~allsystem@ has joined #ceph
[10:11] * Yoric (~David@ has joined #ceph
[11:04] * MKFG (~MK_FG@ has joined #ceph
[11:05] * MK_FG (~MK_FG@219.91-157-90.telenet.ru) Quit (Ping timeout: 480 seconds)
[11:05] * MKFG is now known as MK_FG
[11:30] * pombreda1 (~Administr@ Quit (Quit: Leaving.)
[11:38] * morse (~morse@supercomputing.univpm.it) Quit (Quit: Bye, see you soon)
[11:39] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[12:04] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[14:42] * DLange (~DLange@dlange.user.oftc.net) Quit (Quit: a reboot a day keeps bugs away)
[14:45] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[15:23] * lxo (~aoliva@ has joined #ceph
[15:32] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[16:23] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[16:23] * Yoric (~David@ has joined #ceph
[16:33] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[16:44] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[16:48] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[16:52] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:55] * greglap (~Adium@ has joined #ceph
[17:30] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[17:45] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[17:54] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[18:09] * cmccabe (~cmccabe@ has joined #ceph
[18:11] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:13] <cmccabe> do you guys know if we should package crdbnamer?
[18:14] <joshd> is it not included in the debian package?
[18:14] <cmccabe> centos
[18:14] <joshd> yes, it should be
[18:14] <joshd> it's used by the udev rule
[18:15] <cmccabe> librados-config is another odd program
[18:15] <cmccabe> I guess it should be packaged?
[18:15] <joshd> not sure about that one
[18:15] <cmccabe> it's not packaged by debian apparently
[18:16] <cmccabe> but if it has any use at all, it would seem to be in distros?
[18:16] <cmccabe> I mean we know what version we're building
[18:17] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:17] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit ()
[18:18] * sager (~4221ce08@webuser.thegrebs.com) has joined #ceph
[18:18] <sager> let's do the standup at 11
[18:29] * sager (~4221ce08@webuser.thegrebs.com) Quit (Quit: TheGrebs.com CGI:IRC (EOF))
[18:36] * rajeshr (~Adium@99-7-122-114.lightspeed.brbnca.sbcglobal.net) Quit (Quit: Leaving.)
[18:49] * Yoric (~David@ Quit (Quit: Yoric)
[18:59] * rajeshr (~Adium@ has joined #ceph
[19:14] * lxo (~aoliva@ Quit (Ping timeout: 480 seconds)
[19:16] * lxo (~aoliva@ has joined #ceph
[20:33] <wido> cmccabe: librados-config will be used when building against librados
[20:33] <wido> configure programs could easily find out which version librados is and what capabilities
[20:33] <cmccabe> wido: I think we concluded that it was used to implement pkg-config
[20:34] <wido> Oh, ok
[20:34] <wido> my Atom cluster is almost ready! http://zooi.widodh.nl/ceph/osd/
[20:34] <cmccabe> wido: which is basically only used by configure as far as I know
[20:34] <cmccabe> wido: and other make systems
[20:35] <wido> a -config binary might be outdated, I know that for example PHP uses it for curl and mysql
[20:35] <wido> they both have a curl-config and mysql-config binary
[20:35] <Tv> pkg-config is very useful when e.g. headers or .so's are installed in a non-standard location (e.g. foo's headers are in /usr/include/foo/bar.h, used as #include <bar.h>, foo-config --cflags has -I/usr/include/foo
[20:36] <Tv> cmccabe: btw librados-config is broken
[20:36] <Tv> [0 tv@dreamer ~/src/ceph.git]$ ./src/librados-config --cflags
[20:36] <Tv> 2011-03-18 12:36:27.306648 7fb730b90720 common_init: unable to open config file.
[20:36] <Tv> do not make it read a config file
[20:37] <Tv> i keep saying this, things should not go reading things out of /etc by themselves
[20:37] <Tv> and i heartily recommend all ceph devs remove ceph.conf from /etc and ~
[20:38] <Tv> because you'll just hide bugs by having them
[20:55] <cmccabe> tv: librados-config has no argument called --cflags
[20:55] <cmccabe> tv: one could be added, but if you open librados-config.cc you'll see that it has none
[20:56] <Tv> cmccabe: i'm not 100% clear what the pkg-config expected api is
[20:56] <cmccabe> tv: I agree that it should not read a config file. It should just use the librados api
[20:57] <Tv> librados-config doesn't need or want the actual librados api
[20:57] <Tv> it just needs to know how to build stuff against it, and what version it is
[20:57] <cmccabe> tv: I think maybe we're talking at cross-purposes
[20:57] <cmccabe> tv: the version is defined in include/rados/librados.h
[20:58] <cmccabe> tv: so by reading that, you are "using the librados API"
[20:58] <cmccabe> tv: there's no reason to initialize anything though.
[20:58] <Tv> cmccabe: it has no need to link against the librados .so, or init the api, or anything like that
[20:59] <cmccabe> tv: I am confused by why you think librados-config shouldn't link against librados
[20:59] <Tv> it doesn't need to be
[20:59] <Tv> if it has the version as a #define, it shouldn't call any functions from librados
[20:59] <cmccabe> tv: is there anyone in the world who will want librados-config but not librados?
[20:59] <Tv> do you understand what pkg-config is?
[21:00] <cmccabe> tv: pkg-config tells you which libraries are installed on your system, more or less
[21:00] <Tv> nope
[21:00] <Tv> it tells you how to compile against a library
[21:01] <cmccabe> (12:34:36 PM) cmccabe: wido: [pkg-config] is basically only used by configure as far as I know
[21:03] <cmccabe> so basically, my first instinct would be to remove the call to common_init, and add a call to librados_version
[21:05] <cmccabe> I guess rados_version doesn't do anything except copy some constants, so if you really want, you can just use those constants instead, and remove the dependency on librados itself
[21:06] <cmccabe> I suppose it will be a very, very, very small efficiency gain over linking against librados
[21:06] <cmccabe> I just wondered if you had another reason besides this small efficiency gain for suggesting this strategy
[21:10] <gregaf> yehudasa says he added librados-config for wido and it's not designed to be compliant with dpkg-config or anything
[21:11] <cmccabe> I wonder if we could produce a version that worked for both wido and dpkg-config / pkg-config?
[21:11] <Tv> gregaf: then what is it meant for?
[21:11] <cmccabe> that would seem like a good outcome
[21:11] <cmccabe> ok, going to lunch, see you in a bit
[21:11] <Tv> cmccabe: make it more reliable
[21:11] <gregaf> I think wido just wanted something that was easy to call and parse from php -- ask him!
[21:14] <Tv> so apparently you guys chose the exact one way to break any autoconf-using users of librados, fun..
[21:14] <gregaf> yehuda did it, it wasn't "you guys"…glad you consider yourself one of us, though!
[21:15] <Tv> it was 2 days after i started working here ;)
[21:15] <Tv> anyway, on my todo list already
[21:53] <yehuda_wk> Tv: a version of librados-config that doesn't link with librados will give you the version of the 'librados-config' and not the version of the actual librados installed on the system
[21:54] <Tv> yehuda_wk: yes, the assumption is they get updated at the same time
[21:54] <yehuda_wk> Tv: why making such an assumption?
[21:55] <cmccabe> back
[21:55] <Tv> yehuda_wk: because just about every *-config is a libtool/whatever shell script
[21:55] <Tv> yehuda_wk: and we should do the same
[21:56] <Tv> --version)
[21:56] <Tv> echo 2.7.7
[21:56] <Tv> exit 0
[21:56] <Tv> ;;
[21:56] <cmccabe> cmccabe@metropolis:~/src/ceph2/src$ file /usr/bin/ncursesw5-config
[21:56] <cmccabe> /usr/bin/ncursesw5-config: POSIX shell script text executable
[21:56] <cmccabe> cmccabe@metropolis:~/src/ceph2/src$ file /usr/bin/ncurses5-config
[21:56] <cmccabe> /usr/bin/ncurses5-config: POSIX shell script text executable
[21:56] <cmccabe> cmccabe@metropolis:~/src/ceph2/src$ file /usr/bin/freetype-config
[21:56] <cmccabe> /usr/bin/freetype-config: POSIX shell script text executable
[21:56] <cmccabe> cmccabe@metropolis:~/src/ceph2/src$ file /usr/bin/libpng12-config
[21:56] <cmccabe> /usr/bin/libpng12-config: POSIX shell script text executable
[21:57] <cmccabe> I think these things are somehow getting automatically generated by libtool
[21:57] <cmccabe> it would be nice if ours could be too
[22:01] <darkfaded> cmccabe: do you have a spare second? i think it was you who told me why fuse can actually be a good thing
[22:01] <cmccabe> darkfaded: ok
[22:01] <darkfaded> i'll be doing a training on linux storage next week and wanna explain it right
[22:01] <cmccabe> darkfaded: I don't think it was me, but I've heard the arguments
[22:02] <cmccabe> darkfaded: FUSE avoids the server-and-client-on-single-machine deadlock
[22:02] <darkfaded> the idea was that there's less context switches and less copy-back-and-forth of the data(which i'll admit) is coming from userspace anyway
[22:02] <darkfaded> ah ok thats an easy one
[22:03] <cmccabe> darkfaded: as far as I know, FUSE will never have fewer context switches than an in-kernel FS
[22:03] <darkfaded> oh
[22:03] <cmccabe> darkfaded: with FUSE you have application -> kernel VFS -> FUSE handler
[22:03] <cmccabe> darkfaded: both of those arrows represent context switches
[22:04] <darkfaded> right.
[22:04] <Tv> app->kernel is not really a full context switch, syscalls are fast
[22:04] <cmccabe> darkfaded: I think glusterFS developed a shim library that could hijack write(), read(), etc and redirect them directly to the gluster binary, cutting out the kernel VFS from the equation
[22:04] <Tv> the cs hit is when you need to activate the fuse userspace process
[22:04] <cmccabe> darkfaded: but nothing like that exists for Ceph or fuse in general
[22:04] <Tv> and then copy the data out to userspace RAM for that, etc
[22:05] <darkfaded> cmccabe: they had a LD_PRELOAD library yes, i dont know if its still supported
[22:05] <darkfaded> everythign is stale at times in gluster
[22:05] <darkfaded> including the mounts
[22:05] <Tv> LD_PRELOAD is horrible ;)
[22:05] <cmccabe> tv: I guess making a syscall is not technically a "context switch", but it is not free
[22:05] <Tv> well more like there's many kinds of contexts to switch
[22:05] <Tv> some are more expensive
[22:05] <darkfaded> Tv: let me re-read for a second
[22:06] <cmccabe> tv: more precisely, it doesn't involve switching tasks, which could be a TLB flush
[22:07] <darkfaded> for me it is very confusing... since i'm no coder or something... for me a fs driver belongs in the kernel... must be fast and no need to care about stability. because stability is one max prio vendor case away ;)
[22:08] <darkfaded> but i wanna give them better advice than that
[22:08] <cmccabe> darkfaded: the simplified explanation is FUSE == slow
[22:08] <darkfaded> cmccabe: that's my very thinking
[22:08] <Tv> slow is relative
[22:08] <darkfaded> i need to dig out who said otherwise and why
[22:09] <darkfaded> Tv: i can understand the impact (negative) of a context switch or tlb flush
[22:09] <Tv> e.g. sshfs as FUSE is typically plenty good enough, grabbing the data over wifi is going to be slower for me anyway
[22:10] <cmccabe> fuse is good enough for a lot of applications really
[22:10] <darkfaded> Tv: yeah... the question is could you imagine reasons why fuse would be _better_ for any scenario where throughput is >300MB/s or even really really high
[22:10] <darkfaded> i have a lot of trouble making myself believe it can be a good thing
[22:11] <cmccabe> darkfaded: well, a buggy fuse filesystem can't cause a kernel panic
[22:11] <darkfaded> cmccabe: yeah but it will lock and hang for ages ;)
[22:12] <darkfaded> but of course, yes, that is an advantage of being outside of the kernel
[22:12] <darkfaded> in short, i'll go with fuse == slow
[22:12] <Tv> darkfaded: FUSE is easier to develop for, that's the one reason for it's existence
[22:13] <bchrisman> for a large scale filesystem looking for high performance, yeah...
[22:13] <bchrisman> yeah.. easy to build a fuse fs in python, perl, whatever....
[22:13] <darkfaded> bchrisman: it's a training just on san / multipath io tuning on linux. so the (small) audience doesn't really care about anything than performance
[22:13] <bchrisman> ahh yeah.. makes sense then
[22:14] <darkfaded> stability, yes, but i'd assume people that can't code well aren't allowed to touch that kind of code
[22:14] <darkfaded> or it wont make it onto any real server until it's ironed out well, which is ok too
[22:14] <cmccabe> darkfaded: if you've read any vendor drivers, you might be disappointed
[22:14] <darkfaded> cmccabe: hehe
[22:15] <darkfaded> vendor drivers have the big advantage of being widely deployed, so even if they have bugs or idiot workarounds, chances are high the issue is already dealt with (imo)
[22:16] <darkfaded> but *lol* ther'es exceptions
[22:16] <cmccabe> darkfaded: depends on the vendor and the hardware
[22:16] <darkfaded> switching from qlogic to vanilla qla drivers was such a relieve
[22:16] <darkfaded> relief?
[22:16] <darkfaded> that thing.
[22:17] <darkfaded> well thanks for the advice
[22:17] <darkfaded> a lot
[22:17] <cmccabe> some proprietary drivers are fairly ok... I think nvidia spends a fair amount of engineering effort on their proprietary drivers
[22:17] <darkfaded> If I get through the standard chapters quickly i'll have them set up ceph in the end
[22:17] <cmccabe> but I've read through some that were really questionable
[22:18] <cmccabe> so who's using multipath I/O on linux these days?
[22:18] <darkfaded> cmccabe: compared to my good old hpux all linux code is really questionable. but it doesn't smell of old age ;)
[22:19] <darkfaded> cmccabe: almost anybody who doesn't have proprietary drivers (which usually suck a lot)
[22:19] <cmccabe> darkfaded: somehow I got the impression that SAN was bigger on Windows than Linux. I don't know if that's true or not though.
[22:19] <darkfaded> the most fun setup i once had was multipath with fc and iscsi (via fc bridge)
[22:20] <darkfaded> cmccabe: no dont think so
[22:20] <cmccabe> darkfaded: I guess SAN is still the best way to do things like realtime video
[22:20] <darkfaded> M$ finally added real multipathing in 2008R2 they say, idk if its true
[22:21] <wido> cmccabe: about the -config binary, I saw that a lot of libraries where using it, while writing librados
[22:21] <darkfaded> cmccabe: that's like lowend san setups, but i think video editing and such was the first Many_MB/s sans
[22:21] <wido> So I thought that kind of binary whould be useful
[22:21] <cmccabe> wido: yeah, it seems useful. We were just talking about how to make it more so :)
[22:21] <darkfaded> but any kind of larger datacenter (company not hosting) will be using a san or a few
[22:22] <cmccabe> darkfaded: well, Google and facebook don't...
[22:22] <darkfaded> cmccabe: yeah
[22:22] <darkfaded> err well ok
[22:22] <darkfaded> any company that actually does real stuff
[22:22] <Tv> any company with a traditional IT infrastructure
[22:22] <wido> Since I'm using Ceph I started hating NFS and iSCSI more and more...
[22:23] <darkfaded> Tv: any company that is liable if stuff doesnt work?
[22:23] <Tv> darkfaded: bleh, redundant array of inexpensive nodes is more reliable than a SAN
[22:23] <wido> darkfaded: http://zooi.widodh.nl/ceph/osd/ < You were interested in my Atom servers, right?
[22:23] <Tv> the difference is, you need to architect for it, you can't take old school IT infra and make run on that
[22:24] <cmccabe> my impression is that SAN is kind of a way of doing storage provisioning to a traditional IT setup
[22:24] <cmccabe> *adding
[22:24] <darkfaded> Tv: yes i'll agree making applications "scale out a bit" is complex, i'll give you that
[22:24] <darkfaded> thing is, i.e. finance corps have done that already in the mid-90s
[22:24] <cmccabe> SAN is more a way to scale up than a way to scale out
[22:25] <darkfaded> and they still do need scaling up
[22:25] <cmccabe> or at least it seems that way to me
[22:25] <darkfaded> cmccabe: it's the thing that can scale up *and* out
[22:25] <wido> Isn't a SAN just a "large" disk over a network, could be Ethernet of FC
[22:25] <darkfaded> wido: if you look at smallscale, yes
[22:25] <cmccabe> heh. Sorry I wasn't trying to start a flamewar; I just genuinely have never used SAN so I'm curious
[22:25] <darkfaded> cmccabe: i dont think we're flaming yet
[22:26] <wido> darkfaded: What kind of scale are you talking about?
[22:26] <cmccabe> at any rate, san allows you to have like a wall of disks and share those disks among multiple machines. But only block level sharing, so you'll need many different FS instances.
[22:27] <darkfaded> umm, what i though is: you'll have 30-40 big storage arrays and 500 servers, some of these servers will be really huge single systems which run applications that can't scale out, and on the other side you can still have many applications that are spread over servers
[22:27] <darkfaded> its just a idk... block-layer abstraction thing
[22:28] <darkfaded> and stability wise... if you see more than 1 failure in 3-4 years then the admins are idiots
[22:28] <cmccabe> so it's well suited to a system where you have single servers doing stuff rather than a cluster per se
[22:28] <darkfaded> cmccabe: hmm. no, i meant it's suited both ways
[22:29] <darkfaded> if you have 500 boxes working on different parts of a dataset then a san would be stupid
[22:29] <darkfaded> but if you have 100, and a few fat servers, it might already make sense. and if you have many boxes doing different stuff, it might make even more sense
[22:30] <wido> But if you look to a SAN, they seem to be getting old, very fast
[22:30] <darkfaded> wido: i heard that since iscsi came around
[22:30] <darkfaded> which is a joke
[22:30] <cmccabe> I guess there are filesystems like red hat's GFS that operate with a SAN backend
[22:30] <darkfaded> same for fcoa
[22:30] <darkfaded> eee
[22:30] <cmccabe> so SANs are not incompatible with clustering
[22:30] <wido> we got a EqualLogic today, a 30TB SAN, but I think it's a stupid black box
[22:31] <darkfaded> cmccabe: CXFS from sgi is the oldest
[22:31] <wido> darkfaded: I'm not a big fan of iSCSI, but what would you use, FC?
[22:31] <darkfaded> wido: well thats ok for storing some data, but not for any heavy lifting
[22:31] <wido> But iSCSI is a cheap way to realize a budget SAN, there isn't a large budget in every situation
[22:31] <darkfaded> wido: yes, would be normal? you get very high and lossless bandwidth
[22:32] <wido> Ethernet is cheap
[22:32] <darkfaded> wido: agreed. i jsut dunno if it's still cheap if you need the same availability
[22:32] <wido> You are talking about 1 failure every 3 years? (few lines back)
[22:32] <darkfaded> if you got a lot of cool juniper switches around it'll be fine
[22:32] <gregaf> cmccabe: I think a lot of the older network filesystems rely on a SAN or similar setup
[22:33] <cmccabe> gregaf: yeah
[22:33] <gregaf> even Lustre expects multiple nodes to have access to the same block devices
[22:33] <gregaf> or at least that's how their redundancy works
[22:33] <darkfaded> wido: yeah, for example. including maintenance. It seems that ethernet switches that can do loadbalancing and failover are horribly expensive
[22:33] <darkfaded> and 10gig ethernet tends to be a lot slower than 8gig fc
[22:34] <darkfaded> gregaf: lustre is really odd
[22:34] <gregaf> and that's assuming the failover works...
[22:34] <darkfaded> they demand your backend is 100% available
[22:34] <wido> darkfaded: I think you have a different level of performance demands
[22:34] <cmccabe> gregaf: I'm surprised that lustre uses block-level redundancy for OSDs
[22:34] <cmccabe> gregaf: disappointing
[22:34] <gregaf> cmccabe: have you read much about it?
[22:34] <cmccabe> gregaf: slightly
[22:35] <gregaf> it can be hard to find the good docs, but there's a separation between the nodes which handle disk writes (these are object storage targets) and where the disks are located…
[22:35] <wido> A good Ethernet setup with multipath iSCSI works fine, if you keep both ethernet networks seperate, don't use nasty things like STP
[22:35] <cmccabe> gregaf: that sounds like it's almost assuming a san!
[22:35] <darkfaded> wido: yes, separate networks is very helpful ;) like different fabrics in fc. putting it all on one cable or trusting stp is helpless
[22:36] <gregaf> oh sorry, I got that backwards — the nodes which handle writes are object storage servers, and then the disks are hosted on object storage targets
[22:36] <gregaf> I can give you the Lustre manual if you want, that's the only thing I could find that really described the architecture in a meaningful way
[22:36] <cmccabe> gregaf: yeah, I should read that.
[22:36] <cmccabe> gregaf: the old joke is that it requires a PhD to administer
[22:37] <gregaf> oh wait, maybe I didn't get it backwards — their terminology is confusing as hell
[22:37] <darkfaded> hehe
[22:37] <cmccabe> I wonder if anyone's rocking lustre + DRBD
[22:39] <gregaf> I'm not really familiar with DRBD but I suspect that its implementation might slow Lustre down too much
[22:39] <darkfaded> wido: let me put it that way: if i'm on a budget and free in design and i know load will be stable, i'll go iscsi. if i'm on a budget and they let me shop where i want, i'll grab used FC. and if i'm not on a budget and responsible for it, it'll definitely be fc. but on the other hand there's a lot of use cases where 10gig ethernet would also be interesting. and some others i'd rather have something that really scales out limitless like ceph
[22:40] <darkfaded> my xen boxes all do the same thing and just talk to each other, and i dont want some central SPOF and I can't buy a fat emc^2. so ... how would I get many IOPS other than an awesome filesystem
[22:40] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[22:41] <wido> Well, the reason I like Ceph even more, you get the SPOF out of the way
[22:41] <darkfaded> yup
[22:41] <wido> old system always have some kind of master
[22:41] <wido> or you have a Master<>Slave setup, which is wasting your money to, having a box waiting until another box fails
[22:43] <darkfaded> normally you'd split your servers in half, one on the one box, one on the other. those both replicate to each other and each node is seeing the same disks from both
[22:43] <darkfaded> so nothing would idle there
[22:43] <bchrisman> I think the lustre build out complete just before I left my last job was boasting 20GB/s write throughput… on a ridiculous amount of hardware for sure… and probably not all to one file.. :)
[22:43] <darkfaded> the advantage is setups like that dont need a brain in the server admins
[22:43] <darkfaded> bchrisman: wowwwwww
[22:43] <wido> darkfaded: That's just a way to say, well, the box isn't idle, it's doing some work
[22:43] <darkfaded> nice ;)
[22:43] <wido> but both boxes still need to have the power to do all the work on their own
[22:43] <wido> when the other fails
[22:43] <darkfaded> wido: it's doing exactly 50% of the work until one dies
[22:44] <darkfaded> and it's powerful enough to not degrade when that happens
[22:44] <bchrisman> lustre can work.. just needs chicken sacrifices and the easy-style HPC workload
[22:44] <darkfaded> bchrisman: all streaming IO? :)
[22:44] <wido> darkfaded: What I mean, it's still wasting your money :)
[22:44] <bchrisman> running an 'ls' on a directory with < 1000 files could take 30 minutes though.. unless you turned off the redhat default colorization.. :)
[22:45] <wido> But I think I'm a big fan of Ceph ;-)
[22:45] <bchrisman> cuz that would go out to each file object.. :)
[22:45] <bchrisman> yeah… ceph design far better.
[22:45] <bchrisman> mds design in particular...
[22:45] <darkfaded> why/how? you'll always need two copies of the stuff, and ... idk. no issue...? and I'm here because i'm a ceph fan too
[22:46] <darkfaded> i've seen idiots put large esx clusters on midrange san boxes and saw how much it can suck to use a san in one scenario
[22:46] <bchrisman> dynamically load balancing metadata among mds's should make operations much more scalable.
[22:46] <darkfaded> but i think there's many scenarios :)
[22:46] <darkfaded> bchrisman: 30 minutes????
[22:46] <wido> darkfaded: It's always a matter of budget
[22:46] <bchrisman> though don't quote my wording on it..
[22:46] <darkfaded> because of the stats?
[22:47] <bchrisman> darkfaded: yeah.. I kid you not… I was at a conference and explaining it to someone.. and some random person from the same lab laughed and said he'd run into the same issues.
[22:47] <wido> But still, I don't like EMC, NetApp or such because it's a black box
[22:47] <bchrisman> yeah… lustre doesn't keep enough metadata in its metadata servers
[22:47] <wido> I need to them if something fails, they are the only ones who know how it really work
[22:47] <wido> I got to run! ttyl
[22:48] <darkfaded> wido: yeah and budget is, too, decided by what your systems do. If my hosting fails I'll have to refund, look like an idiot. in the other job, if something fails, the company will drop out of the stock market
[22:48] <bchrisman> a lot of the other cluster filesystems that would be competitive with ceph were generally written as library-filesystems for clustered apps instead of as a general purpose clustered/distributed filesystem.
[22:48] <darkfaded> thats just different worlds...
[22:48] <darkfaded> plus, why do you think the admins don't know what happens inside their netapp/emc/...
[22:48] <darkfaded> Laters :)
[22:49] <darkfaded> bchrisman: yeah the guy who is doing "clusterfs" for redhat (tries to add better replication and stuff to gluster) listed this as a requirement: "Must be a real FS"
[22:50] <darkfaded> and honestly I havent seen a single one that has a elegant and inherently scalable / robust design as ceph does
[22:51] <darkfaded> if my gluster FS are mounted without user_xattr, i'll have data loss or at least it'll be inaccessible
[22:51] <darkfaded> but it will start up and say all is well
[22:52] <darkfaded> because it doesnt even have a mechanism to notice if something fails
[22:52] <darkfaded> then there's amplidata who will probably never get to the selling stage with their product
[22:52] <darkfaded> and vastsky which mostly consists of a pdf presentation
[22:53] <darkfaded> grr.grrr.
[22:54] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:55] <bchrisman> yeah… it's a tough problem to solve well.
[22:55] <bchrisman> and it all goes to crap once you want a kernel driver for it .. :)
[22:56] <darkfaded> sage is a genious in that :)
[22:56] <bchrisman> :)
[22:56] <cmccabe> the storagesearch guy dismisses amplidata, but mostly because he thinks that hard disks are doomed...
[22:57] <cmccabe> vastsky claims that it provides "eucalyptus EBS support"
[22:57] <darkfaded> cmccabe: they hope ssd caching will be enough to avoid that. but their main problem is that they dont get to seling it :/
[22:58] <cmccabe> EBS got a black eye with the whole reddit debacle
[22:58] <darkfaded> i missed that
[22:59] <cmccabe> darkfaded: http://blog.reddit.com/2011/03/why-reddit-was-down-for-6-of-last-24.html
[23:00] <darkfaded> need some chocolate with that brb
[23:01] <bchrisman> cmccabe: that' s a big honkin outage..
[23:02] <cmccabe> amazon likes to claim that they've never lost data from S3. But they have had unavailablitiy from time to time
[23:02] <cmccabe> I'm not sure if they make a similar claim for ebs though
[23:07] <Tv> sjust: so in the control file, you should be able to pass ceph_bin_url='http://ceph.newdream.net/gitbuilder/tarball/ref/origin_yourbranchhere.tgz'
[23:07] <Tv> sjust: in the template.format call
[23:07] <Tv> sjust: untested but should work ;)
[23:07] <Tv> sjust: also, you need to wait for gitbuilder to build it first
[23:08] <sjust> ok
[23:08] <darkfaded> hrmpf i'll bbl... phone
[23:09] <darkfaded> Ec2 in one availability zone is just one namespace (and spof?)
[23:09] <darkfaded> ??
[23:09] <cmccabe> ec2 isn't supposed to be a SPOF
[23:10] <cmccabe> I mean not in the same way that having one machine do everything is
[23:10] <darkfaded> tehe
[23:10] <gregaf> EBS is different from EC2 though
[23:10] <darkfaded> people with just one san box also build their own fancy spof
[23:10] <bchrisman> apparenlty they were relying too heaving on EBS
[23:10] <gregaf> errr, than S3 I mean
[23:10] <darkfaded> gregaf: sorry a typo
[23:10] <darkfaded> EC2!=EBS!=S3
[23:10] <darkfaded> we all mistype but mean the same thing :)
[23:11] <cmccabe> ok, new python bindings should be in...
[23:11] <darkfaded> they didn't think where they replicate to
[23:11] <gregaf> anyway my understanding is that EBS shouldn't be able to fail like that — that's the whole point of it
[23:11] <darkfaded> but imho amazon is just a very large and comfortable vendor lock in
[23:11] <gregaf> though I'm sure you're supposed to replicate your data anyway for uptime purposes
[23:12] <gregaf> (as reddit says — and they did have most of it replicated but they lost all their disks at once)
[23:12] <darkfaded> gregaf: thats what i meant by same namespace. in essence they replicated twice but on the same "thing"
[23:12] <darkfaded> and 'We could make some speculation about the disks possibly losing writes when Postgres flushed commits to disk' sounds like a lack of O_DIRECT
[23:12] <darkfaded> (fuse, anybody?)
[23:13] <cmccabe> darkfaded: amazon doesn't tell you to replicate to multiple regions
[23:13] <cmccabe> darkfaded: replication is supposed to be their job
[23:13] <cmccabe> darkfaded: it's just that EBS is by far the weakest part of their offering... its performance sucks too, to the point where people are manually RAIDing multiple EBS volumes
[23:17] <Tv> the reddit stupidity was that cassandra is perfectly able to handle its own distribution and replication
[23:17] <Tv> they could have run it on local disks, maybe set up one server with ebs to be "wan replication" like cassandra supports
[23:17] <Tv> (= avoid losing data if all ec2 nodes crash at once)
[23:18] <Tv> apart from that, they pretty much followed best current practices
[23:18] <Tv> ebs and s3 are unreliable, if you care about your uptime you just don't rely on then, it's as simple as that
[23:18] <lxo> speaking of failures... I found my 0.25.1-created filesystem (large rsync into it still underway, lots of btrfs and ceph kernel errors) had a number of zero-sized files dated with the epoch time in previously-copied directories
[23:18] <Tv> *them
[23:19] <cmccabe> tv: there were also postgres servers involved
[23:19] <Tv> cmccabe: yeah but as far as i can tell those were handled according to BCP for ec2 customers
[23:19] <darkfaded> Tv: if that's best current practices then i definitely stand by my point that this is computing for web-shops but nothing remotely comparable to how any real *data* es handled
[23:19] <lxo> I suppose some rsync session crashed, but I can't imagine a POSIX-compliant scenario in which multiple files would be zeroed without any of them having got any contents
[23:19] <Tv> can't fault reddit there (apart from wanting to use ec2 ;)
[23:19] <darkfaded> lxo: ouch.
[23:20] <Tv> darkfaded: remember my earlier point that you can't cram traditional IT infra into redundant array of inexpensive nodes? postgres is traditional.
[23:20] <lxo> could this be the result of one of the many reboots? and, more importantly, any idea of how ceph could have got into this scenario, and how to help debug it?
[23:21] <darkfaded> Tv: yes but this on a level that would get any admin fired on the very day it happens
[23:21] <lxo> (FWIW, the zero-sized files seem to be one of the things that make rsync very slow; starting appending to them takes several seconds)
[23:21] <Tv> darkfaded: i've had several customers lose gigabytes of mail from their Exchange servers, and nobody got fired...
[23:21] <cmccabe> darkfaded: I don't understand where reddit screwed up in your opinion, aside from trusting a company named amazon
[23:21] <Tv> darkfaded: (fwiw i wouldn't touch those servers with a big stick)
[23:22] <cmccabe> darkfaded: traditionally you don't get fired for trusting amazon/ibm/some other big company
[23:22] <Tv> lxo: there's a failure mode for the extN filesystems that truncate files on crash
[23:22] <darkfaded> hehe
[23:22] <darkfaded> Tv: thats just EMAILs
[23:22] <lxo> Tv, no ext here any more (on these machines, anyway), all btrfs
[23:22] <darkfaded> data is stuff that is worth money?
[23:23] <cmccabe> rsync does pre-create directories before doing a transfe (at least with -avi) I don't know if it pre-creates zero-length files or what
[23:23] <Tv> cmccabe: nope
[23:23] <darkfaded> one time one of my colleagues just did a replica the wrong way, because someone distracted him. we got sued by the customer for around $7m
[23:23] <lxo> still, the metadata is handled separately by ceph, and the timestamps make it clear that *that* was lost too, no? (I'm guessing, my first walk-through in the ceph code wasn't very useful)
[23:24] <Tv> lxo: sounds like a ceph bug, yes
[23:24] <lxo> cmccabe, it doesn't pre-create files, even with --append or --append-verify, I checked that
[23:24] <Tv> lxo: any idea on how to reproduce it?
[23:24] <cmccabe> ok
[23:25] <lxo> Tv, not really. rsyncing my home dir (with years of chat logs and stuff), experiencing occasional btrfs freezes and oopses (and occasional ceph.ko's failure to allocate memory), rebooting servers and moving data to try to speed things up, changing replication factors
[23:26] <lxo> then finally trying to figure out why the syncing of these small chat logs are taking so long (hadn't those been synced before?!?), and find lots of zero-sized files
[23:26] <lxo> what makes even less sense is that *some* of the files in the middle were not zero-sized, and rsync is creating them in order
[23:26] <Tv> lxo: yeah.. all i can say is i'm setting up automated testing that'll create those kinds of problems automatically
[23:26] <lxo> awesome!
[23:27] <Tv> lxo: greg or sage might be able to look at the log files and divine some meaning
[23:27] <lxo> my best theory is that some file contents may have got lost during some of the disk reshuffling I did, but I *think* I didn't reshuffle since the last time I created the filesystem from scratch
[23:28] <lxo> some disks were regarded as lost for extended periods of time, though
[23:28] <lxo> (after btrfs froze the disk and the osd in it, but the cluster kept going)
[23:30] <lxo> anyway, I should mention that ceph 0.25.1 is orders of magnitude more robust than 0.24.3. I haven't lost the entire filesystem any single time! (which would unfortunately happen quite often with 0.24.3, when the monitors wouldn't elect a leader any more, or some pgs were unrecoverably lost, or such stuff)
[23:30] <lxo> so, great work, thanks!
[23:39] <gregaf> Ixo: sorry, could you describe what happened from the beginning?
[23:52] <sjust> anyone know what ceph_ver.h is supposed to be?
[23:53] <gregaf> I think it's an autogenerated header file containing the ceph version numbers
[23:53] <gregaf> not certain, though
[23:53] <cmccabe> sjust: ceph_ver.h is autogenerated by the build system and it contains the ceph version
[23:53] <sjust> ok
[23:54] <cmccabe> what brings it to your mind
[23:56] <sjust> just seeing a build error, updating master fixed it
[23:57] <gregaf> generally an issue with that is resolved by make clean; make
[23:57] <gregaf> or make dist-clean; make
[23:58] <cmccabe> there was an issue with config.o recently... automake doesn't like moving files
[23:58] <cmccabe> I don't think ceph_ver.h should have any issues though

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.