#ceph IRC Log


IRC Log for 2011-11-08

Timestamps are in GMT/BST.

[0:15] * fronlius (~fronlius@f054105049.adsl.alicedsl.de) Quit (Quit: fronlius)
[1:11] * cp (~cp@ has joined #ceph
[1:17] <cp> Quick question: I'm trying to build qemu to work with rbd but I'm getting a strange error message
[1:17] <cp> ./configure --enable-rbd
[1:17] <cp> ERROR
[1:17] <cp> ERROR: User requested feature rados block device
[1:17] <cp> ERROR: configure was not able to find it
[1:17] <cp> ERROR
[1:18] <cp> I have ceph installed so I'm not sure what's going one
[1:20] <cp> Ah.. apt-get install librbd-dev
[1:20] <cp> sorry for the chatter
[1:28] <ajm> 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x5c4fae]
[1:28] <ajm> 10: (void decode<unsigned int, PG::Interval>(std::map<unsigned int, PG::Interval, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, PG::Interval> > >&, ceph::buffer::list::iterator&)+0x31) [0x639951]
[1:28] <ajm> not sure if this is known/new
[1:28] <ajm> full osd log: http://adam.gs/osd.7.log.bz2
[1:32] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[2:14] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:22] * ajm (adam@adam.gs) has left #ceph
[2:23] * ajm (adam@adam.gs) has joined #ceph
[2:23] <ajm> NaioN: fwiw, btrfs seems happier in 3.1.0
[2:32] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:43] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[3:01] * cp (~cp@ Quit (Quit: cp)
[3:10] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[3:11] * gregorg (~Greg@ has joined #ceph
[3:11] * jantje (~jan@paranoid.nl) Quit (Read error: Connection reset by peer)
[3:11] * jantje (~jan@paranoid.nl) has joined #ceph
[3:37] * adjohn (~adjohn@ Quit (Quit: adjohn)
[4:05] * aliguori_ (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:04] * mrjack (mrjack@office.smart-weblications.net) Quit (Ping timeout: 480 seconds)
[5:07] * mrjack_ (mrjack@office.smart-weblications.net) Quit (Ping timeout: 480 seconds)
[5:41] * adjohn (~adjohn@70-36-139-211.dsl.dynamic.sonic.net) has joined #ceph
[6:22] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[8:35] <NaioN> ajm: i'm running 3.1 but have the same problem
[8:36] <NaioN> so I think it also has to do with the workload (20 rsyncs)
[8:36] <NaioN> but I'm going to try ext4 today...
[8:52] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[8:53] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) Quit ()
[9:24] * adjohn (~adjohn@70-36-139-211.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[9:41] * adjohn (~adjohn@70-36-139-211.dsl.dynamic.sonic.net) has joined #ceph
[9:45] * adjohn (~adjohn@70-36-139-211.dsl.dynamic.sonic.net) Quit ()
[10:04] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[10:04] * gregorg (~Greg@ has joined #ceph
[10:27] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[11:10] <todin> NaioN: ext4 is much better, but be careful, there ist a bug in ext4 which is fixed just a few days ago
[11:23] <psomas> todin: do you have a link for the bug?
[11:27] * gregorg_taf (~Greg@ has joined #ceph
[11:28] * gregorg (~Greg@ Quit (Read error: No route to host)
[11:50] <todin> psomas: http://patchwork.ozlabs.org/patch/121441/
[11:51] <psomas> thanks
[11:52] <todin> psomas: as far as I could see it, it is in mainline by now
[12:09] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[12:52] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[13:48] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[13:49] <NaioN> todin: thx!
[13:50] <NaioN> but I'm hitting the same thing.... after a while the IO stalls
[13:50] <NaioN> and ofcourse the OSDs commit suicide
[13:51] <NaioN> but I don't know why the IO stalls
[13:51] <NaioN> I now tested with btrfs and ext4 on mdraid
[14:00] <ajm> I assume your using 0.37 ?
[14:10] <todin> NaioN: do you ses btrfs or ext4?
[14:41] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[14:43] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[14:43] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:57] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[16:00] * pserik (~Serge@eduroam-60-133.uni-paderborn.de) has joined #ceph
[16:01] <pserik> hey all
[16:02] <pserik> i tried to make and install ceph from sources but if I gonna to start it, it says: .: 35: Can't open /usr/lib/ceph/ceph_common.sh
[16:05] <pserik> and I have no directory named ceph in my use/lib
[16:08] <pserik> I built it from 0.37 sources by executing: autogen, configure, make and make install commands
[16:10] <NaioN> todin: ext4
[16:11] <NaioN> but i think the problem is mdraid
[16:11] <NaioN> I get the same issue with btrfs and ext4
[16:11] <NaioN> after a while the IO stalls
[16:11] <NaioN> and I see a high io wait without io to the disks/md device
[16:12] <NaioN> I've tried with btrfs/ext4 per disk and I had no problem with stalled IO
[16:12] <todin> NaioN: I just know this roblem with btrfs, I use ext4 without a prob, but I do not have a md
[16:13] <todin> NaioN: ok,
[16:13] <todin> NaioN: but why do you use md? you have the replication level in ceph
[16:18] <pserik> somebody, any suggestions?
[16:27] <NaioN> todin: because with md between I don't have to reboot the osd with a failed disk
[16:27] <NaioN> with ext4 if the disk dies you get a stalled mount
[16:27] <NaioN> and ofcourse you could reboot the osd, because of the replication, but i try to avoid it
[16:28] <NaioN> with btrfs you could use the capability of btrfs (multi device) but that's with RAID1
[16:29] <NaioN> so then you get RAID1 on the node and replication
[16:29] <NaioN> so your effective space is 1/4
[16:54] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[16:56] * adjohn (~adjohn@70-36-139-211.dsl.dynamic.sonic.net) has joined #ceph
[16:59] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[17:12] <pserik> ah, I got it…
[17:20] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[17:26] <gregaf1> pserik: haven't seen that one before; are you sure that make install worked? (did you forget to run it as sudo?)
[17:27] <pserik> no no… everything was ok, but the libs were installed to /usr/local/(lib | bin). so i changed the path variables it ceph
[17:30] <pserik> gregaf1: after that i was able to start ceph
[17:35] <gregaf1> hmm, odd
[17:36] <gregaf1> pserik: can you make a bug about what happened and what your OS and such is?
[17:38] <pserik> ok, can do it later. have to register me first
[17:39] <gregaf1> thanks!
[17:50] <pserik> gregaf1: how I can target the 0.37 version? because on the "New issue" I only can target the 0.39
[17:51] <gregaf1> pserik: that's the version we want it fixed in, not the version it affects
[17:51] <gregaf1> you can just leave it blank
[17:51] <pserik> gregaf1: ok
[18:06] <NaioN> todin: but I'm now doing a run with BTRFS on all disks (RAID1)
[18:06] <NaioN> it looks promising
[18:10] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:15] * pserik (~Serge@eduroam-60-133.uni-paderborn.de) has left #ceph
[18:39] <Tv> gregaf1: e37ab41605989b310ed21993bcf55766ff520f55 is problematic.. -1 == -EPERM
[18:41] <Tv> b8733476d295e0d522710c24522e24c0edb5c17b makes src/test_trans.cc be ignored, that's confusing
[18:41] <Tv> gregaf1: still you ;)
[18:42] <Tv> best probably to rename test_trans.cc then
[18:42] <gregaf1> heh, anybody who names a c file test_ in the src dir deserves what they get
[18:42] <Tv> if that's really how we wanna roll
[18:44] <gregaf1> and you can't get EPERM there; it's fine
[18:44] <Tv> gregaf1: it's definitely not "fine" ;)
[18:44] <Tv> it may be acceptable
[18:45] <gregaf1> if you want to fix the client to align with your vision of reality then you can do so, but it's an internal interface that isn't required to pass around strict POSIX errnos; it's fine
[18:46] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:54] * jrosser (jrosser@dog.thdo.woaf.net) Quit (Quit: Changing server)
[19:22] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:22] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[19:27] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[19:33] <slang> has the work on the ceph chef integration stalled/died?
[19:33] <Tv> slang: nope, just.. lots of things to fix
[19:34] <slang> Tv: will they have support for doing things like adding an osd to an existing deployment?
[19:34] <Tv> slang: that part has been done for a file
[19:34] <slang> (the chef recipes for ceph)
[19:34] <slang> oh cool
[19:35] <Tv> slang: it currently doesn't yet understand multiple hard drives and such
[19:35] <Tv> slang: work is progressing on coordinating multiple monitors in a better way, etc
[19:35] <slang> Tv: is the ceph-cookbooks repo up-to-date? it hasn't had any commits for a while
[19:36] <Tv> slang: the wip-simple branch does what it does; i've been burning time testing it under crowbar lately
[19:36] <Tv> slang: it's still not in master because i think it'll get rebased a little to clean up
[19:36] <Tv> slang: current limitations: one osd per box, osd data is a subdir not a separate fs, single mon only
[19:37] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:39] * mfoemmel (~mfoemmel@chml01.drwholdings.com) has joined #ceph
[19:41] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[19:43] * `gregorg` (~Greg@ has joined #ceph
[19:45] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:46] <slang> Tv: ok thanks!
[19:46] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[19:46] * morse_ (~morse@supercomputing.univpm.it) Quit (Read error: Connection reset by peer)
[20:02] <sagewk> tv: any reason not to put the conf in ctx somewhere? ctx.conf perhaps?
[20:03] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Quit: Ex-Chat)
[20:07] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[20:13] * fronlius (~fronlius@f054114193.adsl.alicedsl.de) has joined #ceph
[20:16] <Tv> sagewk: the top-level config? i wanted to avoid wide-spread use of it because it leads to a "global variable" syndrome where it's harder to see what makes a task behave a certain way
[20:16] <Tv> sagewk: and e.g. can't run workunit task twice in the same test etc, as they'll now use the same config
[20:16] <sagewk> the ceph.conf config actually
[20:16] <Tv> ohh that
[20:17] <sagewk> i need to pick out individual monitor addrs
[20:17] <sagewk> e.g.
[20:17] <sagewk> def get_mon_status(self, mon):
[20:17] <sagewk> addr = self.ctx.conf['mon.%s' % mon]['mon addr'];
[20:17] <sagewk> return self.raw_cluster_cmd(args=['-m', addr, 'mon_status' ]);
[20:19] <Tv> sagewk: yeah.. basically, i'm trying to avoid problems we (as in the twisted developers) saw with a variable named "ctx" in twisted's web stack some 5 years ago
[20:19] <Tv> sagewk: there ctx.foo was a wild ground of no guarantees and got really hard to reason about
[20:22] <Tv> sagewk: i'm not sure what the best approach is, and that always makes me cautious
[20:22] <sagewk> :)
[20:24] <Tv> sagewk: i think having the ceph task set ctx.ceph = argparse.Namespace(); ctx.ceph.conf = ... is probably safe enough
[20:24] <sagewk> k
[20:24] <Tv> i was thinking of institutionalizing something like that earlier, but at the time refactoring teuth didn't seem all that important
[20:25] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[21:19] * aliguori (~anthony@ has joined #ceph
[22:23] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[22:28] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:43] * bchrisman (~Adium@ has joined #ceph
[22:50] <todin> is there any planing for a tired storage on the osd site? too put "hot" object on ssd on "cold" on hdd?
[22:52] <Tv> todin: are we talking rados or cephfs?
[22:53] <Tv> oh inside a single osd, that would be doable
[22:53] <Tv> without changing anything outside the osd
[22:54] <Tv> todin: but it's not on the medium term roadmap; btrfs or some other multi-disk fs might end up providing it for free
[22:56] <todin> Tv: that's right, but I don't see that btrfs will get stable any time soon
[22:57] <Tv> todin: http://ceph.newdream.net/2011/09/roadmap-update/ is a lot of work that'll probably all happen before we worry about SSDs too much
[22:59] <todin> Tv: there is always a lot to do. is it right that you guys change the direction form a HPC filesystem more to a backend storage system for the could?
[23:00] <Tv> todin: Ceph is a software stack. The lowest levels of the stack will need to be stabilized first. RADOS is the lowest level, and that and RBD (which is a thin wrapper on top) provide a lot of value for the cloud/virtualization audience.
[23:01] <Tv> todin: So it's not as much a change in direction as, that just will be ready first.
[23:02] <Tv> Ceph the unified storage system will solve all kinds of storage problems.
[23:04] <todin> but diffrent storage problems will have diffrent workload pattern, are HPC and cloudhosting similar enough for one solution?
[23:05] <Tv> todin: HPC is not a single workload in the first place, really.
[23:06] <todin> HPC is not my area, hosting is my area of expertise
[23:07] <Tv> todin: I think managing multiple different workloads has been the core challenge of file systems and storage for a long time; there's nothing too special in that itself.
[23:11] <todin> that's right, but my point is, if I look at commercial soultion for cloud storage they all come with tired storage, therefore it is hard for me to tell the CXO level that ceph will bring us more advantage
[23:12] <Tv> todin: The real question is always alternate costs; will your SSD be faster than a cluster that's as much $$$ larger, with more RAM, spindles and network bandwidth.
[23:12] <todin> and the density of virtual machines is rising, and therefore also the need for more iops, we are almost always iops limited
[23:13] <Tv> todin: Read between the lines: we're not a hardware vendor, and don't get a kickback for selling SSDs ;)
[23:14] <Tv> todin: Outside of the seek performance, SSDs can be even slower than a good disk. The world is not as simple as your hardware vendor would like it to seem.
[23:14] * verwilst (~verwilst@d51A5B679.access.telenet.be) has joined #ceph
[23:14] <Tv> As for the seek performance, how many spindles is your storage cluster going to have? Are you really seek limited at that point..
[23:14] <Tv> We stripe every file and rbd image over the whole cluster.
[23:17] <todin> we will have 320 spindle for around 1600 virtual machine, this will be one building block. for all the virtual maschines we are planing more of those building blocks
[23:19] <todin> at this size the network equipmnet is still affordable
[23:21] <Tv> I'm personally happily running >5 vms per spindle almost everywhere, but it's more me-providing-vms-to-me.
[23:22] <Tv> I don't have an official professional recommendation ;)
[23:24] <NaioN> todin: you could have a look at dcache
[23:25] <NaioN> but I don't know how it behaves in combination with ceph
[23:25] <todin> that wasn't my intetion do get a professional recommendation, I wanted just to let you guys know what we do with your great stuff, so that you maybe get a diffrent angel of view ;-)
[23:26] <Tv> todin: rbd-for-vms is a big use case we are putting a lot of focus on
[23:26] <Tv> todin: SSDs are mentioned often, but apart from btrfs or use-as-a-journal we don't really have anything special in the architecture for them
[23:27] <todin> and some good new, I couldn't crash my development system for days
[23:28] <ajm> Tv: any #s on how much of a speedup you get from journal-on-ssd ?
[23:28] <Tv> ajm: not yet
[23:29] <elder> I'm not sure that SSD journal will do what you think. Journaling is basically all write, and that's the worst case for SSD's.
[23:29] <Tv> elder: I'm well aware, but people do report a speedup. We're not providing any numbers yet.
[23:30] <todin> ajm: as far I can tell the osd journal disk is not the limit, those write a sequential
[23:30] <todin> the data disk is the problem
[23:30] <elder> It also depends on what the journal traffic actually looks like... I'm used to it being a continuous stream of contiguous blocks, which is best case for a spinning disk. Maybe your journals aren't like that.
[23:30] <Tv> I still think the best possible future is btrfs putting metadata on SSD.
[23:31] <Tv> elder: Journal is contiguous, but it has lots of syncs; perhaps that is what helps the SSDs be faster.
[23:32] <elder> Maybe. It's one of those things you won't know until you try it (as you know)
[23:32] <Tv> Yes, and I'm explicitly saying we haven't tried it properly, but here's what we've heard from others. No guarantees.
[23:32] <todin> Tv: but didn't the ebofs tried to solve the problem of the many seek on the osd data disk?
[23:33] <elder> (Carry on. Sorry for inserting myself mid-conversation.)
[23:34] <Tv> todin: I don't think avoiding read-time seeks was the major motivation for it, more the ability to minimize the metadata etc.. that might avoid a seek here and there, but honestly I expect the fs structure to be in RAM anyway.
[23:34] <todin> hmm, I just glanced quickly over the paper maybe it missunderstood it
[23:35] <Tv> todin: the big thing EBOFS could do was avoiding directories, unix access modes, etc
[23:35] <todin> Tv: do you have any special case where you want perfomace data?
[23:36] <Tv> todin: What I want is for our QA processes to include regular, repeatable benchmarks, and I want consistent performance to be a release criteria...
[23:38] <todin> that's what I do with every release ;-)
[23:38] <Tv> todin: Want a job?-)
[23:40] * verwilst (~verwilst@d51A5B679.access.telenet.be) Quit (Ping timeout: 480 seconds)
[23:40] <todin> it's quite far away, so you want to build a continuous integration system?
[23:41] <todin> which will do some stabibility and performance test?
[23:42] <Tv> the real question is what to measure, with what setup, under what kind of load, and what decisions to make based on that
[23:42] <Tv> most of the fs benchmarks etc out there are naively single-computer
[23:46] <todin> that's a prob for the cephfs, but why don't you write a simple load generator, and start it on many diffrent hosts?
[23:47] <todin> for the cloud test, I have a vm's which behave like our customer, which will then run on the rbd store

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.