#ceph IRC Log


IRC Log for 2010-08-02

Timestamps are in GMT/BST.

[2:24] * bbigras (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[2:25] * bbigras__ (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) Quit (Read error: Connection reset by peer)
[2:25] * bbigras is now known as Guest587
[2:32] <darkfader> no testing crushmaps at 2am
[2:32] <darkfader> i'll postpone that :>
[3:04] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[3:04] <greglap> wido: Gotta love that log bot, it let me see your question today instead of tomorrow :D
[3:06] <greglap> snapshot rollback is an actual on-disk copy whereas creating and deleting pool snapshots are done just by an OSDMap change, which is applied when OSDs get the new map
[3:07] <greglap> this OSDMap change is fairly cheap since for creation OSDs only need to do on-disk work for a snapshot creation when a new write comes in (so you expect latency anyway) and can put off the on-disk work of a deletion until they're not busy
[3:08] <greglap> rolling back a snapshot would require every OSD with pool data to do on-disk copies when they get the command and that's an expensive operation to do across the cluster
[3:20] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[4:48] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[6:08] * Jiaju (~jjzhang@ has joined #ceph
[6:52] * f4m8_ is now known as f4m8
[7:59] * eternaleye_ (~quassel@184-76-53-210.war.clearwire-wmx.net) Quit (Ping timeout: 480 seconds)
[8:39] * allsystemsarego (~allsystem@ has joined #ceph
[9:10] * eternaleye (~quassel@173-129-154-222.pools.spcsdns.net) has joined #ceph
[9:36] * eternaleye (~quassel@173-129-154-222.pools.spcsdns.net) Quit (Ping timeout: 480 seconds)
[10:30] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[11:50] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[11:58] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[12:19] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[12:29] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[13:50] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:32] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) has joined #ceph
[14:54] * jantje (~jan@shell.sin.khk.be) has joined #ceph
[14:54] <jantje> Hi there
[15:21] <wido> hi
[15:41] <wido> yay, 2.6.35 is out :-)
[15:55] * f4m8 is now known as f4m8_
[16:01] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[16:37] <jantje> I want to give ceph a try
[16:38] <jantje> I have 4 servers with each 2 SATA disks and 4x1GB nic on board
[16:38] <jantje> I need redundancy (server may fail = 2 lost disks) and high speed -writes-
[16:38] <kblin> what kind of writes?
[16:38] <jantje> And I'm looking for some advice
[16:39] <jantje> It's going to be used as shared storage for build servers, reading .c files and writing large .o files
[16:39] <jantje> I have no idea under what kind that qualifies
[16:39] <kblin> many changes in a single directory?
[16:40] <kblin> or is that well-spread over many directories
[16:40] <jantje> no, mostly small
[16:40] <jantje> oh, changes, euhm
[16:41] <jantje> just a sec, i think we group our object files per module in a directory
[16:45] <jantje> lets say 50 files MAX
[16:46] <jantje> so not that many I would think
[16:46] <kblin> ok, because there's no distributed file system that I'm aware of which is really fast writing thousands of files to a directory simultaneously, a smaller number probably is ok
[16:47] <jantje> great
[16:48] <jantje> i'm currently reading some paper (quite old, but gives a good view on the architecture) , and i've read that IBM article as well
[16:49] <jantje> there's also that 213 page paper :-)
[16:55] <monrad-65532> i would like to test out ceph, just need a few more servers than i can get together right now :)
[16:56] <monrad-65532> most as backend for mail and websites (php and plain html)
[17:04] <jantje> traffic control looks really cool, data is redistributed based on MDS statistics
[17:05] <jantje> (is there some cost/speed metric? for example, try to write files to a SSD disk, and then move it to a slower SATA disk?)
[17:08] * fred_ (~fred@80-219-183-100.dclient.hispeed.ch) has joined #ceph
[17:08] <fred_> yehudasa, around?
[17:20] <jantje> kblin: suggestions for a crush map is welcome :-)
[17:26] * fred_ (~fred@80-219-183-100.dclient.hispeed.ch) Quit (Quit: Leaving)
[17:51] * Anticime1 is now known as Anticimex
[17:51] <Anticimex> monrad-65532: hi
[18:09] <gregaf> kblin: it hasn't been well-tested so there are probably implementation issues, but Ceph is designed so that writing many files to the same directory shouldn't impact performance
[18:11] <gregaf> jantje: there isn't any way right now to write data to an SSD and then move it, but all the servers make use of journals in various ways, so if you can put the journal on an SSD you'll get most of the benefit from that
[18:13] <wido> jantje: you could place your metadata pool on SSD with a CRUSH rule
[18:14] <wido> and place your journal on a SSD, i'm using a X25-M SSD for that, works very nice
[18:15] <wido> gregaf: are you "into" librados?
[18:15] <gregaf> I've worked with it a bit
[18:15] <gregaf> what are you looking for?
[18:16] <wido> well, just some general questions regarding the app i'm writing
[18:16] <gregaf> okay
[18:16] <wido> we've developed a CalDAV server for which i'm writing a backend, i'm going for RADOS
[18:16] <wido> doing so, gave me some questions about the lib
[18:16] <wido> Is there a maximum length of a RADOS pool name?
[18:17] <gregaf> it's just a free-form string
[18:17] <gregaf> there might be a max somewhere in the code but if there is it's a bug
[18:18] <wido> ok, there is no define in the header, so i guessed so
[18:18] <wido> i'm limiting to 256 chars right now, just in case
[18:19] <gregaf> definitely not a problem
[18:20] <wido> and there is no method to list all the objects in a pool, you have to use list_objects_more to get them in a batch of 1024
[18:20] <wido> to prevent memory eating?
[18:21] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[18:22] <wido> oh, about the IRC log, all the times are in GMT :)
[18:23] <gregaf> well, to list all the objects in a pool you need to go around to every OSD in the cluster, which can get expensive
[18:23] <gregaf> even worse, it's possible for the objects in the pool to change while you're getting them!
[18:24] <wido> indeed, both valid :)
[18:24] <gregaf> so there's a limit on the number you can fetch to keep it sane, and to max sure that if the objects change while you're listing them you will find out and get an up-to-date list
[18:25] <wido> and right now librados tries to open /etc/ceph/ceph.conf, this can only be changed with the -c option which it reads from argv
[18:26] <wido> this is not really wanted when building against the library, since i would like to pass this option to the initialize() method, or even run without a config, only specify the monitor IP, secret and the client name
[18:27] <gregaf> hmm, I think you should be able to do that via the initialize method by just constructing the appropriate argv
[18:28] <wido> yes, i think so, but that is not really clean imho
[18:28] <gregaf> "-m", "", etc
[18:29] <wido> rados.set_monitor_ip(""); would be nice
[18:29] <gregaf> haha, I suppose not but trying to maintain a separate interface that can handle all the options people might want to pass in isn't really worth it
[18:29] <wido> btw, most people are focussing on Ceph, while RADOS is so nice
[18:29] <wido> it's usefull for so much things
[18:30] <wido> just stuff some data in it and never look at it again
[18:30] <gregaf> the simple char array isn't glamorous but it's very flexible since it provides an interface for setting every config setting Ceph recognizes, without the thousand setter methods that would otherwise be required
[18:31] <wido> true, but for now you can only set the config (-c) and the monitor ip (-m)
[18:31] <wido> i would like to run without a config at all: specify monitor ip, secret and name
[18:31] <wido> which args should i pass for that, besides -m
[18:32] <gregaf> lemme look
[18:33] <gregaf> -m for monitor
[18:34] <gregaf> "-K filename" for a file containing the cephx key
[18:35] <gregaf> or "-k filename" for a file containing the cephx keyring
[18:36] <gregaf> "-n name" for the name
[18:37] <wido> ah, great
[18:37] <wido> that will do, i'll wrap some functions around it
[18:38] <gregaf> :)
[18:40] * Osso_ (osso@AMontsouris-755-1-10-232.w90-46.abo.wanadoo.fr) has joined #ceph
[18:44] <jantje> ** ERROR: error creating empty object store in /data/osd0: Inappropriate ioctl for device
[18:44] <jantje> /dev/sdb 1048576 72 1048504 1% /data/osd0
[18:44] <jantje> (btrfs)
[18:44] <jantje> the disk is a qemu virtual disk
[18:44] <jantje> maybe that's the problem?
[18:45] <wido> did you specify btrfs devs in your config?
[18:45] <gregaf> you need to specify a size if it's a file instead of a block device
[18:45] <wido> true :)
[18:45] <wido> that's for the journal
[18:45] <gregaf> ah, I just had the error associated with that answer
[18:45] <jantje> it's a device
[18:46] * fred_ (~fred@68-29.1-85.cust.bluewin.ch) has joined #ceph
[18:46] <fred_> hi
[18:46] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[18:46] * Osso_ is now known as Osso
[18:47] <jantje> and /data/osd0/journal exists
[18:47] <wido> that is a file, so did you specify a journal size?
[18:48] <jantje> # cat /sys/block/sdb/device/model
[18:48] <jantje> QEMU HARDDISK
[18:48] <jantje> Hm
[18:48] <jantje> ok, no then
[18:49] <jantje> where do i specifiy this size
[18:49] <jantje> just the config?
[18:50] <gregaf> yeah
[18:50] <jantje> osd journal size indeed
[18:50] <fred_> I've got strange qemu freezes using qemu+rbd (vm does not answer ping, qemu management console is not responsive) may that be related to librados blocking ?
[18:50] <gregaf> "osd journal size =", I think
[18:50] <gregaf> pretty sure it's in MB
[18:51] <sagewk> wido; on #328, that crash is the workaround greg added last week not quite working. it means you've reproduced the original problem, though.. the workload was just an rsync?
[18:52] <jantje> anyone uses PXE images to boot nodes?
[18:52] <wido> yes sagewk only an rsync
[18:53] <wido> is is the same? to me it seemed new
[18:53] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[18:53] <sagewk> the problem is we don't have the mds logs for when the log entries were generated, just the brokenness during replay.
[18:54] <jantje> 10.08.02_12:53:33.071646 7fa755ed5710 -- :/2915 >> pipe(0x1810310 sd=-1 pgs=0 cs=0 l=0).fault first fault
[18:54] <jantje> (on osd stat)
[18:54] <sagewk> it crashed this time because one of hte mislinked items was a directory, and it wasn't able to forcibly remove it from the cache (due to children)
[18:54] <gregaf> jantje: that's a messaging problem
[18:55] <gregaf> are all the appropriate processes running?
[18:55] <jantje> oh crap
[18:55] <jantje> i didnt start
[18:55] <jantje> :)
[18:55] <gregaf> :)
[18:56] <sagewk> maybe you can redo the rsync, and then restart the mds (and watch for the 'FIXME had dentry link to wrong inode' messages?)
[18:56] <fred_> jantje, for the next time, it means the monitor at cannot be reached
[18:56] <sagewk> i'll push a quick fix to make the reply complete
[19:00] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[19:01] <sagewk> wido: e3721638 should let your mds recover (assuming the mislinkage isn't too severe)
[19:04] <sagewk> wido created a new issue #329 for this
[19:10] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[19:16] <fred_> is yehuda the man for qmeu+rados troubles ?
[19:16] <sagewk> fred_: yeah. he should be in soon
[19:17] <fred_> good, thank you
[19:18] <wido> sagewk: building right now
[19:18] <wido> btw, i'm seeing: "ceph: corrupt inc osdmap epoch 4202 off"
[19:19] <sagewk> is there more to that line? should include some numbers and pointesr after "off"
[19:19] <wido> oh yes, a lot of lines
[19:20] <wido> http://www.pastebin.org/441756
[19:21] <sagewk> woah, ok
[19:21] <wido> that is only a snippet, there is a lot more in the log
[19:21] <wido> you could take a look at logger.ceph.widodh.nl in the kern.log
[19:21] <sagewk> should be enough
[19:26] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[19:29] <wido> about the Debin packages, they depend on libssl0.9.8k, which is present on Debian, but Ubuntu 10.04 has libssl0.9.8m
[19:30] <wido> so you can't install these packages without forcing. They work fine though on Ubuntu
[19:31] <sagewk> is that because you built on ubuntu? debian/control doesn't specify a version of libssl
[19:32] <wido> no, on my laptop i wanted to create a small Ceph install, so i fetched the packages from ceph.newdream.net/debian
[19:34] <wido> libssl0.9.8 (>= 0.9.8m-1) that is what the depends says in "ceph_0.21-1_i386.deb"
[19:35] <sagewk> oh.. yeah those packages were built on sid
[19:35] <sagewk> for squeeze, you want the ~bpo60, for lenny ~bpo50 packages.
[19:35] <wido> ah, ok, didn't try those
[19:35] <wido> i just took the stable, thought that was the best package to get
[19:39] <fred_> sagewk, maybe that new feature of linux 2.6.35 would make a good ceph junior job if you need some...: fuse: splice() support
[19:41] <sagewk> fred_: the cfuse manages it's own cache, so i'm not sure splice() will work for us in that case
[19:45] <fred_> sagewk, does cfuse cache data ?
[19:45] <sagewk> uea
[19:45] <sagewk> yeah
[19:47] <fred_> on read+write ?
[19:48] <fred_> anyway I think the purpose is cache coherency, so yeah no general use for splice
[19:52] * fzylogic (~fzylogic@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[20:11] <sagewk> wido: pushed a fix for the corrupt osdmap problem to ceph-client.git unstable
[20:12] <yehudasa> fred_: I'm around now, anything specific?
[20:14] <fred_> yehudasa, yeah, thanks
[20:14] <fred_> yehudasa, I've got strange qemu freezes using qemu+rbd (vm does not answer ping, qemu management console is not responsive) may that be related to librados blocking ?
[20:15] <yehudasa> might be related to the qemu-rbd blocking
[20:15] <yehudasa> what version are you using?
[20:15] <fred_> yehudasa, I have the qemu log file (debug ms = 20), are you interested?
[20:16] <yehudasa> yeah, that'd be nice
[20:16] <fred_> yehudasa, stable-0.12.5 + to which I backported your rbd branch
[20:17] <fred_> yehudasa, 50M log, compressd to 2M approx, where and how should I send ?
[20:17] <yehudasa> you can send it to yehuda@hq.newdream.net
[20:19] <fred_> sent
[20:19] <yehudasa> great
[20:19] <yehudasa> also, can you do a 'git log' on the qemu-kvm tree, and tell me what's the most recent commit?
[20:21] <fred_> yehudasa, what do you need? I did a great soup of merging between ubuntu's ceph's and upstream's commits...
[20:21] <yehudasa> I need to know which commits went in from the rbd branch
[20:21] <fred_> yehudasa, ok, I'll prepare that
[20:22] <yehudasa> specifically which were the latest
[20:30] <fred_> yehudasa, first list http://pastebin.org/441907
[20:30] <fred_> yehudasa, I you need, I can prepare the corresponding commit SHAs so that you could do a for i in $commits ; do git cherry-pick $i ; done
[20:32] <yehudasa> fred_: no need for it
[20:32] <fred_> ok
[20:33] <yehudasa> fred_: you haven't applied the latest rbd
[20:33] <yehudasa> there's some commit that I dropped that went in
[20:34] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[20:35] <fred_> yehudasa, let me check... btw latest rbd commit is to weeks old, right ?
[20:36] <yehudasa> yeah, 12 days ago
[20:37] * jantje throws out qemu for testing
[20:37] <jantje> it just crashes
[20:37] <jantje> even my host server
[20:37] <jantje> It's probbly not ceph related, but still qemu is crap.
[20:38] <yehudasa> fred_: but I dropped an earlier commit, the queuing delay one
[20:38] <fred_> checking
[20:40] <jantje> I have a box with 4x1Gbps links, is it 'OK' to serve 4 OSD's (SATA disks) on every link? Or could that saturate the SATA controller/chipset/whatever
[20:40] <wido> jantje: why not use bonding (xor)?
[20:40] <gregaf> you mean an OSD on every link with its own disk?
[20:40] <gregaf> that'll work fine
[20:41] <jantje> gregaf: yes
[20:41] <jantje> wido: that's a possibility too, but I've never tried it
[20:41] <gregaf> or you could stick all the drives together under btrfs and bond the NICs like wido says, dunno which'd be more performant
[20:42] <gregaf> for proper reliability with more than one OSD on a single box you'll need to modify your CRUSH map
[20:42] <jantje> yea, talking about crush, I need some decent explenation on that one, preferrably with examples orso
[20:42] <wido> sagewk: the the commit was in unstable, right?
[20:43] <sagewk> wido: ceph-client.git unstable, yeah
[20:43] <wido> jantje: i've got an example crushmap: http://zooi.widodh.nl/ceph/crushmap.txt
[20:43] <wido> i'm using the master branch due to IPv6
[20:43] <sagewk> 73a7e693
[20:43] <sagewk> unstable has been rebased on top of 2.6.35, in preparation for the merge window
[20:43] <wido> oh, ok, i'll switch to unstable then
[20:44] <wido> btw, i have been creating rados snapshots today, but not on the data nor metadata pool
[20:44] <jantje> wido: interesting, thanks.
[20:44] <fred_> yehudasa, ok got it. this is something I did not see in the last force-update I guess
[20:45] <fred_> yehudasa, you think that can be the cause of the problems I'm seeing ?
[20:45] <sagewk> the pool snapshots (on any pool) are what's crashing the client.. the incremental map update parsing was wrong.
[20:46] <jantje> I have 1U boxes that can have 2 disks, but I lose a disk for booting the OS. Anyone tried PXE boot ?
[20:47] <yehudasa> fred_: yes
[20:48] <jantje> wido: http://zooi.widodh.nl/ceph/20100701_015.jpg , we have the same cases (like the one on top)
[20:48] <wido> jantje: why should a OS make you loose a disk? You can make a small partition for the OS
[20:48] <fred_> yehudasa, ok thank you, I'll rebuild and test, and report success if you wish
[20:49] <yehudasa> great
[20:49] <jantje> wido: I think I prefer an entire blockdevice as OSD, it just feels unsafe to me :-)
[20:50] <jantje> wido: maybe I can put in an SSD as boot disk and put my journaling on that as well
[20:50] <wido> ah, ok. PXE should work, but then you would have no swap
[20:50] <wido> and yes, a SSD would be nice for that and journaling
[20:50] <wido> i'm really seeing some nice results with that
[20:51] <jantje> is 4GB ram enough?
[20:51] <jantje> At this moment I'm just putting up an test environment
[20:52] <wido> 4GB for two OSD's, should be sufficient
[20:55] <jantje> K, great
[20:55] <jantje> it's crappy that write speeds of small SSD's are crappy
[20:56] <jantje> 32 or 40GB disks are relatively cheap, but offer low writing speeds
[20:56] <gregaf> gotta be a pretty old or small SSD for the write speeds to be bad, what are you looking at?
[20:57] <gregaf> ah, I forgot vendors had started coming out with those, I'm used to the 64 and 80GB smallest models
[21:00] <jantje> Looks like Corsair Force F60 60GB can do sequential writes up to 275MB/s
[21:02] <jantje> guys, my compliments on Ceph. It's really cool. I hope it get's stable very quickly.
[21:06] <wido> gregaf: models like the X25-V
[21:06] <wido> those really suck, while the X25-M is much better and the X25-E is even cooler
[21:09] <wido> one more thing about the rados pool snapshots, those are not usefull when using Ceph are they?
[21:09] <wido> you would have to snapshot data and metadata at exact the same moment
[21:09] <wido> and rolling back with a large fs could take ages and would require rolling back every object "manually"
[21:22] <wido> sagewk: any idea what "mount error 5 = Input/output error" means?
[21:22] <wido> i've been seing this a lot lately, restarting my mons and mds'es fixes it
[21:22] <wido> while a ceph -s show that all is fine
[21:22] <wido> oh, got to go, ttyl
[21:23] <jantje> wido: doei! :)
[21:25] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[21:27] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[21:44] <fred_> thanks, bye
[21:44] * fred_ (~fred@68-29.1-85.cust.bluewin.ch) Quit (Quit: Leaving)
[21:53] <gregaf> wido: what benefit were you thinking you might get from rados snapshots on Ceph, instead of just using the snaps mechanism?
[21:53] <wido> gregaf: a rollback with the Ceph snaps would require running a cp / rsync
[21:54] <wido> i thought that a RADOS rollback would be faster
[21:54] <gregaf> ah
[21:54] <gregaf> no, they're going to do basically the same operations
[21:55] <wido> ah, so rolling back a snapshot of a really large tree, could take hours, even days
[21:55] <wido> when for some reason you do a rm -rf and don't find out
[21:57] <wido> i'm afk again
[21:57] <gregaf> well, I guess the Ceph snapshots would be slower since they have to go through the filesystem layer and the OSDs can do rollbacks locally, but they're still going to need to copy all the filesystem data
[22:00] <iggy> default
[22:00] <iggy> sorry, wrong window :(
[22:16] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.