#ceph IRC Log


IRC Log for 2011-02-28

Timestamps are in GMT/BST.

[0:23] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[1:16] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[1:16] <monrad-51468> talk in 15 min?
[1:16] <monrad-51468> should i pop the popcorn :)
[1:19] <DeHackEd> where?
[1:20] <monrad-51468> http://www.socallinuxexpo.org/scale9x/presentations/ceph-petabyte-scale-storage-large-and-small-scale-deployments
[1:20] <monrad-51468> i think
[1:20] <monrad-51468> if i got the timing right
[1:22] <DeHackEd> umm.. not exactly realtime if it takes 90 seconds to load, is it?
[1:22] <monrad-51468> well i can live with that :)
[1:24] <darkfader> aw i wish i could stay up and watch it
[1:24] <monrad-51468> i should really sleep, but now i am going to watch :)
[1:24] <DeHackEd> I can record it
[1:25] <DeHackEd> whoops, n/m
[1:26] <darkfader> hehe
[1:37] <monrad-51468> it would be nice if the slides were online
[1:37] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[1:37] * jnrg (~Ju@cpe-76-169-8-253.socal.res.rr.com) has joined #ceph
[1:38] <jnrg> hello !
[1:38] <jnrg> Not really a question regarding ceph, I'm still looking and reading the doc, but I've noticed that guy : Idugemucywe
[1:38] <jnrg> http://ceph.newdream.net/wiki/Special:Contributions/Idugemucywe
[1:39] <jnrg> very active at uploading pdf... none of those seams related to ceph in particular
[1:40] * jnrg (~Ju@cpe-76-169-8-253.socal.res.rr.com) has left #ceph
[1:40] <darkfader> the talk is so silent
[1:41] <darkfader> ouch yeah someone will have to ban that pdf spammer
[2:36] <darkfader> sagewk: nice talk :)
[2:42] <monrad-51468> i look forward to the slides
[4:39] * Jiaju (~jjzhang@ has joined #ceph
[8:22] * greglap1 (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[8:22] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[8:45] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[8:49] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[9:09] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[10:02] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[10:09] * allsystemsarego (~allsystem@ has joined #ceph
[11:01] * Yoric (~David@ has joined #ceph
[12:23] * MarcRichter (~mr@natpool.web-factory.de) has joined #ceph
[12:24] <MarcRichter> Hi everyone. I'd like to get in touch with ceph. But on the Homepage I can't find a link to some kind of a main documentation.
[12:25] <MarcRichter> Can somebody please give me a hint to which document one should read?
[12:26] <Plnt> MarcRichter: there is wiki with some info for starting with ceph.. http://ceph.newdream.net/wiki/
[12:29] <MarcRichter> Plnt: I know :) I'd just like to know if that's the most current one or if the most recent changes are documented elsewhere :)
[12:57] <wido> MarcRichter: for now the Wiki is :)
[13:04] <MarcRichter> Plnt, wido : Thank you very much for the Info :)
[13:06] <wido> MarcRichter: If you have any questions, shoot!
[13:07] <MarcRichter> I will, but for now I know enough to start my tests of ceph.
[13:43] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[13:53] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[15:17] * Yoric_ (~David@ has joined #ceph
[15:21] * Yoric (~David@ Quit (Ping timeout: 480 seconds)
[15:21] * Yoric_ is now known as Yoric
[16:03] * greglap1 (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[16:24] * Yoric_ (~David@ has joined #ceph
[16:25] * Yoric (~David@ Quit (Ping timeout: 480 seconds)
[16:25] * Yoric_ is now known as Yoric
[16:30] * greglap (~Adium@ has joined #ceph
[16:47] * Yoric_ (~David@ has joined #ceph
[16:52] * Yoric (~David@ Quit (Ping timeout: 480 seconds)
[16:52] * Yoric_ is now known as Yoric
[17:16] * greglap (~Adium@ Quit (Quit: Leaving.)
[17:30] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[17:44] * pombreda (~Administr@ has joined #ceph
[17:45] * Yoric_ (~David@ has joined #ceph
[17:47] * Yoric (~David@ Quit (Ping timeout: 480 seconds)
[17:47] * Yoric_ is now known as Yoric
[17:49] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:51] * cmccabe (~cmccabe@ has joined #ceph
[18:02] * MarcRichter (~mr@natpool.web-factory.de) Quit (Quit: Verlassend)
[18:05] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[18:12] * Yoric (~David@ Quit (Quit: Yoric)
[18:22] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:33] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:35] <ana_> Hello there
[18:35] <gregaf> hi
[18:35] <ana_> I am trying to set up a simple configuration
[18:36] <ana_> with 3 monitors, 4 osds
[18:36] <ana_> I am using the vstart script to start up everything
[18:36] <ana_> and, I may be missing something
[18:37] <ana_> because when it starts the OSD (I am running each osd in a different node)
[18:38] <ana_> cosd fails because the id given as argument, it is not correct
[18:38] <ana_> which indeed it isn't. it always launched ./cosd -i 0
[18:38] <ana_> for every osd
[18:39] <ana_> and in the config file, each osd is named with a different number (osd0, osd1, osd2... and so on)
[18:39] <gregaf> ana_: vstart doesn't handle distributed clusters; it's just for testing and starts everything on a single node
[18:39] <ana_> ohhh
[18:40] <ana_> okayy
[18:40] <gregaf> for a proper distributed cluster you'll want to go through the wiki instructions using mkcephfs, init-ceph, etc :)
[18:40] <gregaf> http://ceph.newdream.net/wiki/Main_Page#Setting_it_up
[18:40] <ana_> so I should start it by myself !
[18:40] <ana_> I am lazy bones :)
[18:40] <ana_> okay, thanks a lot :D
[18:41] <gregaf> np!
[18:41] <Tv> think of it this way: imagine some of the machines rebooting.. surely you want them to start their own daemons..
[18:41] <Tv> testing and development have very different needs
[18:43] <Tv> quote of the day: "Xen is 2nd most popular virtualization technology after VMware followed by Linux Kernel-Based Virtual Machines (KVM) was fourth with 21.3%"
[18:43] <Tv> 2nd is followed by 4th?
[18:44] <prometheanfire> xen is 2nd
[18:44] <prometheanfire> vmware 1st
[18:45] <prometheanfire> kvm wait
[18:45] <Tv> ... followed by KVM that was fourth ;)
[18:45] <prometheanfire> fuck
[18:51] * pombreda (~Administr@ Quit (Quit: Leaving.)
[19:01] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:18] * neurodrone (~neurodron@dhcp207-051.wireless.buffalo.edu) has joined #ceph
[19:22] <Tv> filed an issue about librados init: http://tracker.newdream.net/issues/838
[19:22] <Tv> summary: i think it needs to be "productized" more, not assume it's used by one of the ceph daemons
[19:22] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[19:23] <cmccabe> tv: we talked about the logging thing earlier; the default should be either log to stderr or no logging
[19:24] <Tv> oh yeah i should clarify the bug a bit more
[19:24] <cmccabe> tv: reading the conf twice is dumb, but the fix is pretty much inseparable from the killing g_conf fix, which is also on the schedule
[19:25] <cmccabe> tv: again, it won't read the configuration file by default once g_conf is gone
[19:25] <cmccabe> tv: the new API has a separate call to read the configuration file
[19:26] <Tv> cmccabe: use the bug report
[19:26] <cmccabe> tv: unfortunately common_init is still mostly the same under the covers for the time being
[19:30] <Tv> 10:30?
[19:34] <wido> Hi
[19:35] <wido> is the presentation of sage somewhere online? (From yesterday)
[19:35] <wido> The live stream is not working here
[19:41] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[19:56] <gregaf> wido: I couldn't get the stream working either, although his slides at least will be posted sometime today
[20:03] <Tv> "We hope (as of March 2005) to have a fix soon."
[20:03] <Tv> well that's a big relief.. :-/
[20:03] <gregaf> Tv: what's that from?
[20:03] <Tv> http://goog-perftools.sourceforge.net/doc/heap_checker.html
[20:03] <Tv> oh also, if the stuff is still kept in RSS even under memory pressure, it's not a leak
[20:04] <Tv> that's what i really wanted to identify
[20:04] <gregaf> well it's not a heap leak — massif and tcmalloc both report appropriate heap sizes
[20:05] <gregaf> it was clean some time last year, although I guess it's been a while since we ran it through a leak checker
[20:05] <Tv> gregaf: ohhh pmap's -x option gives you RSS per mapping
[20:05] <gregaf> also those docs are old — google-perftools is hosted on Google code now :)
[20:05] <Tv> oh crap, bad google
[20:05] <gregaf> oh cool
[20:05] <gregaf> do you know how it pulls that out?
[20:06] <Tv> gregaf: /proc/19407/smaps i think
[20:07] <gregaf> oh, I didn't know about that one, nifty
[20:07] <Tv> gregaf: did you see the mmap stuff on http://google-perftools.googlecode.com/svn/trunk/doc/heapprofile.html ?
[20:08] <gregaf> I haven't tried running with those options, no
[20:08] <gregaf> oh, but they should catch the libraries even if we're not mmaping ourselves
[20:08] <gregaf> right, I'll check that if my current run with pmap data isn't illuminating enough
[20:11] <Tv> gregaf: also, http://google-perftools.googlecode.com/svn/trunk/doc/tcmalloc.html TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD looks interesting
[20:16] <Ormod> btw started playing yesterday with the idea of getting ceph to compile on os x
[20:16] <gregaf> Ormod: heh, I've fiddled around with that a bit
[20:17] <Ormod> well I'm currently at 19 files changed, 121 insertions(+), 41 deletions(-)
[20:17] <Ormod> and it still doesn't work. :) current problem is pthread_t being a pointer on os x
[20:17] <gregaf> I've never gotten very far into it myself
[20:17] <Ormod> but that shouldn't be really hard to fix.
[20:17] <Ormod> I just don't have really time for this
[20:17] <gregaf> since in the brief time I looked at it I couldn't figure out a good way to get all the basic linux types :/
[20:18] <gregaf> I was working through them on a case-by-case basis when I found that OS X didn't like posix spinlocks
[20:18] <Ormod> well I created a mac-compat.h with the stuff, though then ran into problems with types.h undeffing the _le*variants
[20:18] <Tv> Ormod: maybe you could try explicitly compiling cfuse only?
[20:18] <gregaf> and from what I read they don't exist at all?
[20:18] <Tv> or is MacFUSE compatible at all?
[20:19] <Ormod> Tv: should be
[20:19] <gregaf> it was once upon a time, we have some #ifdef DARWIN flags sprinkled throughout the codebase for that reason
[20:19] <Tv> i mean, i don't expect people to run ceph server-side stuff on OS X, not really..
[20:19] <Tv> but the client, very much so
[20:19] <Ormod> Tv: yeah I completely agree
[20:19] <Ormod> Tv: is there a makefile target for fuse?
[20:19] <gregaf> it would also be nice for me personally if I could develop in OS X and at least be able to compile-test things before I push them onto my dev box
[20:20] <gregaf> but it's not something I want to spend any work time on
[20:20] <Tv> Ormod: i'd expect just "make -C src cfuse" to do it
[20:20] <gregaf> and as an amusing diversion I've run out of steam before I ever got anywhere
[20:20] <cmccabe> macfuse is apparently a completely different codebase than fuse on linux
[20:20] <Tv> that's kinda how i remember it
[20:20] <Tv> talk about objective-c api and all..
[20:21] <Tv> then again, "The user-space library (libfuse), which provides the developer-visible FUSE API, has numerous Mac OS X specific extensions and features."
[20:21] <Ormod> cmccabe: really. Ah damn.
[20:21] <Tv> *extensions*
[20:21] <gregaf> I'm pretty sure it's linux-FUSE-compatible
[20:21] <gregaf> that was the whole point of bringing it to OS X as far as I know
[20:22] <Tv> yeah, it's just that on linux, you have the kernel-side stuff in the mainstream kernel; for OS X, MacFUSE needs to implement kernel-side things too
[20:22] <cmccabe> I don't think it's compatible really at all
[20:22] <cmccabe> I mean the general concepts are similar
[20:22] <Tv> e.g. sshfs should work on MacFUSE
[20:24] <gregaf> Tv: it does in some incarnation! there are a lot of Mac apps that wrapped up MacFUSE and sshfs and provided remote file access as a new feature
[20:25] <wido> did somebody place back a backup of the Wiki?
[20:25] <wido> I just saw #833, i swear I've fixed the whole RBD wiki about 2 weeks ago
[20:32] <neurodrone> Hi, I was going through the Ceph paper and had a quick question on how the information exchange between clients and MDSs actually take place.
[20:33] <cmccabe> neurodrone: what's the question
[20:33] <sagewk> wido: not that i know of.. did you check the page history?
[20:33] <sagewk> i was surpised too :/
[20:34] <wido> sagewk: Yes, I've checked it. I see some changes I made, but for example, the page still said load "1.2", but I changed that to 1.3
[20:34] <neurodrone> In the "Dynamic Subtree Partition" section it's said that - "Normally clients learn the locations of unpopular (unreplicated) metadata and are able to contact the appropriate MDS directly." My question is how do the clients get to know about the entire namespace tree in first place?
[20:34] <wido> Stumbled on the old docs about a week ago
[20:34] <wido> so I started fixing them
[20:34] <neurodrone> cmccabe: Is it like the entire tree is replicated over all the MDSs synchronously or optimistically?
[20:35] <cmccabe> neurodrone: I don't understand what you mean by the entire namespace tree
[20:35] <Tv> neurodrone: my understanding: you start from root, iterate down, get told where to go if that subtree/frag is assigned elsewhere; cache things you've encountered
[20:35] <cmccabe> neurodrone: If you mean the entire file hierarchy, I don't think clients know about that
[20:35] <neurodrone> I mean the complete metadata namespace. Isn't that stored in form of a tree (e.g. the directory hierarchy) ?
[20:36] <cmccabe> neurodrone: yes, but no one node has the entire hierarchy
[20:36] <neurodrone> Tv: Okay. And how does it know which MDS has the root node?
[20:36] <Tv> neurodrone: well, every mds is guaranteed to cache it
[20:36] <sagewk> neurodrone: / is always on mds0
[20:36] <sagewk> but yeah, every mds has a copy.
[20:37] <gregaf> neurodrone: every MDS is the authoritative daemon for some (not necessarily adjacent) subsections of the tree, but it also keeps non-auth caches of how to connect all its trees to the root inode
[20:37] <neurodrone> oh only MDS0. That kinda makes sense.
[20:38] <neurodrone> gregaf: I see.
[20:38] <gregaf> so each MDS is solely responsible for its own subtrees, but on the subtree boundaries it keeps the other MDSes informed, so an MDS can forward messages appropriately
[20:39] <gregaf> and when the client gets inode data from the MDS that data includes which mds is "auth" for that inode
[20:39] <neurodrone> So, if a particular MDS changes its set of metadata (accepts to store a popular set of metadata) is this change known to all MDSs ? or is it passed onto MDS0 ? or this change is not propagated at all ?
[20:39] <monrad-51468> sagewk: nice talk, you were the reason i got late to work today :)
[20:40] <neurodrone> gregaf: Oh okay. That explains some things.
[20:40] <gregaf> neurodrone: the MDSes keep the other MDSes which share subtree boundaries informed
[20:40] <sagewk> monrad-51468: thanks
[20:41] <gregaf> but changing whether an MDS is auth for something or not is a cooperative process
[20:41] <neurodrone> Oh I see.
[20:41] <neurodrone> Okay, that does make sense of the things I have gone till now.
[20:41] <gregaf> somebody else needs to take over the auth if you're dropping it!
[20:42] <gregaf> and IIRC the information about migrations is stored in the MDSMap, so clients are kept aware through that if nothing else
[20:42] <neurodrone> gregaf: Pretty interesting project. I have just started going through it couple of days back. So, I guess it will take time to get familiar with most of the concepts.
[20:42] <monrad-51468> sagewk: will the slides be up somewhere?
[20:42] <neurodrone> gregaf: Oh and this MDSMap is the daemon which was mentioned earlier?
[20:43] <neurodrone> or is the data structure which the daemon uses to pass the appropriate MDS to the client to connect to?
[20:43] <gregaf> the MDSMap is a group of widely-shared data about how the state of the MDS cluster
[20:43] <gregaf> it's maintained by the monitor daemons
[20:44] <gregaf> and is roughly analogous to the OSDMap
[20:44] <neurodrone> Ah, I get it.
[20:45] <sagewk> monrad-51468: yeah, hopefully posted today!
[20:46] <neurodrone> So, I assume it works like this: 1. Client connects to a particular MDS which is running a distributed MDSdaemon service. 2. It requests the inodes of the file it wants to work on.
[20:46] <neurodrone> 3 . The MDSdaemon has access to the MDSMap and it checks it to see which MDS is responsible for that file. If it's a read operation all the relevant MDSs are provided, or else just the location of the authorative MDS is given to the user.
[20:47] <neurodrone> Is this somewhat correct?
[20:47] <gregaf> err, not really
[20:47] <neurodrone> Oh and the popularity of those MDSs is provided along as well.
[20:47] <gregaf> the client connects to one of the monitor daemons and gets a bunch of cluster state, which includes the MDSMap
[20:47] <neurodrone> gregaf: Oh. What am I missing?
[20:48] <neurodrone> oh it connects to the monitor daemon itself and gets all the data.
[20:48] <neurodrone> That sounds a good strategy.
[20:48] <neurodrone> like a*
[20:48] <gregaf> the client then connects to the MDS which is auth for its mountpoint
[20:48] <neurodrone> and then using that information it decides which MDS to connect to I guess.
[20:48] <neurodrone> oh
[20:49] <gregaf> and then keeps local caches of the inodes & dentrys for the directories it's working in
[20:49] <wido> for debugging PG recovery, which debug level to use? debug osd and debug filestore cause a bit to much traffic towards to my remote syslog machine
[20:49] <bchrisman> so it'd be theoretically possible in looking up '/cephfs/dir1/dir2/dir3/file' to have the client hit mds0 for 'cephfs', mds1 for 'dir1', mds0 for 'dir2', mds1 for 'dir3'? (Just trying to make sure I understand the subtree/fragment distribution mechanism)
[20:49] <neurodrone> okay, and does it need to connect to the auth MDS even for a read operation? like if it wants to know mtime or size or something?
[20:49] <wido> would debug osd = 20 be enough?
[20:49] <gregaf> neurodrone: yep!
[20:49] <gregaf> there's a very complicated set of capabilities to keep things in sync
[20:50] <gregaf> have to run to lunch though, later!
[20:50] <neurodrone> gregaf: the "embedded inodes" strategy. Nice.
[20:50] <neurodrone> Oh okay.
[20:50] <neurodrone> gregaf: Thanks a lot for the information!
[20:50] <neurodrone> gregaf: Appreciate it. :)
[20:50] <gregaf> this is all discussed reasonably well in Sage's thesis, which is available on the webiste
[20:50] <neurodrone> gregaf: Okay. I will check it up as well. :)
[20:54] * neurodrone (~neurodron@dhcp207-051.wireless.buffalo.edu) Quit (Quit: neurodrone)
[21:04] <bchrisman> gregaf: any one directory is wholly owned (auth'ed) by one MDS?
[21:06] <Ormod> Sent the hackety hack os x patch to the list in case someone wants to take it up
[21:36] <sagewk> bchrisman: normally, yes. the mds can also fragment a busy/large directory, but that's rarely necessary
[21:37] <sagewk> bchrisman: yes (about the path lookup)
[22:19] <Tv> autotest server going down for maintenance
[22:42] * verwilst (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[22:43] <Tv> holy crap
[22:43] <Tv> gregaf: i got an arm qemu for you
[22:43] <gregaf> oh, nice!
[22:43] <Tv> as soon as i realized it crashes if i don't explicitly specify how much ram to give it ;)
[22:43] <gregaf> lol
[22:45] <Tv> gregaf: grab flak:~tv/debian-6.0.0-armel-CD-1.iso
[22:45] <Tv> qemu-img create -f qcow2 hda.img 10G
[22:45] <Tv> wget http://ftp.us.debian.org/debian/dists/squeeze/main/installer-armel/current/images/versatile/netboot/vmlinuz-2.6.32-5-versatile http://ftp.us.debian.org/debian/dists/squeeze/main/installer-armel/current/images/versatile/netboot/initrd.gz
[22:45] <Tv> qemu-system-arm -M versatilepb -hda hda.img -cdrom debian-6.0.0-armel-CD-1.iso -kernel vmlinuz-2.6.32-5-versatile -initrd initrd.gz -append "root=/dev/ram" -m 256
[22:45] <Tv> Works For Me (tm)
[22:45] <gregaf> heh
[22:46] <gregaf> I haven't done anything with qemu, do we have a standard space to run it?
[22:46] <Tv> not really.. it kinda assumed fast local x server, for the sdl graphics
[22:46] <Tv> setting up serial consoles for it is a pain, you'll avoid a lot of trouble by having a local x
[22:46] <gregaf> heh
[22:47] <gregaf> so is it easy and I just need to apt-get some packages, or am I going to need to hack around in config files?
[22:47] <bchrisman> sagewk: cool
[22:48] <bchrisman> sagewk: lotta legacy apps out there which do horrible things with large numbers of files in a single directory...
[22:48] <gregaf> Tv?
[22:49] <sagewk> yep. the fragmentation stuff is almost ready, i just need to find some time to test my recovery/rejoin fixes
[22:50] <Tv> gregaf: checked, doing ... -append "root=/dev/ram console=ttyAMA0" -m 256 -nographic leads to some sort of a serial console
[22:50] <Tv> gregaf: on my ubuntu10.10, sudo apt-get install qemu-kvm-extras was enough
[22:50] <gregaf> okay, cool!
[22:51] <gregaf> thanks a bunch :)
[22:54] <Tv> gregaf: oh and.. good luck compiling ceph inside that thing.. give it more ram and a day or so ;)
[22:54] <gregaf> haha
[22:57] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:11] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[23:28] <cmccabe> tv: I guess -curses is also an option for qemu
[23:29] <cmccabe> tv: although the serial console is a better solution in the long term
[23:35] <darkfader> Tv: i spend today and yesterday and some more i can't remember on pxe booting PV xen domUs. i'm really gonna need another weekend by middle of the week. old xen is broken everywhere and you can't get a current one, kvm is a hyped desktop toy, and free vmware esxi turns off 99% of the useful api calls
[23:36] <darkfader> if you find anything that really works... tell me :(
[23:37] * lxo (~aoliva@ has joined #ceph
[23:42] <lxo> hey, guys. just started playing with ceph last night, and I think I just ran into one of the problems mentioned in the latest release notes (I didn't realize I was running older versions)
[23:43] <lxo> I'm very excited with it, but now I'm wondering about how to recover from that problem. I assume the data is lost, which is fine, but do I have to do anything to recover the disk space I suppose it still takes up?
[23:43] <sjust> Ixo: sorry, which problem?
[23:44] <lxo> oh, doh! the one about data loss with an ods is restarted during recovery
[23:44] <sjust> Ixo: ah
[23:46] <lxo> I was testing recovery on a two-machine “cluster” after a disk loss (I actually decided to rearrange a striped btrfs filesystem into multiple smaller ones), and then restarted ceph on the machine that still held all the data
[23:47] <sjust> Ixo: I think that that bug actually does cause the on-disk stuff to be deleted (if it's the one related to the erroneously marked backlog)
[23:49] <lxo> now, I get the impression that the disk usage is still higher than I expected after the data shrunk to about half the original size, but I may very well be mistaken. I'm now undecided between starting over or trying to re-create the one osd after recovery into the new osds complete
[23:50] <sjust> Ixo: it might actually be able to recover the data from the replica
[23:50] <lxo> one thing that puzzles me is that accesses to the root of the fs have been hanging since then. I wonder if this is normal during recovery (I hope not) or if ceph is waiting for e.g. the old osd (now obliterated) to come back up with the metadata of the root of the tree, or one of the subdirs thereof
[23:51] <sjust> Ixo: yeah, some of the metadata/data might be on pgs which are not yet active
[23:51] <sjust> Ixo: could you give me ceph -s output?
[23:51] <lxo> I suppose this is the line you're after 11.02.28 19:51:29.611561 pg v20322: 528 pgs: 110 active, 418 active+clean; 116 GB data, 1102 GB used, 2549 GB / 3661 GB avail; 100062/205040 degraded (48.801%)
[23:52] <lxo> before the restart, it held some 384k objects, with 221 GB
[23:53] <sjust> Ixo: could we get the mds line? (or the whole thing :) )
[23:53] <lxo> I thought I had to restart the other server after changing its config file to reflect the addition of new osds. maybe I didn't have to, and this would have saved me some headaches
[23:53] <lxo> 11.02.28 19:51:29.617238 mds e134: 1/1/1 up, 1 up:standby(laggy or crashed), 1 up:replay(laggy or crashed)
[23:53] <lxo> 11.02.28 19:51:29.647748 osd e279: 6 osds: 6 up, 6 in
[23:53] <lxo> 11.02.28 19:51:29.716455 log 11.02.28 18:52:37.165093 mon0 40 : [INF] mds0 up:replay
[23:53] <lxo> 11.02.28 19:51:29.717024 mon e1: 1 mons at
[23:56] <sjust> Ixo: looks like the mds is down
[23:56] <sjust> Ixo: looking at that it appears that you have indeed lost a large number of objects
[23:57] <lxo> I had some mds crashes/disappearances, but not on this one machine. the other machine is running an even older version (pending upgrade to BLAG 140k-ish)
[23:58] <lxo> thanks for the confirmation and for the attention! this is so exciting! ceph seems to have come out of my dreams ;-)
[23:59] <lxo> I'll try to figure out whether data was indeed removed, or start over, or something ;-) thanks again!

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.