#ceph IRC Log


IRC Log for 2011-08-02

Timestamps are in GMT/BST.

[0:08] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[0:39] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[1:22] * gregaf (~Adium@ip-64-111-111-107.dreamhost.com) Quit (Quit: Leaving.)
[1:26] * gregaf (~Adium@ip-64-111-111-107.dreamhost.com) has joined #ceph
[1:44] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[2:02] * huangjun (~root@ has joined #ceph
[2:11] * joshd (~joshd@ip-64-111-111-107.dreamhost.com) Quit (Quit: Leaving.)
[2:33] * cmccabe (~cmccabe@ has left #ceph
[2:41] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) Quit (Remote host closed the connection)
[2:43] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:55] * morse_ (~morse@supercomputing.univpm.it) has joined #ceph
[2:58] * morse (~morse@supercomputing.univpm.it) Quit (Ping timeout: 480 seconds)
[3:25] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[3:43] <huangjun> i use ext4,but can not mkfs even i use "mount -o remount,user_xattr /"
[3:43] <huangjun> it reports "Extended attributes don't appear to work,Got error error 95: Operation not supported. If you are using ext3 or ext4, be sure to mount the underlying file system with the 'user_xattr' option."
[3:44] <huangjun> so what should i do to resolve this?
[3:46] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[4:11] * lx0 (~aoliva@83TAACN97.tor-irc.dnsbl.oftc.net) Quit (Read error: Connection reset by peer)
[4:12] * lx0 (~aoliva@1GLAACZUE.tor-irc.dnsbl.oftc.net) has joined #ceph
[4:46] * gregpad (~rooms@ has joined #ceph
[4:48] <gregpad> huangjun: not sure you can remount your root drive?
[4:49] <gregpad> You might try adding it to your /etc/fstab instead
[4:50] * jmlowe (~Adium@mobile-166-137-141-223.mycingular.net) has joined #ceph
[4:50] * jmlowe (~Adium@mobile-166-137-141-223.mycingular.net) has left #ceph
[5:00] * gregpad (~rooms@ Quit (Quit: Rooms ??? iPhone IRC Client ??? http://www.roomsapp.mobi)
[5:01] <huangjun> uhmm
[5:01] <huangjun> thaks, it eorks
[5:01] <huangjun> it works now
[5:03] <huangjun> but there is another problem
[5:04] <huangjun> we add 10 OSDs to the cluster, and we see crashed PG
[5:04] <huangjun> some OSD marked out
[5:08] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[5:08] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has left #ceph
[5:19] * jiaju (~jjzhang@ has joined #ceph
[5:19] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[5:20] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit ()
[6:19] * lx0 (~aoliva@1GLAACZUE.tor-irc.dnsbl.oftc.net) Quit (Quit: later)
[6:36] * lxo (~aoliva@09GAAFUYQ.tor-irc.dnsbl.oftc.net) has joined #ceph
[7:11] * lxo (~aoliva@09GAAFUYQ.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[7:12] * lxo (~aoliva@9KCAAA790.tor-irc.dnsbl.oftc.net) has joined #ceph
[7:45] * lxo (~aoliva@9KCAAA790.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[7:46] * lxo (~aoliva@09GAAFUZ2.tor-irc.dnsbl.oftc.net) has joined #ceph
[8:26] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[9:19] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) has joined #ceph
[10:51] * jiaju (~jjzhang@ Quit (Remote host closed the connection)
[10:58] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:18] * Meths_ (rift@ has joined #ceph
[11:23] * Meths (rift@ Quit (Read error: Operation timed out)
[11:28] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[12:16] * jiaju (~jjzhang@ has joined #ceph
[12:49] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:51] * lxo (~aoliva@09GAAFUZ2.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[12:51] * lxo (~aoliva@09GAAFU4P.tor-irc.dnsbl.oftc.net) has joined #ceph
[13:05] * lxo (~aoliva@09GAAFU4P.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[13:06] * lxo (~aoliva@1GLAAC0A0.tor-irc.dnsbl.oftc.net) has joined #ceph
[13:55] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) has joined #ceph
[14:33] * kilburn (~kilburnsc@ has joined #ceph
[14:33] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[14:33] <kilburn> hi all
[14:34] <kilburn> I just wanted to tell you guys that the RBD link on the website's front page is wrong
[14:34] <kilburn> (it points to the rados wiki pages instead of the RBD wiki page)
[14:58] <huangjun> it ok, i just tried
[15:07] <kilburn> from the source code:
[15:07] <kilburn> The <a href="http://ceph.newdream.net/wiki/RADOS_Gateway">RBD</a> driver provides a shared network block device via a Linux kernel block device driver (2.6.37+) or a <a href="http://ceph.newdream.net/wiki/Kvm-rbd">Qemu/KVM storage driver</a> based on librados. In contrasts to alternatives like iSCSI or AoE, RBD images are striped and replicated across the Ceph object storage cluster, providing reliable, scalable, and thinly provisioned access to block sto
[15:08] <kilburn> see the first link? it is wrong, isn't it?
[15:45] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[16:03] <lxo> looking at the code, I get the impression that all scrubbing does is to check consistency of PG meta-information, not file contents/readability/etc, right?
[16:04] <lxo> I ask because I had an incident with one of the osds, and recovering the filesystem in a hacky way left many of the btrfs data checksums inconsistent, and thus btrfs read garbage or errored out
[16:06] <lxo> is there anything in ceph that will fix this on its own (say, testing file checksums among replicas and propagating matching entries to other replicas/primary, or voting for the correct version, which would work with 3 replicas)
[16:06] <lxo> I ended up dropping the osd entirely and letting it rebuild from scratch, but that sounded wasteful
[16:08] <lxo> considering I have everything 3-plicated on 3 osdes, I considered rsyncing a snapshot of one osd's current onto another, like the process of initializing monitors, but I ended up not doing that. would it have worked? (copying xattrs too)
[16:18] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[16:33] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[17:10] * jiaju (~jjzhang@ Quit (Quit: ??????)
[17:51] * greglap (~Adium@ has joined #ceph
[17:56] * morse_ (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:03] * huangjun (~root@ Quit (Quit: leaving)
[18:05] <greglap> lxo: right now scrubbing just looks at metadata
[18:05] <greglap> deeper scrubs looking at checksums or whatever are something we plan to do at some point but there are other things to deal with first
[18:06] <greglap> rsyncing another OSD's current directory wouldn't have worked though, since they all store different PGs ??? you just would have ended up making the OSD go through and delete all the PGs it wasn't supposed to have and then resync the ones it needed anyway
[18:14] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:27] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:30] <yehudasa> kilburn: yeah, thanks
[18:35] <greglap> hmmm, has something changed on the sepia machines or in teuthology lately?
[18:35] <Tv> not by me
[18:35] <greglap> some machines that I accidentally had locked overnight have syslog entries in /tmp/cephtest that the ubuntu user doesn't have perms to delete
[18:36] <greglap> I can sudo but that's likely to be a problem for teuthology, unless the way I busted it caused them somehow
[18:36] <Tv> "syslog entries"?
[18:37] <greglap> ubuntu@sepia57:~$ ls -lha /tmp/cephtest/
[18:37] <greglap> total 12K
[18:37] <greglap> drwxr-xr-x 3 syslog syslog 4.0K 2011-08-02 06:25 .
[18:37] <greglap> drwxrwxrwt 3 root root 4.0K 2011-08-02 09:25 ..
[18:37] <greglap> drwxr-xr-x 3 syslog syslog 4.0K 2011-08-02 06:25 archive
[18:37] <greglap> ubuntu@sepia57:~$ ls -lha /tmp/cephtest/archive/syslog/misc.log
[18:37] <greglap> -rw-r----- 1 syslog syslog 0 2011-08-02 06:25 /tmp/cephtest/archive/syslog/misc.log
[18:37] <greglap> ubuntu@sepia57:~$ rm -rf /tmp/cephtest/
[18:37] <greglap> rm: cannot remove `/tmp/cephtest/archive/syslog/misc.log': Permission denied
[18:37] <Tv> oh yeah
[18:37] <Tv> syslog really wants to own those files
[18:37] <Tv> the relevant task has sudo rm -rf in its cleanup
[18:37] <greglap> oh, okay
[18:38] <greglap> as long as cleanup handles it properly
[18:38] <Tv> and confirmed that nuke has sudo too
[18:39] <yehudasa> kilburn: fixed
[18:40] * greglap (~Adium@ Quit (Quit: Leaving.)
[18:54] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[18:59] * cmccabe (~cmccabe@ has joined #ceph
[19:00] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[20:06] <sagewk> tv, gregaf: http://fpaste.org/HICI/
[20:07] <Tv> lots of 8MB allocations
[20:08] <sagewk> i wonder if tha't shwat posix_memalign() is doing
[20:08] <Tv> so that's 8MB anon mmaps with 4k guard pages in the middle
[20:09] <gregaf> hmm, I wonder if maybe the threads aren't getting cleaned up
[20:09] <yehudasa> is it heap?
[20:09] <yehudasa> 8MB is usually thread size
[20:10] <Tv> thread heaps would be believable
[20:10] <gregaf> it spawns a lot of MDLog::replay_threads, maybe those aren't getting dealt with properly
[20:10] <gregaf> and since they're dead they would be really easy to swap out, although I'm surprised it happens before there's any memory pressure
[20:10] <yehudasa> Tv: that would be thread stack
[20:10] <Tv> oh yeah duh
[20:11] <Tv> there's a 6MB area flagged "[heap]", dunno if c++ stdlib really puts most of heap elsewhere
[20:11] <sagewk> they look like this is smaps
[20:11] <sagewk> http://fpaste.org/bJZz/
[20:11] <gregaf> bad paste
[20:11] <gregaf> Error 500: Sorry, you broke our server. You might have reached the 512KiB limit! http://fpaste.org/.
[20:11] <sagewk> http://fpaste.org/6oD5/
[20:12] <sagewk> only 13 threads in task/ tho, so presumably they're joining
[20:13] <yehudasa> sagewk: your latest paste is probably not what you wanted to paste?
[20:13] <sagewk> :)
[20:13] <sagewk> http://fpaste.org/YrXb/
[20:14] <gregaf> yeah, I don't see it ever calling join on the replay_thread
[20:14] <gregaf> that's got to be it
[20:16] <sagewk> yeah
[20:16] <cmccabe> if it's still alive, you can check tasks
[20:16] <gregaf> I'll just make a context that calls join and goes into waitfor_replay
[20:16] <sagewk> did, only 13
[20:16] <sagewk> gregaf: testing that now
[20:16] <gregaf> oh, okay
[20:17] <cmccabe> it is pretty important to call join on all threads you start, unless they're detached
[20:17] <gregaf> oh, I missed that you only saw 13 of them
[20:17] <cmccabe> this could also explain the thread creation failure we saw earlier
[20:17] <cmccabe> I'm not sure whether zombies show up there?
[20:17] <sagewk> sadly no, that was on the osd
[20:17] <cmccabe> oh
[20:18] <gregaf> but it might be that it's only cleaning up part of the machinery, since the thread is done but the caller needs to clean up at least part of the memory (for handling return values)
[20:20] <cmccabe> in other words, the kernel thread may not exist, but the userspace data and the memory mappings very well might.
[20:20] <cmccabe> in fact that's almost certainly the case since caling exit() from the thread will kill it from the kernel's point of view, but not from pthreads' point of view.
[20:21] <Tv> you mean pthread_exit?
[20:22] <Tv> yeah the return value (a void*) hangs around, is returned by pthread_join
[20:22] <Tv> the join needs to do type-specific freeing
[20:23] <sagewk> tv: pthreads probably does it in user memory tho, not using the kernels process table
[20:24] <Tv> sagewk: err not sure how to respond to that
[20:24] <sagewk> no need
[20:24] <Tv> current thinking in this room: the thread has exited, the pthread_exit(foo) memory is hanging around, waiting for a pthread_join(...) call and freeing of the foo
[20:25] <gregaf> in this case there isn't actually any user data foo
[20:25] <gregaf> but the infrastructure still needs to deal with it
[20:25] <yehudasa> a quick fix would be to create the thread detached
[20:25] <gregaf> detached?
[20:25] <yehudasa> but we should probably want to actually see the return value
[20:26] <gregaf> is that for this scenario where you don't care about return data?
[20:26] <Tv> or whether the thread stays alive, etc
[20:26] <yehudasa> yeah
[20:26] <Tv> or when it completes
[20:26] <gregaf> sagewk: so how's it look, did adding a join context take care of it?
[20:26] <sagewk> its' tricker, because the contexts are run in the child thread and you can't join from there.
[20:26] <sagewk> restructuring hte thread spawning/joining stuff entirely
[20:27] <cmccabe> tv: about your earlier question
[20:27] <cmccabe> tv: exit is a system call that actually only terminates the current thread
[20:28] <cmccabe> tv: confusingly, exit(3) in libc actually calls exit_group to terminate all threads
[20:28] <yehudasa> cmccabe: not really
[20:28] <cmccabe> http://linux.die.net/man/2/exit_group
[20:28] <yehudasa> oh, ok
[20:28] <Tv> exit is a libc call, but yeah
[20:28] <cmccabe> well, it's both a libc call and a syscall
[20:32] <sagewk> yeah i think pthread_detatch is actually what we want
[20:32] <sagewk> otherwise it's just extra work to call the join
[20:32] <gregaf> yeah
[20:35] <Tv> one thing that would be nice for ceph at some point, is visibility into the threads.. e.g. cassandra's SEDA-style architecture revolves around thread pools, and you get decent stats on how many threads are busy doing what, etc: http://www.datastax.com/docs/0.8/operations/monitoring
[20:36] <Tv> with their arch, each thread pool has queue of incoming work, you can monitor those queues and figure out e.g. IO vs CPU balance issues just by seeing where the bottleneck is
[20:36] <Tv> (i'm not a fan of SEDA, but i liked the monitoring)
[20:36] <sagewk> exposing the workqueue threadpool framework via the admin socket could get us something like that
[20:37] <Tv> so what i'm saying, detached threads are kinda the opposite of that ;)
[20:38] <sagewk> yep. replay isn't anything like a workqueue, tho.. it's a linear process that we don't want starve processing of the message dispatch thread
[20:38] <sagewk> unless we dream up visibility into threads themselves, but i'm not sure we could say about them at that low a level that /proc/pid/tasks can't
[20:39] <Tv> if there's stages of execution, they can report that
[20:39] <Tv> but yeah this gets too vague quick
[20:39] <Tv> just saying, boo on random unique special threads doing special things, yay on manageable, monitorable, throttleable pools of workers
[20:40] <sagewk> yeah
[20:41] * aliguori (~anthony@ has joined #ceph
[20:43] <sagewk> yay, detach did the trick.
[20:43] <gregaf> great success!
[20:43] <gregaf> are there other areas we need to check up on that kind of thing?
[20:46] <sagewk> not that i can think of. this is the only regularly recreated thread.. everything else is either a workqueue or a long-running thread, and those tend to get joined carefully.
[20:46] <gregaf> yeah, I was mostly just thinking about startup threads that might get leaked which we haven't noticed
[20:47] <sagewk> you can git grep Thread and audit...
[21:23] * kemo (~kemo@c-68-54-224-104.hsd1.tn.comcast.net) has joined #ceph
[21:28] <kemo> Hmm...anyone able to say the actual hardware needs of a Ceph monitor? Can I get away with a series of Atom servers, or do I still need to have some umph under the hood?
[21:29] <kemo> I imagine that with workload and larger data pools, the monitors are put under a bit of extra use, but wouldn't most of the actual work would be done by the cmds?
[21:31] <Tv> kemo: ceph monitors should be fairly lightweight
[21:32] <Tv> i don't recall hearing bad feedback about using atoms for cmon
[21:33] <Tv> (we did get some bad feedback on atoms for osd; currently, osd recovery and such non-healthy states consume ~1GHz of a decent core)
[21:33] <Tv> most of the performance tuning work is still ahead of us, at this moment
[21:40] <gregaf> kemo, Tv: yeah, ceph monitors don't use any CPU at all
[21:40] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[21:46] <kemo> Nice, thanks for the feedback fellas...and tell me if this sounds crazy/not possible/unstable/etc...Ceph + Xen + RAID 1 as the general arch of my cluster? (I'm wanting to setup my VPS cluster and am looking into storage solutions, and still want a hardware layer of redundancy)
[21:47] <kemo> Is the RAID 1 with Ceph overkill? Ceph + Xen != bueno?
[21:48] <kemo> Also, anyone remember what that set of mIRC scripts was called...had a lot more slaps...themes...etc...been a while since I've joined a channel...and my Google Fu is failing me today =[
[21:49] <Tv> kemo: any vm mechanisms slows down IO pretty significantly
[21:49] <Tv> kemo: whether that's too much, depends on your usage
[21:49] * Meths_ is now known as Meths
[21:49] <gregaf> you might want to look at RBD instead of using full Ceph for VM image storage
[21:50] <Tv> kemo: most big storage setups definitely avoid raid1, because that wastes a lot of space
[21:50] <Tv> kemo: oh wait are you storing vms on ceph, or running ceph daemons in vms?
[21:51] <kemo> I'm wanting to store my VM images on Ceph
[21:52] <darkfader> gregaf: is there a xen blkback for rbd now?
[21:52] <darkfader> i'd have to spend the night testing it right now if there is
[21:53] <Tv> darkfader: more likely just the kernel block device export thing
[21:53] <darkfader> ahh
[21:53] <darkfader> sorry
[21:53] <Tv> those two need separate names
[21:53] <Tv> rbd-the-mechanism-and-protocol vs rbd-block-devices-provided-by-kernel
[21:53] <darkfader> RBD and /dev/rbd?
[21:53] <darkfader> would do
[21:55] <gregaf> yeah, no, not anything for Xen right now
[21:55] <gregaf> but whatever you could hack up would have a lot less code/be more stable than POSIX Ceph
[21:56] <darkfader> Tv: vm overhead for IO is really a strange thing... i have one system that reads ~255MB/s from raid10 in dom0 and ~250 in domU, and another one with SSD that is around ~250 in dom0 and ~180 in domU. same kernels and roughly same cpu power
[21:56] <darkfader> the only solution was not to care :)
[21:57] <Tv> darkfader: i've been out of the xen circles long enough that i don't know what's up with domU's, but yeah it was pretty darn sad, and if you got the versions of everything just right you got a decent fraction of the real hardware's IO throughput; KVM has been better for me
[21:58] <darkfader> Tv: orionvm has done some crazy tuning, they get over 400MB/s write in domU
[21:59] <darkfader> Tv: i fully understand if people just use kvm and be happy
[22:00] <darkfader> xen has too many regression issues and important (speed...) patches that are just lost in the next version or the one after it
[22:00] <Tv> more like, kvm works by default as well as xen ever worked for me after fiddling
[22:00] <Tv> upstream integration ftw
[22:02] <darkfader> i'd guess kvm will be a lot worse for me with my setups as they're too far from "normal lbivirt desktop"
[22:03] <darkfader> and yes envy for the kernel integration of kvm, now even montavista carrier grade linux comes with kvm
[22:04] <darkfader> Tv: i'm just waiting for someone to tell me that kvm schedules as fast as xen...
[22:04] <darkfader> for 40-80 VMs
[22:05] <Tv> that's pushing it quite high
[22:05] <darkfader> setup(xen) or config(kvm) messiness is solvable
[22:06] <darkfader> Tv: nobrainer on xen; in theory it should be just the same on kvm, but i think there's just need for some more polishing and it will be there
[22:08] <darkfader> kemo: http://ceph.newdream.net/wiki/Xen-rbd this is what gregaf meant (i guess)
[22:08] <darkfader> it's not a real blkdev backend but it will work :)
[22:08] <cmccabe> darkfader: so when you have multiple kvm VMs open at the same time, they're all sharing /dev/kvm?
[22:09] <cmccabe> darkfader: I'm just trying to figure out your question about scheduling. It makes sense for Xen because Xen is a hypervisor. But for kvm, your scheduling should be determined by the host kernel right?
[22:09] <darkfader> cmccabe: can't even answer that - i use kvm on my laptop sometimes but don't know how they schedule
[22:10] <darkfader> cmccabe: i don't use HVM domUs on Xen either because it's a little slower...
[22:11] <darkfader> so short answer - yes the linux kernel scheduler schedules the vm's in kvm
[22:11] <cmccabe> one of these days I have to read about how that all works
[22:13] <darkfader> hehe. let me know
[22:13] <darkfader> i'd figure it's between the kvm kernel module and the scheduler only
[22:20] <kemo> darkfader: Thanks! Looks like a workable solution
[22:45] <gregaf> sagewk: sjust: right now the OSDCaps give the "owner" auid full perms if it doesn't have any set
[22:45] <gregaf> the check for this currently skips if the uid is the default(-1), and I think that behavior is just a mistaken attempt to provide higher security
[22:45] <gregaf> anybody foresee badness if I remove it?
[22:46] <sjust> nothing here
[22:47] <sagewk> you mean if owner == -1 and you are -1, it doesn't igve full perms, but should?
[22:48] <gregaf> yeah
[22:49] <gregaf> I think the thinking was that anonymous users (that's basically what ???1 is) shouldn't get automatic perms to anything
[22:49] <gregaf> but only pool that are created by anonymous users are going to have that set as the owner
[22:49] <sagewk> if -1 for you means nobody/don't know, and -1 for pool means no owner, then that seems right? anon users should have full access to pools without an owner set
[22:50] <sagewk> i think pools you create with 'rados mkpool foo' have -1 set as the owner?
[22:50] <gregaf> if you're authenticated then it should have your auid set, I think
[22:50] <gregaf> otherwise it creates that way, yeah

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.