#ceph IRC Log


IRC Log for 2012-02-06

Timestamps are in GMT/BST.

[20:20] <fghaas> at the risk of asking a stupid question: riven a specific object listed in "rados -p <pool> ls", how would I find out which PG and OSD it belongs to?
[20:41] <yehudasa> fghaas: not stupid at all
[20:41] <yehudasa> fghaas: I usually do 'rados -p <pool> stat <obj> --debug-ms=1 --log-to-stderr'
[20:42] <yehudasa> fghaas: and then I see where the stat command went to
[20:44] <fghaas> oh wow, so there's no real way to query the object store for that information, other than sniffing the logs?
[20:46] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:53] <yehudasa> fghaas: not that I'm aware of, but I might be wrong
[20:58] <fghaas> hm. color me surprised :)
[21:00] <sjust> fghass: using osdmaptool you can actually get the pg in pool 0 to which an object would map
[21:00] <sjust> ./ceph osd getmap -o <mapfile>
[21:00] <sjust> osdmaptool --test_map_object <object_name> <mapfile>
[21:00] <sjust> unfortunately, the tool currently does not allow you to specify the pool at the moment
[21:02] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: No route to host)
[21:06] * fghaas loves undocumented options :)
[21:10] <fghaas> sjust: ok, so for output like "object 'rb.0.1.0000000000ff' -> 0.cd97 -> [0,2]" that means the object is in PG 0.cd97, for which replicas exist in OSDs 0 and 2?
[21:11] * lollercaust (~paper@85.Red-83-41-151.dynamicIP.rima-tde.net) has joined #ceph
[21:12] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:12] * lollercaust (~paper@85.Red-83-41-151.dynamicIP.rima-tde.net) Quit (Max SendQ exceeded)
[21:15] * lollercaust (~paper@85.Red-83-41-151.dynamicIP.rima-tde.net) has joined #ceph
[21:29] * eightyeight (~88@pthree.org) has joined #ceph
[21:29] <eightyeight> i need help troubleshooting mounting
[21:29] <eightyeight> /dev/md1 is a linux software raid 5, formatted as btrfs with default options, and mounted to /mnt with rw
[21:30] <eightyeight> of course, /etc/ceph/ceph.conf is updated as the 'btrfs devs'
[21:30] <eightyeight> yet, on the client, 'mount.ceph /mnt/ceph' fails with:
[21:30] <eightyeight> mount error 22 = Invalid argument
[21:30] <eightyeight> there is nothing in dmesg(1) on the storage server
[21:31] <eightyeight> /var/log/ceph/mon.foo.log doesn't give me any worthwhile information
[21:32] <eightyeight> i'm running 0.41 from git, following the instructions on building from source at http://ceph.newdream.net/wiki/Debian
[21:32] <eightyeight> this is, however, an ubuntu server
[21:32] <eightyeight> both the client and the storage servers are running kernel version 2.6.38-11-generic
[21:33] <eightyeight> any help would be greatly appreciated. i need to have a working proof-of-concept by the end of the week, or sooner
[21:36] <nhm> eightyeight: this might be helpful: http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/2403
[21:39] <sjust> fghaas: yeah, thought the [0,2] part will change if the pg is remapped
[21:39] <sjust> *though
[21:40] <fghaas> sjust: yeah, that's understood, but for the current state of the OSDs my reasoning is correct?
[21:41] <sjust> fghaas: yep
[21:41] <fghaas> gotcha
[21:41] <eightyeight> yeah. i've already seen this post. is the "secret=" found in /etc/ceph/keyring on the sotrage server (in my case)?
[21:41] <eightyeight> and, should it be copies over to the client?
[21:42] <eightyeight> also, i don't have the "cauthtool" installed. what package provides this?
[21:42] <fghaas> eightyeight: ceph-authtool
[21:42] <fghaas> binary names changed a while back
[21:43] <eightyeight> is that provided by the ceph.git on github, or separately?
[21:43] <eightyeight> ah
[21:43] <eightyeight> got it
[21:44] <eightyeight> # mount.ceph /mnt -v -o name=admin,secret=AQDyhS1P6AwfBRAAzFrSOw6wJcraJsccUrJeHA==
[21:45] <eightyeight> parsing options: name=admin,secret=AQDyhS1P6AwfBRAAzFrSOw6wJcraJsccUrJeHA==
[21:45] <eightyeight> ... and just sitting there
[21:45] <eightyeight> timing out?
[21:46] <joshd> eightyeight: check dmesg on the client
[21:46] <eightyeight> mount error 5 = Input/output error
[21:47] <eightyeight> [357220.786986] libceph: client0 fsid c263607f-5cb3-45d6-93e4-58da0b59fd9f
[21:47] <eightyeight> [357220.789230] libceph: mon0 session established
[21:47] <eightyeight> that's it
[21:48] <joshd> eightyeight: what about the mon0 logs?
[21:48] <eightyeight> tcpdump(8) shows some syn/ack packets...
[21:48] * eightyeight checks
[21:48] <eightyeight> on the client, or the storage server?
[21:48] <joshd> on the monitor server
[21:48] <eightyeight> ok
[21:48] <joshd> usually in /var/log/ceph/mon.0.log
[21:49] <eightyeight> 2012-02-06 13:47:25.307834 7f65a5842700 -- >> pipe(0x1089000 sd=10 pgs=0 cs=0 l=0).accept peer addr is really (socket is
[21:49] <eightyeight> is the latest
[21:49] <eightyeight> ( being the monitor server, being the client)
[21:50] <joshd> can you pastebin the output of 'ceph -s'?
[21:50] <eightyeight> on the monitor server?
[21:50] <joshd> anywhere that can connect to the monitors
[21:51] <eightyeight> ok
[21:52] <eightyeight> joshd: http://ae7.st/p/64
[21:53] <fghaas> no OSDs?
[21:54] <eightyeight> the 'storage server' = 'kvmsan1' is an osd. it's the only one atm
[21:54] <eightyeight> more will come soon, once i can get this working
[21:54] <joshd> eightyeight: that output shows no osds running though - that's the problem
[21:54] <eightyeight> i can pastebin the /etc/ceph/ceph.conf if needed
[21:54] <joshd> 2012-02-06 13:50:54.369356 osd e1: 0 osds: 0 up, 0 in  
[21:55] <eightyeight> oh. i see
[21:55] * fronlius_ (~fronlius@f054100059.adsl.alicedsl.de) has joined #ceph
[21:56] <eightyeight> root@kvmsan1:~# /usr/bin/ceph-osd -i 0 -c /etc//ceph/ceph.conf
[21:56] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[21:56] <eightyeight> doesn't appear to actually launch a pid, according to ps(1)
[21:56] * fronlius (~fronlius@f054100059.adsl.alicedsl.de) Quit (Read error: Connection reset by peer)
[21:56] * fronlius_ is now known as fronlius
[21:56] <eightyeight> with the init script:
[21:57] <eightyeight> # /etc/init.d/ceph -a start
[21:57] <eightyeight> same issue
[21:57] <joshd> eightyeight: can you pastebin the osd log?
[21:57] <eightyeight> ok
[21:57] * fronlius (~fronlius@f054100059.adsl.alicedsl.de) Quit (Read error: Connection reset by peer)
[21:58] <eightyeight> http://ae7.st/p/8e
[21:58] * fronlius (~fronlius@f054100059.adsl.alicedsl.de) has joined #ceph
[22:00] <fghaas> ouch :)
[22:01] <joshd> yeah, that's not a good assert to hit
[22:01] <eightyeight> bug?
[22:02] <joshd> looks like it - not sure why we haven't seen this elsewhere
[22:02] <eightyeight> crap
[22:02] <eightyeight> possible to get it fixed today? :D :D
[22:02] <joshd> if you just want to get your system running, you can re-run mkcephfs
[22:04] <eightyeight> how would that change what i'm facing?
[22:05] <eightyeight> # mkcephfs -c /etc/ceph/ceph.conf --allhosts -v
[22:05] <eightyeight> # /etc/init.d/ceph -a start
[22:06] <joshd> actually, it might be due to the way you're specifying the osd journal/data
[22:06] <joshd> what's your ceph.conf?
[22:06] <eightyeight> # ps -ef | grep ceph
[22:06] <eightyeight> i'll pastebin
[22:09] <eightyeight> http://ae7.st/p/51
[22:11] <joshd> eightyeight: could you add 'debug osd = 25', 'debug filestore = 20', and 'debug journal = 20' to the osd section, restart the osd daemon, and paste the osd log?
[22:14] <eightyeight> yeah. sec.
[22:16] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) has joined #ceph
[22:17] <eightyeight> http://ae7.st/p/4f
[22:17] * BManojlovic (~steki@ has joined #ceph
[22:26] <eightyeight> joshd: see anything of interest?
[22:26] <joshd> looking
[22:27] <eightyeight> ok
[22:39] * fronlius_ (~fronlius@f054097033.adsl.alicedsl.de) has joined #ceph
[22:39] <joshd> eightyeight: the bug is in the error path, when the osd can't authenticate with the monitor
[22:40] <joshd> eightyeight: you'll need to set a keyring for the osd in your ceph.conf
[22:40] * fronlius (~fronlius@f054100059.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[22:40] * fronlius_ is now known as fronlius
[22:40] <joshd> eightyeight: just like you have for the mds section
[22:40] <eightyeight> ok
[22:41] <eightyeight> rebuild with mkcephfs(8) and restart services?
[22:41] <eightyeight> or just restart?
[22:42] <joshd> just restart
[22:42] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Operation timed out)
[22:43] <eightyeight> ok. it didn't create a /data/keyring.ods.kvmsan1, as would be expected
[22:43] <eightyeight> or do i need to create that manually myself?
[22:44] <verwilst> is ceph production ready?
[22:44] <joshd> eightyeight: I guess it's created at mkcephfs time, but you can create it yourself by running 'ceph auth list' and putting the osd part in that the expected file
[22:45] <Tv|work> verwilst: rados, radosgw and rbd are, the distributed filesystem is not; http://ceph.newdream.net/docs/latest/#status
[22:45] <Tv|work> verwilst: but do take care to learn the system well before relying on being able to recover from errors
[22:46] <Tv|work> verwilst: commercial support should be widely available 1-3 months from now, if you have a really interesting use case you might get to be one of the early customers
[22:47] <verwilst> i don't think our usecase is that crazy :)
[22:47] <verwilst> we have some webservers that have common data
[22:47] <verwilst> so i would like to have a cluster fs so they can all be active and write to the same data :P
[22:48] <verwilst> not sure if that's a valid use case :)
[22:48] <Tv|work> verwilst: the distributed filesystem is not production ready
[22:48] <Tv|work> verwilst: prototyping and feedback is welcome
[22:49] <verwilst> you happen to know what's more suited for a setup like mine?
[22:49] <verwilst> since it's production shizzle :)
[22:49] <Tv|work> verwilst: rsync? git checkout?
[22:49] <verwilst> not the code
[22:49] <verwilst> but the data
[22:49] <Tv|work> verwilst: your use case sounds more like a distribution problem than having an active filesystem requirement
[22:49] <verwilst> the code can be deployed with capistrano or sth :)
[22:50] <Tv|work> verwilst: or do you actually create/update the data files?
[22:50] <verwilst> yeah
[22:50] <verwilst> users can upload stuff for example
[22:51] <verwilst> normally we try to use sth like mogilefs for it
[22:51] <Tv|work> verwilst: well you could work on top of an object store; radosgw has an S3-like API
[22:51] <verwilst> but because it's a lot of larger files for this project, we would like to use x-sendfile, which mogilefs doesnt work
[22:55] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:58] <eightyeight> joshd: do i need to create that keyring file under /data, or should the daemon create it?
[22:58] <eightyeight> i guess i can touch(1) it, and see if it gets updated
[22:59] <joshd> eightyeight: mkcephfs creates it, or you can create it. It needs to be where you say it is in the 'keyring = blah' part of your ceph.conf
[22:59] <eightyeight> ah. mkcephfs(8). heh
[23:02] <eightyeight> from dmesg(1):
[23:02] <eightyeight> [361861.560622] libceph: client0 fsid 7487c444-2731-4120-9b31-ff0407ec639b
[23:02] <eightyeight> [361861.561170] libceph: auth method 'x' error -1
[23:02] <eightyeight> "operation not permitted" when mounting
[23:05] <eightyeight> oh. wait
[23:06] <eightyeight> yeah. same error, after fixing my mistake
[23:08] <joshd> eightyeight: did you use the new key since you re-ran mkcephfs? if so, could you pastebin the monitor logs?
[23:08] <eightyeight> the key has the same value as 'client.admin' for 'osd.0': AQBtTTBPqEIMLBAAOJ4EUyUZSMtzkQaPoKf45w==
[23:09] <eightyeight> # mount.ceph /mnt -v -o name=osd,secret=AQBtTTBPqEIMLBAAOJ4EUyUZSMtzkQaPoKf45w==
[23:10] * eightyeight pastebins the logs
[23:10] <eightyeight> with verbose debugging?
[23:10] <joshd> wait, you mean 'ceph auth list' shows tha same key for both osd.0 and client.admin?
[23:10] <eightyeight> yes
[23:11] <eightyeight> no
[23:11] <eightyeight> osd.0: AQBtTTBPqEIMLBAAOJ4EUyUZSMtzkQaPoKf45w==, client.admin: AQBtTTBPUBgFOBAAgn1MRmo2+clfuH15DfoX4Q==
[23:11] <eightyeight> looked the same at first glance
[23:13] <joshd> you won't be able to mount the filesystem as an osd, since they don't have capabilities for the mds (note how client.admin has caps: [mds] allow)
[23:14] <joshd> you'll have to mount as client.admin
[23:14] <eightyeight> i assume you mean this: http://ae7.st/p/56
[23:14] <eightyeight> ok
[23:15] <joshd> yeah
[23:15] <eightyeight> woah
[23:15] <eightyeight> i have a mount
[23:15] <eightyeight> w00t
[23:16] <nhm> eightyeight: congrats. :)
[23:16] <eightyeight> nhm: thx
[23:16] <eightyeight> joshd: thx for your help
[23:16] <eightyeight> iozone is running, and i'm not liking the results of what i'm seeing, however
[23:17] <eightyeight> slow writes
[23:17] <joshd> eightyeight: you're welcome
[23:17] <eightyeight> i guess i would see faster writes with more osd servers on board, yeah?
[23:18] <joshd> eightyeight: one thing that will help there is putting the osd journal on a separate disk, so it's not contending with the normal data storage of the osd
[23:18] <eightyeight> joshd: is there a document for that?
[23:18] <joshd> eightyeight: more osds would help too
[23:18] <nhm> eightyeight: what kind of speed are you seeing with what settings?
[23:18] <eightyeight> just a sec. phone.
[23:19] <joshd> eightyeight: just set 'osd journal = /dev/whatever' in ceph.conf
[23:22] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[23:28] <nhm> eightyeight: I'm going afk for a while, but would be very interested in your performance results. Please feel free to email me at nhm@clusterfaq.org, or post results in channel. Thanks
[23:29] <eightyeight> nhm: will do
[23:29] <eightyeight> joshd: working on that now
[23:30] <eightyeight> so, recreating my drives with a manual linux software raid 10 (4 individual mirrors, than 1 stripe across the mirrors)
[23:31] <eightyeight> and the journal device as a linux raid 1 with 2 drives
[23:31] <eightyeight> thoughts?
[23:31] <iggy> A. linux supports "raid10" which may or may not be better than manually doing mirrors and stripes
[23:32] <iggy> B. You still have the journal contending with the actual osd FS
[23:33] <iggy> for best performance, you probably want the journal on it's own drive (ssd, 15k sas, etc...)
[23:33] <iggy> there's a page on the wiki about designing a cluster that has some of this info in it
[23:37] <eightyeight> in my experience, building a raid 10 manually, rather than letting the kernel do it, has given me _substantial_ performance boosts
[23:37] <eightyeight> further, raid 10 support is still "experimental"
[23:37] <eightyeight> so, /dev/md10 is my osd, and /dev/md5 is my osd-journal
[23:38] <eightyeight> (/dev/md5 being the raid 1)
[23:41] <eightyeight> hmm. taking the journal out (i don't see it mounted) does show a slight performance improvement
[23:42] <eightyeight> then again, these disks are under heavy stress syncing the arrays
[23:42] * eightyeight should wait for them to finish
[23:44] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) Quit (Quit: Ex-Chat)
[23:54] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)

