#ceph IRC Log


IRC Log for 2010-10-01

Timestamps are in GMT/BST.

[0:15] <yehudasa> sage: sagewk: have you done anything to the playground?
[1:02] * deksai (~deksai@dsl093-003-018.det1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[2:57] * greglap (~Adium@ has joined #ceph
[3:39] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has left #ceph
[3:42] * bbigras (quasselcor@bas11-montreal02-1128536392.dsl.bell.ca) has joined #ceph
[3:43] * bbigras is now known as Guest1390
[3:45] * Guest1097 (quasselcor@bas11-montreal02-1128535274.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[3:52] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[4:03] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[5:40] * bbigras (quasselcor@bas11-montreal02-1128536392.dsl.bell.ca) has joined #ceph
[5:40] * bbigras is now known as Guest1399
[5:44] * Guest1390 (quasselcor@bas11-montreal02-1128536392.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[5:45] <cclien> [6~/win 5
[5:48] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[6:03] * yehudasa_hm (~yehuda@adsl-69-225-137-176.dsl.irvnca.pacbell.net) has joined #ceph
[6:33] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Ping timeout: 480 seconds)
[6:37] * Spudz76 (~spudz76@dc.gigenet.com) has joined #ceph
[6:38] <Spudz76> anyone around to answer a quick question?
[6:51] * f4m8_ is now known as f4m8
[7:30] <greglap> what can we do for you, Spudz76?
[7:30] <Spudz76> I got it figured out now
[7:30] <Spudz76> you have to name osds in order
[7:31] <greglap> haha, my favorite kind of question :)
[7:31] <Spudz76> that was all
[7:31] <Spudz76> it doesn't mention that anywhere
[7:31] <greglap> you mean 1,2,3,...?
[7:31] <Spudz76> like I had them named per host like osd101 etc
[7:31] <greglap> ah
[7:31] <Spudz76> and no osd0 - osd100
[7:31] <Spudz76> so stuff hung and didn't say why
[7:31] <Spudz76> also ods0101 doesn't work
[7:31] <Spudz76> strips the padding 0
[7:32] <Spudz76> but now I got osd0 - osd41 and its rockin
[7:32] <greglap> yeah, it's a straight conversion in a number of the data structures
[7:32] <greglap> although maybe we could strip that out if there's sufficient demand/reason for it
[7:32] <Spudz76> I figured a sscanf/sprintf sort of dealie
[7:32] <Spudz76> but whats with the forced order
[7:33] <Spudz76> if its not in the config file then it doesn't exist, right?
[7:33] <Spudz76> heh
[7:33] <greglap> I'm not sure why that is, actually
[7:34] <greglap> but I'll look into it and at least document it better tomorrow, thanks for letting us know
[7:34] <Spudz76> sure
[7:34] <Spudz76> it could have been a bunk crushmap actually
[7:34] <Spudz76> I changed back to completely in-order osds and then fixed the crushmap
[7:36] <Spudz76> I will re-test the out-of-order osd numbers with a non-bunk crushmap in a bit
[7:37] <Spudz76> but I opened a ticket for the padding-strip thing
[7:39] <greglap> ah right, saw that ticket come in
[8:40] * allsystemsarego (~allsystem@ has joined #ceph
[9:23] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[11:18] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[12:23] * Yoric (~David@ has joined #ceph
[14:08] * Spudz76 (~spudz76@dc.gigenet.com) has left #ceph
[14:48] * Guest1399 (quasselcor@bas11-montreal02-1128536392.dsl.bell.ca) Quit (Remote host closed the connection)
[14:50] * bbigras (quasselcor@bas11-montreal02-1128536392.dsl.bell.ca) has joined #ceph
[14:51] * bbigras is now known as Guest1437
[15:40] * Yoric_ (~David@ has joined #ceph
[15:40] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[15:40] * Yoric_ is now known as Yoric
[15:49] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[15:54] * Yoric (~David@ has joined #ceph
[16:01] * f4m8 is now known as f4m8_
[17:06] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) Quit (Quit: bye)
[17:12] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[17:12] * gregorg_taf (~Greg@ has joined #ceph
[17:40] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[17:48] * yuravk (~yura@ext.vps.lviv.ua) has left #ceph
[18:51] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[18:55] * Yoric (~David@ Quit (Quit: Yoric)
[19:03] * idletask (~fg@AOrleans-553-1-29-88.w92-152.abo.wanadoo.fr) has joined #ceph
[19:03] <idletask> Hello
[19:07] <gregaf> hello
[19:07] <idletask> I have two questions...
[19:08] <idletask> If you specify more than 1 dev in the "btrfs devs" parameter of an osd section and use mkfs.ceph, will the btrfs filesystem in question use RAID 0 or RAID 1 for data storage?
[19:09] <idletask> That's the first
[19:09] <gregaf> nope
[19:09] <gregaf> if you want raid under btrfs you'll need to set that up yourself
[19:09] <idletask> OK, so it creates two separate filesystems?
[19:10] * MarkN (~nathan@ Quit (Ping timeout: 480 seconds)
[19:11] <idletask> That's a third question actually, since my second is: what happens if one osd has a shorter space available than another?
[19:11] <gregaf> oh, wait, no, it uses the default settings
[19:11] <gregaf> just passes those parameters to the btrfs mkfs stuff, so whatever that defaults to
[19:12] <idletask> Errr
[19:12] <gregaf> I don't know if it's RAID1 or RAID0, you'd have to look at the btrfs docs, but it does make one fs out of both disks
[19:12] <idletask> THere isn't a /etc/mkbtrfs.conf or the like
[19:12] <idletask> OK, so it's man mkbtrfs then
[19:12] <idletask> Fair enough
[19:12] <gregaf> mkfs.btrfs
[19:12] <gregaf> yeah
[19:13] <idletask> "If multiple devices are specified, btrfs is created spanning across the specified devices"
[19:13] <idletask> There, I have my answer
[19:13] <gregaf> by default, OSDs with different disk space are treated the same, and if one fills up then the cluster marks itself as full
[19:14] <idletask> The whole cluster, then?
[19:14] <gregaf> you can change that by manually specifying different weights in the CRUSH map
[19:14] <gregaf> in which case they'll get data in proportions according to their weights
[19:14] <idletask> Hmm, OK, CRUSH I have just read the name but don't know nil about it, so that will be for later
[19:15] <gregaf> it doesn't do this by default because 1) it would be complicated, and 2) it will decrease overall performance since you're increasing the load on some OSDs (presumably without increasing network or disk bandwidth to match)
[19:17] <idletask> Well, I think it's time that I initiated my newly born ceph cluster first - still need to configure the auth and I should be good to start the monitor and mkfs.ceph
[19:19] * MarkN (~nathan@ has joined #ceph
[20:01] <idletask> OK, authentication turns out to be way harder than I thought
[20:02] <idletask> In fact, I can't fathom any of it :(
[20:03] <idletask> I'll do without auth first
[20:04] <gregaf> idletask: if you want authentication we can walk you through the hard parts
[20:07] <idletask> gregaf: that will turn out to be a lot of question - and a text for a wiki page
[20:07] <idletask> questions, I meant
[20:08] <idletask> I like to understand what I am doing, and be able to explain it to others afterwards :p
[20:10] <idletask> First of all, then - there are different authentication schemes supported, cephx is one of them
[20:10] <idletask> What are the others?
[20:11] <yehudasa> idletask: the other is "none"
[20:11] <idletask> Ah :p
[20:12] <idletask> If not specified, none is the default, then?
[20:12] <yehudasa> yeah, right now that is
[20:15] <idletask> Next question, then...
[20:15] <idletask> The keyring
[20:15] <idletask> You can specify a keyring in the global section, in the mon and mon* sections, in the mds and mds* sections
[20:16] <idletask> That's a lot
[20:16] <idletask> So, what is the role of a keyring exactly?
[20:18] <yehudasa> the keyring contains the keys for the different entities
[20:18] <idletask> A key being...?
[20:19] <yehudasa> a key being a shared secret and the related caps an entity might have
[20:20] <idletask> I suppose an "entity" here means a metadata server, a monitor or an object store, right?
[20:20] <yehudasa> e.g., the admin user (client.admin) needs to have some key that is known to the monitors in order to be able to authenticate
[20:23] <yehudasa> oh, entity is client/osd/mds/mon
[20:23] <yehudasa> note that each entity instance in the system has a different key
[20:24] <yehudasa> (except for the monitors that share a common key)
[20:24] <idletask> Let me write that out
[20:31] * josef (~seven@nat-pool-rdu.redhat.com) has joined #ceph
[20:37] <idletask> Next question: if you use authentication, this is used to manage the cluster itself - but does it also manage what machines are allowed to mount ceph filesystems? If yes, do those machines need a key as well?
[20:42] <yehudasa> yes, in order to be able to mount, you need to give a key to the client machines
[20:44] <idletask> So, are mounting machines yet another entity?
[20:44] <yehudasa> the machines are clients
[20:44] <yehudasa> there is no distinction between a mounting machine, or a single user
[20:46] <idletask> OK, I thought the "client" entity was a cluster administrator, my bad
[20:46] <yehudasa> the client.admin is the cluster administrator, but only because it has unrestricted caps
[20:47] <idletask> That's a reserved name, then?
[20:47] <yehudasa> I don't remember having any specialized code for that specific user
[20:47] <yehudasa> but I might be wrong here, need to check that
[20:48] <idletask> Right now it's just a strncmp(), then?
[20:48] <yehudasa> oh, actually it is a reserved name
[20:49] <yehudasa> what do you mean strncmp?
[20:49] <josef> sage: your chat support people are awesome btw :)
[20:50] <idletask> Well, string compare "client.admin" to other strings wrt caps
[20:50] <idletask> (which are my next big subject)
[20:51] <yehudasa> we don't just compare the name of the client to what we have in our system and that's it, if that what you ask
[20:52] <yehudasa> there's some kind of a key exchange between the client and the monitors that handle the authentication, and once the authentication is done, the client gets a signed ticket that include its caps inside
[20:52] <idletask> OK, that makes it clear - and another thing to write out
[20:53] <yehudasa> it works the same as kerberos in that sense
[20:54] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:05] <wido> sagewk: you there?
[21:06] <wido> I see all my OSD's are up, did you patch them last night? This morning I skipped my daily build, just in case you patched something
[21:07] <wido> I manually merged "osd: fix recovery_primary loop on local clone" from testing into unstable, and build packages with that, but haven't installed them. Is that ok?
[21:13] <sagewk> wido: didn't touch them, i don't think. the patch was only to make recovery complete, so you shoudl be fine with unstable for now.
[21:13] <sagewk> wido: yep. i'll be merging that into unstable shortly.
[21:14] <sagewk> wido: do you have a crash log from one of the osds that died? i suspect the recovery change i made didn't fully clean up the in-memory state
[21:18] <josef> sagewk: you just got a new customer ;), and ceph has been updated to the latest version in fedora
[21:18] <josef> thanks to some random guy who wants to use ceph
[21:19] <sagewk> cool :)
[21:19] <idletask> josef: I'm in the process of trying to set up a cluster, and am fighting my way through the authentication mechanism
[21:19] <idletask> Gentoo isn't up to date with its ebuilds
[21:20] <josef> idletask: well fedora rawhide has the newest stuff
[21:20] <josef> again no thanks to me, i've not learned the new git stuff
[21:20] <idletask> josef: but keep up the good work on btrfs ;)
[21:20] <josef> idletask: yup thats where most of my time is spent
[21:21] <idletask> I use it and it's awesome for my needs
[21:21] <josef> tho today i'm trying to get my domain moved around
[21:21] <idletask> Which is why I now want to put /home on cepth
[21:21] <idletask> ceph
[21:23] <idletask> About auth again, and caps
[21:24] <wido> sagewk: I didn't gather the core dumps yet, but they are on node02, node05, node07 and node08
[21:24] <wido> in / you will find a bunch of coredumps, they all have various backtraces, seem to be different bugs
[21:24] <idletask> yehudasa: from our discussion above, this is what I understand:
[21:24] <wido> dumps from sept 29 and 30
[21:24] <idletask> Entity C is a client, entity M is a monitor
[21:24] <idletask> C authenticates to M, then M returns a ticket to C with C's capabilities in it
[21:24] <idletask> Is this correct?
[21:25] <yehudasa> idletask: basically yes, just that C specifically requests a ticket to user with certain services
[21:25] <yehudasa> e.g., C asks for mds and osd tickets
[21:27] <yehudasa> should be 'tickets to use'
[21:29] <idletask> So, how is "client.admin" special in this case? Can it tell monitors "I want {mds,osd}x down" or something?
[21:29] <idletask> I'm lost
[21:33] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[21:34] <yehudasa> idletask: yeah, from what I see now, the special powers that the admin users have are only derived from its caps
[21:34] <yehudasa> so there's no real magic in the client.admin user anymore, other than that we use it as the default admin user
[21:34] <idletask> yehudasa: so, it is the monitor recognizing client.admin as special, then?
[21:35] <yehudasa> the monitor recognizes any user that it has in its internal db
[21:35] <idletask> There's a chicken and egg problem that I just don't understand :(
[21:36] <yehudasa> usually, when we create the filesystem, we first generate a client.admin user
[21:36] <yehudasa> then we create the monitor, and let it know about that user
[21:37] <idletask> Is it mkcephfs which generates that user?
[21:37] <yehudasa> yes
[21:38] <idletask> And if you don't use it, you must use cauthtool and create "client.admin" in the appropriate keyring, correct?
[21:38] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) Quit (Quit: Leaving.)
[21:39] <yehudasa> right
[21:39] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[21:39] <idletask> OK, the chicken and egg problem is solved
[21:39] <idletask> Now, I just need to write that down, and put it in practice
[21:40] <idletask> Thanks!
[21:40] <yehudasa> np!
[21:55] <sagewk> idletask: are you trying to avoid mkcephfs or something?
[21:55] <sagewk> or just understand how it works?
[21:58] <yehudasa> np!
[21:59] <yehudasa> wrong window..
[22:02] <gregaf> machines that mount are clients and get client keys
[22:02] <gregaf> oh, whoops, scrolled way back by mistake, n/m that last message
[22:40] <wido> sagewk: do you need me to gather those core dumps? It are about 10 dumps I think, from 4 machines
[22:43] <idletask> sagewk: understand how it all works
[22:43] <sagewk> wido: hmm is there a stack trace in the log?
[22:44] <wido> Yes, there is
[22:44] <sagewk> oh i see nm, i'll look at thec ores now
[22:44] <wido> it's node02, node04, node07 and node08
[22:44] <wido> those 4 crashed over and over 2 days ago
[22:55] <sagewk> hmm looks like cosd was updated after that. do you remmeber if there were stack traces in the logs?
[23:03] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[23:05] <sagewk> which i guess means: yes, if you could gather logs, that would be great. the cores are only helpful if you have the matching unstripped binary, which is tricky because the debs have stirpped binaries and debug symbosl in /usr/lib/debug/usr/bin/cosd. and i'm not sure how to merge those two togeher or make gdb read it all in
[23:06] <wido> sagewk: I haven't touched the logs, so they should be in /var/log. The timestamps of the logs should match with the core timestamp
[23:08] <sagewk> ok i see it thanks
[23:09] <sagewk> hmm, this looks like it may be fallout from my half-baked fixes.
[23:11] <wido> sagewk: I found the traces too in the logs, for example on osd7 at 09:53 in osd.7.log.1.gz
[23:11] <sagewk> yeah that's the one i'm looking at now actually :)
[23:11] <wido> ok, but it were a lot of crashes on 4 nodes, i'm really not sure what happend
[23:12] <wido> for example, when I tried to do a "ls" in the rbd pool, osd7 went down
[23:13] <sagewk> yep, this was fallout from the old broken code. fixed up now.
[23:14] <sagewk> there may still see some residual crashing on these pgs that were sloppily fixed up, at least until there is a new write to the pool. if it's that same assert in clean_up_local() then that's it. i suspect it's all sorted out now though since they stopped crashing :)
[23:15] <wido> yes, I think so too
[23:15] <wido> but now the data and metadata pool are clean, I still can't mount the FS now, keeps giving me mount error 5
[23:16] <wido> Pretty nothing saying error, I can't find anything in the MDS log. Where should I look?
[23:16] <sagewk> mds is down it looks like
[23:16] <wido> both are up?
[23:17] <sagewk> restarting for good measure
[23:17] <wido> Restarted them about 15 minutes ago, didn't fix anything
[23:17] <sagewk> they're in up:standby
[23:17] <sagewk> hmm
[23:18] <sagewk> oh, your mdsmap is in a weird state.
[23:20] <wido> Looks like my whole cluster is weird sometimes :)
[23:20] <idletask> Back...
[23:20] <idletask> Now, capabilities
[23:20] <idletask> What are they?
[23:21] <wido> idletask: you mean in cephx?
[23:21] <idletask> From the wiki, I only saw [allow] <some unix-like permissions there>
[23:21] <idletask> Yes
[23:21] <wido> r = read, w = write and x = execute
[23:22] <wido> with that you can control wether a client for example should have access to the metadata server (mds)
[23:22] <wido> If you are using a client which is RADOS only (No filesystem) you can skip the MDS access for example
[23:23] <wido> and a read only client (correct me if i'm wrong) will only need Read access to the MDS
[23:23] <wido> same goes for OSD access and mon access.
[23:24] <idletask> Erm, hold on
[23:24] <idletask> What does a basic client which just want to mount a ceph filesystem need as capabilities, then?
[23:24] <wido> mount, mds r, mon r and osd r
[23:24] <sagewk> idletask: man cauthtool
[23:26] <idletask> sagewk: this gets back to my original question, then - capabilities, which are they?
[23:28] <gregaf> in this context, capabilities are permissions to perform certain actions on a Ceph cluster
[23:29] <gregaf> all actions a client can ask a server to do are broken down as one of read, write, execute
[23:29] <gregaf> and then the client has some combination of permissions for each server type in the cluster
[23:30] <gregaf> client in this case meaning a mounting computer — individual users don't get their own keychains or capabilities
[23:32] <idletask> gregaf: what I really meant was: what are the individual capabilities?
[23:32] <gregaf> some combination of read/write/execute on each of the mon/mds/osd
[23:33] <gregaf> they do about what you'd expect, but I don't think we have a comprehensive listing of them anywhere
[23:34] <idletask> Well, I need to read more about what to expect, probably
[23:36] <gregaf> probably just reading the cauthtool man page will give you what you need
[23:37] <gregaf> capabilities are a little more capable than they need to be given what's actually possible with them
[23:37] <gregaf> err, no pun intended
[23:37] <idletask> I didn't mean that
[23:37] <idletask> I'm just beginning, after all
[23:39] <idletask> gregaf: so, why stick with the rwx model?
[23:40] <gregaf> we weren't sure exactly what we'd need and it is useful for handling the OSD pools
[23:41] <gregaf> on which you have precisely read, write, and execute permissions
[23:43] <idletask> Back to individual capabilities then - "objects" are mount, mds, mon, osd, and "permissions" on them are r, w, x, correct?
[23:45] <wido> idletask: there is no mount object, only osd, mds and mon
[23:47] <idletask> Og
[23:47] <idletask> s,g$,h,
[23:47] <wido> ?
[23:47] <idletask> So, mount needs "r" on all three, then?
[23:50] <wido> yes, but on the osd you can restrict the client to the "data" pool
[23:50] <idletask> Err, let's keep it simple to start with :p
[23:50] <wido> The mds will do the r/w to the metadata pool, so the client doesn't need access to the metadata pool. Right gregaf?
[23:51] <gregaf> right
[23:51] <gregaf> of course, if you actually want to write anything you'll also need write access to the data pool
[23:51] <idletask> Erm
[23:51] <wido> yes, indeed
[23:52] * idletask is lost already
[23:52] <wido> idletask: I think you are a bit overwhelmed by Ceph
[23:52] <wido> just start using it, get to know the bricks
[23:52] <wido> and then things will fall into place
[23:52] <idletask> Yes, probably so
[23:52] <gregaf> authentication isn't really necessary to do work with
[23:53] <gregaf> it's just in case you're worried about your network being exposed and people trying to mount your ceph fs on computers that you don't control
[23:54] <wido> gregaf: but cephx doesn't protect data either, it's only auth, no encryption
[23:54] <wido> that question is still open in the FAQ on the Wiki :)
[23:55] <idletask> Yes, but OTOH, I stick to the principle of "what isn't necessary is forbidden", so I like to understand security mechanisms and how to work with them
[23:55] <gregaf> uh, yeah, I suppose people could listen in if your network was really poorly constructed
[23:56] <idletask> Well, the same applies to a network anyway, doesn't it?
[23:57] <idletask> I use firewalls on each and every machine I ever build, and that's for a reason
[23:57] <gregaf> cephx and similar things have been thrown in over the last year, but in some ways they're just band-aids — Ceph was designed in the expectation that it would be used in a trusted environment, so it's not really any more secure than a local filesystem would be
[23:58] <idletask> Well, for a "thrown in over the last year" stuff, it seems to do a lot already :p
[23:58] <gregaf> Yehuda did a good job with it :)
[23:58] <wido> he did :)
[23:59] <wido> but while I get the early design, having a RO client is sometimes usefull
[23:59] <wido> boxes can get hacked :)
[23:59] <idletask> Read-only what? Mount?
[23:59] <gregaf> yep!

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.