#ceph IRC Log


IRC Log for 2012-05-02

Timestamps are in GMT/BST.

[0:04] <Tv_> sagewk: how do you feel about this idea: i might want to pre-populate ceph_fsid sometimes, and if i do that i might as well write down the right magic (just to make sure) and whoami too.. that way, i could pre-assign a disk to a specific cluster, and record id allocation result asap
[0:05] <Tv_> sagewk: because without ceph_fsid, how am i to know what cluster to add a blank disk in?
[0:06] <sagewk> tv_: that's ok.. it'll overwrite the file with teh value you feed ceph-osd, but no matter.
[0:06] <Tv_> sagewk: yeah i know the implementation is ok, i'm more talking about the big picture
[0:07] <Tv_> the whoami is just to minimize the window of losing allocated osd id
[0:07] <sagewk> i think the next thing we'll want to do is generate the osd uuid ourselves too, and feed it to ceph-osd --mkfs. that way we can make 'cpeh osd create' idempotent
[0:07] <Tv_> especially to avoid a loop ticking them away at >1 per second
[0:07] <sagewk> yeah, may as well.
[0:07] <Tv_> uuids are trivial for me to generate, if --mkfs can take them
[0:09] <sagewk> let me double check
[0:09] <sagewk> not yet. that'll be the next iteration, once we put the osd uuids in the osdmap
[0:10] <sagewk> it only takes the cluster uuid. via --fsid <uuid>
[0:11] <Tv_> sagewk: i'm not 100% sure what the arguments for osd uuids in osdmap were, etc
[0:11] <Tv_> sagewk: as long as i need to manage the osd id, i don't see that as too relevant
[0:12] <sagewk> it would mean 'ceph osd create <uuid>' would only assign and id once, and then keep giving it back to you.
[0:12] <Tv_> (if the osd assigned an id to itself automatically, things would be different)
[0:12] <sagewk> but that can come later.
[0:12] <Tv_> oh
[0:12] <Tv_> hmm
[0:12] <Tv_> yeah that avoids that one loop-consume-resources issue
[0:12] <sagewk> yeah
[0:12] <sagewk> well, it's easy to add... let's just do it.
[0:13] <sagewk> since this is all 0.47 anyway.
[0:13] <Tv_> frankly, i wish osd ids were an internal implementation detail
[0:13] <Tv_> and the osdmap contained a list of uuids, and index in that list was your id, or something
[0:13] <Tv_> as that one person on the mailing list wished
[0:13] <sagewk> that's effectively what it is.
[0:13] <Tv_> it's still very visible
[0:14] <sagewk> should be able to make ceph-osd take a uuid and not the id on mkfs and startup
[0:14] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:14] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) Quit (Quit: LarsFronius)
[0:20] <Tv_> sagewk: hmm that actually changes what i'm writing now almost completely
[0:20] <Tv_> which makes me want to explore that, instead
[0:21] <sagewk> hrm.. i'd prefer to defer it if we can avoid getting distracted.
[0:21] <Tv_> ok as long as you realize that's a big chunk of this logic
[0:21] <sagewk> just making ceph osd create not eat ids is easy, but i'm worried about catching all the corner cases with a larger change
[0:21] <Tv_> yeah
[0:21] <Tv_> i understand, and changing osdmap is scary (to me)
[0:22] <sagewk> well, i'm changing osdmap here, but it's an easy/safe addition.
[0:22] <sagewk> and the rest won't touch osdmap after this
[0:22] <sagewk> mostly ceph_osd.cc probably
[0:23] <sagewk> this means we depend on uuid-runtime
[0:23] <Tv_> sagewk: if we had leisure of time i'd suggest more radical changes ;)
[0:24] <Tv_> like, making ceph-osd just a lean & mean core, and not worry about stuff like this
[0:33] * aliguori (~anthony@ has joined #ceph
[0:42] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[0:43] <jmlowe> got a quick question, does this error mean anything to anybody "currently delayedold request osd_op"
[0:45] * eternaleye_ (~eternaley@tchaikovsky.exherbo.org) has joined #ceph
[0:46] * eternaleye (~eternaley@tchaikovsky.exherbo.org) Quit (Read error: Connection reset by peer)
[0:46] * detaos|cloud (~cd9b41e9@webuser.thegrebs.com) Quit (resistance.oftc.net larich.oftc.net)
[0:46] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (resistance.oftc.net larich.oftc.net)
[0:46] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) Quit (resistance.oftc.net larich.oftc.net)
[0:46] * rosco (~r.nap@ Quit (resistance.oftc.net larich.oftc.net)
[0:46] * darkfader (~floh@ Quit (resistance.oftc.net larich.oftc.net)
[0:46] * vikasap (~vikasap@ Quit (resistance.oftc.net larich.oftc.net)
[0:46] * mfoemmel (~mfoemmel@chml01.drwholdings.com) Quit (resistance.oftc.net larich.oftc.net)
[0:46] * mkampe (~markk@aon.hq.newdream.net) Quit (resistance.oftc.net larich.oftc.net)
[0:46] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (resistance.oftc.net larich.oftc.net)
[0:46] * sboyette (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (resistance.oftc.net larich.oftc.net)
[0:46] * sagewk (~sage@aon.hq.newdream.net) Quit (resistance.oftc.net larich.oftc.net)
[0:46] * SpamapS (~clint@xencbyrum2.srihosting.com) Quit (resistance.oftc.net larich.oftc.net)
[0:46] * exel (~pi@ Quit (resistance.oftc.net larich.oftc.net)
[0:46] * detaos (~quassel@c-50-131-106-101.hsd1.ca.comcast.net) Quit (resistance.oftc.net larich.oftc.net)
[0:47] <gregaf> jmlowe: means there's an OSD with a request which is currently blocked waiting on the PG to become active
[0:49] <gregaf> hmm, actually it might be a couple other things it looked like, but essentially it means the Op is on a waitlist because the OSD is missing something and is trying to get it
[0:50] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[0:50] <jmlowe> once burping is done and I have two hands i'll investigate, lost half of 2 node cluster
[0:51] * rosco (~r.nap@ has joined #ceph
[0:52] * SpamapS (~clint@xencbyrum2.srihosting.com) has joined #ceph
[0:54] * detaos (~quassel@c-50-131-106-101.hsd1.ca.comcast.net) has joined #ceph
[0:55] * exel (~pi@ has joined #ceph
[0:55] * vikasap (~vikasap@ has joined #ceph
[0:56] * dmick (~dmick@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[0:59] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[1:00] <sagewk> tv_: do you think it's safe to drop the 'ceph osd create N' syntax? it was non-idempotent (and thus mostly useless) anyway
[1:01] <sagewk> none of our stuff used it
[1:01] <Tv_> sagewk: once the uuid stuff is in place, yes
[1:01] <sagewk> it'll happen in one go
[1:01] <sagewk> annoying to support both
[1:01] <Tv_> which means there's an awkward period when the chef stuff is broken, but that's fine
[1:01] <sagewk> i hope to push this branch in the next 30 min that will use uuids
[1:01] <sagewk> for this part
[1:02] <Tv_> oh i thought you wanted to defer
[1:02] <Tv_> oh i see it'll still allocate ids etc
[1:02] <sagewk> yeah just the osd create part
[1:02] <Tv_> just use that as nonce for idempotency
[1:02] <Tv_> yeah i like it
[1:03] <jmlowe> hmm, node rebooted itself, panic and watchdog perhaps, nothing for you guys to see, although I'm wondering why the extra mon's mds's and replicas didn't take over?
[1:04] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[1:05] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[1:05] * Ryan_Lane (~Adium@ has joined #ceph
[1:06] <gregaf> jmlowe: the message you pasted is a notification of slow service, but it doesn't itself trigger any failovers
[1:06] <gregaf> if the node is still responding properly to heartbeats and such
[1:09] <jmlowe> right, what I have is 3x mon, 2x mds, 2x6 osd, I should have the crush map putting a copy on each of the two machines, so when one rebooted without restarting ceph why didn't the other one take over?
[1:11] <Tv_> jmlowe: 3x mon on just two machines perhaps?
[1:11] <jmlowe> 3xmon on 3 machines, on just has a mon on it nothing else, byzantine kings and all
[1:12] <Tv_> ah
[1:12] <jmlowe> generals rather
[1:12] <Tv_> jmlowe: was the other mds a standby, or also active?
[1:12] <jmlowe> probably standby
[1:13] <Tv_> if it was active, then the dfs would largely stop working when one was lost
[1:13] <Tv_> number of active mdses can't go down, currently
[1:13] <jmlowe> mds e77: 1/1/1 up {0=alpha=up:active}, 1 up:standby
[1:13] <Tv_> ok so standby
[1:13] <Tv_> i'm out of easy answers ;)
[1:13] <jmlowe> I'm doing rbd and not mounting ceph
[1:13] <Tv_> jmlowe: reading the logs may lead to enlightenment
[1:14] <Tv_> jmlowe: then you don't even need to run mdses
[1:14] <gregaf> what's the rest of the output of ceph -s?
[1:14] <sagewk> tv_: btw this won't break what you ahve now, 'osd create' still works, just not 'osd create 123'
[1:14] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[1:14] <Tv_> sagewk: oh i want the uuid stuff ;)
[1:15] <jmlowe> 2012-05-01 19:15:00.326169 pg v621877: 2376 pgs: 2376 active+clean; 2115 GB data, 4247 GB used, 6494 GB / 11177 GB avail
[1:15] <jmlowe> 2012-05-01 19:15:00.336978 mds e77: 1/1/1 up {0=alpha=up:active}, 1 up:standby
[1:15] <jmlowe> 2012-05-01 19:15:00.337123 osd e1542: 12 osds: 12 up, 12 in
[1:15] <jmlowe> 2012-05-01 19:15:00.337299 log 2012-05-01 19:00:20.927741 osd.11 150 : [INF] 2.36 scrub ok
[1:15] <jmlowe> 2012-05-01 19:15:00.337420 mon e1: 3 mons at {alpha=,beta=,gw48=}
[1:15] <gregaf> Tv_: sagewk: these look like okay doc changes? https://github.com/ceph/ceph/commit/1ecfc1a8c022ce3426993b213bfbe4ddef31e50e
[1:15] <gregaf> jmlowe: oh, you turned the OSDs back on?
[1:15] <jmlowe> yeah, needed to get things going again
[1:16] <gregaf> okay, no real-time diagnosis for you then ;)
[1:16] <Tv_> gregaf: you might want to avoid saying "local IP", there's a couple of different things that might mean
[1:17] <Tv_> gregaf: just because ipv6 talks a lot more about link-local etc addresses
[1:17] <jmlowe> yeah,terminal buffer is too small to catch the full ceph -s when it was broken
[1:17] <Tv_> gregaf: i think it's missing ``mon host``, btw
[1:18] <gregaf> ?
[1:18] <gregaf> that's definitely not a required field, although it might be optional in the config file
[1:18] <sagewk> gregaf: i think it shoudl be apositive instead of negative statement...
[1:18] <Tv_> gregaf: just saying there's one more way to find monitors, these days
[1:19] <sagewk> "if you provide an ip via -m ... on the local host it will bind to it" instead of "you can't ..."
[1:19] <Tv_> gregaf: the note could use a paragraph break before it and some clarification
[1:19] <gregaf> Tv_: so by "local IP" I mean "something that turns up from getifaddrs"
[1:19] <gregaf> is there a better way of saying that?
[1:19] <jmlowe> gregaf: got this coming back up
[1:19] <jmlowe> 2012-05-01 17:29:54.915547 pg v619771: 2376 pgs: 2 inactive, 1146 active+clean, 93 active+clean+replay, 144 peering, 2 active+recovering+degraded+remapped+backfill, 27 remapped, 93 active+remapped, 425 down+peering, 182 active+degraded, 27 down+replay+peering, 123 remapped+peering, 33 down+remapped+peering, 36 active+recovering, 11 active+recovering+remapped+backfill, 32 active+recovering+degraded+backfill; 2051 GB data, 3387 GB used, 46
[1:19] <jmlowe> 2012-05-01 17:29:54.922243 mds e74: 1/1/1 up {0=alpha=up:replay}, 1 up:standby
[1:19] <jmlowe> 2012-05-01 17:29:54.922298 osd e1506: 12 osds: 10 up, 12 in
[1:19] <sagewk> it's a "you can do this", not a "don't do this"
[1:20] <Tv_> gregaf: why --auth-supported on the command line?
[1:20] <Tv_> wouldn't that be in ceph.conf
[1:20] <gregaf> Tv_: because if you are using auth without a ceph.conf you need to specify it
[1:20] <gregaf> and all the rest of that document is doing no-config-file setup
[1:20] <Tv_> gregaf: oh without ceph.conf.. where's that implied?
[1:20] <Tv_> oh
[1:21] <gregaf> well it's specifying???everything else
[1:21] <Tv_> i don't think that was really the biggest point there
[1:21] <gregaf> or at least that's how I read it
[1:21] <gregaf> it's something I ran into ;)
[1:21] <gregaf> and I'm definitely not a tech writer, ugh
[1:22] <gregaf> can pull it out if you prefer
[1:22] <Tv_> gregaf: it seems like clutter in this context
[1:22] <Tv_> i do appreciate copy-pastable examples
[1:22] <Tv_> but i think those can assume some setup work
[1:23] <gregaf> sagewk: I really don't think automatically binding to local addrs is helpful in this context ??? if you have a pre-filled-out list of things then you're going to have pre-distributed a filled-in config file, etc
[1:23] <Tv_> gregaf: i've already said this before, i'd make cephx mandatory than clutter everything with explanations doing it both ways
[1:24] <Tv_> sagewk, gregaf: i think the problem is -m should NOT behave exactly like the "mon addr" list in ceph.conf
[1:24] <sagewk> gregaf: yeah let's leave it off as it'll be default Real Soon Now
[1:24] <Tv_> -m should be "talk to these guys"
[1:24] <gregaf> Tv_: we'd need to write the docs first???if we want to do that I don't mind it, but it would have to be in a single sprint and if we did it without the docs we'd regret it for *months*
[1:24] <Tv_> not "you're one of these guys"
[1:24] <Tv_> gregaf: all the stuff in doc/ i wrote assumes cephx
[1:24] <gregaf> Tv_: I was making this argument as well
[1:24] <Tv_> gregaf: but it tells you to put it in ceph.conf
[1:25] <gregaf> (the -m one)
[1:26] <sagewk> -m is "these are the monitors"
[1:26] <Tv_> sagewk: "... you should talk to"
[1:29] <sagewk> the central questions with monitor bootstrap is (1) who are the monitors, and (2) am i one of them. who i should talk to depends on the answer to 2, which depends on 1.
[1:29] <Tv_> you would really do well to think of "mon hosts" and "-m" as seeds: http://www.datastax.com/docs/1.0/configuration/node_configuration#seeds
[1:29] <sagewk> if we require the caller to tell you that, it makes chef's life harder
[1:29] <Tv_> sagewk: you can't even assume -m to be complete, so being included in it is meaningless
[1:29] <sagewk> the whole point here was to be able to specify 'mon host = foo,bar,baz' and that was enough.
[1:30] <sagewk> for bootstrap you have to assume that <whatever the input is> is complete for it to work.
[1:30] <Tv_> sagewk: so for the purposes of bootstrap, if i'm foo, i'll just need to fall back to bar to get the actual monmap
[1:30] <sagewk> in any other situation, they're seeds.
[1:30] <Tv_> oh right you're talking about initial bringup
[1:30] <Tv_> greg is editing docs for adding nodes
[1:31] <sagewk> and it will do that, if bar already formed a quorum and started. but for mkfs/bootstrap, they will form a new quorum once they hit a majority, etc...
[1:31] <sagewk> that's what this code is for :)
[1:31] <Tv_> hrmm yeah
[1:31] <Tv_> but the local ip autobind magic thing is a trap
[1:32] <Tv_> it'll only trigger in dev setups (>1 mon on one host)
[1:32] <sagewk> yeah
[1:33] <sagewk> could make an additional flag for the mkfs case, and only do it then.. :/
[1:33] <Tv_> uhh doesn't the new mon ("me") already know what port it'll be on?
[1:33] <Tv_> so having me:6890 and me:6891 in the list shouldn't be confusing...
[1:34] <Tv_> or was the mon port also a dynamic magic thing?
[1:34] <gregaf> Tv_: if there isn't a specified port it'll try and take 6789
[1:34] <Tv_> gregaf: but did you try to run two mons on the same host without specifying ports?
[1:34] <Tv_> i guess this comes down to, greg ran it in dev setup without a ceph.conf
[1:34] <gregaf> Tv_: oh, don't think so
[1:35] <gregaf> I'm pretty sure Carl did this too
[1:35] <Tv_> oh
[1:35] <gregaf> it's the only way I can find to make sense of one of his reports
[1:35] <sagewk> tv_: you can use wip-osd-uuid for testing
[1:36] <Tv_> sagewk: sweet
[1:36] <sagewk> tv_: i need to resolve the osdmap compat stuff a bit better, though, so don't merge it or anything
[1:36] <gregaf> Anyway, people shouldn't do it I guess, but it's really obnoxious to track down if you don't know what you're doing, so it needs a warning
[1:36] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[1:36] <Tv_> gregaf: the way ceph is, the docs will be nothing but warnings
[1:36] <gregaf> heh
[1:36] <Tv_> gregaf: the correct answer just isn't "add a warning to docs"
[1:41] <sagewk> tv_, gregaf: i think just making it match the port as well as the ip will make the trap go away
[1:42] <Tv_> sagewk: but when it goes through the list asking "am i this guy", it doesn't know
[1:42] <Tv_> sagewk: we talked about it, this only triggers on dev setups (>1 mon per host), gregaf might add a doc note to not run >1 mon per host
[1:43] <sagewk> yeah
[1:43] <sagewk> that works for me :)
[1:43] <Tv_> and any clarification on that then belongs in doc/dev/
[1:46] <elder> Duhhhhh. I finally figured out why I wasn't getting debug messages.
[1:46] <elder> I was trying to enable them before the module was actually loaded.
[1:46] <elder> Running things via teuthology, and was trying to set everything up before I started my run, to no avail.
[1:47] <sagewk> elder: ha, i've done that
[1:47] <elder> Now it's working fine. But I think it's so damned slow that teuthology gives up on me.
[1:47] <elder> Now that I know what's wrong I can fine-tune things a bit.
[1:49] <gregaf> jmlowe: so it looks like you only detected 2 of the OSDs as down (out of 6 down, right?), and some of the PGs hadn't finished transferring but they were in progress
[1:49] <gregaf> if you had logs you could look into the failure reports and see why they didn't all get marked down properly
[1:49] <Tv_> elder: yeah you need to throttle down what messages to get.. printk ain't a speedy channel
[1:50] <elder> I know.
[1:50] <elder> But it's the only way I can also get kdb access.
[1:50] <elder> printk() is even worse, it preempts the world.
[1:50] <elder> Or rather, locks it out.
[1:50] <elder> I've had time-of-day clocks get very slow on a noisy boot with a slow console line.
[1:51] <jmlowe> gregaf: looks like we had a split brain situation here, right before it went down I've got stuff in syslog about not being able to talk to the osd's on the other node
[1:52] <jmlowe> makes some sense, we lost a line card in one of our f10's and had to do some reconfiguring to run on half the trunked 10GigE lines
[1:57] <gregaf> huh, can't think of anything offhand but I've got to run, take it up with somebody else if you need to :)
[1:57] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[1:57] * sboyette (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[1:57] * mkampe (~markk@aon.hq.newdream.net) has joined #ceph
[1:57] * mfoemmel (~mfoemmel@chml01.drwholdings.com) has joined #ceph
[1:57] * darkfader (~floh@ has joined #ceph
[1:57] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) has joined #ceph
[2:06] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:13] * Ryan_Lane1 (~Adium@ has joined #ceph
[2:13] * Ryan_Lane (~Adium@ Quit (Read error: Connection reset by peer)
[2:15] * Tv_ (~tv@aon.hq.newdream.net) Quit (Quit: Tv_)
[2:32] * yoshi (~yoshi@p3167-ipngn3601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:39] * lofejndif (~lsqavnbok@82VAADJGA.tor-irc.dnsbl.oftc.net) has joined #ceph
[3:00] <Qten1> hi guys, does ceph have an option for offsite async replication?
[3:01] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[3:03] * lofejndif (~lsqavnbok@82VAADJGA.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[3:03] * loicd (~loic@magenta.dachary.org) has joined #ceph
[3:09] * cattelan is now known as cattelan_away
[3:10] <elder> nhm, you seem to know a bit about performant SSD's. For my laptop (which I think only supports 3.0 Gbit SATA) do you have any strong feelings about a particular brand, or line, or anything to help narrow my search?
[3:12] <jefferai> elder: check anandtech for a lot of detailed reviews
[3:12] <jefferai> but look at Samsung 830s
[3:12] <ajm> i'm a huge fan of intel (320m) but it might be overkill for a laptop
[3:13] <jefferai> yeah, intel is nice but tends to be pricey
[3:13] <jefferai> samsung has a nice price/performance ration
[3:13] <jefferai> ratio
[3:13] <jefferai> and a long history of reliability
[3:13] <elder> Looks like Samsung 830 supports 600 Mbps--overkill that I don't want to pay for.
[3:13] <elder> Samsung is a pretty big flash manufacturer.
[3:13] <jefferai> define "pay for"
[3:14] <jefferai> it highly depends what you compare it to
[3:14] <jefferai> but regardless, manufacturer-specified throughput values are relatively meaningless
[3:14] <nhm> elder: Intel are some of the best.
[3:14] <elder> Well, yes--if any of the premium price is for the 600 Mbit I don't care to pay for it. But I'll pay for reliability.
[3:14] <jefferai> again, manufacturer-specified throughput values are relatively meaningless
[3:15] <elder> I expect latency will be the biggest win for me--regardless of manufacturer.
[3:15] <jefferai> you need to look at benchmarks of behavior with different kinds of workloads, TRIM performance, IOops, etc
[3:15] <jefferai> nhm: so you're around :-)
[3:15] <jefferai> I was pointed your way
[3:15] <jefferai> regarding some benchmarks
[3:15] <nhm> jefferai: just got done putting the kids to bed. :)
[3:15] <nhm> jefferai: uhoh. :)
[3:15] <jefferai> heh
[3:16] <jefferai> nhm: so I'm putting together hardware on which I want to run ceph, and I have a couple questions, and you seem like the right person to ask
[3:16] <nhm> elder: price/performance I like the sandisk extreme I got, but I haven't gotten a chance to even install it yet.
[3:17] <nhm> jefferai: I can certainly try. I'm still in the middle of benchmarking stuff out myself.
[3:17] <jefferai> nhm: so I have three questions
[3:18] <jefferai> 1) CPU; 2) Memory; 3) Scaling
[3:18] <jefferai> basically, they are like this:
[3:18] <jefferai> 1) Do I want less, faster cores or more, slower cores
[3:19] <jefferai> e.g. does Ceph take advantage of mulitcore processors, or for its calculations is it better to get as much speed as possible
[3:19] <jefferai> 2) How much memory? I was told the more the better...is 128GB overkill?
[3:19] <jefferai> 3) I'm looking at having a *lot* of RADOS stores
[3:19] <nhm> jefferai: some of those questions depend on how many OSDs you plan on having per server...
[3:20] <nhm> jefferai: how many drives per node?
[3:20] <jefferai> well
[3:20] <jefferai> 20 SSD and 45 HDD
[3:20] <jefferai> but
[3:20] <jefferai> I probably want each SSD to be its own object store
[3:20] <jefferai> and possibly some of the HDDs to be the same
[3:20] <jefferai> and I'm not sure how that affects things
[3:21] <nhm> jefferai: hrm, ok. How many controllers and what kind of networking?
[3:21] <jefferai> um
[3:21] <jefferai> well, networking, either 10GbE or multiple bonded 10GbE links
[3:21] <jefferai> how many controllers...dunno
[3:22] <nhm> or perhaps I should ask, what is the expected level of throughput to each node?
[3:22] <jefferai> I'm probably going to have four hardware nodes to start
[3:22] * aliguori (~anthony@ Quit (Read error: Operation timed out)
[3:22] <jefferai> so each capable of up to 20 SSD and 45 HDD
[3:22] <jefferai> and I want three copies of everything
[3:22] <jefferai> I forget if that is a replication factor of 2 or 3
[3:22] <nhm> jefferai: I ask because 45HDD and 20SSDs per node is a lot of IO for one box.
[3:23] <jefferai> That's true; I don't expect all of it to be active all the time
[3:23] <jefferai> Much of the HDDs will go to relatively static storage
[3:23] <jefferai> the SSDs will be active, but currently their workload is handled by a RAID-10 of disks over 2GbE
[3:24] <jefferai> (bonded GbE iSCSI)
[3:24] <jefferai> not handled well, but handled
[3:24] <jefferai> RAID-10 of HDDs, I mean
[3:24] <nhm> jefferai: some of the things you want to keep in mind is that on every node you'll be writing to the journal and the data disks at the same time.
[3:24] <jefferai> I'm not entirely sure what that means for me
[3:25] <nhm> jefferai: well, just that you'll need to make sure your controller(s) can keep up at twice the speed of whatever networking you have in place.
[3:26] <jefferai> ah
[3:26] <jefferai> So if they can't, I don't need that much networking
[3:26] * eternaleye_ is now known as eternaleye
[3:27] <nhm> and that your journals and drives are balanced correctly. Some people choose to put journals on the same drives as the OSDs (best to have a seperate partition), but you can also put journals on a seperate drive (Right now I'm testing 1 SSD supplying 2 OSDs with journals).
[3:27] <elder> That's what you want if you want to perform.
[3:28] <jefferai> hm
[3:28] <jefferai> so one of the limitations I'm under is that I need to be able to know at any given point exactly which drives contain what data
[3:29] <jefferai> which is why I'm thinking of one OSD per SSD
[3:29] <nhm> jefferai: You mean which drives contain which 4MB chunks?
[3:30] <jefferai> no, contain which data
[3:30] <jefferai> the data we're storing will belong to different downstream customers
[3:31] <jefferai> we have to be able to prove that, if a customer requests their data to be deleted, that all of the relevant data was deleted
[3:31] <jefferai> the easiest way by far is to pull the drives and give those drives to them
[3:31] <jefferai> so I was thinking that each SSD could have its own store
[3:33] <nhm> jefferai: ceph distributes 4MB chunks of data to different OSDs in a pseudo-random fashion.
[3:33] <jefferai> I think it's quite possibe I'm not using the right terminology
[3:34] <nhm> jefferai: perhaps, but I think what you would like is to be able to have entire files be distributed to specific OSDs based on their ownership? Is that correct?
[3:34] <jefferai> is there not a way to say "this disk over here, that's an independent data store, don't mix it with any others"
[3:34] <jefferai> well
[3:34] <jefferai> not entirely
[3:34] <jefferai> more like
[3:35] <jefferai> I want to expose RBDs where all of the underlying objects are on one drive, and no other RBDs have objects on that drive
[3:36] <jefferai> maybe what I really want is to run lots of different clusters
[3:36] <jefferai> each one relatively small
[3:36] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[3:36] <nhm> jefferai: Hrm... Probably best to talk to some of the devs in here when they come back. You could always set it up as seperate clusters, but that might not be the best way.
[3:37] <jefferai> okay
[3:37] <nhm> jefferai: getting back to your performance questions though: For CPU, we've recommended 1 core per OSD in the past.
[3:37] <jefferai> nhm: alternately, a custom crushmap may do what I need
[3:37] <jefferai> although -- maybe not if that's just to control which nodes get what
[3:37] <jefferai> not what disks get what
[3:37] <nhm> jefferai: It might be possible, I don't want to say as I'm not entirely sure. :)
[3:37] <jefferai> heh, okay
[3:37] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has left #ceph
[3:38] <jefferai> I guess I'll try the devs and see what they say in terms of whether this can be done
[3:38] <jefferai> and then I can run their recommendations by you and see what your guesses are for hardware needs
[3:38] <nhm> jefferai: For hardware they may point you back at me. ;)
[3:39] <nhm> jefferai: I'm not entirely certain that 1 core per OSD is actually needed depending on the kind of core.
[3:40] <nhm> jefferai: For memory, more is certainly always better. Anything you have spare will be used for FS cache.
[3:41] <nhm> jefferai: 65 drives in 1 box is bigger than anything I think we've tested in house (especially with 20 SSDs!).
[3:41] <jefferai> eh
[3:41] <jefferai> heh
[3:41] <jefferai> I don't plan on ever saturating the box
[3:41] <nhm> jefferai: That would be something of a unique testbed. :D
[3:41] <jefferai> IO wise that is
[3:41] <jefferai> except, possibly, during testing
[3:41] <jefferai> I don't anticipate a workload outside of that that will do so
[3:42] <jefferai> sagewk indicated to me that this isn't all that unique/large an install
[3:42] <jefferai> as for cores -- I could get 2x16core Opterons
[3:42] <jefferai> or I could get 2x4core much faster opterons
[3:42] <jefferai> (or 2x6core much faster Xeons)
[3:42] <jefferai> but that's why I'm asking about cores vs. speed, it's a tradeoff
[3:43] <nhm> jefferai: I've managed lustre storage that is similar (actually a bit more drives) per node than what you are talking about. It's doable, but can require some finesse to get going well.
[3:43] <jefferai> huh
[3:44] * jefferai wonders what happens when you go past /dev/sdz
[3:44] <nhm> jefferai: I think it starts doubling up letters
[3:44] <jefferai> that would be my guess
[3:47] <nhm> jefferai: for cpus, you have some tradoffs on intel vs AMD. On the AMD side, with interlagos you get more cores, but the hypertransport links aren't fully connected and running at half throughput so your NUMA performance is going to be slower per core than on Intel.
[3:48] <jefferai> hrm
[3:48] <jefferai> I haven't watched procs closely in a while
[3:48] <jefferai> why are the HT links running at half throughput?
[3:48] <jefferai> and, the bigger question: does it matter? :-)
[3:49] <jefferai> e.g. if the network latency is by far the largest part of the equation...
[3:49] <nhm> jefferai: Interlagos is kind of (but not exactly) like 2 8 core chips glued together into one socket.
[3:49] <jefferai> ah
[3:49] <nhm> jefferai: the NUMA setup connecting all of those chips together is kind of complicated but it looks more like a hypercube than full-connected.
[3:50] <nhm> jefferai: as for it mattering, probably not, but maybe?
[3:50] <jefferai> :-)
[3:51] <nhm> jefferai: If you are doing a single 10GBE link, you might not have enough IO to suffer from it, unless things are horribly configured and you've got all kinds of IO going all over the place to non-optimal numa nodes.
[3:51] <jefferai> I probably wouldn't know enough to avoid that
[3:51] <jefferai> or know enough not to avoid it
[3:52] <jefferai> I don't deal with that kind of stuff very often
[3:52] <jefferai> but I think I have a way forward
[3:52] <jefferai> which is -- talk to the devs and find out the feasibility of having a way to know what data is where
[3:52] <jefferai> whether it means several smal clusters or a crushmap or or or
[3:52] <nhm> I guess if it were me I'd go for an intel setup just to minimize some of those problems.
[3:53] <jefferai> ok, good to know
[3:53] <jefferai> and once they have some recommendations, then come back to you and see what you think :-)
[3:53] <jefferai> nhm: this would be on the storage nodes, though, not so much the compute nodes accessing them for VMs?
[3:54] <nhm> jefferai: Yeah, for the compute nodes use whatever you like. Even on the storage nodes, this may not matter given the performance you are targeting. A single 8 core cpu may be enough.
[3:55] * cattelan_away is now known as cattelan
[3:55] <nhm> jefferai: The simpler the design, the less headaches you will have.
[3:55] <jefferai> Yep
[3:55] <jefferai> Like I said, I am not currently anticipating anywhere close to saturating the I/O on these things
[3:55] <jefferai> mostly the HDDs are for relatively static storage
[3:58] <jefferai> So I imagine running ceph-mon and ceph-osd on the storage boxes
[3:58] <jefferai> I don't anticipate using cephfs for now
[3:58] <jefferai> just rbd and possibly the swift bits
[4:01] <nhm> jefferai: I guess given what you've been saying, I think you should look at maybe some reasonably fast 6 or 8 core intel boxes.
[4:01] <jefferai> OK
[4:01] <nhm> jefferai: single CPU to minimize numa headaches since you don't want to deal with that.
[4:02] <nhm> jefferai: This is assuming you are just targetting 10GE per node.
[4:03] <jefferai> I think I'd have the capability to target 20GbE (via bonding or multipath)
[4:03] <jefferai> I don't anticipate saturating that, or at least not often
[4:04] <jefferai> mainly I'd do that for intra-cluster replication
[4:04] <jefferai> not for outside clients
[4:06] <nhm> jefferai: if the other guys think the cpu needs to be beefier, you may want to consider dual intel cpus that share a single IOH.
[4:06] <nhm> jefferai: sandy bridge will be a bit different as the PCIE controller is integrated, but at least for last gen chips you'll want to avoid dual IOH boards.
[4:07] <jefferai> yeah, if I go intel I'd get the newer sandy bridge xeons
[4:07] <jefferai> not going to replace the hardware for a while, might as well
[4:07] <jefferai> thanks for all the help, don't want to eat more of your time now
[4:07] <jefferai> it's much appreciated
[4:08] <jefferai> I'll talk to sagewk and others some more
[4:08] <jefferai> see what they say about data placement
[4:08] <nhm> jefferai: np, glad to help. Sorry we don't have all the answers yet. We want to get some of this kind of stuff published at some point. :)
[4:08] <jefferai> yeah
[4:08] <nhm> jefferai: too much to do, too little time!
[4:08] <jefferai> well
[4:09] <jefferai> as I said -- it's possible that I could get some funding to help out here, with benchmarking and testing
[4:09] <jefferai> if there is some kind of recognized output out of it
[4:09] <jefferai> a conference talk, a paper
[4:09] <jefferai> it's unlikely I'd get funding without the expectation of publication
[4:09] <jefferai> sadly
[4:09] <nhm> jefferai: What kind of organization do you work for?
[4:09] <jefferai> a lab at a university
[4:10] <nhm> jefferai: Cool. I used to work for the University of Minnesota up until 2 months ago. :)
[4:10] <jefferai> ah, I seem to remember something about that
[4:10] <jefferai> where do you work now?
[4:10] <elder> :)
[4:11] <jefferai> heh
[4:11] <nhm> jefferai: Ceph Co. :)
[4:11] <jefferai> ah, there's a Ceph Co. now?
[4:11] <jefferai> I thought that was called "Dreamhost"
[4:11] <elder> Sort of.
[4:11] <elder> We're a piece of Dreamhost.
[4:11] <jefferai> I see
[4:11] <nhm> jefferai: technically New Dream Network
[4:11] <elder> Don't even go there.
[4:11] <jefferai> heh
[4:11] <elder> It's all confusing to me, still.
[4:12] <jefferai> Well, congrats
[4:12] <jefferai> elder: so you're one of the devs too?
[4:12] <elder> Yes.
[4:12] <jefferai> Cool
[4:12] <elder> I'm pretty focused on the Linux kernel stuff, but I don't expect it to be that way forever.
[4:12] <elder> But that's my area of expertise.
[4:12] <elder> So Linux client.
[4:12] <jefferai> Any opinions on all of the (crap? excellent?) advice that nhm was giving above?
[4:13] <elder> I would defer to Mark (nhm).
[4:13] <jefferai> Ah
[4:13] <elder> I wasn't paying close attention but he's a pretty bright guy.
[4:13] <jefferai> nhm: so which devs would you recommend I ask?
[4:13] <jefferai> elder: sure, he just recommended asking other devs some of the questions
[4:13] <elder> Well, I'm a different sort of dev, I guess.
[4:14] <elder> I think he's talking more about the server side, not the client.
[4:14] <elder> Let me scan what you guys were talking about...
[4:14] <jefferai> elder: the basic question is:
[4:14] <elder> Holy crap, you guys are chatterboxes.
[4:14] <jefferai> well
[4:14] <jefferai> heh
[4:14] <jefferai> I'll recap
[4:14] <nhm> elder: ;)
[4:15] <jefferai> essentially, I need a way to know, and ideally control, what data goes where
[4:15] <jefferai> by which I mean
[4:15] <jefferai> if I have five drives each on three nodes
[4:16] <jefferai> I'd like to be able to say "I want data for this RBD device to be on drive 1 on each node, data for this other RBD device to be on drive 2 on each node", etc
[4:16] <elder> OK.
[4:16] <jefferai> because I need, for auditing reasons, to be able to identify at any given time what data is on what disks
[4:16] <nhm> jefferai: Tough to say. I'd just mention it in channel during normal pacific time business hours.
[4:16] <nhm> jefferai: people will probably chime in.
[4:16] <jefferai> and the easiest way is simply to control ahead of time which data is on which disks
[4:17] <jefferai> so my thought was that I could have different pools on each disk and prevent them from mixing
[4:17] <jefferai> but, I don't know if that's possible
[4:17] <elder> I believe what you are talking about is entirely possible, and is done via defining the crush rules properly.
[4:17] <jefferai> okay
[4:17] <jefferai> I was thinking maybe that would be the case
[4:17] <elder> That is an area I don't know a lot about yet, but am eager to dig into it at some point. But in any case, generally:
[4:18] <nhm> elder: yeah, I don't know enough about the placement rules to know if that's possible so I didn't really want to say anything.
[4:19] <elder> I think you define the way your hardware is organized, and doing so is normally sort of organized based on ensuring things that store the same or related data are kept separate.
[4:19] <elder> So you'd define which storage sits in which position in a cabinet, and which row that cabinet sits in, and so on.
[4:19] <nhm> elder: btw, I have some movie output from seekwatcher I just generated for xfs osd writes. For some reason it seems like there are a ton of seeks during these writes (might be what's going with the performance issues I'm seeing on the burnupi nodes).
[4:19] <jefferai> elder: Okay; I'll look into crushmap bits more closely...I looked at e.g. http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH but didn't seem to see what I needed
[4:20] <elder> Then the crush rules will ensure things are laid out such that they don't allow for one failure to take out a complete hunk of data.
[4:20] <jefferai> right, I was hoping the data replication would take care of that
[4:20] <elder> What you're talking about is sort of oriented differently, but I suspect someone (like sage, who typically signs on in the evening) might be able to advise on how to go about it.
[4:21] <jefferai> okay
[4:21] <jefferai> I'll keep an eye out for him, talked to him a bit earlier today
[4:21] <elder> Sage ultimately knows the most about everything ceph.
[4:21] <jefferai> elder: and then my other question that we chattered on about was CPU/memory sizing
[4:21] <elder> I'd like to refer you to someone else.
[4:21] <jefferai> but I'll come back around to that once I get the skinny on how I should organize things in ceph
[4:22] <elder> CPU memory sizing I'm sure that Mark gave you good advice.
[4:22] <nhm> elder: yeah, I hate telling people to bug Sage but he just knows too much ;)
[4:22] <jefferai> elder: sure -- some of it was "it depends" based on exactly how things shake out with my first question
[4:22] <elder> nhm is there a small enough version of hte movie that I can look at it on the off chance I have insight?
[4:23] * Ryan_Lane1 (~Adium@ Quit (Quit: Leaving.)
[4:23] <elder> I think that ceph is aggressively syncing, which is not going to do well for XFS.
[4:23] <elder> The XFS journal really ought to be on a separate drive too, if that's not already the case.
[4:24] <nhm> jefferai: One thing to think about too is more smaller nodes. You loose some density, but you gain some nice things: Simpler debugging, better fault tolerance (less data redistributed during failures), more network IO per OSD, etc.
[4:24] <elder> Otherwise journal writes will be fighting with data and other metadata writes.
[4:24] <elder> And that means lots of seeks.
[4:24] <nhm> elder: yes, I still haven't even looked at them yet, but will send them to you as soon as I do.
[4:24] <jefferai> elder: for instance, from the crushmap wiki page: "For other situations, it is problematic. For example, if each host has two disks, we may want to run two independent cosd daemon for each disk. That way, a single disk failure only takes out a single disk's worth of data."
[4:25] <nhm> elder: yeah, journals are on SSDs
[4:25] <jefferai> so if it's possible to run many cosd daemons, thay may be what I have to do for the data layout I want
[4:25] <jefferai> but then that may affect what I need in terms of memory/cpu
[4:25] <elder> jefferai, maybe.
[4:25] <jefferai> so, need to get that figured out first
[4:25] <jefferai> will keep an eye out for sage
[4:25] <elder> I really can't offer much ehlp.
[4:25] <jefferai> nhm: yeah -- I could think about that. One problem is that I have limited data center space
[4:25] <elder> I feel like it's do-able, and it may take thinking about how to use crush in a different way from typical.
[4:26] <nhm> jefferai: with that big of box you'll almost certianly be network limited, so evne if you have lots of osd daemons they'll probably be sititng around a lot.
[4:26] <elder> I.e., it seems like it may be a case where you need to bend it to do your will, because the defaults are oriented toward slightly different goals.
[4:26] <jefferai> elder: sure
[4:27] <nhm> jefferai: I think the convential advice on the wiki is assuming you want all of your drives running at full-speed all the time.
[4:27] <jefferai> nhm: network limited in what way -- do you mean due to client access, or due to replication traffic?
[4:29] <jefferai> I really don't anticipate client access being problematic; most client access will be to dump a bunch of data in once and not really deal with it again for a long time, if ever...or, very bursty traffic
[4:29] <jefferai> the majority of what I'll be doing with these boxes is already being handled with iSCSI over a couple gige
[4:29] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[4:30] <jefferai> so from my perspective what I'm worried about is whether the bandwidth is enough for the replication and administrative traffic
[4:30] <jefferai> that ceph uses/needs
[4:31] <nhm> jefferai: I'm just saying that you may saturate your network throughput before CPU utilization becomes a problem, even with tons of OSDs.
[4:31] <jefferai> ah, ok
[4:31] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:31] <jefferai> and even with tons of osd daemons?
[4:31] * jefferai isn't sure of whether by "tons of OSDs" you mean daemons or simply the storage pools themselves
[4:32] * loicd (~loic@magenta.dachary.org) has joined #ceph
[4:32] <nhm> jefferai: I was thinking daemons. I guess the question is how much does the CPU utilization scale with OSD throughput.
[4:32] <jefferai> yeah
[4:33] <jefferai> ok -- certainly have a lot to think about
[4:34] <jefferai> and will try to contact sage for sage advice
[4:34] <nhm> jefferai: might be some good info on the mailing list. I'll try to take a bit of time to look through it.
[4:34] <nhm> jefferai: will you be around tomorrow?
[4:35] <jefferai> sure will
[4:36] <nhm> cool. Time for me to head to bed. Have a good evening
[4:36] <jefferai> me too
[4:36] <jefferai> thanks!
[4:37] <elder> Raining nhm?
[4:37] <elder> Just starting here.
[4:41] <joao> strangely enough, I only noticed how late it is when I saw nhm saying he's going to bed...
[4:42] <elder> Quite late for you joao
[4:42] <joao> yeah, looks like I'm naturally shifting to your timezone
[4:42] <elder> That'll help prepare you for Saturday.
[4:42] <joao> although I still keep waking up early, which sucks
[4:43] <joao> indeed it will :p
[4:47] <joao> oh well... calling it a night
[4:47] <joao> bye
[5:32] * joao (~JL@ Quit (Ping timeout: 480 seconds)
[5:32] * aa (~aa@r186-52-164-50.dialup.adsl.anteldata.net.uy) has joined #ceph
[5:32] * renzhi (~renzhi@ has joined #ceph
[5:35] <elder> x
[5:36] <dmick> y
[5:36] <elder> ^Z
[5:36] <elder> - or -
[5:36] <elder> because we like you!
[5:36] <dmick> ~~>>~>>~>>.~~~
[5:37] <elder> Damn! Disconnected again!
[5:37] <dmick> who even remembers AT commands...
[5:37] * Josh_ (~chatzilla@ has joined #ceph
[5:37] <elder> atdt
[5:37] <elder> That's all I remember.
[5:38] <elder> I remember using an acoustic coupler on a teletype. No AT commands needed there.
[6:08] * Josh_ (~chatzilla@ Quit (Quit: ChatZilla [Firefox 13.0/20120425123149])
[7:08] * cattelan is now known as cattelan_away
[7:57] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[8:01] * Theuni (~Theuni@ has joined #ceph
[8:47] * aa (~aa@r186-52-164-50.dialup.adsl.anteldata.net.uy) Quit (Ping timeout: 480 seconds)
[9:06] * loicd (~loic@magenta.dachary.org) has left #ceph
[9:14] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[9:25] * Ryan_Lane (~Adium@c-98-210-205-93.hsd1.ca.comcast.net) has joined #ceph
[9:31] * eightyeight (~atoponce@pthree.org) Quit (Ping timeout: 480 seconds)
[9:39] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[9:46] * andreask (~andreas@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[9:53] * Ryan_Lane (~Adium@c-98-210-205-93.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[9:53] * Ryan_Lane (~Adium@c-98-210-205-93.hsd1.ca.comcast.net) has joined #ceph
[9:58] * Ryan_Lane (~Adium@c-98-210-205-93.hsd1.ca.comcast.net) Quit ()
[10:06] * s[X]_ (~sX]@60-241-151-10.tpgi.com.au) has joined #ceph
[10:32] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[10:34] * s[X]_ (~sX]@60-241-151-10.tpgi.com.au) Quit (Remote host closed the connection)
[10:40] * Qten (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[10:42] * Qten1 (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (Quit: Now if you will excuse me, I have a giant ball of oil to throw out my window)
[10:55] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[10:56] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[10:58] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[11:07] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[11:08] * BManojlovic (~steki@ has joined #ceph
[11:09] * Theuni (~Theuni@ Quit (Ping timeout: 480 seconds)
[11:14] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[11:14] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[11:18] * Theuni (~Theuni@ has joined #ceph
[11:44] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[11:44] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[11:58] * Theuni (~Theuni@ Quit (Ping timeout: 480 seconds)
[12:01] * joao (~JL@89-181-154-158.net.novis.pt) has joined #ceph
[12:02] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[12:25] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[12:28] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[12:46] * andreask (~andreas@chello062178057005.20.11.vie.surfer.at) Quit (Quit: Leaving.)
[12:54] * Theuni (~Theuni@ has joined #ceph
[13:19] * yoshi (~yoshi@p3167-ipngn3601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[13:28] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[13:29] <nhm> good morning #ceph
[13:31] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[13:41] <BManojlovic> good day to you too :)
[13:50] * renzhi (~renzhi@ Quit (Ping timeout: 480 seconds)
[13:53] * eightyeight (~atoponce@pthree.org) has joined #ceph
[13:55] * andreask (~andreas@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[14:14] * aliguori (~anthony@ has joined #ceph
[14:24] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[14:26] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[14:59] <elder> nhm, I'm going to be off reading code. I'll come back occasionally to see if anyone's looking for me.
[14:59] <elder> Or you can ping me with a text.
[15:04] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[15:04] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[15:07] * morse (~morse@supercomputing.univpm.it) Quit (Quit: Bye, see you soon)
[15:12] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:22] * aliguori (~anthony@ Quit (Remote host closed the connection)
[15:33] * aa (~aa@r190-135-28-205.dialup.adsl.anteldata.net.uy) has joined #ceph
[15:38] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Quit: LarsFronius)
[15:43] * aa (~aa@r190-135-28-205.dialup.adsl.anteldata.net.uy) Quit (Ping timeout: 480 seconds)
[15:45] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[15:49] * aa (~aa@r190-135-28-205.dialup.adsl.anteldata.net.uy) has joined #ceph
[16:19] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[16:37] * aa (~aa@r190-135-28-205.dialup.adsl.anteldata.net.uy) Quit (Ping timeout: 480 seconds)
[16:50] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[16:54] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[17:04] * andreask (~andreas@chello062178057005.20.11.vie.surfer.at) Quit (Quit: Leaving.)
[17:08] * Theuni (~Theuni@ Quit (Ping timeout: 480 seconds)
[17:12] * cattelan_away is now known as cattelan
[17:16] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:23] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[17:23] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[17:29] <jefferai> nhm: morning nhm :-)
[17:33] <nhm> jefferai: good morning! I had a couple of thoughts last night as I feel asleep: 1) If you buy a 2 socket romley board and only put 1 6 or 8 core cpu in it, you can avoid NUMA for now but upgrade later if necessary. 2) What I told you regarding half-throughput hyper transport links for interlagos is only true on quad processor setups, not dual processor setups. I keep forgotting to qualify that.
[17:34] <nhm> jefferai: so if you buy a dual processor interlagos setup you will get full hyper tranpsport throughput to each of the 4 "cpus" (2 per socket) except on the diagonal which is half throughput.
[17:35] <nhm> jefferai: here's a picture that explains it better (ignore the genmini nic part): http://www.prace-ri.eu/IMG/png/crayxecomputenode.png
[17:37] <nhm> this is what it looks like for quad processor setups (each link is 8-bit except inter-socket links which are 16bit+8bit): http://www.hpc2n.umu.se/sites/default/files/G34r1-4P-topo-maxio_0.png
[17:37] <nhm> sorry, intra-socket
[17:38] <jefferai> nhm: OK, I'll keep that in mind
[17:38] <jefferai> still need to get input on how the data can be partitioned
[17:38] <jefferai> I imagine that if running lots of osd daemons more cores might be useful than faster cores
[17:38] <jefferai> but not sure
[17:39] <nhm> jefferai: I'm still thinking that it probably isn't going to matter that much for the throughput you are targetting and that you are best off just avoiding NUMA headaches as much as possible, but we'll see what others have to say.
[17:40] * jefferai notes that NUMA is only a headache if he realizes that it's causing problems
[17:40] <jefferai> heh
[17:40] <jefferai> I probably wouldn't know
[17:40] <jefferai> which doesn't mean it's not affecting me
[17:40] <jefferai> but I'm not in the HPC market
[17:40] <jefferai> just not experienced with it
[17:41] <nhm> jefferai: What kind of lab do you work in?
[17:42] <jefferai> I mostly do networking stuff
[17:42] <jefferai> these boxes are because I also run some IT infrastructure that nobody else wants to run
[17:42] <jefferai> So I need to run VMs with e.g. bug trackers, a wiki, various other services on it
[17:42] <jefferai> eventually maybe host a bunch of user VMs
[17:44] <jefferai> nhm: I'm curious, since you know a lot about this: if you were going to be hosting a bunch of VMs with various types of servers on them, would you go more cores with interlagos, or fewer cores but romley?
[17:44] <nhm> jefferai: Yeah, we ended up supporting a lot of random stuff when I was in academia too. Various systems where the grad student who built it and maintained for a year left and now no one knows what to do.
[17:44] <jefferai> I was planning on the former, but I wonder
[17:44] <jefferai> yeah, there's not much of that...I don't really end up taking over other peoples' stuff, but rather standing up services for others to use
[17:45] <jefferai> but this infra isn't going to be used for e.g. molecular simulations
[17:45] <jefferai> no mpi or anything
[17:45] <nhm> jefferai: I think it really depends on what your VMs are going to be doing. We built some cloud stuff with Intel chips, but they were doing heavy computation and we weren't planning on having that many VMs per node and had a mix of serial and parallel applications.
[17:45] <jefferai> just hosting a bunch of servers, and a bunch of storage
[17:45] <jefferai> yeah
[17:46] <nhm> jefferai: if you want lots of VMs that each have a light workload interlagos might be the better way to go.
[17:46] <jefferai> that was my thought
[17:46] <jefferai> workloads are pretty light
[17:46] <jefferai> except in rare bursty circumstances
[17:46] <jefferai> they're actually all running on 4-year old Opterons now, and compute-wise they're just fine
[17:47] <jefferai> it's the network that's the problem
[17:54] * ElusiveParticle (b89e1566@ircip1.mibbit.com) has joined #ceph
[17:54] * ElusiveParticle (b89e1566@ircip1.mibbit.com) Quit (Killed (MoranServ (Possible spambot -- mail support@oftc.net with questions.)))
[18:00] * mgalkiewicz (~mgalkiewi@ has joined #ceph
[18:00] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[18:00] <mgalkiewicz> Hi guys. I cant start on of my monitors https://gist.github.com/2577769.
[18:01] <sagewk> nhm: for hte movie with allt he seeks, do you know what the config was at that time? flusher off?
[18:02] <sagewk> nhm: it would be great to capture the osd logs for the same duration as the trace, so we can correlate exactly what is going on.
[18:02] <sagewk> nhm: i.e. rotate the logs, start the test and capture the trace, then rotate the logs again
[18:03] <nhm> sagewk: yeah, I was just thinking that. I've got the logs for everything, but it's in one giant log file.
[18:03] <nhm> sagewk: that's the approach I've taken in the past though with these tests.
[18:04] <nhm> sagewk: per Alex's suggestion, I'm trying with 1 concurrent thread at different request sizes, and have noticed that with the flusher off, larger writes have the problem less often, and smaller writes have it more often.
[18:05] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[18:05] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[18:06] <sagewk> nhm: that makes sense.. xfs will be more efficient with larger writes and better able to keep up with the journal.
[18:06] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[18:06] <sagewk> at some level, this problem is always be there with this config (1 disk for journal, 1 disk for data) because a sequential journal strema will always be faster than the fs.. just a question of how much.
[18:06] <sagewk> ...but 20-40mb/sec way slower than xfs should be.
[18:06] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit ()
[18:07] <nhm> sagewk: Seems like once the data is written to the journal, we could combine writes before they get sent to the data disk?
[18:08] <nhm> At least for FS where we aren't doing the writse in parallel.
[18:10] <sagewk> there are 2 threads (by default) writing to the fs. and in general each filestore transaction is independent (a separate 4mb object, in this case)
[18:10] * Tv_ (~tv@aon.hq.newdream.net) has joined #ceph
[18:10] <sagewk> might be interesting to adjust the thread count...
[18:10] <sagewk> try 'filestore op threads = 1' or 4. need to modify the conf and restart ceph-osd, can't change that while running.
[18:11] <sagewk> i would probably start with capturing a new trace with a matching log file, so we can figure out what those reads are
[18:11] * alexxy (~alexxy@ has joined #ceph
[18:11] <nhm> Ok. What do you make of all of the little reads during the valleys in that movie I pointed out earlier?
[18:12] <nhm> yeah
[18:14] <sagewk> my guess is that they're inode updates.. once the data extents are written. but i may be wrong, i don't remember much about the xfs on-disk layout
[18:14] <nhm> sagewk: I should do this with btrfs. btrfs was actually performing worse when I originally was testing these nodes which is why I moved to xfs.
[18:15] <nhm> I should try to get some comparisons.
[18:15] <sagewk> yeah, that would also be interesting to see
[18:15] <nhm> sagewk: should I go with non-concurrent requests to start out with?
[18:15] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:15] <sagewk> fwiw, i think ag=1 will be perform marginally better. and adjusting threads might improve things.
[18:16] <sagewk> i'd just stick with the default t=16.. it all gets serialized by the osd anyway
[18:16] <sagewk> the thing that will affect xfs's workload is the filestore op threads (how many concurrent write(2) calls we do)
[18:18] <dwm_> Is the recommended practice now to use btrfs's large metadata feature? (mkfs.btrfs -l 64k -n 64k)
[18:21] <sagewk> dwm_: yeah
[18:21] <sagewk> from what i hear from josef, 16k or 32k seems to be the sweet spot. we haven't evaluated that ourselves, yet.
[18:22] <Tv_> sagewk: ooh is that knowledge gathered somewhere?
[18:22] <Tv_> http://ceph.newdream.net/docs/master/rec/filesystem/#btrfs nudge nudge
[18:22] <dwm_> Hmm, it would be useful to know if there's a minimum recommended kernel for that kind of config.
[18:24] <sagewk> it was just merged in 3.3, iirc. but as i said, this is just what josef told me in passing.. we haven't evaluated performance yet.
[18:24] <dwm_> *nods* I'm currently running 3.2-series, so I'd presumably need to upgrade my test hosts to get the necessary (reliable) support.
[18:25] <sagewk> tv_: i'll make a note of it once wip-doc gets a bit more love and is merged into master
[18:26] <sagewk> did a review last night and there are still some changes needed
[18:26] <jefferai> sagewk: when you have some time, I did a bunch of chatting with elder and nhm and got a lot of questions answered but they deferred to you for some of it
[18:26] <elder> Or someone else you find suitable, sagewk.
[18:27] <elder> He wants to have careful control over where data is placed.
[18:27] <sagewk> nhm: in any case, i suggest we go a bit further to understand the xfs situation before switching gears, lest we forget everything we learned :)
[18:27] <elder> And knowledge of everywhere it is.
[18:27] <sagewk> elder, jefferai: that second part is easy, the first part is limited.
[18:27] <jefferai> okay
[18:28] <jefferai> essentially, I need to be able to audit
[18:28] <jefferai> to say "data from customer X is here, here, and here"
[18:28] <jefferai> and ideally to simply define that data from customer X lives on e.g. drive 1 on three nodes
[18:29] <sagewk> jefferai: that's easy, although in general any customer with more than a trivial amount of data iwll ahve some on every node.
[18:29] <jefferai> unless I can carefully define where it goes :-)
[18:29] <elder> That's what I was about to ask about. Is there any concern about deleting data?
[18:29] <sagewk> jefferai: you won't be able to restrict it down that much without throwing out most of the benefits of ceph's data distribution
[18:29] <elder> Meaning, if I move its "official location" off one node and onto another, must the earlier copy be erased reliably?
[18:29] <sagewk> jefferai: you'd need to define a pool for each customer, with a custom crush map for each one placing it on a subset of nodes. it can be done, but is tedious, and won't scale to large numbers of customers.
[18:30] <jefferai> I see
[18:30] <sagewk> jefferai: if that's your requirement, i don't think rados is the right tool for you
[18:30] <jefferai> it isn't necessarily, just trying to cover my bases
[18:30] <jefferai> if I can find out whre that data is, it might be enough
[18:30] <elder> Do you both need to know where it *is*, as well as ensure it *is not* anywhere else?
[18:30] <jefferai> elder: I think the former might be enough
[18:30] <jefferai> after all, if I can find out where it is, then I know it's not anywhere else
[18:31] <sagewk> jefferai: finding data is easy (there's an ioctl for the fs, if that's what you're using). but the answer will almost always be "every osd"
[18:31] <Tv_> jefferai: note, where it *is* != where it *has been*
[18:31] <elder> Exactly.
[18:31] <jefferai> Tv_: yep
[18:31] <Tv_> jefferai: but this really comes down to, are we talking about ~10 customers or ~10000
[18:31] <sagewk> jefferai: which makes asking the question mostly pointless; you can save time by just assuming it's everywhere.
[18:32] <jefferai> sagewk: it's not really pointless, it's just an answer...now I have to see if that answer can be lived with :-)
[18:32] <jefferai> if it was easy to restrict to which osds data was living, then that would make things simpler
[18:32] <jefferai> although I understand why that's not ideal
[18:33] <jefferai> I was talking a lot with nhm about hardware requirements...one thing I'm wondering is OSDs
[18:33] <Tv_> jefferai: it's not horribly hard, it just won't scale to tens of thousands of customers
[18:33] <jefferai> I assume you make a block device an OSD
[18:33] <jefferai> Tv_: I'm talking about, say, 20
[18:34] <Tv_> jefferai: you could give each a separate pool, do crushmaps for each, etc; you could pull it off
[18:34] <nhm> sagewk: on the performance side, he was wondering about CPU. I'm thinking that given he is going to be extremely network limited with 45disks+20SSds per node, that he may be best off sticking with a fast single cpu system with an extra socket avaialble if CPU usage is really a problem. That lets him avoid NUMA issues until he has to deal with them.
[18:34] <Tv_> jefferai: you'd essentially partition your osds into tenants
[18:34] <Tv_> jefferai: (at which point, maybe you'd want to run just 20 ceph clusters instead?)
[18:34] <jefferai> Tv_: yeah, I was looking in the documentation about all this, including the custom crushmap stuff, but most of it isn't complete
[18:34] <jefferai> Tv_: I thought of that too, honestly
[18:34] <jefferai> and maybe that's a way to do it
[18:34] <sagewk> nhm: sounds reasonable. network will definitely be the limiter with that sort of config, yeah.
[18:35] <jefferai> Tv_: but I'm not sure how that would scale, hardware wise
[18:35] <jefferai> sagewk: so when you say "network will be the limiter" -- exactly what is going to be chewing up all the bandwidth?
[18:35] <jefferai> I don't really know the network requirements of ceph
[18:35] <Tv_> jefferai: the stuff i'm working on right now lets you actually manage multiple clusters on the same servers; each disk is marked to belong to a specific cluster etc
[18:36] <jefferai> Tv_: so you'd run e.g. 20 ceph-mon processes on each box, with 20 ceph-osd daemons on each box, etc?
[18:36] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[18:36] <Tv_> jefferai: uhh you wouldn't want 20 ceph-mon
[18:36] <Tv_> jefferai: 20 ceph-osd if you have 20 disks on it
[18:36] <jefferai> yeah, I'll have 65 disks
[18:37] <Tv_> jefferai: on a single server?
[18:37] <jefferai> yes, it's a storage server...
[18:37] <nhm> jefferai: each node will be receiving data from the client, sending replicant data to secondary OSDs, and recieving replicant data from other OSDs.
[18:37] <Tv_> jefferai: that's quite a lot for the ceph design; you might run out of ram & cpu first
[18:37] <jefferai> Tv_: that's why I'm asking these questions
[18:37] <Tv_> jefferai: we're more a scale out than scale up thing
[18:37] <Tv_> jefferai: the price you paid for that 65 disk thing would have probably gotten you a handful of independent servers, and then your failure domains would be smaller
[18:38] <jefferai> Tv_: haven't paid anything yet, that's why I'm asking about hardware
[18:38] <jefferai> but rack space is limited
[18:38] <Tv_> jefferai: 65 disks makes sense pretty much only when you absolutely must have that much space managed by a single operating system; we're explicitly avoiding that limitation
[18:39] <jefferai> Tv_: I don't have that requirement
[18:39] <Tv_> jefferai: just about all ceph clusters i know of are using direct-attached disks, what fits in the base server, no expansion boxes
[18:40] <jefferai> hm
[18:40] <Tv_> jefferai: think about it this way: what if your power supply fails; how much data is going to be offline because of that
[18:40] <jefferai> if both power supplies fail? :-)
[18:40] <nhm> jefferai: That's why I mentioned last night you may want to go with smaller boxes. It's less painful in a number of ways. You mostly just loose density.
[18:40] <jefferai> ideally zero, because of ceph replication
[18:41] <Tv_> jefferai: once again, we believe in piles of cheaper hardware vs extra-reliable everything
[18:41] <jefferai> Tv_: I'm with you there
[18:41] <Tv_> jefferai: the servers we use on our stuff don't even have dual power
[18:41] <jefferai> this hardware is not expensive
[18:41] <jefferai> :-)
[18:41] <nhm> jefferai: with big boxes, when a box fails, there's a ton of data to re-replicate.
[18:41] <Tv_> jefferai: why would they; any server can fail as it pleases
[18:41] * jefferai doesn't actually have a choice, they all come with redundant PSUs :-)
[18:42] <Tv_> jefferai: you could buy more of something cheaper...
[18:42] <jefferai> actually, I could buy less
[18:42] <jefferai> because a jBOD is very cheap
[18:42] <jefferai> especially when I'm filling it with my own disks
[18:42] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[18:42] <jefferai> so a 6U 20 SSD/45HDD solution is quite cheap
[18:43] <jefferai> I could put in 3 2U boxes, but it would cost more because each one would need memory and a processor
[18:43] <nhm> jefferai: yeah. I'd argue though that the disks are going to be the majority of that cost anyway though.
[18:43] <nhm> jefferai: and you can split the memory up and buy lower density dimms if you go with smaller servers.
[18:43] * mgalkiewicz (~mgalkiewi@ Quit (Ping timeout: 480 seconds)
[18:43] <jefferai> OK, so suppose that I do that -- buy more smaller servers
[18:44] <jefferai> each fits 24 SSDs or 12 HDDs
[18:44] <jefferai> (or 24 2.5" HDDs)
[18:44] <jefferai> so I have, max, 24 OSDs per box
[18:45] <jefferai> what kind of hardware requirements am I looking at for that kind of use?
[18:45] <nhm> jefferai: I think the biggest advantage of the big-box setups are density if you have a really big deployment, and speed if you are doing something like IB (And we don't do RDMA anyway).
[18:46] <jefferai> I was just thinking it made sense since the majority of what's on the HDDs would be low-traffic
[18:46] <jefferai> but I'm not against swapping out for 3x2U devices
[18:47] * mgalkiewicz (~mgalkiewi@ has joined #ceph
[18:47] <nhm> jefferai: Our internal test boxes are 12-bay 2U machines with 6-core AMD chips and 16GB of ram.
[18:48] <jefferai> I see
[18:49] <nhm> jefferai: I'm not sure yet if that is optimal, but so far cpu and memory don't seem to be the limitation.
[18:49] <Tv_> part of that is that we'd rather have more test servers, to trigger those sort of issues
[18:50] <jefferai> sure
[18:50] <Tv_> part of avoiding big boxes right now is this: recovery causes a spike in cpu and ram usage
[18:50] <jefferai> ah
[18:50] <Tv_> you might not want to provision a box to tolerate that spike for >>20 disks
[18:50] <jefferai> so everything slows down during recovery
[18:51] <Tv_> now, are we working on decreasing the severity of the spike? for sure
[18:51] <Tv_> but still, the ceph architecture intentionally pushes work to the osds
[18:51] <Tv_> so it's not just "plug in as many disks as you can" and a dumb server that just shuffles bytes
[18:51] <jefferai> which makes sense
[18:52] <Tv_> this ain't your granpa's iscsi ;)
[18:52] <jefferai> Tv_: yeah, I understand
[18:53] <jefferai> So let's say that I got 9 2U boxes each with 12 3.5" slots that I stuffed 3TB disks in
[18:53] <jefferai> and 3 2U boxes each with 24 2.5" slots that I can pop SSDs in
[18:54] <jefferai> I assume it's possible with a crushmap to keep data that I want to stay on the fast SSDs on those?
[18:54] <jefferai> and will 24 OSDs/disks in one box be problematic?
[18:55] <Tv_> jefferai: if you have a separate "fast" pool and a "slow" pool, then you can control data placement
[18:55] <jefferai> OK
[18:55] <nhm> jefferai: Maybe, but you'll be so network limited on a box like that that the ssds will be mostly idle. Figure even slow SSDs should give you ~100-200MB/s.
[18:55] <Tv_> jefferai: but that means whatever uses RADOS needs to use those pools, explicitly
[18:56] <nhm> jefferai: so 5-10 of them will easily outpace your 10GE link.
[18:56] <jefferai> nhm: I'm still having a hard time understanding "network limited" -- I don't have a good feel for bandwidth requirements
[18:56] <jefferai> nhm: but are you talking for Ceph traffic?
[18:57] <jefferai> because I don't expect those to be running flat out all the time
[18:57] <Tv_> jefferai: he's saying the SSDs will go way faster than your network at that point, so if you're not going to upgrade the networking (e.g. bond 2*10gig), *perhaps* you might as well use spinning disks
[18:57] <Tv_> this is all hypothetical at this point
[18:57] * jefferai was going to bond 2*10g
[18:58] <Tv_> you'd need benchmarks to know for sure
[18:58] <jefferai> actually, it'll be multipath
[18:58] <jefferai> and I could possibly bond on top of that
[18:58] <jefferai> sure
[18:58] <Tv_> also, ceph performance is still a moving target
[18:59] <nhm> Tv_: not necessarily spinning disks, just that having 24 SSD osds in one node won't push the node as hard as it might since the network will hold everything back.
[18:59] <joao> did any of you guys had a warning message when installing the vidyo .deb regarding bad package quality?
[18:59] <nhm> Tv_: Though I agree, we won't know until we test a node like that.
[19:00] <nhm> joao: yes, I gave up on vidyo on linux.
[19:00] <nhm> joao: it's not worth the pain.
[19:00] <jefferai> so on a 24-disk server, 8-12 AMD cores and 32GB of RAM is probably enough?
[19:01] <joao> nhm, gonna give precise a shot and fall back to osx if it goes south
[19:01] <nhm> jefferai: Lets say I'd expect to be spending more time dealing with controller issues, tweaking ceph, and tweaking underlying filesystems than dealing with cpu/memory with that kind of setup. :)
[19:01] <jefferai> hah, sounds excellent
[19:02] <jefferai> so maybe, in the 24U I have available, I can go for 9 2U 12-disk 3.5" nodes and 3 2U 24-disk 2.5" nodes
[19:04] <nhm> jefferai: just keep in mind that we're still working on making different configs perform well...
[19:05] <jefferai> Understood
[19:05] <jefferai> and once I buy hardware I'm hoping to get some funded time where I can work on testing things out with you guys
[19:05] <jefferai> if you guys would like that
[19:05] <nhm> though the 12 disk nodes are very similar to what one of our customers is deploying.
[19:06] <nhm> jefferai: Personally I'd love to see any results you collect!
[19:06] <jefferai> sure
[19:07] <nhm> jefferai: ok, we've got a meeting in a couple of mins, so time to run. Good luck!
[19:07] <jefferai> sure, thanks
[19:08] <elder> sagewk, do you know why con_close_socket() is called in con_work() after the OPENING bit is cleared from con->state?
[19:09] <sagewk> let me look
[19:10] <elder> Also, I think I might like to talk with you a bit after the meeting, or with someone else with decent messenger knowledge. Can you stay on afterward?
[19:11] <sagewk> elder: someone can call ceph_con_close() and then ceph_con_open()
[19:11] <sagewk> we need to close the old socket, and open a new one
[19:12] <sagewk> hmm, in that case it should probably jump straight to try_write(), or try_read() will deref the NULL con->sock?
[19:12] <elder> If ceph_con_close got called, the CLOSED bit would be set, so the case just prior would have been taken.
[19:12] * bchrisman (~Adium@ has joined #ceph
[19:12] <elder> Huh...
[19:12] <elder> Interesting!
[19:12] <sagewk> hmm
[19:12] <elder> That's sort of what I was thinking.
[19:12] <sagewk> i seem to rember try_read and try_write getting swapped around a while back?
[19:12] <elder> Nope.
[19:12] <sagewk> that may be how this came up
[19:12] <elder> Didn't work.
[19:13] <elder> So I never committed that.
[19:13] <elder> I believe this block of code goes back to the original commit.
[19:13] <sagewk> yeah, not sure how OPENING woudl get set without CLOSED...
[19:13] <elder> OK, well can we talk after the meeting?
[19:13] <elder> Briefly.
[19:13] <sagewk> sure
[19:14] <sagewk> also.. con_close_socket() sets SOCK_CLOSED, so the case below would catch that and fault.. weird
[19:14] <sagewk> try_read checks for null con->sock, so that's not it
[19:14] <sagewk> yeah
[19:15] * chutzpah (~chutz@ has joined #ceph
[19:18] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:26] * Ryan_Lane (~Adium@ has joined #ceph
[19:37] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[19:38] * Oliver1 (~oliver1@ip-176-198-97-69.unitymediagroup.de) has joined #ceph
[19:44] * Theuni (~Theuni@ has joined #ceph
[19:52] <NaioN> any idea if the rbd caching will also be included in the kernel client?
[19:57] <elder> Not soon.
[19:57] <elder> Or rather, not in the next month or two at least.
[20:03] * aliguori (~anthony@c-68-44-125-131.hsd1.nj.comcast.net) has joined #ceph
[20:10] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) has joined #ceph
[20:11] <mgalkiewicz> hi elder
[20:11] <mgalkiewicz> elder: could u take a quick look at https://gist.github.com/2577769
[20:11] * danieagle (~Daniel@ has joined #ceph
[20:20] <dmick> yikes yes pudgy is just slightly faster than my desktop for build
[20:20] <sjust> that's odd
[20:20] <sjust> -j15?
[20:20] <dmick> that was sarcasm, is what that was
[20:20] <sjust> oh
[20:20] <sjust> yes
[20:20] <sjust> ok
[20:20] <sjust> I get it now
[20:20] <dmick> -j 8 and it's probably at least 20x
[20:20] <sjust> ah
[20:21] <The_Bishop> i use ccache for frequent rebuilds
[20:21] <gregaf> I miss pudgy
[20:21] <dmick> The_Bishop: yes, we've experimented
[20:21] <gregaf> It had more disk problems than kai but its memory never broke on me, and it had 8 real cores :(
[20:21] * dmick growls at gregaf and cuddles pudgy in his paws
[20:22] <gregaf> actually, wait, does it have 8 or 16?
[20:22] <dmick> 16 I believe
[20:22] <gregaf> I can't remember any more; it's been so long
[20:22] <gregaf> *cry*
[20:23] <dmick> 4x quad opteron, yes.
[20:29] <elder> mgalkiewicz, I see that but there's not much I can do with it. Unfortunately I'm still focused on a different bug right now.
[20:29] <dmick> a little confused; if man/ is made from doc/man, why do we have files checked into man/?
[20:31] <Tv_> dmick: so we don't need the whole sphinx toolchain as build dep
[20:31] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[20:31] <mgalkiewicz> elder: ok I will add the issue to your tracker
[20:31] <Tv_> dmick: maybe drop in a man/README that warns you about this? and ensure it doesn't hurt
[20:31] <elder> Sounds good.
[20:32] <dmick> oh, so you might have stale manpages if you don't have sphinx, but at least you have something
[20:32] <Tv_> dmick: we don't actually auto-update man/ either
[20:32] <dmick> and they'll only be stale if you made changes.
[20:32] <Tv_> dmick: because it keeps messing with the darn dates, that's just spurious changes
[20:32] <Tv_> dmick: never had time to polish that part
[20:32] <dmick> er...it seemed to just now
[20:33] <dmick> ah. no, I'm wrong
[20:33] <Tv_> dmick: admin/build-doc writes to build-doc/output/man
[20:34] <dmick> yargh. ok, glad I asked
[20:35] <dmick> so should I make changes to .rst, run build-doc, and then move changed files from output/man to man/?
[20:35] <Tv_> yeah
[20:35] <dmick> (and push both the updated .rst and man/ files?)
[20:35] <Tv_> yeah
[20:35] <dmick> k
[20:42] * sjust-phone (~sjust@aon.hq.newdream.net) has joined #ceph
[20:43] * sjust-phone (~sjust@aon.hq.newdream.net) Quit ()
[20:48] <nhm> phone irc!
[20:50] * yehuda_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[20:52] * aliguori (~anthony@c-68-44-125-131.hsd1.nj.comcast.net) Quit (Read error: Operation timed out)
[20:52] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[20:53] * aliguori (~anthony@c-68-44-125-131.hsd1.nj.comcast.net) has joined #ceph
[20:54] <dmick> Tv_: would you mind doublechecking https://github.com/dmick/ceph/commit/31fb8f97b64857f50bdb0ff155bc769a5a1f29d8
[20:54] <Tv_> dmick: multiple spaces ain't gonna do you any good
[20:55] <Tv_> or are you one of those "two spaces after a period" people?
[20:56] <Tv_> dmick: otherwise yeah, it's fine
[20:57] <dmick> Tv_: yes, yes I am
[20:57] <dmick> it still makes sense to me, unlike punctuation inside doublequotes
[20:57] <dmick> tnx
[20:59] <Tv_> i'm a firm believer in single space (let computers figure out the spacing & kerning) and logical quotation (if you quoted a question, then you need the question mark inside the quotes)
[20:59] <dmick> I hear you. but I still use a lot of fixed-pitch fonts and I believe two spaces sets off the sentence slightly better. But I appreciate the counterargument.
[21:00] <Tv_> the fact that we use monospace fonts so much is part of the problem (let's fix that!).. but honestly, for me a two-space monospace gap is too big
[21:01] <Tv_> the text i'm reading starts looking like this.
[21:01] <elder> I don't think monospace fonts need fixing.
[21:01] <dmick> well not between words...
[21:01] <elder> Well, for code-like things...
[21:01] <Tv_> dmick: yeah but generalize that to a paragraph and you get lots of gaps inside it
[21:01] <elder> When it should be pretty, it should be pretty.
[21:01] <dmick> but I wouldn't generalize it. That would indeed be silly.
[21:02] <Tv_> elder: that's just because of our historical war with indentation
[21:02] <nhm> I like 2 spaces between sentences too.
[21:02] <Tv_> i've seen a proportional font lisp editor, it was gorgeous
[21:02] <elder> And the two spaces after a period is magically taken care of by the same computer that figures out what to do with one space.
[21:03] <elder> As long as we eliminate all invisible spaces at the end of lines I'm happy.
[21:04] <elder> Oh, and spaces before tabs.
[21:04] <Tv_> elder: i've actually written Go in a form where i was unable to commit anything with non-canonical indentation or whitespacing
[21:04] <Tv_> elder: it works very well when the language was designed with that in mind
[21:04] <Tv_> it would get rewritten to canonical on git add
[21:04] <elder> You should have a pretty-printer take care of that sort of silliness.
[21:05] <Tv_> i did
[21:05] <elder> OK, good.
[21:05] <Tv_> C has too much ambiguity for that, but Go was just lovely
[21:05] <elder> That's the sort of thing that in-person code reviews I was in a decade or more ago used to zoom in on. "I don't like the look of that semicolon."
[21:06] <yehuda_hm> Tv_: you're obviously not getting paid by the size of your source files
[21:06] <elder> You can do the same thing with C, you just have to decide what convention you want to follow.
[21:06] <elder> Gotta go. Back on line in a few hours.
[21:08] * BManojlovic (~steki@ has joined #ceph
[21:11] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[21:24] * The_Bishop (~bishop@cable-86-56-102-91.cust.telecolumbus.net) Quit (Ping timeout: 480 seconds)
[21:25] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[21:31] * Theuni (~Theuni@ Quit (Quit: Leaving.)
[21:39] * The_Bishop (~bishop@cable-86-56-102-91.cust.telecolumbus.net) has joined #ceph
[21:39] * Theuni (~Theuni@ has joined #ceph
[21:39] * Theuni (~Theuni@ Quit ()
[21:41] * sjust-phone (~sjust@aon.hq.newdream.net) has joined #ceph
[21:45] * sjust-phone (~sjust@aon.hq.newdream.net) Quit (Read error: Connection reset by peer)
[21:47] <jefferai> Tv_: nhm: elder: so I've changed around my quotes; I'm now looking at 12 2U systems instead of 4 2U+4U systems
[21:47] <jefferai> so, trying to take your advice to heart :-)
[21:49] <nhm> jefferai: Good deal. I think that will be a better setup for ceph.
[21:49] <jefferai> Yeah. So 4 systems with 20 2.5" bays each (not fully populated to start) and 8 systems with 12 3.5" bays each (also not fully populated to start)
[21:50] <jefferai> each will have dual 10G links going through different switches; but I can see if I can get bonding set up
[21:50] <jefferai> I'm not sure if you can really do bonding between multiple end hosts
[21:50] <jefferai> or if it's more point-to-point
[21:51] <nhm> jefferai: I wouldn't worry too much about bonding on the 12bay systems. For the SSD systems you could see some benefit potentially.
[21:51] <jefferai> yeah -- does RBD do multipath natively?
[21:52] <nhm> jefferai: don't know honestly.
[21:53] <darkfader> most bonding types hash over ip addresses, so you'll see better usage with increased number of systems talking. 1-to-1 is worst case
[21:53] <darkfader> but linux can iirc hash by ip ports too
[21:54] <darkfader> (i did try it and it didn't work out as i wanted, but i think that was on a very old kernel)
[21:59] <jefferai> darkfader: yeah, I just didn't know if you could do multipoint bonding, since each storage system would talk to each compute node
[21:59] <jefferai> I would guess you could, since it's just a matter of setting the IP addresses on the interfaces and making the kernel aware of them being bonded
[22:00] * Ryan_Lane1 (~Adium@ has joined #ceph
[22:00] * Ryan_Lane (~Adium@ Quit (Read error: Connection reset by peer)
[22:25] * Oliver1 (~oliver1@ip-176-198-97-69.unitymediagroup.de) Quit (Quit: Leaving.)
[22:33] * Ryan_Lane1 (~Adium@ Quit (Quit: Leaving.)
[22:33] * Ryan_Lane (~Adium@ has joined #ceph
[22:41] * sjust (~sam@aon.hq.newdream.net) Quit (Quit: Leaving.)
[22:41] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[22:54] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[23:07] <sagewk> tv_: do you care whether i change 'osd crush add ..' to do an add/update, or make a new 'osd crush update'?
[23:07] <sagewk> depends on whether we think it's useful to have an add that doesn't update.
[23:07] <Tv_> sagewk: i wish the name made sense.. as to what that concretely means, that's open ;)
[23:08] <Tv_> sagewk: osd crush set?
[23:08] <Tv_> sagewk: not sure whether add is worth keeping
[23:08] <Tv_> but i'd expect add to crap out if already exists, from the name
[23:09] <Tv_> "set" at least has that semantic right, in my mind.. make it so, whether it existed or not
[23:09] <Tv_> "update" doesn't really imply create, in my mind
[23:11] <sagewk> set works. can we safely remove add? people are probably already using it.
[23:11] <Tv_> never heard of anyone but me, but that doesn't mean they're not out there
[23:11] <sagewk> hmm the other cookbooks don't use it?
[23:12] <elder> nhm, can you send a quick summary of how you're invoking or using seekwatcher?
[23:12] <sagewk> checking
[23:12] <Tv_> sagewk: haven't actually looked
[23:12] <Tv_> ( / can't remember)
[23:13] <nhm> elder: super easy, first: "sudo blktrace -o <trace file> -d <device>"
[23:13] <nhm> elder: second: "seekwatcher -t <trace file> -o <movie> --movie"
[23:13] <sagewk> doesn't look like it
[23:13] <sagewk> ok
[23:13] <sagewk> that makes it easy :)
[23:14] <nhm> elder: you may need to fix the seekwatcher code though. ;)
[23:14] <nhm> I encounted a couple of bugs both in 0.12 and in HEAD. Easy to fix though.
[23:15] <sagewk> tv_: merge wip-crush-update to get the new goodness.
[23:17] <Tv_> sagewk: i really like the weeks you're not at conferences ;)
[23:18] <sagewk> :)
[23:19] <sagewk> at least when the overlap with weeks you're not fighting dell ipmi hardware
[23:19] <Tv_> sagewk: i only have problems when i have to fight it and i'm not allowed to use a sledge hammer
[23:19] <Tv_> let me loose in IRV with one and i'll be very happy
[23:20] <Tv_> plana58 is getting it first
[23:23] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[23:27] <Tv_> uh what serves http://ceph.newdream.net/debian-snapshot-amd64/ ?
[23:28] <Tv_> sagewk: i think after the dh panel security audit, i can't manage the automated apache stuff anymore
[23:28] <Tv_> so now i have no clue about what broke, when one of the proxy rules is acting up
[23:29] <sagewk> its a proxy to the old vm
[23:29] <Tv_> no http spoken there, currently
[23:29] <sagewk> i can update it, or we can send people to gitbuilder.ceph.com/... somewhere?
[23:29] <Tv_> i'll happily put in a new url if i know what it is ;)
[23:30] <Tv_> doc/ops/autobuilt.rst
[23:30] <Tv_> hmm i can look it up from the qa chef stuff
[23:30] <sagewk> http://gitbuilder.ceph.com/ has several dirs for different distributions
[23:30] <Tv_> oh no i can't, those don't use debs
[23:30] <sagewk> the deb ones d
[23:30] <sagewk> o
[23:30] <sagewk> e.g. http://gitbuilder.ceph.com/ceph-deb-oneiric-x86_64-basic/
[23:31] <Tv_> oh crap the link on the website still goes to http://ceph.newdream.net/gitbuilder.cgi
[23:31] <Tv_> no wonder i was confused
[23:35] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[23:48] <Tv_> so am i on crack or does ceph-osd --mkfs remove the monmap file?
[23:48] <Tv_> i couldn't find an unlink call that looked quilty
[23:50] <dmick> I'll just point out that it could easily be both
[23:51] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[23:58] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) Quit (Read error: Connection reset by peer)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.