#ceph IRC Log


IRC Log for 2012-03-30

Timestamps are in GMT/BST.

[0:42] * gregorg_taf (~Greg@ has joined #ceph
[0:42] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[0:49] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[1:04] * loicd (~loic@magenta.dachary.org) Quit (Ping timeout: 480 seconds)
[1:31] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[1:53] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[1:59] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[2:36] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[3:09] * lofejndif (~lsqavnbok@659AAA52K.tor-irc.dnsbl.oftc.net) Quit (Quit: Leaving)
[3:16] * joao (~JL@ Quit (Ping timeout: 480 seconds)
[3:29] * tjikkun (~tjikkun@82-169-255-84.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[3:41] * perplexed_ (~ncampbell@ has joined #ceph
[3:41] * perplexed_ (~ncampbell@ has left #ceph
[3:42] * tjikkun (~tjikkun@82-169-255-84.ip.telfort.nl) has joined #ceph
[4:22] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:44] <sage> elder: should we mark those bugs 'can't reproduce'?
[4:46] <elder> Oh. Maybe.
[4:47] <elder> I just made a comment to that effect on that last one, but maybe you just updated it.]
[4:47] <elder> I'll do that.
[4:48] <elder> I don't mean to say I won't look at it, it's just not worth spending any more time testing I don't think.
[4:48] <elder> I've been thinking some of these could be the result of hitting those missing commits from February. Now that everything is copacetic things are stable.
[4:49] <sage> yeah
[4:49] <elder> Or rather, the missing commits could have led to some of these problems.
[4:49] <sage> i'd mark them can't reproduce for now to get rid of the noise, we can always reopen if they come up again
[4:49] <elder> I'll mark them Can't Reproduce.
[4:49] <elder> Right.
[4:53] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[4:54] <elder> Cool, they just magically disappear from my list!
[4:55] <sage> the best part!
[4:55] <elder> I'm going to do that to all of them!
[4:55] <sage> you can also highlight multiple and right click to set the status
[4:56] <elder> I wondered how to do that. I figured it was something like that but I never tried a right-click. Highlight meaning click the little box?
[4:56] <sage> hmm #1940 is probably resolved now?
[4:57] <elder> I guess so. That was yours as far as I was concerned. You said "patch in master" so I figured you knew what you were talking about.
[4:57] <elder> I've been running a whole bunch of tests on your wip-atomic-open branch. No problems so far.
[4:57] <elder> I don't remember where I got this list of tests but it might be a nightly test, or something like it.
[4:58] <elder> Maybe not. Maybe it's just everything I could find. blogbench, iozone, pjd, tiobench, fsstress, kernel_untar_bui, rbd/{copy,import_export,test_librbd}.sh
[4:59] <elder> A few are commented out because of errors.
[5:05] * adjohn (~adjohn@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[5:42] * adjohn (~adjohn@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Quit: adjohn)
[5:44] * adjohn (~adjohn@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[5:59] <sage> elder: still there?
[6:10] <elder> Yes
[6:10] <elder> What's up?
[6:14] <elder> Well, actually, I'm headed to bed. Leave a message here if you like, or e-mail me, and I'll see it first thing in the mornig.
[6:21] <sage> just wondering when you're arriving in sf
[6:36] * adjohn (~adjohn@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Quit: adjohn)
[6:49] * f4m8_ is now known as f4m8
[6:54] <sage> elder: i guess the xfs ilock lockdep fix isn't in testing anymore.. should be leave it in there for the time being (until it's upstream) to avoid the qa noise?
[7:01] * perplexed (~ncampbell@c-76-21-85-168.hsd1.ca.comcast.net) has joined #ceph
[7:05] * perplexed (~ncampbell@c-76-21-85-168.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[7:06] * perplexed (~ncampbell@ has joined #ceph
[7:10] * perplexed (~ncampbell@ has left #ceph
[7:52] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[8:16] * perplexed_ (~ncampbell@c-76-21-85-168.hsd1.ca.comcast.net) has joined #ceph
[8:17] * perplexed_ (~ncampbell@c-76-21-85-168.hsd1.ca.comcast.net) Quit ()
[8:28] * adjohn (~adjohn@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[8:45] * imjustmatthew (~imjustmat@pool-96-228-59-130.rcmdva.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[9:03] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[9:14] * BManojlovic (~steki@ has joined #ceph
[9:25] * loicd (~loic@ has joined #ceph
[9:44] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:58] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[10:39] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[10:49] * Kioob`Taff1 (~plug-oliv@local.plusdinfo.com) has joined #ceph
[11:23] * Kioob`Taff1 (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[11:26] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[11:47] * adjohn (~adjohn@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Quit: adjohn)
[12:23] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:23] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[12:24] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[12:24] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[12:34] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[12:52] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[12:52] * gregorg_taf (~Greg@ has joined #ceph
[13:04] * adjohn (~adjohn@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[13:04] * adjohn (~adjohn@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit ()
[13:21] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Ping timeout: 480 seconds)
[13:38] * joao (~JL@ has joined #ceph
[13:57] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[13:59] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit ()
[13:59] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[13:59] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit ()
[14:00] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[14:25] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[14:42] <elder> sage, I thought it was going to be in with the merge, sorry about that. I'll add it back in shortly.
[14:42] <elder> If we rebase after the -rc1 release we can remove it again.
[14:43] <elder> Or rather, just reset the heads to -rc1 once it's released.
[14:44] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[14:55] <elder> sage, looks like the client system died while running wip-atomic-open. No more information at this point.
[15:06] * oliver1 (~oliver@p4FD061F1.dip.t-dialin.net) has joined #ceph
[15:24] <oliver1> Hi... unfortunately the tracker is down... Have an interesting update for #2178 waiting to be accepted ;)
[15:25] <joao> again?
[15:26] <joao> looks like it's true; tracker down
[15:26] <oliver1> Perhaps I could have been banned out :-D
[15:27] <oliver1> ( for some known reason... kidding...)
[15:27] <joao> oh no, it's unresponsive alright
[15:27] <joao> some redmine issue with git
[15:28] <oliver1> So I have to stay in my office until I get rid of whats in the "resend-buffer"... *sigh*
[15:30] <joao> elder, do you know in which server the tracker is?
[15:42] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[15:43] * f4m8 is now known as f4m8_
[15:46] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[15:52] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[16:01] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[16:10] <elder> joao, all I know is tracker.newdream.net
[16:11] <elder> And it's working for me.
[16:11] <joao> strange then
[16:12] <joao> I'm still waiting on the server
[16:13] <oliver1> Now there... and updated...
[16:13] <joao> oh yeah, it's working, as long as I use firefox...
[16:13] <oliver1> failed, DNS now gives me
[16:32] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[17:39] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:46] <Tv|work> ah perhaps that was the server move then
[17:51] * kirby (~kfiles@pool-151-199-52-228.bos.east.verizon.net) Quit (Ping timeout: 480 seconds)
[18:13] <sagewk> joao: it's working now
[18:13] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[18:13] <joao> sagewk, yes, it's been working for a while now
[18:15] * imjustmatthew (~imjustmat@pool-96-228-59-130.rcmdva.fios.verizon.net) has joined #ceph
[18:15] <oliver1> Hey Sage, if you have any more q. or todo's with regards to #2178 let me know, about to leave the office and hit the weekend ;)
[18:15] <sagewk> looking at it now
[18:17] <oliver1> Thnx. Not having tried out the "barrier=1" setting, cause it's not easy to reproduce. But I'm confident I can :-D
[18:18] <sagewk> it looks like the #2164 bug was triggered, altho i haven't yet confirmed it corrupted the specific data block you saw.
[18:18] <sagewk> want to try to reproduce with the patch?
[18:20] <oliver1> Yeah.
[18:22] <sagewk> oliver1: will push a branch to git in a few minutes
[18:25] <oliver1> Well, I can try to do s/t from remote later, though it's not that convenient. Are the logfiles of any interest?
[18:29] <oliver1> I will automate my setup to start 4 VM's, with all fdisk/test/verify things and send out a report if any of the "Ooops" occured. Read u l8r...
[18:29] * oliver1 (~oliver@p4FD061F1.dip.t-dialin.net) has left #ceph
[18:30] <joao> I gotta find a way to get rid of eclipse
[18:30] <joao> working with it is all puppies and rainbows until it isn't
[18:31] <nhm> opinion request: How crazy of an idea is it to do multivariate regression tests to find correlation coefficients associating benchmark configurations with specific operation times from the ceph logs?do multivariate regression tests to find correlation coefficients associating benchmark configurations with specific operation times from the ceph logs?
[18:31] <nhm> doh, sorry about the double text.
[18:32] <sagewk> oliver1: i copied the log files, will look at them more closely.
[18:32] <sagewk> oliver1: fix isn't quite ready, so don't wait for us. we'll have something by end of day today but that probably doesn't help you much. :)
[18:33] <Tv|work> nhm: matching op duration? i wonder if you could get a better id mechanism.. but that can be greatly helped by the timestamps
[18:33] <sagewk> nhm: would be interesting... once we have a solid framework with reproducible benchmarks. would need to focus on a narrow range of config options to avoid exploding the set of possible combinations
[18:34] <sagewk> nhm: oh nm, misunderstood
[18:34] <Tv|work> nhm: "what happened with this one slow request" is an interesting problem; but having the 2-stage logging might make that easier.. on a took-longer-than-a-threshold request, at the end of the request, if this hasn't been done in the last X seconds already, mark the whole verbose log to be dumped to disk
[18:35] <Tv|work> sagewk: ^
[18:35] <nhm> Tv|work/sagewk: it's an idea I've been kicking around in the back of my head. I did something kind of like that years ago using model trees in weka. Just have to make sure that you don't over-fit the data.
[18:35] <sagewk> tv|work: yeah
[18:36] <Tv|work> sagewk: the problem with that is that the "regular writer" was racing to keep up with the generation, and is very likely to have kept up to date; this sort of introduces another consumer temporarily, or something
[18:36] <Tv|work> sagewk: or makes the regular writer "rewind"
[18:36] <Tv|work> not pretty
[18:36] <Tv|work> harder to make safe
[18:37] <sagewk> tv|work: it's not bad at all with the ultra-simple locking wip-log is doing now.
[18:37] <Tv|work> but i've seen that debug strategy used even more violently; fork and crash dump in the child
[18:38] * bchrisman (~Adium@ has joined #ceph
[18:40] <sagewk> anyway, nhm, you mean individual request times, or averages for a specific operation type?
[18:44] <nhm> sagewk: specific operation types is what I was thinking.
[18:45] <nhm> sagewk: most of which would probably be totally uncorrelated with the benchmark parameters, but that's why we'd want a model for each operation and only generate useful ones.
[18:50] <sagewk> oliver1: pushed the fix to stable git branch. can you test it when you get a chance? should be v0.44.1-1-g41f84fa
[18:51] <sagewk> where operation type would be "4KB write", "4MB write", something like that?
[18:52] <jmlowe> Joshd: Mind if I ask how the librbd caching is coming along?
[18:53] * loicd (~loic@ Quit (Quit: Leaving.)
[18:53] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit (Quit: Ex-Chat)
[18:53] <joshd> jmlowe: unfortunately stalled a bit while some openstack stuff came up this week
[18:53] * loicd (~loic@ has joined #ceph
[18:54] <joshd> jmelowe: I should be back on it next week, but it'll take a bit more work (not sure it'll be ready for 0.45)
[18:54] <nhm> sagewk: I was thinking more like enqueue_op, dispatch, eval_repop, etc.
[18:55] <jmlowe> joshd: ok, really looking forward to giving it a test drive
[18:55] <joshd> jmlowe: I'll let you know when it's ready for testing
[18:55] <sagewk> nhm: oh i see. that would be interesting too, as long as it's subdivided by the actual op that is being enqueued, etc., because that's where most of the variance will be
[18:56] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[18:57] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[18:57] <nhm> sagewk: there's probably a really good way to build a model like that.
[18:58] <nhm> sagewk: Anyway, this is probably far-off stuff, I think it could potentially give us some nice insights.
[18:59] * loicd (~loic@ Quit (Quit: Leaving.)
[18:59] <nhm> Right now I'm more focused on just distilling the data from all of these tests I just ran into something meaningful.
[19:01] <sagewk> nhm: as a starting point it would be nice to get a simple op size/throughput graphs for rados to get some idea of where we're starting from
[19:01] <yehudasa> oh, unity is such a crap
[19:01] <sagewk> if we can build those on a regular basis, even just eyeballing them will give us a good idea if there are obvious performance improvements/regressions
[19:02] <sagewk> yehudasa: ha! yeah i always go back to gnome classic or whatever it's called now
[19:02] <nhm> sagewk: yeah. Do you like graphs or histograms better?
[19:02] <yehudasa> sagewk: yeah, finally did that, but still had to fiddle with config so that alt+tab would work
[19:03] <joao> just wondering, would the tracing I'm getting onto the FileStore be useful to nhm's idea?
[19:03] <dmick> unity is about the only thing I've seen that's even worse than metacity
[19:03] * rosco (~r.nap@ Quit (Quit: leaving)
[19:03] * rosco (~r.nap@ has joined #ceph
[19:04] <joao> dmick, don't you say bad things about unity...
[19:04] <sagewk> nhm: whatver communicates the data best.. it would be nice to see percentile ranges, but it's hard to cram that manyh dimensions of data onto a single plot
[19:04] * dmick puts up fists
[19:04] <joao> it's so osx-ish I barely can't live without it :p
[19:05] <nhm> sagewk: Yeah, I've got about 8 dimensions rigth now.
[19:06] <dmick> import wormhole_graph
[19:06] <elder> nhm, just get it started. Don't try to lay it all out ahead of time, there's just too much to learn and we won't even know what's interesting until we see some of the data.
[19:06] <nhm> elder: oh, I already ran initial tests last night.
[19:06] <sagewk> nhm: let's just not get overwhelmed to soon.. even 2 dimensions is way more than we have been looking at and will add huge value :)
[19:08] <nhm> sagewk: My plan is to start out by just generating summary info in spreadsheets where it's much easier to show that kind of dimensionality. Then 2D graphs / histograms / etc after that.
[19:11] <sagewk> nhm, elder, everyone: let's do skype for standup today
[19:12] <sagewk> conference rooms have been disassembled
[19:12] <sagewk> not that vidyo worked anyway :)
[19:12] <joao> okay
[19:12] <nhm> sagewk: out of curiousity, why did you disassemble the conference rooms? :)
[19:12] <dmick> maybe vidyo works better on iOS/Android :)
[19:12] <dmick> nhm: there was a party
[19:12] <nhm> ah
[19:12] <sagewk> openstack meetup last night
[19:13] <nhm> cool
[19:13] <dmick> probably had 50-60 people, couple presentations, beer/food, mingling
[19:13] <dmick> went really well I thought
[19:14] <sagewk> sounds like it. bummed i missed it.. clint was here, would have been an opportune time to pick his brain about precise status
[19:14] <nhm> Did Jay Pipes make it out?
[19:16] <elder> OK, I'm on...
[19:16] <elder> Or I thought I was.
[19:16] <nhm> elder: yeah, seems to be acting goofy
[19:16] <elder> Grrreeaaat.
[19:17] <joao> I second that
[19:17] <nhm> elder: frosted flakes for breakfast?
[19:17] <sagewk> let's do this in 16 minutes :)
[19:18] <elder> No, shredded wheat. You?
[19:18] <elder> OK.
[19:18] <nhm> elder: eggs with black beans and rice
[19:18] <elder> at :33 after? Presumably :30 after.
[19:18] <elder> Sounds interesting nhm.
[19:18] <nhm> elder: Standard costa rica breakfast though without the fried cheese.
[19:18] <elder> And the Grrreat didn't have an exclamation point, note.
[19:19] <joao> nhm, I'd call that lunch :p
[19:20] * danieagle (~Daniel@ has joined #ceph
[19:20] <nhm> joao: Yeah, sadly computer work isn't as demanding as working fields.
[19:20] <nhm> joao: Can't do that kind of breakfast often.
[19:22] <sagewk> ok working now
[19:22] <elder> Yup.
[19:24] <sagewk> elder: you on?
[19:24] <elder> sagewk, the xfstests are proceeding nicely so far. Some output mismatches are showing up, but I some of that is "normal" (I'll have to re-run on a local filesystem.
[19:24] <elder> Now I am.
[19:27] * chutzpah (~chutz@ has joined #ceph
[19:34] * perplexed (~ncampbell@ has joined #ceph
[19:41] <perplexed> Placement rules... min size 1, max size 10. This is what's defining the number of object copies for the cluster? If so, why the wide range by default... and how come "ceph osd dump -o -|grep 'rep size'" indicates 2? Wouldn't I expect to see min/max both set to 2 in the rule to result in that? Apologies if this should be obvious...
[19:45] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[19:45] * jmlowe (~Adium@140-182-139-191.dhcp-bl.indiana.edu) has joined #ceph
[19:46] <joshd> perplexed: min and max size are a bit confusingly named - they actually govern when that placement rule is used - i.e. it's used when replication size is 1-10
[19:48] <joshd> perplexed: you don't have to change the crush rules to change replication level - see http://ceph.newdream.net/wiki/Adjusting_replication_level
[19:49] <perplexed> Thx for the clarification.
[19:49] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[19:53] * jmlowe (~Adium@140-182-139-191.dhcp-bl.indiana.edu) Quit (Ping timeout: 480 seconds)
[19:58] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:30] <Tv|work> sagewk: for your consideration: https://github.com/ceph/ceph/commits/kill-btrfs-devs
[20:30] <Tv|work> actually, one more rebase..
[20:31] <Tv|work> that's better
[20:32] * BManojlovic (~steki@ has joined #ceph
[20:34] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[20:41] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[20:51] * pmcarthur (~peter@nyv-exweb.iac.com) has joined #ceph
[20:58] * jmlowe (~Adium@129-79-134-204.dhcp-bl.indiana.edu) has joined #ceph
[21:02] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[21:03] <joao> Tv|work, that branch name is a bit misleading :p
[21:03] * jmlowe (~Adium@129-79-134-204.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[21:06] * LarsFronius (~LarsFroni@g231139206.adsl.alicedsl.de) has joined #ceph
[21:13] <NaioN> what was the option to have the osd's use a different interface for their data?
[21:13] <NaioN> is there documentation about all the options you can set in ceph.conf?
[21:16] <joshd> NaioN: cluster_addr
[21:16] <joshd> some of the options have comments in src/common/config_opts.h, but actual documentation for them is being created
[21:18] <perplexed> rados bench - indicates 4KB is the default write object size, but it looks as though 4MB is actually being used ("Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds." with no -b value specified - defaults). Man page error?
[21:18] <NaioN> joshd: thx! was searching for that one
[21:20] <nhm> perplexed: sounds like it.
[21:21] <perplexed> Also, is there any guidance on how best to interpret the results of rados bench? What min and average latency results during the test are measuring (units are seconds? What latency is being measured?).
[21:22] <perplexed> Checked the wiki/man page. I'll check back through the mailing list archive too though..
[21:28] <joshd> perplexed: I think rados bench is measuring the latency of each operation it does (i.e. total time from send to osd until receiving the message meaning 'data is on disk on all replicas')
[21:28] <joshd> it's in seconds too
[21:29] <perplexed> That helps. Thanks
[21:30] <Tv|work> joao: we'll call it a happy accident
[21:34] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:43] <wido> sagewk: Regarding the new HB code, everything is still happy
[21:44] <wido> Tv|work: killing btrfs devs already?
[21:57] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[21:57] * Oliver1 (~oliver1@p5483CFD5.dip.t-dialin.net) has joined #ceph
[22:05] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:13] <perplexed> Is there any way to have one osd associated with more than one filesystem/disk on a server, or is the typical approach to have one osd defined per HD? I have servers with 11 storage HD's... I'm assuming I will need to define 11 osd's per server to utilize all spindles efficiently...
[22:14] <perplexed> wasn't sure of ceph-osd was able to support multiple filesystems/disks under one instance.
[22:17] <sjust> perplexed: usually we recommend one osd per disk
[22:17] <nhm> perplexed: This doesn't exactly answer your question, but if you have that many disks you may want to have each disk behind an OSD. That way you can spread your journals out for better performance.
[22:17] <gregaf> not that we actually know if that's the right tradeoff or not...
[22:17] <gregaf> I'd really like somebody to do RAID (or btrfs striping) just so we can compare
[22:19] <nhm> gregaf: I think the journal performance is basically drowning everything else out right now...
[22:19] <gregaf> it's not really a performance thing, more about costs of recovery and RAM
[22:20] <nhm> gregaf: hrm?
[22:21] <gregaf> with a big bundle of disks behind one OSD the recovery risks and costs change dramatically
[22:21] <nhm> gregaf: oh, yeah.
[22:22] <gregaf> and each OSD process on a machine takes 100-200MB of RAM away from disk caching, so by having an OSD/disk you're reducing your cache:disk ratio
[22:22] <nhm> gregaf: The performance tradeoffs there are more how well we can scale up OSDs vs how fast we can write to individual OSDs.
[22:22] <gregaf> we don't actually know what the right tradeoffs are, we made a WAG about it by saying we preferred smaller units of failure
[22:23] <elder> dmick, Can you grab a screen shot of plana54 console for me before I attempt to reset it?
[22:24] <nhm> gregaf: My understand too though, is that each OSD only has a single journal thread doing directIO writes to it's journal, and that at last on XFS and Ext4 those journal writes must complete before a transfer can complete.
[22:25] <gregaf> yeah, true, but for most users their journal can run as fast as their network interface can, so it's not a limiting factor
[22:26] <nhm> gregaf: Why is that?
[22:26] <gregaf> because most people are on GigE and getting a journal set up that can handle 125MB/s is easy
[22:26] <nhm> gregaf: I'm figuring that by the time we have lots of users everyone will be doing 10G
[22:27] <nhm> At least anyone who cares about performance in a box with 12 drives...
[22:27] <gregaf> okay, but I bet that by the time 10G is common then getting a journal that can write at 10G will be common too :p
[22:28] <sjust> gregaf: nope
[22:28] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Quit: WeeChat 0.3.2)
[22:29] <nhm> gregaf: 500MB/s is possible out of modern OSDs, but so far I'm not convinced we can get that to our journals...
[22:30] <nhm> s/OSDs/SSDs
[22:30] <gregaf> well, journaling is a good point about scaling, but nonetheless, we really don't know what the right set of tradeoffs is and so I dislike telling people to use an OSD/disk without qualifying it
[22:30] <sjust> gregaf: yep
[22:31] <nhm> gregaf: yeah, fair enough.
[22:32] <nhm> Probably the more important thing is to say that your network, OSDs, and Journals should all be able to handle the target level of throughput.
[22:32] * lofejndif (~lsqavnbok@83TAAELYT.tor-irc.dnsbl.oftc.net) has joined #ceph
[22:34] <nhm> I was thinking that a box with 10 7200RPM drives, 2 fast SSDs, and 10GE is roughly perfect.
[22:35] * lofejndif (~lsqavnbok@83TAAELYT.tor-irc.dnsbl.oftc.net) Quit ()
[22:36] * Oliver1 (~oliver1@p5483CFD5.dip.t-dialin.net) has left #ceph
[22:45] * lofejndif (~lsqavnbok@83TAAELY3.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:02] <perplexed> Thx for the thoughts. In my case I'm setting up a test cluster... just 4 servers, 11 2TB internal HD's per server, 1 SSD per server for the journal. These are GigE (prod servers would be 10GigE, and would likely have 16 HD's w SSD for the Journal).
[23:03] <perplexed> I'll go ahead and re-config the test cluster w 11 osd's per server and will see how that goes.
[23:10] * LarsFronius (~LarsFroni@g231139206.adsl.alicedsl.de) Quit (Quit: LarsFronius)
[23:11] <nhm> perplexed: Hopefully your primary bottleneck should be the network either way. It's possible the journal could limit you in some situations.
[23:12] <perplexed> Yup. One SSD as the journal fs for all 11 osd's... If the nw is the bottleneck I'll be happy with that :)
[23:14] <nhm> perplexed: definitely let us know how your testing goes..
[23:16] <Tv|work> for the record, that's sort of close to our "burnupi" test boxes.. ssd + 8x1TB here
[23:17] <Tv|work> so if anyone here wants to play.. ;)
[23:17] <nhm> Tv|work: As soon as they are ready I'd love some nodes. :D
[23:17] <Tv|work> nhm: i do believe they are pretty ready; they just won't be teuthology targets
[23:17] <Tv|work> nhm: Dan is your man on them
[23:18] <nhm> Tv|work: Yeah, last time I asked he still had some work to do.
[23:18] <Tv|work> i expect he'll have some left even after we start using them ;)
[23:18] <Tv|work> oh the 10gig is still down i do believe
[23:18] <Tv|work> just 1gig for now
[23:19] <nhm> Tv|work: Mark promised me the whole cluster when I was hired. ;)
[23:19] <Tv|work> hahaha
[23:20] <Tv|work> nhm: you should have been here last night, at the openstack meetup.. the speaker asked "who here has a 25-machine cluster to play with"
[23:20] <Tv|work> he meant it rhetorically
[23:20] <Tv|work> but i'm a smart ass so of course i raised my hand
[23:21] <nhm> Tv|work: I do have to say that I miss being able to grab a hundred or so nodes any time I want.
[23:22] <Tv|work> nhm: oh hey so, in the hpc world.. did you guys do exclusive access to nodes? was it just the job scheduler handling "locking" of nodes, or what?
[23:24] <nhm> Tv|work: We used torque and moab almost exclusively with a rather complicated queue setup. Basically we had a couple of different queues with various max run times and node count limitations. Depending on how close you were to your "SU" (ie currency) usage target, your priority for nodes got raised or lowered.
[23:24] <nhm> Staff had extremely high priority... :)
[23:24] <nhm> Generally speaking nodes weren't locked unless we really needed them to be.
[23:25] <Tv|work> yeah if your user interface is the queue...
[23:28] <nhm> Tv|work: PBS submission scripts are kind of the standard in the HPC world unless you are using SLURM. It's a little archiac but works well enough for most typical jobs.
[23:29] <nhm> here's a simple example: http://amdahl.physics.purdue.edu/using-cluster/node24.html
[23:31] <nhm> We had scripts setup to kill any user-spawned process not owned by the job submitter every 30s or so.
[23:36] * lofejndif (~lsqavnbok@83TAAELY3.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[23:42] <joao> nhm, I always assumed (mainly due a couple of college courses) that Condor was "the real deal" when it came to job scheduling in the hpc world
[23:44] <dmick> //SYSPRINT DD SYSOUT=*
[23:46] <nhm> joao: Condor is nice for some things. If your app can make use of their libraries you can do checkpointing and interruption. That makes it really good for things like running low-priority background jobs on a lab of PCs when they aren't in use.
[23:47] <nhm> joao: I haven't seen it used often for large homogenous clusters though as the primary queuing system. Even purdue (who uses it extensively) only uses it to backfill for PBS on their big systems.
[23:47] <joao> oh, okay
[23:48] <joao> well, I had a professor who tried very hard to brainwash was towards Condor and the Globus Tool Kit
[23:49] <joao> I can't say he did a good job though; I find them both hideous
[23:51] <nhm> joao: Yeah, condor has a very dedicated following of users. I never really saw the big draw of it either. I think if someone wrote a good scheduler for openstack and someone got PCI passthrough + SR-IOV working with IB cards, you could basically replace both PBS and Condor in the HPC world.
[23:53] <nhm> Globus is another can of worms. If you think globus is hideous, you should look at cagrid. It's basically a bioinformatics grid that was cobbled together ontop of the globus libraries.
[23:54] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[23:55] <sagewk> sjust: wip-name-sequencers seem ok?
[23:56] <nhm> joao: Think ancient C meets java soap libraries meets crazy people obsessed with semantic vocabularies and ontologies.
[23:57] <joao> lol
[23:57] <joao> nhm, I can't imagine such thing
[23:58] <joao> globus was probably my worst nightmare during the msc
[23:58] <joao> when you get a submission file with as many lines as the program you want to submit, then you know something is terribly wrong
[23:59] <nhm> joao: were you using gram?
[23:59] <joao> not sure, really
[23:59] <joao> it was a couple of years back
[23:59] <joao> the only thing I got imprinted in my brain is globus and its xml files

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.