#ceph IRC Log


IRC Log for 2012-02-03

Timestamps are in GMT/BST.

[0:09] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[0:14] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[0:33] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Remote host closed the connection)
[0:34] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[0:34] * fronlius (~fronlius@e176052045.adsl.alicedsl.de) Quit (Quit: fronlius)
[0:36] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Connection reset by peer)
[0:37] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:35] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:52] * adjohn is now known as Guest1388
[2:52] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[2:52] * Guest1388 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: Connection reset by peer)
[3:17] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[3:55] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[4:13] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:20] * rosco (~r.nap@ Quit (Read error: Operation timed out)
[4:40] * rosco (~r.nap@ has joined #ceph
[4:48] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:55] * gregaf (~Adium@aon.hq.newdream.net) Quit (Read error: Connection reset by peer)
[4:56] * joshd (~joshd@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[4:57] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[4:59] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[4:59] * Guest1271 (~Mike@awlaptop1.esc.auckland.ac.nz) Quit (Read error: Connection reset by peer)
[5:21] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[6:47] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[7:31] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[7:32] * tjikkun_ (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Ping timeout: 480 seconds)
[7:35] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[7:41] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[7:45] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[8:11] * lx0 is now known as lxo
[8:40] * joao (~joao@ has joined #ceph
[9:13] * joao (~joao@ Quit (Quit: joao)
[9:28] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:36] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[9:56] <Kioob`Taff> hi
[9:57] <Kioob`Taff> gregaf: did you find anything usefull in the second logs ?
[10:01] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[10:02] * joao (~joao@ has joined #ceph
[10:11] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[11:10] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:29] <lxo> sjust, the osd randomization patch did wonders. the only thing it couldn't fix was that, because osd.0 had got behind in pushing its own pgs to replicas, now all active pgs have it as primary, and don't benefit from the patch any more
[11:55] <lxo> I wonder if it would be too hard to get the pg primary to tell a replica to pull an object from a random osd holding the object
[11:56] <lxo> rather than having the primary push them all from local storage
[12:03] <lxo> yeah, it doesn't look like it would be a trivial change, let alone a safe one
[12:35] <nhm> morning all
[12:51] * __jt__ (~james@jamestaylor.org) Quit (Remote host closed the connection)
[13:37] * joao is now known as Guest1426
[13:37] * Guest1426 (~joao@ Quit (Read error: Connection reset by peer)
[13:37] * joao_ (~joao@ has joined #ceph
[13:37] * joao_ is now known as joao
[13:54] * fghaas (~florian@85-127-92-127.dynamic.xdsl-line.inode.at) has joined #ceph
[14:28] <NaioN> could anyone tell me what the new pg status backfill means?
[14:29] <NaioN> I read it is for faster recovery, but what is it exactly?
[14:30] <fghaas> installing 0.41 on opensuse 12.1 (rpm built with rpmbuild -tb from released tarball) does create the libcls_rbd.so.1 -> libcls_rbd.so.1.0.0 symlink in /etc/lib64/rados-classes, but not plain libcls_rbd.so. when clients attempt to touch rbd, the osd tries to dlopen() libcls_rbd.so which fails. pretty sure this is unintentional. where in the autofoo magic should I fix this?
[14:54] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:25] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[15:32] <fghaas> following up on my own report above; I seem to be unable to reproduce the issue on debian, so I might have been bitten by a suse-ism
[15:39] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) has joined #ceph
[15:58] * BManojlovic (~steki@ has joined #ceph
[16:02] * stass (stas@ssh.deglitch.com) Quit (Read error: Connection reset by peer)
[16:03] * fghaas (~florian@85-127-92-127.dynamic.xdsl-line.inode.at) has left #ceph
[16:52] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:52] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Connection reset by peer)
[17:36] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[17:39] * fred_ (~fred@80-219-180-134.dclient.hispeed.ch) has joined #ceph
[17:43] <fred_> yehudasa, I just noticed that your fix (posted at http://pastebin.com/Uw05TEgG) was not committed into ceph's git repository... I thought you may need a reminder. Have a nice day.
[17:46] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:47] <yehudasa> fred_: yeah, I actually remembered that I missed to push it and was planning to do it today, thanks
[17:47] * joao (~joao@ Quit (Quit: joao)
[17:51] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Ping timeout: 480 seconds)
[18:06] * fred_ (~fred@80-219-180-134.dclient.hispeed.ch) Quit (Quit: Leaving)
[18:20] * vodka (~paper@212.Red-83-55-54.dynamicIP.rima-tde.net) has joined #ceph
[18:24] * Tv|work (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:39] <gregaf> Kioob`Taff: I managed to track it down to a single stuck inode lock, couldn't track down why exactly it was stuck
[18:40] <gregaf> created a bug for it, #2019 I believe
[18:42] <gregaf> NaioN: backfill means that the primary is sending PG data to a new or very out-of-date replica
[18:42] <gregaf> the new thing in backfill is that it's incremental rather than a single batch, so it reduces memory usage and load and increases reliability :)
[18:45] * joao (~joao@ has joined #ceph
[18:51] * bchrisman (~Adium@ has joined #ceph
[18:58] * vodka (~paper@212.Red-83-55-54.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[19:01] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[19:07] * vodka (~paper@212.Red-83-55-54.dynamicIP.rima-tde.net) has joined #ceph
[19:10] <Kioob> ok, thanks gregaf !
[19:10] <NaioN> gregaf: aha thanks, in case of a new one it's the same as degraded?
[19:13] * chutzpah (~chutz@ has joined #ceph
[19:23] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[19:46] <gregaf> NaioN: I'm not sure exactly how the state flags compare/compose, sjust or sagewk would have to tell you that…
[19:46] <gregaf> but I believe that it's just reporting more information than it did previously
[19:46] <sagewk> naion: backfill means we're copying/migrating a pg wholesale over to another node.
[19:46] <sjust> NaioN: backfill does not imply degraded
[19:53] * __jt__ (~james@jamestaylor.org) has joined #ceph
[20:18] * fronlius (~fronlius@f054019049.adsl.alicedsl.de) has joined #ceph
[20:21] * stass (stas@ssh.deglitch.com) has joined #ceph
[20:34] <NaioN> ok
[20:35] <NaioN> so now you can see which pg's gets moved if you add/replace a pg...
[20:35] <NaioN> s/which/how many
[20:39] * fronlius (~fronlius@f054019049.adsl.alicedsl.de) Quit (Quit: fronlius)
[21:23] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[21:24] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Quit: Leaving)
[21:26] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[21:30] * fronlius (~fronlius@g225245031.adsl.alicedsl.de) has joined #ceph
[21:47] * vodka (~paper@212.Red-83-55-54.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[22:06] * BManojlovic (~steki@ has joined #ceph
[22:07] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[22:12] * amichel (~amichel@wsip-70-164-248-101.tc.ph.cox.net) has joined #ceph
[22:14] <amichel> So, I'm building a storage node for use in a ceph cluster, and I'm a little confused about OSD sizing. Is it better for a big storage box to present one giant pool of disk via a single OSD or multiple smaller stripes with an OSD each?
[22:20] <gregaf> amichel: we tend to recommend multiple smaller stripes, but really we have no idea; it needs further experience in operation :)
[22:20] <nhm> amichel: Testing will be done at some point. :D
[22:21] <amichel> Ha, fair enough. What are the influential factors? I'd be glad to deliver some data for you while I'm doing my build out if it would be helpful
[22:21] <nhm> amichel: how many nodes?
[22:21] <amichel> Building my first node right now, actually
[22:21] <Tv|work> amichel: 1 osd -> you lose lots of data on any failure (that makes it through e.g. RAID); they share fate *and* performance
[22:22] <amichel> Second to follow after I crystalize the software configuration
[22:22] <Tv|work> amichel: lots of osds -> RAM & CPU usage spike during recovery
[22:22] <Tv|work> amichel: one OSD per disk is assumed best up to ~12 disks per machine, beyond that.. we don't really have a recommendation
[22:22] <amichel> The storage node is a modified Backblaze pod so I've got a lot of disks
[22:22] <Tv|work> amichel: assuming decent server cpu & ram sizing
[22:22] <amichel> :D
[22:23] <Tv|work> amichel: Backblaze design -> you'll never be able to read the disks anyway
[22:23] <amichel> How do you mean?
[22:23] <nhm> amichel: fantastic, that's one of the setups I wanted to look into.
[22:23] <Tv|work> amichel: too much storage compared to IO
[22:23] <Tv|work> amichel: good for archival, pretty bad for actual use
[22:23] <amichel> I do not follow what you mean by that. I mean, I can't peg the disks or anything, but I can hit 1GB/s reading.
[22:23] <gregaf> he's talking about the "storage wall"
[22:24] <nhm> Tv: That kind of design can be decent enough with IB, but it takes a lot of tuning.
[22:24] <Tv|work> amichel: sure, but that's 1GB/s read off of 67TB of storage
[22:24] <gregaf> how long it takes to put all the data in or take all the data out
[22:24] <amichel> It's actually 100TB or so online
[22:24] <amichel> :D
[22:24] <nhm> amichel: 3TB drives then?
[22:24] <amichel> Yeah
[22:25] <Tv|work> so about 28 hours to read it all under ideal circumstances
[22:25] <gregaf> it's actually only about a day, which is less bad than I'd have thought
[22:25] <Tv|work> good for archival, not good for use
[22:25] <amichel> I have to run but I'll be back in channel to understand this. I'm perhaps not doing what I think I was doing :D
[22:25] <nhm> amichel: thought about going 40GBE?
[22:25] <amichel> Thought about multi-10G
[22:26] <amichel> But I'll run short on disk performance before I hit 40Gb
[22:26] <amichel> The port multipliers really limit the throughput
[22:26] <amichel> Ok, I'll be back
[22:26] <nhm> amichel: yeah, I've been curious how a boded 10GBE setup would work.
[22:26] <amichel> Thanks for the discussion guys
[22:26] <nhm> s/boded/bonded
[22:27] * amichel (~amichel@wsip-70-164-248-101.tc.ph.cox.net) Quit (Quit: Bad news, everyone!)
[22:27] <Tv|work> worry about failure isolation
[22:27] <Tv|work> one disk taking out the whole HBA would be a disaster, with that setup
[22:28] <Tv|work> if you're ok with that, well then you're ok with it ;)
[22:28] <Tv|work> and your only criteria is "don't run too many OSDs to run out of RAM&CPU"
[22:28] <Tv|work> so you need to bundle the disks somehow
[22:28] <nhm> Tv: we've got 60 drives behind each of our lustre OSSes right now, but are using QDR IB.
[22:28] <Tv|work> (i guess i'm saying this more for nhm's benefit)
[22:29] <Tv|work> nhm: yes paying more money will make you avoid the problems of cheap hardware
[22:29] <Tv|work> nhm: that's why we use fairly fancy RAID controllers as stupid HBAs
[22:29] <iggy> nfw
[22:30] <nhm> Tv|work: I was actually curious what controllers you guys were using. H700s?
[22:30] <Tv|work> but you have to realize, backblaze is built on the business model of "customers dribble data in, only a fraction of them ever want their data out"
[22:30] <Tv|work> nhm: talk to Carl Perry on that
[22:30] <Tv|work> nhm: he's specific to the level of "and then you re-flash this firmware" etc
[22:31] <nhm> Tv|work: sounds like amichel has something a bit more performance oriented than a standard backblaze design if he's using 10GBE and contemplating a bonded setup...
[22:31] <Tv|work> nhm: beefing the networking won't help if the bottleneck is the SATA, though
[22:32] <iggy> or mem... or cpu... or...
[22:32] <nhm> Tv: Yeah, Dell's firmware for their controllers is way behind the stock LSI firmware.
[22:32] <darkfader> right now the bottleneck is pcie2.0 i think
[22:32] <Tv|work> iggy: usually mem or cpu use of osd is fairly low, but it spikes during recovery.. so not as much a performance issue as reliability issue
[22:32] <nhm> Tv: Yeah, I'm assuming he's got multiple SATA/SAS cards in there.
[22:32] <Tv|work> iggy: unless you run Atom ;)
[22:32] <darkfader> i think one FDR hba can bascially flood it
[22:33] <nhm> darkfader: can get about 6GB/s on 16x PCIE 2.0
[22:33] <darkfader> nhm: yeah but are there any raid or ib cards in 16x?
[22:33] <iggy> but if he's having to bundle the disks together with some sort of software raid (iirc the backblaze setups uses softraid), that's certainly going to be some cpu
[22:33] <darkfader> i only see 8x everywhere
[22:33] <nhm> darkfader: Yeah, 8x is the only thing I ever see, so half that. But if you are going high performance you probably aren't using a single controller anyway...
[22:34] <darkfader> yeah thats true
[22:38] <nhm> the fastest single-node performance I've seen was the setup Bull presented at the last lustre conference with their cpu/memory affinity patches: https://encrypted.google.com/url?sa=t&rct=j&q=bull+2011+lustre+&source=web&cd=1&ved=0CCIQFjAA&url=http%3A%2F%2Fwww.olcf.ornl.gov%2Fwp-content%2Fevents%2Flug2011%2F4-14-2011%2F900-930_Diego_Moreno_LUG_Bull_2011.pdf&ei=C1MsT7SCDYrBtgfdoPTnDw&usg=AFQjCNEKuA8o2GU_mfe2bFlRmSR12hQG7Q
[22:38] <nhm> ugh, stupid google.
[22:39] <darkfader> its ok
[22:39] <darkfader> i can click them in irrsi for some reason that is above me
[22:39] <darkfader> it works
[22:40] <darkfader> hehe that is fun to read so far
[22:42] <nhm> Yeah. For ceph it'll probably be better to stick with smaller single socket servers and scale out servers.
[22:42] <nhm> but who knows...
[22:42] <Tv|work> nhm: i understand you will, in a few months ;)
[22:43] <darkfader> thanks very much for the link
[22:43] <nhm> Tv: :D
[22:44] <darkfader> reminds me i still need to find two cheap old origins with numalink
[22:47] <nhm> Tv|work: It'll be interesting to do a cost/performance/reliability breakdown for smaller nodes vs larger nodes. The problem with the 12drive 2U servers is that they just aren't that dense.
[22:49] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:50] <darkfader> nhm: i have one 2u/12 drive box and it is really not making me happy
[22:51] <darkfader> a twin^2 so you have 4 compute nodes of 3 disks each, thats as not dense as it gets
[22:51] <darkfader> i guess in a year i'll put in slower nodes and use it as for "extra" copy osds only
[22:52] <nhm> darkfader: Are you a supermicro shop?
[22:52] <darkfader> nhm: oh shop would be so much overstating it
[22:52] <darkfader> i have 3 intel 1u and 1 ibm 2u and that supermicro
[22:52] <darkfader> the intels all have nice comfy 2.5" disks
[22:53] <darkfader> the supermicro is less pita than ibm and more than intel
[22:54] * fronlius (~fronlius@g225245031.adsl.alicedsl.de) Quit (Quit: fronlius)
[22:55] <nhm> darkfader: we've got some Dell C6100s which are basically just rebadged Twin^2 boxes.
[22:55] <darkfader> i hope they replaced the ipmi/bmc software
[22:55] <darkfader> but i didnt know they are the same tech inside
[22:56] <darkfader> i'm missing some basic features, i.e. seeing the slot id in dmidecode
[22:56] <nhm> darkfader: yeah, apparently they basically just outsourced the devel to supermicro.
[22:56] <darkfader> the backplane is verrryyyyy passive
[22:57] <darkfader> do the dell ones take SAS disks?
[22:57] <darkfader> or also sata only?
[22:57] <nhm> darkfader: The dells will do SAS
[22:57] <darkfader> ah :)
[23:15] * izdubar (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[23:36] * izdubar (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[23:58] * fronlius (~fronlius@g224061030.adsl.alicedsl.de) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.