#ceph IRC Log


IRC Log for 2012-09-06

Timestamps are in GMT/BST.

[0:00] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:12] * robkain (~Torfir@ has joined #ceph
[0:15] * sagelap (~sage@139.sub-166-250-66.myvzw.com) Quit (Ping timeout: 480 seconds)
[0:26] * robkain (~Torfir@ Quit (Quit: robkain)
[0:39] * tnt (~tnt@19.110-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[0:52] * sagelap (~sage@157.sub-166-250-69.myvzw.com) has joined #ceph
[0:57] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:06] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[1:11] * yoshi (~yoshi@p28146-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:13] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) has joined #ceph
[1:18] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[1:21] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[1:22] * amatter (~amatter@ Quit (Read error: Operation timed out)
[1:22] * Tv_ (~tv@2607:f298:a:607:51e0:e578:bd15:6681) Quit (Ping timeout: 480 seconds)
[1:23] * pentabular (~sean@adsl-71-141-232-252.dsl.snfc21.pacbell.net) Quit (Read error: Operation timed out)
[1:29] * EmilienM (~EmilienM@ADijon-654-1-107-27.w90-39.abo.wanadoo.fr) Quit (Quit: kill -9 EmilienM)
[1:36] * sagelap1 (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[1:42] * sagelap (~sage@157.sub-166-250-69.myvzw.com) Quit (Ping timeout: 480 seconds)
[1:42] * amatter (amatter@c-174-52-137-136.hsd1.ut.comcast.net) has joined #ceph
[1:55] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[1:58] * pentabular (~sean@adsl-71-141-232-252.dsl.snfc21.pacbell.net) has joined #ceph
[2:07] * stan_theman (~stan_them@ Quit (Ping timeout: 480 seconds)
[2:10] <elder> Well after about four tries I'm guessing I can't do kdb breakpoints *and* dynamic debug output together.
[2:10] <elder> At least that's my current theory.
[2:11] <dmick> that sounds....distasteful elder
[2:14] <elder> I'm kind of starting to lose my mind I think.
[2:15] <dmick> well you are debugging kernel code. that's sort of the natural ground state of such work
[2:15] <dmick> but it feels so good when you find it
[2:16] <joshd> elder: sage mentioned that it was possible to use gdb with uml, that might be a lot easier
[2:16] <elder> That's what I've been using all along, but the problem doesn't arise in UML.
[2:17] <dmick> <audience>awwwww..</>
[2:17] <elder> I am seriously trying to verify it's *not* some sort of compiler problem, but I feel like a n00b thinking that way.
[2:18] <dmick> if you'd like me to be your rubber ducky, it's not like I'm getting much else accomplished today, if that would help
[2:18] <elder> You could try to see if you hit the same problem on your own nodes I suppose, but I'm pretty sure you will.
[2:19] <elder> You can see the latest state of things on ceph-client/wip-rbd-new
[2:19] * amatter_ (~amatter@ has joined #ceph
[2:20] <elder> you'll notice there are a few commits at the end that are simply adding debugs, which will narrow you in to the code involved.
[2:20] <dmick> kicking off an install, will look momentarily
[2:20] <elder> Go fetch that stuff if you want to play along, and once yo'ure there I'll continue.
[2:21] * amatter (amatter@c-174-52-137-136.hsd1.ut.comcast.net) Quit (Ping timeout: 480 seconds)
[2:27] <elder> I can use kdb tracepoints if I don't activate the debug prints...
[2:27] <dmick> what happens with both?
[2:27] <elder> Freeze. And I power cycle and try again 10 minutes later.
[2:27] <dmick> ugh.
[2:28] <dmick> branch checked out, cscope/tags making, looking at commits next
[2:30] <dmick> ewg. if you're looking directly at the stack dump from code, you're...in need of debugging tools :)
[2:31] <dmick> I wonder how stupid it would be to try to glue gdb to IPMI-serial-console
[2:31] <elder> I would love that but I haven't tried it in ages--two or three jobs ago I think.
[2:31] <elder> Have you done it recently?
[2:37] <dmick> no
[2:38] <dmick> I did gdb-through-to-serial-on-UML
[2:38] <dmick> (rather than gdb-on-uml-binary)
[2:38] <dmick> just to say I cuold
[2:38] <dmick> it wasn't very happy
[2:38] <dmick> but it worked
[2:38] <dmick> my impression, not founded in anything, is that the gdb remote protocol is not necessarily very robust. seems to be "ascii, and if you drop a byte, god help you"
[2:39] <dmick> but this is only an impression
[2:39] <dmick> anyway: to your actual problem:
[2:39] <elder> Maybe. ILike I said I used it in the past and it served well. But I think I had a pretty solid connection on Real Serial Lines.
[2:39] <dmick> what's the symptom? is there a bug?
[2:39] <elder> OK. Look in drivers/block/rbd.c
[2:40] * JT (~john@astound-64-85-239-164.ca.astound.net) Quit (Quit: Leaving)
[2:40] <elder> My tree is actually sitting at 1cf36c07ac63f363a0be7817a16f17ef4af8c9b3 at the moment.
[2:40] <elder> But it's all pretty close,
[2:40] <elder> just debug messages added or removed.
[2:40] <elder> The problem occurs when function _rbd_dev_v2_snap_features() returns.
[2:41] <elder> It's actually in function rbd_dev_v2_probe(), on the return from rbd_dev_v2_features()
[2:41] <dmick> ok
[2:42] <elder> If you see the two "dout() lines--the one after rbd_dev_v2_object_prefix() and the one after rbd_dev_v2_features(), that demonatrates the problem.
[2:42] <elder> Before the *features() call, rbd_dev is a valid pointer value.
[2:42] <elder> After the call, even though it succeeds (ret == 0), the value of rbd_dev has changed to 0x0000...001
[2:43] <elder> At this level, *nothing* has changed the local variable value.
[2:43] <elder> But in fact I have just verified that something seems to have written 0000001 into the stack before the called function restores the value of rbx, which holds rbd_dev
[2:44] <elder> So after all day of this I have narrowed it down to that. Somewhere inside _rbd_dev_v2_snap_features() that is occurring.
[2:44] <elder> Now that I am successfully setting the breakpoints in kdb I can narrow it further.
[2:45] <dmick> ok
[2:45] <elder> Last time, before I rebooted, I verified that the stack value that was about to get restored to rbx before return contained 00000001
[2:45] <elder> About to check it again.
[2:47] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) Quit (Quit: Leaving.)
[2:49] <dmick> so it was ok inside rbd_dev_v2_features?
[2:49] <elder> Yes.
[2:49] <dmick> oh that *is* weird.
[2:50] <elder> And I now know the problem is that the stack location where register rbx was pushed on entry to the function has a diffrent value that's being restored before return.
[2:50] <dmick> so a sometimes-amusing technique I've used is to create a landing area of stack, filled with a known value
[2:50] <dmick> and then look at the landing area and see what shows up there
[2:50] <dmick> sometimes you can see other damage that shows a clue
[2:50] <dmick> like, a big automatic char array full of 'A' or something
[2:51] <dmick> doesn't help if the damage is targeted to stack you can't affect
[2:51] <elder> That's an idea. Let me see what I find this time, I've got it stopped at the entry to _rbd_dev_V2_snap_features() and am looking at what's been saved.
[2:53] <dmick> how many other things does it save after rbx before the call? (actually I guess I can build this)
[2:54] <dmick> got a .config handy I can steal?
[2:54] <elder> 1590: 55 push %rbp
[2:54] <elder> 1591: 48 89 e5 mov %rsp,%rbp
[2:54] <elder> 1594: 48 83 ec 40 sub $0x40,%rsp
[2:54] <elder> 1598: 48 89 5d e0 mov %rbx,-0x20(%rbp)
[2:54] <elder> 159c: 4c 89 65 e8 mov %r12,-0x18(%rbp)
[2:54] <elder> 15a0: 4c 89 6d f0 mov %r13,-0x10(%rbp)
[2:54] <elder> 15a4: 4c 89 75 f8 mov %r14,-0x8(%rbp)
[2:54] <elder> 15a8: e8 00 00 00 00 callq 15ad <_rbd_dev_v2_snap_features+0x1d>
[2:54] <elder> I'll e-mail it to you.
[2:54] <dmick> so, a few regs, but they may not be as important or may be ok
[2:54] <elder> Actually, I'm just using the config that autobuild uses...
[2:55] <elder> So I think that's in autobuild-ceph file "kernel-config"
[2:55] <dmick> ok
[3:08] * stan_theman (~stan_them@ has joined #ceph
[3:08] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[3:09] <iggy> have you tried in kvm? that at least shortens your "it's locked up again, hit reset" down time
[3:11] <iggy> and you should be able to use gdb with kvm
[3:15] <dmick> iggy: the problem is we don't have a direct serial connection here, it's over IPMI
[3:16] <dmick> it's possible that can be plumbed, but I suspect we just haven't tried it. I intend to.
[3:16] <iggy> I think one of us misunderstands
[3:16] <iggy> because that confused me
[3:16] <dmick> the target-under-debug is remote, accessed only with network and IPMI
[3:16] <iggy> I meant try it running inside a kvm vM
[3:17] <dmick> oh oh
[3:17] <dmick> that's me being confused
[3:17] <dmick> I saw kvm and read kdb
[3:17] <dmick> sry
[3:17] <dmick> hm. kvm with gdb. that's an interesting idea
[3:18] <iggy> I know you said it didn't reproduce under uml, but kvm should work well enough under gdb
[3:18] <dmick> yeah
[3:18] <iggy> assuming it reproduces there
[3:19] <dmick> http://gymnasmata.wordpress.com/2010/12/02/setting-up-gdb-to-work-with-qemu-kvm-via-libvirt/
[3:19] <dmick> is worth a try
[3:21] <elder> dmick, refresh my memory here. Do you suppose there's any reason to hang onto what the remote end of a connection said was its own incompatible set of features?
[3:22] <elder> I think maybe no, once we've looked at it and decided we had a pairwise set of compatible features.
[3:22] <dmick> I'm sorry, my brain simply does not hold those semantics at all
[3:22] <elder> That's OK.
[3:22] <dmick> I think I've decided discussions of feature matching are just futile and so I can't store them anymore
[3:22] <gregaf> elder: dunno if it applies to the client yet, but in our userspace stuff we often(/sometimes) make decisions about encoding messages based on the other side's feature set
[3:22] <elder> I'm not going to fix it the right way until tomorrow, when my fresh brain may be able to answer the question for me.
[3:23] <gregaf> so in the abstract, yes, it can be useful
[3:23] <elder> Well, that opens another can of worms, gregaf... What exactly to the "features" represent--and what is their scope.
[3:23] <gregaf> whatever we defined the feature bit to mean ;)
[3:23] <dmick> my guess: "incompatible features" is that set that you must agree to use (and thus implies that you implement), or you can't use this image
[3:24] <elder> The connection, the protocol, client functionality, server functionality, image features, etc.
[3:24] <dmick> i.e. I would call them "required features"
[3:24] <gregaf> elder: oh, you're referring to rbd features?
[3:24] <dmick> (at least within the scope of rbd)
[3:24] <elder> Right.
[3:24] <elder> gregaf, yes.
[3:24] <gregaf> I thought you meant messenger features
[3:24] <gregaf> still, same applies: in the abstract it can be useful
[3:24] <gregaf> right now you certainly want to know if it's an old or new-format image, right
[3:25] <dmick> the only useful feature bit right now is 1
[3:25] <dmick> and it is a requirement that you access the image differently if that bit is set
[3:25] <elder> It's the lonliest feature.
[3:25] <dmick> so, yes
[3:25] <gregaf> but in the future instead of being a binary new-format flag it could be "uses an allocation bitmap" or "does not use watch-notify on the header" etc, and these could be set on a per-image level
[3:25] <elder> Yes exactly.
[3:26] <dmick> and the point being, some of them may be things that you, as client to the image, could *choose* to implement/use or just ignore (perhaps because they're unknown, perhaps because you don't feel like it)
[3:26] <gregaf> in which case you certainly want the full feature set available to query at any point
[3:26] <dmick> but some, you must know
[3:26] <gregaf> anyway, gotta run
[3:26] <dmick> so I think "incompatible" might be discardable, because that's a "oops, gotta fail now" if it mismatches
[3:26] <dmick> whereas yes, you migth want to save the full features set
[3:27] <dmick> so I think we're saying the same thing
[3:27] <dmick> now if only that's the reality that Josh believes as well :)
[3:27] <gregaf> incompatible could include features that are set on a per-image basis and that you need to know the value for
[3:27] <dmick> right. the feature masks in question are both per-image
[3:27] <gregaf> (example: uses an allocation bitmap)
[3:27] <dmick> sets, not masks.
[3:28] <gregaf> so that's not discardable then
[3:30] <dmick> well, we should ask joshd. but my impression is that incompatible won't include optional features
[3:30] <dmick> that is in fact its definition.
[3:30] <dmick> but I probably misunderstand.
[3:30] <elder> I think we can determine which features will be used at the time a connection between the client and server is established.
[3:31] <elder> Thereafter we don't really need to keep track of what the server supplied.
[3:31] <dmick> so, possibly both masks are irrelevant after connection, true
[3:31] <elder> That's what I mean, yes.
[3:31] <dmick> but of the two I'd think that incompatible is the less-useful to keep around
[3:31] <dmick> one can imagine wanting to turn on or turn off, say, caching of some sort
[3:32] <elder> Yes. Once we've determine we can operate compatibly we don't need to know what won't work.
[3:32] <dmick> one-sidedly, from the kernel client
[3:32] <elder> Maybe.
[3:32] <dmick> after connection
[3:32] <elder> It's not a big deal really, but it's a good discussion though.
[3:32] <dmick> I mean it's easy to change in the future when this gets real
[3:33] <elder> Building.
[3:34] <elder> This helped a lot, dmick. I wasn't even thinking about the possibility the other end could be responsible for this overrun.
[3:34] <elder> (Of course, I have to prove the problem goes away, but it makes good sense.)
[3:36] <dmick> well you got there on your own, but, yw :)
[3:36] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[3:37] <dmick> as I'm sure I told you, a past coworker called that "being his dog". just sit there, listen, and bark once in a while
[3:37] <elder> Chance Gardener?
[3:40] <dmick> heh
[3:41] <elder> Kind of an oblique reference.
[3:41] * adjohn (~adjohn@ Quit (Quit: adjohn)
[3:42] <dmick> yeah. and I've never actually seen teh whole movie. which I should.
[3:43] <elder> I had to study the book in college. As I recall the movie was good but lost some subtlety.
[3:44] <elder> And of course, when you study a book in English Literature class you really look deeply into things that aren't even necessarily there.
[3:58] <elder> I'm so sure your brilliant insight was right, dmick that I'm just going to do a big bang test, not going to be careful at all.
[3:59] <dmick> lol. "my insight". godspeed regardless
[4:00] <joao> sagewk, skimmed through wip-mon patches, but will take a closer look tomorrow morning :)
[4:01] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[4:06] <elder> dmick, you may be pleased to learn that my test is now running just fine.
[4:06] <dmick> sweet!
[4:06] <elder> I'm going to step away from the computer now.
[4:06] <dmick> well done.
[4:07] <elder> Thank you, you've helped me put and end to this somewhat nightmarish day.
[4:07] <elder> A nice pleasant end to the nightmare I guess.
[4:07] <dmick> it's nice to nail one home
[4:07] <dmick> at the end of the day
[4:07] <elder> Or week. But day is good enough.
[4:07] <elder> Talk to you tomorrow.
[4:07] <dmick> nite
[4:44] * amatter_ (~amatter@ Quit (Ping timeout: 480 seconds)
[4:49] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:50] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[4:51] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[4:51] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:55] * houkouonchi-work (~linux@ Quit (Ping timeout: 480 seconds)
[5:00] * sagelap (~sage@ has joined #ceph
[5:03] * sagelap1 (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:04] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[5:21] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[5:33] * sagelap (~sage@110.sub-166-250-66.myvzw.com) has joined #ceph
[5:53] * pentabular (~sean@adsl-71-141-232-252.dsl.snfc21.pacbell.net) Quit (Ping timeout: 480 seconds)
[5:59] * sagelap (~sage@110.sub-166-250-66.myvzw.com) Quit (Ping timeout: 480 seconds)
[6:01] * maelfius (~mdrnstm@ Quit (Quit: Leaving.)
[6:24] * dmick (~dmick@2607:f298:a:607:d01c:d23e:5613:207e) has left #ceph
[6:50] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[7:03] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[7:04] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[7:15] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[7:41] * sagelap (~sage@241.sub-166-250-35.myvzw.com) has joined #ceph
[7:42] * tnt (~tnt@19.110-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[7:48] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[7:48] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[7:54] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[7:54] * sagelap (~sage@241.sub-166-250-35.myvzw.com) Quit (Ping timeout: 480 seconds)
[8:01] * tnt (~tnt@19.110-67-87.adsl-dyn.isp.belgacom.be) Quit (Read error: Operation timed out)
[8:04] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[8:05] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[8:05] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit ()
[8:09] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:09] * loicd (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) has joined #ceph
[8:10] * Cube (~Adium@ Quit (Quit: Leaving.)
[8:13] * sagelap (~sage@50-79-43-161-static.hfc.comcastbusiness.net) has joined #ceph
[8:15] * loicd (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) Quit (Quit: Leaving.)
[8:27] * Ryan_Lane1 (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[8:27] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[8:27] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[8:42] * tnt (~tnt@ptra-178-50-65-171.mobistar.be) has joined #ceph
[8:47] * tnt_ (~tnt@ptra-178-50-65-171.mobistar.be) has joined #ceph
[8:49] * Ryan_Lane1 (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[8:50] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[8:51] * loicd (~loic@ has joined #ceph
[8:51] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[8:52] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit ()
[8:52] * tnt (~tnt@ptra-178-50-65-171.mobistar.be) Quit (Ping timeout: 480 seconds)
[8:58] * stevesun (~steve@ has joined #ceph
[8:58] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[8:59] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[9:00] * tnt (~tnt@ptra-178-50-73-66.mobistar.be) has joined #ceph
[9:01] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[9:02] * eternaleye (~eternaley@tchaikovsky.exherbo.org) Quit (Quit: eternaleye)
[9:02] * eternaleye (~eternaley@tchaikovsky.exherbo.org) has joined #ceph
[9:03] * andret (~andre@pcandre.nine.ch) has joined #ceph
[9:04] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[9:04] * stevesun (~steve@ Quit ()
[9:06] * tnt_ (~tnt@ptra-178-50-65-171.mobistar.be) Quit (Ping timeout: 480 seconds)
[9:08] * tnt (~tnt@ptra-178-50-73-66.mobistar.be) Quit (Ping timeout: 480 seconds)
[9:10] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[9:10] * andret (~andre@pcandre.nine.ch) has joined #ceph
[9:11] * tnt (~tnt@ptra-178-50-86-244.mobistar.be) has joined #ceph
[9:12] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[9:12] * sagelap (~sage@50-79-43-161-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[9:12] * andret (~andre@pcandre.nine.ch) has joined #ceph
[9:14] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:16] * EmilienM (~EmilienM@ADijon-654-1-107-27.w90-39.abo.wanadoo.fr) has joined #ceph
[9:21] * tnt (~tnt@ptra-178-50-86-244.mobistar.be) Quit (Ping timeout: 480 seconds)
[9:22] * loicd (~loic@ has joined #ceph
[9:36] * Leseb (~Leseb@ has joined #ceph
[9:40] * rosco (~r.nap@ has joined #ceph
[9:58] * gregaf1 (~Adium@2607:f298:a:607:50bd:a787:da4:de25) has joined #ceph
[10:01] * mrjack_ (mrjack@office.smart-weblications.net) Quit (Ping timeout: 480 seconds)
[10:03] * gregaf (~Adium@2607:f298:a:607:9c5:b443:1e33:236) Quit (Ping timeout: 480 seconds)
[10:08] * BManojlovic (~steki@ has joined #ceph
[10:08] * MikeMcClurg (~mike@cpc18-cmbg15-2-0-cust437.5-4.cable.virginmedia.com) has joined #ceph
[10:22] * tnt (~tnt@ptra-178-50-66-137.mobistar.be) has joined #ceph
[10:23] * MikeMcClurg (~mike@cpc18-cmbg15-2-0-cust437.5-4.cable.virginmedia.com) Quit (Quit: Leaving.)
[10:30] * SteveCoo1ing (~cooling@ has left #ceph
[10:31] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[10:33] * tnt_ (~tnt@212-166-48-236.win.be) has joined #ceph
[10:35] * tnt (~tnt@ptra-178-50-66-137.mobistar.be) Quit (Ping timeout: 480 seconds)
[11:35] * lofejndif (~lsqavnbok@04ZAAFA3F.tor-irc.dnsbl.oftc.net) has joined #ceph
[11:46] * yoshi (~yoshi@p28146-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:26] * MikeMcClurg (~mike@ has joined #ceph
[12:26] * loicd (~loic@ Quit (Read error: Connection reset by peer)
[12:26] * loicd1 (~loic@ has joined #ceph
[12:33] * loicd1 (~loic@ Quit (Quit: Leaving.)
[12:34] * loicd (~loic@ has joined #ceph
[13:19] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[13:27] * dilemma (~dilemma@2607:fad0:32:a02:21b:21ff:feb7:82c2) Quit (Quit: Leaving)
[13:59] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[14:24] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[15:27] * loicd1 (~loic@ has joined #ceph
[15:28] * loicd (~loic@ Quit (Read error: Operation timed out)
[15:30] * mrjack_ (mrjack@office.smart-weblications.net) has joined #ceph
[15:43] * mgalkiewicz (~mgalkiewi@staticline-31-182-227-151.toya.net.pl) has joined #ceph
[15:47] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) Quit (Quit: Ex-Chat)
[15:47] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) has joined #ceph
[15:47] * lofejndif (~lsqavnbok@04ZAAFA3F.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[15:50] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) has joined #ceph
[16:14] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[17:03] * lofejndif (~lsqavnbok@19NAACESH.tor-irc.dnsbl.oftc.net) has joined #ceph
[17:10] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:23] * mgalkiewicz (~mgalkiewi@staticline-31-182-227-151.toya.net.pl) Quit (Quit: Ex-Chat)
[17:23] * sagelap (~sage@149.sub-166-250-34.myvzw.com) has joined #ceph
[17:33] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) Quit (Quit: Ex-Chat)
[17:43] * tnt_ (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:44] * loicd1 is now known as loicd
[17:46] * gregorg (~Greg@ has joined #ceph
[17:48] * Tv_ (~tv@2607:f298:a:607:51e0:e578:bd15:6681) has joined #ceph
[17:51] * sagelap (~sage@149.sub-166-250-34.myvzw.com) Quit (Quit: Leaving.)
[17:53] * sagelap (~sage@2600:1010:b00d:ebff:b995:d96f:c03c:8ed2) has joined #ceph
[17:54] <sagelap> joao, gregaf: (untested) first stab at ordering paxos commits in wip-mon-gv.. want to take a look and see if the strategy seems sane?
[17:54] * adjohn (~adjohn@ has joined #ceph
[18:03] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[18:04] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[18:04] * sagelap (~sage@2600:1010:b00d:ebff:b995:d96f:c03c:8ed2) Quit (Ping timeout: 480 seconds)
[18:08] * aliguori (~anthony@ has joined #ceph
[18:13] * loicd (~loic@ Quit (Quit: Leaving.)
[18:14] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:15] * rlr219 (43c87e04@ircip2.mibbit.com) has joined #ceph
[18:21] * adjohn (~adjohn@ Quit (Quit: adjohn)
[18:21] * sjusthm (~sam@66-214-139-112.dhcp.gldl.ca.charter.com) has joined #ceph
[18:31] <rlr219> Hi All. I am wondering if there is a recommended way to have a ceph-osd start automatically on reboot of the osd server? I was thinking of adding a line to crontab, but i notice that when i process is started from the primary, the ceph.conf file is copied to /tmp and a number is added to the end of the file name.
[18:31] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:35] <sjusthm> there might be a way to do it with upstart
[18:38] * loicd (~loic@magenta.dachary.org) has joined #ceph
[18:39] * sagelap (~sage@39.sub-166-250-33.myvzw.com) has joined #ceph
[18:40] <rlr219> does anyone what the reason is for copying the config file to /tmp? if it is to ensure that the config file does not change while the osd is running, then I could just right a script to do the copy and then have it run the process. Just want to understand the reasoning..
[18:41] <sjusthm> where are you seeing that?
[18:43] <rlr219> first let me say i am using Ubuntu Precise. If do a "ps aux |grep ceph" you see the process runs as "/usr/bin/ceph-osd -i 7 --pid-file /var/run/ceph/osd.7.pid -c /tmp/ceph.conf.3132"
[18:43] <rlr219> and this OSD was started via the service ceph -a start osd command
[18:44] <rlr219> from my primary monitor server
[18:46] <sjusthm> hmm, not sure why the file is in tmp
[18:48] <Tv_> rlr219: is that an --mkfs call by any chance? give ps more w's to get wider output
[18:48] <Tv_> sjusthm: wait you said upstart? that command line has nothing to do with the upstart part
[18:49] <sjusthm> he started it via service ceph -a
[18:49] <Tv_> ok so sysvinit
[18:49] <sjusthm> whoops
[18:49] <Tv_> rlr219: just confirming, is this during mkcephfs?
[18:50] <rlr219> that was how I created the cluster, yes.
[18:50] * sagelap (~sage@39.sub-166-250-33.myvzw.com) Quit (Ping timeout: 480 seconds)
[18:51] <Tv_> rlr219: did the mkcephfs complete already?
[18:51] <rlr219> yes
[18:51] <rlr219> the cluster runs fine
[18:52] <Tv_> # conf file
[18:52] <rlr219> however, when the server is rebooted, the osd process has to be manually started again. I just want to make the process start on reboot.
[18:52] <Tv_> if [ "$host" = "$hostname" ]; then
[18:52] <Tv_> cur_conf=$conf
[18:52] <Tv_> else
[18:52] <Tv_> if echo $pushed_to | grep -v -q " $host "; then
[18:52] <Tv_> scp -q $conf $host:/tmp/ceph.conf.$$
[18:52] <Tv_> pushed_to="$pushed_to $host "
[18:52] <Tv_> fi
[18:52] <Tv_> cur_conf="/tmp/ceph.conf.$$"
[18:52] <Tv_> fi
[18:53] <Tv_> cmd="$cmd -c $cur_conf"
[18:53] <Tv_> rlr219: ohh
[18:53] <Tv_> rlr219: can you pastebin your ceph.conf?
[18:53] <Tv_> i wonder what that bit of shell is *trying* to do
[18:53] <Tv_> it seems that when launching daemons over ssh, it really does put ceph.conf into /tmp
[18:54] <Tv_> rlr219: i think you're a simple ceph.conf adjustment away from having it work how you want it, i've seen a similar misunderstanding before
[18:54] <Tv_> but the above code needs to die in a fire ;) (sorry Sage ;)
[18:55] <rlr219> maybe that is to ensure the "master conf" file is being used.
[18:55] <rlr219> but if I have already ssh'ed it to all my servers, then i can just have the proc start using the conf in /etc/ceph
[18:56] <Tv_> rlr219: can you just share your ceph.conf? i have a good guess for the reason, and need to validate it
[18:56] <Tv_> rlr219: if you're uncomfortable with sharing the whole config, we can talk about individual fields but that's slower
[18:57] <rlr219> Give me a few to post it.
[19:02] * houkouonchi-work (~linux@ has joined #ceph
[19:07] <rlr219> ok. its up at http://pastebin.com/HV2GQ5HG it is very basic.
[19:09] * chutzpah (~chutz@ has joined #ceph
[19:10] <Tv_> rlr219: so if you "ssh osd7" and run "hostname", what does it output?
[19:11] <rlr219> just for security, I removed server names and IP addresses. in my actual config, I have FQDN in the config.
[19:11] <Tv_> rlr219: there's your problem
[19:11] <Tv_> the host= field is matched against the short hostname
[19:13] <rlr219> ok. so if I change the config to use the short name, the osd will start at reboot??
[19:13] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[19:13] <Tv_> yes
[19:14] <rlr219> is that an undocumented feature or did I just overlok it in the set-up docs? ;-)
[19:15] <stan_theman> it seems that mount -t ceph will complain about modprobing for ceph even if it's compiled into the kernel. still mounts fine since it's capable of doing it, but the message was concerning!
[19:15] <rlr219> is that for the monitors and mds's as well?
[19:16] <Tv_> rlr219: it's perhaps too poorly documented.. http://tracker.newdream.net/issues/3098
[19:16] <Tv_> stan_theman: sadly, there's no way to really know, so we run modprobe just in case
[19:17] <stan_theman> yeah, i figured after the fact when i saw the mount seemingly working...maybe zgrep /proc/config beforehand?
[19:17] <stan_theman> config.gz
[19:17] <Tv_> stan_theman: and even if we made ceph.ko presence somehow easier to detect, that's a chicken-and-egg problem of at what version of the userspace can we assume that code went in
[19:17] <Tv_> stan_theman: /proc/config is an optional thing afaik
[19:18] <stan_theman> yeah just remembered
[19:18] <Tv_> stan_theman: /proc/filesystems maybe
[19:18] <Tv_> stan_theman: but really, it comes down to -- we're not currently actively improving the ceph dfs
[19:18] <stan_theman> dfs?
[19:18] <Tv_> distributed file system
[19:19] <stan_theman> sorry, been reading "cephfs" so long
[19:19] <Tv_> effort is on rados, rbd, radosgw, deployment right now
[19:19] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) Quit (Quit: Leaving.)
[19:19] <stan_theman> is it on hold? i knew going in that rados was the loved one in the meantime
[19:19] <Tv_> stan_theman: http://ceph.com/docs/master/faq/#is-ceph-production-quality
[19:19] * adjohn (~adjohn@ has joined #ceph
[19:19] <stan_theman> heh, i saw
[19:20] <Tv_> it's not on hold as much as after other things in the queue
[19:20] * adjohn (~adjohn@ Quit ()
[19:21] * adjohn (~adjohn@ has joined #ceph
[19:22] <stan_theman> that's reasonable, just making sure my ducks are in a row!
[19:23] <Tv_> your ducks are clustered in a distributed design that enables parallelism and avoids SPOFs
[19:23] <Tv_> (some ducks may have been CRUSHed)
[19:23] * MikeMcClurg (~mike@ Quit (Quit: Leaving.)
[19:23] <stan_theman> hahaha
[19:35] <rlr219> Thanks Tv_
[19:35] <rlr219> that fixed my issue!
[19:43] * rlr219 (43c87e04@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[19:56] * BManojlovic (~steki@ has joined #ceph
[19:57] * MooingLemur (~troy@ipv4.pinchaser.com) has joined #ceph
[20:00] * jeffp (~jplaisanc@net66-219-41-161.static-customer.corenap.com) has joined #ceph
[20:03] * maelfius (~mdrnstm@ has joined #ceph
[20:04] * Ryan_Lane (~Adium@ has joined #ceph
[20:04] * Ryan_Lane (~Adium@ Quit (Remote host closed the connection)
[20:04] * Ryan_Lane (~Adium@ has joined #ceph
[20:07] <MooingLemur> I've been perusing the ceph documentation, but I haven't found answers to a few questions. I'm probably missing some basic knowledge, so apologies. I assume the backend storage is meant to be JBOD and not need raid due to the replication. True? I don't really see how one would set up multiple disks/mount points for an osd. Does one run an osd node per spindle or something? How are disk failures recovered from? Does it just see an empty tree ...
[20:07] <MooingLemur> ... and repopulate when you rejoin that osd to the cluster?
[20:08] * lofejndif (~lsqavnbok@19NAACESH.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[20:10] <sjusthm> MooingLemur: yeah, usually you would run one osd per disk
[20:10] <sjusthm> MooingLemur: disk failures are handled by failing the osd, normal osd-osd replication takes over from there
[20:16] * dmick (~dmick@2607:f298:a:607:38ee:3e60:f15d:1996) has joined #ceph
[20:19] <MooingLemur> gotcha :)
[20:23] <elder> joshd, how does layering avoid reading stale data on a read that redirects to a parent image?
[20:24] <dmick> elder: AFAIK it's like any other read; it's just that it has to first look to find "no child object", and then "read parent object"
[20:24] <dmick> do you mean how do we handle the race against a new child write?
[20:24] <elder> The way I understand it:
[20:25] <elder> 1) see if object exists; if so, read its data to satisfy the read
[20:25] <elder> 2) otherwise, read parent image for overlapping object's data that was found missing in step (1)
[20:26] <elder> But between the existence check in (1) and satisfying the read in (2) somebody else could have instantiated a read--and even a subsequent write--to the child's object.
[20:26] <dmick> right, but reads are no problem, right?
[20:27] <dmick> racing against the write is an issue. Looking. and now joshd is back
[20:27] <gregaf1> so that's multiple people mounting the block device at once, and then an in-progress read may or may not reflect the effect of a write which was started afterwards
[20:27] <gregaf1> which is perfectly legal, right?
[20:28] <elder> That sounds familiar. I think that's right gregaf1.
[20:28] <dmick> yeah; and each object operation is atomic, so no reading a partial write
[20:28] <elder> I'll have to think about it a bit though... But yes, it's like direct I/O--you are responsible for the trouble that concurrent reads and writes causes.
[20:31] <MooingLemur> does anyone have thoughts on using ZFS (ZFSonLinux) as the backing fs instead of btrfs?
[20:32] <dmick> MooingLemur: you should try it and tell us how it works :)
[20:32] * pentabular (~sean@adsl-70-231-131-129.dsl.snfc21.sbcglobal.net) has joined #ceph
[20:33] <dmick> there's lots of interest.
[20:33] * pentabular is now known as Guest6028
[20:34] <MooingLemur> I think I will :)
[20:34] <MooingLemur> I'm pretty happy with ZoL on a few of my personal machines
[20:36] <jmlowe> I've been curious about using ZoL myself
[20:37] <mikeryan> the FreeBSD port is rock solid, which says absolutely nothing about ZoL :P
[20:37] <MooingLemur> Just started a new job this week, and on day one, they've given me a dozen opteron 246 (very old) machines with 24 1TB drives to do a test ceph instance. I suppose the CPU performance problems would be amplified once I put load on them.
[20:37] <MooingLemur> so it might be easy to compare filesystem performance if I put ZoL on some of them
[20:38] * Guest6028 (~sean@adsl-70-231-131-129.dsl.snfc21.sbcglobal.net) Quit (Quit: Guest6028)
[20:41] * maelfius (~mdrnstm@ Quit (Ping timeout: 480 seconds)
[20:43] <gregaf1> well, I recommend that in this case you use RAID then
[20:43] <gregaf1> 24 OSDs on a 2GHz Socket 940 Opteron aren't going to work out too well
[20:44] <MooingLemur> a handful of raid0s?
[20:46] <jmlowe> careful about raid < 1, my raid controllers unhelpfully masked errors in an attempt to keep data available which didn't let osd failover happen
[20:47] <MooingLemur> well, I guess I'll be potentially using zfs for that rather than md or hardware raid
[20:47] <MooingLemur> so checksum fails should result in io errors
[20:48] <MooingLemur> zfs on some, btrfs on others
[20:59] * maelfius (~mdrnstm@ has joined #ceph
[21:16] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) has joined #ceph
[21:22] <stan_theman> I added tons of files to cephfs, took a while but was fine. when I tried to remove them, the mds crashed (mutex lock problem). should I be looking to tweak something or venturing into multiple mdses?
[21:22] * sagelap (~sage@2600:1010:b004:593:60e2:3e61:90b5:6256) has joined #ceph
[21:22] <sagelap> sjust: /a/sage-i2 has a job hung on wip-3072. seems to be 100% reproducible
[21:23] <joshd> stan_theman: multiple active mdses will be less stable, but it'd be great if you could file a bug report with the backtrace and the mds log
[21:24] <stan_theman> yeah, just wanted to make sure i was doing it right and that it wasn't a known issue or something
[21:24] <sjusthm> sagelap: looking
[21:36] * sagelap1 (~sage@207.sub-166-250-37.myvzw.com) has joined #ceph
[21:38] * sagelap (~sage@2600:1010:b004:593:60e2:3e61:90b5:6256) Quit (Ping timeout: 480 seconds)
[21:39] * lofejndif (~lsqavnbok@04ZAAFBC8.tor-irc.dnsbl.oftc.net) has joined #ceph
[21:45] * adjohn (~adjohn@ Quit (Quit: adjohn)
[21:46] * pentabular (~sean@adsl-70-231-131-129.dsl.snfc21.sbcglobal.net) has joined #ceph
[21:46] * adjohn (~adjohn@ has joined #ceph
[21:47] * pentabular is now known as Guest6035
[21:49] * Guest6035 (~sean@adsl-70-231-131-129.dsl.snfc21.sbcglobal.net) Quit ()
[21:58] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[22:02] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Read error: Connection reset by peer)
[22:03] * adjohn (~adjohn@ Quit (Quit: adjohn)
[22:03] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[22:04] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Remote host closed the connection)
[22:04] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[22:04] <elder> I'm back.
[22:04] <elder> Damned machine froze again.
[22:06] <Tv_> MooingLemur: zfs on a slow cpu might not be an amazing idea either
[22:07] <Tv_> stan_theman: you could run more mds, as long as all but one are in standby only, then a crash will make another one go active
[22:09] * glowell1 (~Adium@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[22:12] <elder> teuthology lock server not responding?
[22:12] <joshd> connection to the database seems to be down
[22:13] * glowell (~Adium@c-98-210-226-131.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[22:16] <dmick> teuthology-the-machine is having issues getting out on the net
[22:20] <dmick> apparently wider than just us. neteng working on it
[22:21] * sagelap (~sage@ has joined #ceph
[22:23] * sagelap1 (~sage@207.sub-166-250-37.myvzw.com) Quit (Ping timeout: 480 seconds)
[22:30] * nhmlap (~nhm@ has joined #ceph
[22:31] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) Quit (Quit: Leaving)
[22:38] <nhmlap> mikeryan: how goes the rados bench testing?
[22:41] * dpemmons (~dpemmons@ Quit (Remote host closed the connection)
[22:44] <mikeryan> nhmlap: i've got a mini spreadsheet showing my results
[22:45] <mikeryan> more benchers == more throughput
[22:45] <mikeryan> more OSDs == more throughput
[22:45] <mikeryan> which points to a messaging bottleneck
[22:45] <sjusthm> maybe!
[22:45] <mikeryan> msbench is misbehaving though
[22:45] <nhmlap> mikeryan: I see more osds=more throughput, but only until we are limited by rados bench.
[22:46] <mikeryan> gimme a sec and i'll put this on gdocs
[22:46] <gregaf1> how fast can you get a single one to go mikeryan?
[22:46] <nhmlap> mikeryan: https://docs.google.com/a/inktank.com/spreadsheet/ccc?key=0AnmmfpoQ1_94dDlmTHhvM19zd19tb05zbFVqZ2xSYXc
[22:46] <gregaf1> I remember that the RAM throughput could be pretty low on that node
[22:46] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:47] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[22:47] <mikeryan> foo, google docs doesn't like gnumeric
[22:47] <nhmlap> mikeryan: csv to gdocs should work.
[22:51] <mikeryan> https://docs.google.com/a/inktank.com/spreadsheet/ccc?key=0AtUcJC0iaUQ9dFJHb2pTelM0MW1LSEt3X3N3MUFQVUE
[22:55] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[22:55] <nhmlap> mikeryan: interesting that your 2 OSD performance is so low.
[22:56] <mikeryan> hm, yeah i haven't looked closely at that
[22:56] <mikeryan> to be honest my measurement strategy was pretty sloppy
[22:56] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:56] * adjohn (~adjohn@ has joined #ceph
[22:57] <nhmlap> mikeryan: with 2 benchers and 12 OSDs, I was getting about 974MB/s, and with 4 benchers I was getting around 1.1GB/s.
[22:58] * sagelap (~sage@50-196-130-246-static.hfc.comcastbusiness.net) has joined #ceph
[22:58] <mikeryan> i don't have the RAM to run 12 OSDs ;)
[22:59] <nhmlap> mikeryan: yes, but I find it interesting that you saw so little improvement with 4 OSDs vs 1.
[22:59] <nhmlap> mikeryan: ram should be doing much higher than 1GB/s.
[22:59] <mikeryan> the network's our bottleneck there
[22:59] <mikeryan> iperf gets slightly over 8 gbit/sec
[22:59] <gregaf1> nhmlap: it's a NUMA machine and it's just not that fast
[22:59] <nhmlap> mikeryan: ah, client is on another machine?
[23:00] <nhmlap> gregaf1: dual socket or quad socket?
[23:00] <mikeryan> nope, it just has pretty bad IO performance
[23:00] <gregaf1> quad
[23:00] <nhmlap> AMD?
[23:00] <mikeryan> 64 glorious cores
[23:00] <mikeryan> yep
[23:00] <mikeryan> 4 x 8core
[23:00] <mikeryan> er
[23:00] <mikeryan> 4 x 16 core
[23:00] <nhmlap> ah, yes. I mapped out that configuration at my last job.
[23:01] <mikeryan> i get better localhost throughput on the core2duo i'm sitting on right now
[23:02] <nhmlap> https://docs.google.com/drawings/d/1V5sFSInKq3uuKRbETx1LVOURyYQF_9Z4zElPrl1YIrw/edit
[23:03] * aliguori (~anthony@ Quit (Remote host closed the connection)
[23:03] <mikeryan> http://mysteryoftheinquity.files.wordpress.com/2011/04/pentagram-star-blue-logo.jpg
[23:03] <mikeryan> i'm ready to summon the dark lord of AMD
[23:03] <nhmlap> actually, interlagos is here: https://docs.google.com/drawings/d/1ZmdFsBUe_VXzac9-RTqoq5CB9q-j1EX8j6dIKIllOys/edit
[23:05] <nhmlap> Despite it's slowness, unless there is a lot of nasty crosstalk over hypertransport, it should be able to do more than 1GB/s to remote memory in most cases.
[23:05] <mikeryan> iperf's topping out at 1 gb/sec
[23:05] <mikeryan> localhost-localhost sockets are bottlenecking us
[23:06] <nhmlap> I think I measured that kind of setup at about 3-4GB/s to non-adjacant memory.
[23:06] * sagelap (~sage@50-196-130-246-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:06] <nhmlap> there used to be this fantastic tool called numademo that let you measure memory performance of every numa node to every other node, but I think it wasn't maintained and they removed it.
[23:07] <mikeryan> latest version (2.0.8-rc5) was released 8/23/12
[23:07] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[23:08] <elder> Github problems now?
[23:08] <mikeryan> nhmlap: i've got numademo compiles
[23:08] <mikeryan> compiled*
[23:08] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit ()
[23:09] <nhmlap> mikeryan: fantastic, last time I looked I couldn't find it.
[23:09] <mikeryan> ftp://oss.sgi.com/www/projects/libnuma/download/
[23:09] <mikeryan> project page is here: http://oss.sgi.com/projects/libnuma/
[23:09] <nhmlap> mikeryan: maybe I was just blind or high or something.
[23:10] <nhmlap> yeah, for some reason I thoguht it got removed.
[23:10] <nhmlap> anyway, numademo is really nice
[23:10] <mikeryan> i have no idea what any of these tests mean
[23:10] <nhmlap> it's doing a bunch of operations across nodes to remote memory.
[23:11] <nhmlap> I forget all of what it does, but there should be some straight forward tests in there.
[23:11] <mikeryan> running ./numademo 1g forward
[23:11] <mikeryan> i'm seeing numbers from 1800 mbytes/sec to 3600 mbytes/sec
[23:11] <nhmlap> hrm, 1.8MB/s is pretty slow.
[23:11] <nhmlap> er GB/s
[23:12] <mikeryan> could be because i'm only using 1g
[23:12] <nhmlap> maybe
[23:13] <nhmlap> I'd expect you should see quite a bit higher performance to the local node memory.
[23:13] <mikeryan> yes, 3600 is on local memory
[23:14] <mikeryan> 1800 appears to be the worst case of a remote node
[23:14] <mikeryan> seems to be tiered, 1800 2700 and 3600
[23:14] <mikeryan> depending on the node
[23:16] <nhmlap> yeah, that makes sense because on some nodes are close with a 16bit+8bit hypertransport link, some are adjacent but not on the same socket which are 8bit HT, and some are not directly connected so have to go over an 8-bit link plus another hop.
[23:16] <nhmlap> s/same socket/different socket
[23:16] <nhmlap> er wht, I was right the first time.
[23:16] <mikeryan> i know what ya mean
[23:16] <nhmlap> blah
[23:16] <mikeryan> roughly
[23:17] <nhmlap> if you stare at that diagram long enough it makes sense. ;)
[23:17] <mikeryan> fascinating arch
[23:17] <nhmlap> mikeryan: the numbers are lower than I would expect though. I wonder if the memory is fully populated.
[23:17] * houkouonchi-work (~linux@ Quit (Remote host closed the connection)
[23:18] <nhmlap> I thought last time I tried numademo on a machine like that I got more like 3.2->6.4GB/s.
[23:18] <nhmlap> that was on a quad socket dell at TACC.
[23:18] <mikeryan> it's eminently possible that it's operator error
[23:20] <nhmlap> mikeryan: maybe. I'm trying to see if my TACC account still works.
[23:20] <nhmlap> wow, it's still active.
[23:21] <nhmlap> lol, there's just nothing in my home directory now. :)
[23:21] <nhmlap> aha, /old_home
[23:22] <nhmlap> I did -c 16k, -c 128k, -c 1024k, -c 8192k, and -c 65536k
[23:23] <nhmlap> I was able to do 12.1GB/s to local memory, 6.8GB/s to memory on the other node on the same socket, 5.2GB/s to adjacent non-socket memory, and 3.6GB/s to memory over two hops.
[23:24] <nhmlap> So that node is sucking for some reason.
[23:33] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) has joined #ceph
[23:38] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[23:39] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:58] * Guest6053 (~sean@adsl-70-231-131-129.dsl.snfc21.sbcglobal.net) has joined #ceph
[23:58] * Guest6053 is now known as pentabular

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.