#ceph IRC Log


IRC Log for 2011-04-06

Timestamps are in GMT/BST.

[0:09] <Tv> sagelap: whoops sorry we were busy talking it out
[0:10] <Tv> sagelap: basically, libradospp is broken and we think the current api is probably unworkable
[0:11] <Tv> sagelap: and we were looking at ways to avoid exposing bufferlists, to simplify it enough to make it doable
[0:11] <sagelap> oh wrt the libatomicops?
[0:11] <Tv> sagelap: and i had some detail question about bufferlists but i think it got answered already
[0:11] <Tv> sagelap: yeah it leaks atomic_t implementation, which is not guaranteed fixed, hence unusable for non-ceph.git use
[0:12] <gregaf> we could hack around it but there's still issues with who allocated the memory in the bufferlist that we don't think the current API deals with at all
[0:12] <sagelap> this is exactly why, btw, it's NO_ATOMIC_OPS insteadof HAVE_ATOMIC_OPS.. the assumption is that the instaled library was built with libatomic_ops.
[0:12] <sagelap> those who build themselves in a nonstandard way are out of luck currently.
[0:12] <Tv> sagelap: sadly, that won't work, afaics
[0:13] <josef> sagelap: 0.26 doesn't build
[0:13] <sagelap> oh.. allocators, right.
[0:13] <sagelap> josef: aie.. what's the error?
[0:13] <sagelap> we need to set up a fedora gitbuilder
[0:13] <Tv> sagelap: so we were thinking, maybe libradospp should take something more like struct iovec from writev(3), than bufferlists
[0:13] <josef> sagelap: http://fpaste.org/AEgF/
[0:14] <Tv> sagelap: simplify the problem
[0:14] <Tv> sagelap: but the original reason i asked for you got covered already, so we can come back to this next week
[0:14] <sagelap> tv: yeah, that could work. the main thing bufferlists get you is avoiding any copies. altho i guess the messenger stuff supports reading into an existing buffer now..
[0:14] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:15] <gregaf> sagelap: we figured we'd just take the memory they give us, convert to a bufferlist of static pointers, and go from there
[0:15] <sagelap> so just char * and len are probably okay too.
[0:15] <Tv> sagelap: the other alternative we talked was completely hiding bufferlist internals from the caller
[0:15] <gregaf> that should keep us at zero-copies in most cases and the caller just needs to free the memory after we return an ack or whatever on the op
[0:15] <Tv> sagelap: but that started sounding like we'd need a pointer indirection everywhere bufferlists are used in the api, to avoid exposing the size of the object
[0:15] <sagelap> gregaf: yeah.. as long as we know the references are all gone when the write returns. i'm not certain that's the case with the current msgr intenrals.
[0:17] <josef> sagelap: seems like gcc is being overly pissy
[0:17] <sagelap> josef; weird that error didn't come up on other archs. rgw_cache is currently unused code i think.
[0:17] <gregaf> sagelap: what msgr refs are you worried about?
[0:18] <josef> this was x86_64
[0:18] <josef> i'll just add -fpermissive so it goes away
[0:19] <Tv> i created this as placeholder for the libradospp api work: http://tracker.newdream.net/issues/987
[0:19] <gregaf> sagelap: oh, you mean the copies it keeps in case the message needs to be resent?
[0:19] <josef> i'll just send a patch
[0:19] <sagelap> josef: ok. btw are you still carrying ceph.spec changes?
[0:19] <Tv> gregaf: but... will there be resends after an ack?
[0:19] <gregaf> by "ack or whatever" I meant "when we tell the caller that we're done for certain" ;)
[0:20] <sagelap> gregaf: yeah. we may do send attempts to multiple destinations, and the userspace msgr implementation doesn't let you revoke (i don't think? i can't remember actually).
[0:20] <gregaf> which I guess is the commit for us
[0:20] <sagelap> the static_buffer requires that you have all references gone when you destroy it...
[0:20] <gregaf> no, there's not any message revocation
[0:20] <Tv> sagelap: so you're saying.. the client will do commit ack after a single destination acks it, but will still be trying to send it to n-1 other destinations?
[0:20] <sagelap> we'd need to add that before we can safely avoid the memory copy.
[0:21] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[0:21] <josef> sagelap: i dont think so
[0:21] <josef> i'll double check
[0:21] <sagelap> tv: we send to osd2.. then find out osd2 is down and send to osd3. the Message* we sent to osd2 may still have a reference to the bufferlist.
[0:21] <sagelap> tv: hrm.. actually, Objecter marks the ocnnection down now so those refs should be gone.
[0:22] <sagelap> tv: and the refs for resend aren't kept (Objecter does that). so actually we should be okay with static_buffer..
[0:22] <sagelap> we can add an assert in teh static raw buffer destructor to make sure the refs are all gone, just to make sure we don't miss anything
[0:23] <gregaf> I guess it does make a mess out of clients who only care about acks and not commits, though :/
[0:23] <Tv> gregaf: we can provide an allocator for them to use
[0:23] <Tv> gregaf: at the cost of having more methods in the interface
[0:24] <Tv> then, for any chunks they get from that allocator, they don't have to keep track & free on commit
[0:24] <gregaf> at that point wouldn't it just make more sense to make them turn everything into a bufferlist and have the API use typedef'd bufferlist pointers?
[0:24] <Tv> gregaf: will that work?
[0:25] <Tv> we can't guarantee what size a bufferlist object is
[0:25] <gregaf> yeah, but if it's just an opaque pointer to the user
[0:25] <Tv> yeah ok that's the "make everything be a pointer indirection" thing
[0:25] <Tv> as long as we can avoid that internally, it should be ok
[0:25] <gregaf> and we provide them a "make_ceph_storage_object(char* data, bool copy_to_internals)"
[0:25] <sagelap> what if the bufferlist allocations just move into buffer.cc?
[0:26] <gregaf> and the first time the bufferlist gets copied it's back on the stack
[0:26] <josef> sagelap: looks like the only difference is yours configures gtk support and such based on what ./configure is given
[0:26] <Tv> sagelap: well right now a lot of code seems to embed bufferlists, changing those to pointers has performance & complexity cost
[0:26] <gregaf> Tv: we don't need to swap it except in the user-facing functions, though
[0:26] <Tv> yeah
[0:27] <gregaf> and that expense is not going to be big compared to anything else we've come up with?
[0:27] <sagelap> tv: the bufferlist itself doesn't have anything that is compilation environment dependent. that's the raw buffers themselves
[0:27] <Tv> gregaf: ok so the cost for that, which i tried to say earlier, is that now you need the append etc operations both on the pointer and the class itself
[0:27] <Tv> gregaf: because api users only ever have the pointer, and internal users want to avoid the pointer to be faster
[0:27] <Tv> sagelap: atomic_t nref
[0:28] <gregaf> oh, and right now our librados users have bufferlists which they can do all the ops on, right
[0:28] <sagelap> that's in the raw buffer. we just need to make all the new calls in buffer.cc and not in the header.
[0:28] <Tv> sagelap: hmmm
[0:28] <Tv> frankly, buffer.h is really hard to read
[0:29] <gregaf> just for my own edification, right now librados just doesn't concern itself with who allocated what, right?
[0:29] <sagelap> librados.hpp?
[0:29] <gregaf> so the buffer pointers happily free stuff from whichever side removes the last ref
[0:30] <Tv> sagelap: ok so.. api users only see pointers to raw, ever, but can e.g. embed bufferlists. that might be doable.
[0:30] <gregaf> yeah, the interface that uses bufferlists ;)
[0:30] <Tv> oh right that's still gonna fail
[0:31] <gregaf> Tv: yeah, I woulda called off the dogs of API-change if it was just the exposal of atomic that was a problem ;)
[0:31] <sagelap> tv: what still gonna fail?
[0:32] <Tv> sooo.. i pass in a non-aligned buffer list; librados wants to align it, thus allocates new ones, copies data, frees the old buffers; or, receives data from socket, allocs buffers, returns them to me the user --- the only way for that to work is for all the allocs/frees to happen inside librados
[0:33] <sagelap> exactly. move all calls to new/delete in buffer.h into buffer.cc.
[0:33] <Tv> yeah and no "new bufferlist" for the user, must go through a factory function
[0:33] <sagelap> most of the rest of buffer.h should do teh same :)
[0:34] <sagelap> new bufferlist is okay, because they will delete it too. and bufferlist itself has to atomic_t or anything else, just the std::list<>
[0:34] <Tv> frankly i'm starting to like the C api more and more
[0:34] <sagelap> as long as sizeof(bufferlist) itself doesn't vary
[0:35] <gregaf> it doesn't embed a buffer_ptr, does it?
[0:35] <sagelap> nope. there is a std::list<ptr>, though.. but ptr also doesn't have an atomic_t and will have constant size.
[0:39] <cmccabe> one thing I'd like to see in buffer.h is to make it possible to forward-declare buffer::list
[0:39] <cmccabe> currently buffer::list is an inner class and as such, cannot be forward declared.
[0:42] * alexxy[home] (~alexxy@ has joined #ceph
[0:43] <sagelap> cmccabe: the inner classes just seemed like a good idea at the time. i
[0:43] <sagelap> 'm not attached to it.
[0:45] <cmccabe> sagelap: inner classes and enums are the two things that can't be forward-declared
[0:45] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[0:45] <cmccabe> sagelap: it's frustrating when you realize there are no workarounds
[0:46] <cmccabe> sagelap: I agree with the discussion about moving the new/delete/atomic increment stuff into buffer.cc
[0:46] * alexxy[home] (~alexxy@ Quit (Remote host closed the connection)
[0:48] <cmccabe> I guess maybe we need to start putting a comment at the top of headers that are part of a public API
[0:50] * alexxy (~alexxy@ has joined #ceph
[0:51] <Tv> so, umm.. class raw can't be defined in buffer.h, because it embeds atomic_t nref ?
[0:51] <Tv> i'm not sure i see how this is going to work
[0:52] <Tv> or, i should say -- i don't see how bufferlists really are laid out, in the first place
[0:52] <Tv> there's list, ptr and raw
[0:52] <Tv> list is a list of ptr's
[0:52] <Tv> ptr contains actual pointer to raw, and offset and length
[0:52] <Tv> so.. raw will become an implementation detail?
[0:53] <Tv> and public api is about lists and ptrs
[0:53] <sagelap> right.
[0:53] <Tv> ptr(raw *r) : _raw(r), _off(0), _len(r->len) { // no lock needed; this is an unref raw.
[0:53] <Tv> r->nref.inc();
[0:53] <sagelap> raw can live in buffer.cc. the constructors are all privat eanyway.. there are factory functions to build them
[0:53] <Tv> bdout << "ptr " << this << " get " << _raw << bendl;
[0:53] <Tv> }
[0:54] <sagelap> well, could be private, not sure if they actually are
[0:54] <Tv> so.. if there is no "class raw" at all in buffer.h, the above has to go away
[0:54] <sagelap> ptr::ptr() just needs to go into buffer.cc
[0:54] <sagelap> and buffer.h should just have a
[0:54] <sagelap> class raw;
[0:54] <sagelap> forward decl
[0:54] <Tv> oh except if it acts like c's "struct foo;", no internals defined
[0:54] <Tv> yeah
[0:56] <sagelap> we'll kill all the inlining, but that will probably be a net win given how much code it translates into.
[0:57] <josef> sagelap: ok it built fine
[0:57] * alexxy[home] (~alexxy@ has joined #ceph
[0:58] <sagelap> josef: yay!
[1:00] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[1:09] <cmccabe> sagelap: unfortunately, you cannot forward declare inner classes, even within the enclosing class
[1:09] <cmccabe> sagelap: class raw will have to move out of class buffer and learn to live on its own
[1:15] <sagelap> cmccabe: yeah that's fine
[1:16] <cmccabe> sagelap: yeah
[1:24] <cmccabe> so I checked out those confutils valgrind leaks, and I don't think they're real
[1:29] <sagelap> interesting tidbit: google is ditching autotest and building something new.
[1:30] <cmccabe> sagelap: open source or closed?
[1:31] <sagelap> not sure. i'll try to drag more information out of him later tonight
[1:31] <Tv> sagelap: yup, autotest is a pain; at the same time, it's the best that's out there right now
[1:31] <Tv> i already have a bunch of fixes on top of it, to make it less painful...
[1:31] <sagelap> yeah
[1:31] <Tv> it really would require a fork+cleanup to be nice
[1:32] <cmccabe> since google has so much cluster management and stuff in place, like LTT-ng, I could see them just going with something completely custom
[1:32] <sagelap> another tidbit: lustre client is 300+ kloc
[1:33] <cmccabe> which unfortunately wouldn't be much help to outsiders of course
[1:33] * sagelap (~sage@ Quit (Remote host closed the connection)
[1:43] <gregaf> damn, 300kloc is bigger than all of Ceph
[1:43] <gregaf> I guess it needs to handle block layouts and everything too though
[1:44] <cmccabe> I get 257,408 for all of ceph
[1:44] <cmccabe> but that's counting many irrelevant things like READMEs
[1:45] <cmccabe> and of course, not the kclient.
[1:46] <cmccabe> oh actually wait, I don't count files unless they have the right extension. So no README.
[1:46] <gregaf> ah, yeah, I found a perl script that seems to have sane defaults
[1:46] <djlee1> guys in 0.26, it says cosd no longer blocks when scrubbing,
[1:46] <gregaf> last I ran it I got ~150k SLOC for the userspace
[1:46] <cmccabe> also gtest contributes 47k, and ebofs 11k, that are not used
[1:46] <djlee1> i'd thought the scrubbing happens only when idle
[1:47] <djlee1> just judging from watching ceph -w
[1:47] <gregaf> djlee1: in general it does
[1:48] <gregaf> but it can also happen during recovery, and in that circumstance the blocking (which on a healthy cluster is generally for a very short period) can stretch out for long periods of time
[1:48] <djlee1> i see
[1:49] <djlee1> greg, about the yesterdays reply about 2x, 3x replc. writes being slow by that factor, I really think it's the case, never seen any faster
[1:49] <djlee1> or even slew down by a little
[1:51] <Tv> cpp: 171560 (78.52%)
[1:51] <Tv> python: 19950 (9.13%)
[1:51] <Tv> sh: 13908 (6.37%)
[1:51] <Tv> ansic: 9436 (4.32%)
[1:51] <Tv> perl: 2322 (1.06%)
[1:51] <Tv> java: 1322 (0.61%)
[1:52] <Tv> <3 sloccount
[1:53] <cmccabe> I tried sloccount, but it segV'ed on me
[1:53] <cmccabe> so I just used wc
[1:55] <djlee1> if anybody can confirm say, 2x replc. write slowing down by about 10%, that's great, but not by about 50%..
[1:55] <cmccabe> looks like I installed sloccount wrong
[1:55] <gregaf> ah, yeah, I think sloccount is what I found but I seem to have lost it
[1:55] <cmccabe> it's in apt-get
[1:55] <djlee1> simplest example 1 node 2, osd , or 2 node 1 osd. I think the results were similar, (for 2x repl.)
[1:56] <cmccabe> djlee1: I seem to remember a thread about slow OSDs causing bad performance on multi-node setups
[1:56] * imcsk8 (~ichavero@nat.ti.uach.mx) Quit (Quit: Leaving)
[1:56] <cmccabe> djlee1: I don't know if that's what's happening in your case or what
[1:56] <cmccabe> djlee1: performance tuning is pretty tough in genera and it usually boils down to finding the bottleneck. Like what kind of interconnect do you have?
[1:58] <djlee1> 2gb/s all, i confirm that a real file-write to osds go 220mb/s, so that's sweat, but i remember when i convert to 2x replication mode, speed halved. maybe i need to try again with new versions..
[1:58] <djlee1> but i'd like to know if anyone actually see the speed slowing down by 40-50% for 2x, 3x, etc.
[1:58] <djlee1> if not, what kind of configurations..? :p
[1:59] <cmccabe> djlee1: when you say 2 gb/s, is that measured speed by iperf or something?
[2:00] <djlee1> iperf, as well as actual ffsb and iozone reporting,
[2:00] <cmccabe> you might try measuring throughput to the disk on both nodes
[2:01] <gregaf> I forget, how many nodes did you have again?
[2:01] <djlee1> i think i got it alright for 1x testing, but not for 2x,
[2:01] <djlee1> using 2 node 6 osd each
[2:01] <gregaf> and you have bonded gigabit going to each node
[2:01] <djlee1> actually 2 node and 3 node each, also gave similiar maxed. out 200mb/s
[2:01] <gregaf> yeah, with 2x you'd expect it to get cut in half on a two node setup
[2:02] <gregaf> err, wait, maybe I'm thinking about that wrong?
[2:03] <gregaf> yeah, I am — the client is pumping out 2Gb/s of data that gets spread across two 2Gb/s nodes
[2:03] <gregaf> and they each ought to be able to handle the full stream from the client, which is 2x replication
[2:04] <djlee1> yeah, client -> 2gb/s, but each node pushes back to client about 1gb/s (but that's okay, it means balance is going good!),
[2:04] <gregaf> wait, why do the nodes push back to client?
[2:04] <djlee1> but i havent exactly measured what happens (at the NICs) when 2x replicated,
[2:05] <djlee1> i mean, push back (reading back), or push-out (writing to)
[2:06] <gregaf> oh, if the client is reading off the nodes at the same time then that would explain why 2x replication drops off in performance — the nodes need to service both the client's reads and the replication traffic on 2Gb/s outbound
[2:09] <djlee1> hmm, when reading, no replication is done.
[2:09] <Tv> FYI i got a librados.hpp to compile with a modified buffer.h & buffer.cc
[2:09] <Tv> i mean, a c++ client app
[2:09] <Tv> happily calling Rados::version
[2:10] <gregaf> djlee1: but you said it was doing reads while writing?
[2:13] <djlee1> right sorry for confusing you!, i meant, client can basically write or read, everything to clients at max. 2gb/s to the nodes, and so can the nodes, it's the that the client bottlenecks at 2gb/s,
[2:17] <djlee1> by default map, at 2x, the replc objects are never stored in the same osd for sure?
[2:18] <gregaf> they're not on the same OSD, but the default map doesn't know which OSDs are on the same node so you need to set that up yourself
[2:24] <djlee1> i'll try again with 2x, but last time i did 2x, the write-performance was increasing as i add more osds, just that all mb/s were about halfed of 1x, heh
[2:25] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[2:26] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit ()
[2:28] <djlee1> but my strong guess is that, at 2x replc., hdds are doing partial-random-writes as writing replc objects as well,
[2:29] <djlee1> so maybe i need to divide the part, so that e.g., 2nd node's osd only gets the replc. copies
[2:29] <djlee1> or if it's just single node with 12 disks, i set osd 6-12 get only the replc objects
[2:29] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[2:32] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:32] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:37] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[2:38] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:45] * cmccabe (~cmccabe@ Quit (Quit: Leaving.)
[2:48] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[2:52] * greglap (~Adium@ has joined #ceph
[3:30] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Operation timed out)
[3:52] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[4:00] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[6:23] * lxo (~aoliva@ Quit (Ping timeout: 480 seconds)
[6:24] * lxo (~aoliva@ has joined #ceph
[7:29] * lxo (~aoliva@ Quit (Read error: Connection reset by peer)
[7:29] * lxo (~aoliva@ has joined #ceph
[7:30] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[8:14] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:18] * lxo (~aoliva@ Quit (Read error: Connection reset by peer)
[8:18] * lxo (~aoliva@ has joined #ceph
[8:30] * Psi-Jack (~psi-jack@ has joined #ceph
[8:59] * allsystemsarego (~allsystem@ has joined #ceph
[9:11] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:16] * gregorg (~Greg@ Quit (Quit: Quitte)
[9:16] * gregorg (~Greg@ has joined #ceph
[9:42] * allsystemsarego (~allsystem@ Quit (Ping timeout: 480 seconds)
[9:56] * allsystemsarego (~allsystem@ has joined #ceph
[10:02] * alexxy[home] (~alexxy@ Quit (Ping timeout: 480 seconds)
[10:07] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[10:07] * Yoric (~David@ has joined #ceph
[13:08] * Yoric_ (~David@ has joined #ceph
[13:08] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[13:08] * Yoric_ is now known as Yoric
[14:11] * samsung (~samsung@ has joined #ceph
[14:12] * Administrator_ (~samsung@ has joined #ceph
[14:12] * samsung (~samsung@ Quit ()
[14:43] * chraible (~chraible@blackhole.science-computing.de) has joined #ceph
[14:44] <chraible> hi @all
[14:44] <chraible> I got following error when ein want to gconfigure ceph "configure: error: no suitable crypto library found"
[14:44] <chraible> which library do I need on CentOS 5.5
[14:44] <chraible> ?
[15:24] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:38] * nms (martin@sexyba.be) has joined #ceph
[15:38] <chraible> k crypto library problem fixed...
[15:39] <chraible> hope you don't have anything important on it
[15:39] <chraible> <slyfox^w> nope
[15:39] <chraible> <slyfox^w> just created!
[15:39] <chraible> <slyfox^w> btw, compressed image sizeis only 18KB
[15:39] <chraible> <slyfox^w> i can share it if you want
[15:39] <chraible> <slyfox^w> (maybe ot's only my setup is screwed and oopses)
[15:39] <chraible> <sensille> if you just created it, how did you corrupt it already?
[15:39] <chraible> <slyfox^w> as it's a usermode linux it can be broken on it's own
[15:39] <chraible> <slyfox^w> or 2.6.39-rc2 is screwed
[15:39] * chraible (~chraible@blackhole.science-computing.de) Quit (Quit: Verlassend)
[15:45] * alexxy (~alexxy@ has joined #ceph
[15:55] * Psi-Jack_ (~psi-jack@yggdrasil.hostdruids.com) has joined #ceph
[16:02] * Psi-Jack (~psi-jack@ Quit (Ping timeout: 480 seconds)
[16:05] * Administrator_ (~samsung@ Quit (Quit: Leaving)
[17:17] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:34] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:53] * greglap (~Adium@ has joined #ceph
[18:07] * sagelap (~sage@ has joined #ceph
[18:13] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:16] <sagelap> tv: ah, linus pulled the key stuff after -rc1.
[18:16] <Tv> nice
[18:24] * Yoric (~David@ Quit (Quit: Yoric)
[18:36] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:47] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[18:48] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[18:51] * alexxy (~alexxy@ has joined #ceph
[18:55] * alexxy[home] (~alexxy@ has joined #ceph
[18:58] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:59] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[19:13] * cmccabe (~cmccabe@ has joined #ceph
[19:24] * sagelap1 (~sage@ has joined #ceph
[19:24] * sagelap (~sage@ Quit (Read error: Connection reset by peer)
[19:30] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:32] <jjchen> Given an object name, what is the quickest way to find out its pgid and which osds the replicas for this object reside. In general, is there a way to find out for each osd, what are the pgs and objects on it, like a system wide object or pg map? Thx.
[19:37] * sagelap1 (~sage@ Quit (Ping timeout: 480 seconds)
[19:39] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[19:42] * sagelap (~sage@ has joined #ceph
[19:42] <Tv> sagelap: not much luck with the skype..
[19:42] <sagelap> well no luck with skype
[19:43] <sagelap> either flaky wireless or noisy room
[19:43] <gregaf> ah sorry, I left my phone in the office — didn't get your text
[19:43] <sagelap> any roadblocks?
[19:43] <gregaf> things are pretty much humming along
[19:44] <gregaf> one of the playground OSDs got a bad disk yesterday and the system kept working
[19:44] <gregaf> sjust might know more about what's happened since yesterday, not sure though
[19:44] <sagelap> do you know by chance if it had fully recovered osd3 by that time?
[19:44] <sjust> that pretty much sums it up, bad disk
[19:45] <sagelap> (i marked it out yesterday bc it was going crazy slow)
[19:45] <gregaf> ah, yeah, osd3 was the bad one :)
[19:45] <sagelap> you sure? osd4 is showing errors today
[19:45] <gregaf> at least we assume that's why the slow performance, a dd on the btrfs filesystem got ~950kb/s
[19:45] <sjust> uh oh, another seems to be down
[19:46] <sjust> osd4 this time
[19:47] <sjust> ugh: [ 162.158350] ata2.00: failed command: READ FPDMA QUEUED
[19:47] <sjust> [ 162.163563] ata2.00: cmd 60/00:08:28:c5:45/01:00:2e:00:00/40 tag 1 ncq 131072 in
[19:47] <sjust> [ 162.163564] res 51/40:00:28:c5:45/40:01:2e:00:00/40 Emask 0x9 (media error)
[19:47] <sjust> [ 162.178872] ata2.00: status: { DRDY ERR }
[19:47] <sjust> [ 162.182947] ata2.00: error: { UNC }
[19:47] <sjust> seems like a second hardware problem
[19:47] <sagelap> i never actually saw a hw issue on 3. now it's failing on ENOSPC. i'm hoping its a btrfs issue..
[19:49] <sagelap> sorta wished we were doing 3x
[19:49] <joshd> jjchen: you can get the mapping of pgs to osds with 'ceph pg dump -o -', I'm not sure about mapping objects to pgs
[19:50] <sjust> sagelap: actually, a dd directly from the disk is getting >40MB/s so probably is btrfs
[19:51] <sjust> on osd3, that is
[19:59] <gregaf> sjust: you tested all the disks?
[20:00] <sjust> oh, right, forgot about that
[20:00] <sjust> no
[20:00] <gregaf> you're the one who told me there are 11 of them or something :p
[20:00] <sjust> yeah, yeah, looking
[20:01] <gregaf> but if there are also ENOSPC issues it might be a btrfs balancing problem :/
[20:01] <sjust> yeah, it claims to be at 85% use, definitely could be a balancing problem
[20:05] <sagelap> what's strange is recovery is making progress but no new writes are being accepted
[20:05] <sjust> ok, the disks individually are probably fine
[20:05] <sjust> sagelap: on osd3?
[20:06] <sagelap> i take it back, just 'Object 0' that the ancient rados bench on swing0 is trying to write
[20:11] <gregaf> sagelap: any idea why a rename would involve an inode that starts out in the stray directory?
[20:35] <sagelap> yeah
[20:36] <sagelap> if you have two links to a file, and you remove the primary, it goes into stray. later the mds will try to move it back into the normal hierarchy by renaming it over the remote link
[20:36] <gregaf> ah, that makes sense
[20:36] <sagelap> hmm, is it broken? it occurs to me the check i added in 0.26 that renaming links over each other is a no-op may have broken it
[20:37] <sagelap> hmm and ripping out the linkmerge stuff
[20:37] <gregaf> there's some issue with locking but I don't think that's the broken part
[20:37] <gregaf> FAILED assert(get_xlock_by() == 0)
[20:38] <gregaf> it looks like the lead MDS on a client request got remote xlocks
[20:38] <gregaf> and then finished the request and started on another request which required similar remote xlocks
[20:38] <gregaf> without waiting for the first set of xlocks to be dropped by the remote journaling
[20:39] <gregaf> I'm trying to figure out if the problem is that we request_drop_foreign_locks and then forget the xlock without waiting for any kind of reply
[20:39] <gregaf> or if the problem is that we never handled this case
[20:39] <sagelap> is that done by early_reply?
[20:39] <gregaf> no, I checked that we weren't doing early_reply
[20:40] <gregaf> but remember we modified reply_request's helper functions and how they drop the remote locks
[20:40] <sagelap> in that case i don't think there is another reply, is there? the master send "finished!" to the slaves and they drop their state. the slave should them journal the commit async, but it never talks to the master again, right?
[20:41] <gregaf> yeah
[20:41] <gregaf> except the master is then sending out requests for xlocks before the async journal finishes and the slave's xlocks get dropped
[20:42] <gregaf> from the slave's point of view:
[20:42] <gregaf> slave_request finish!
[20:42] <sagelap> this is no the no-longer-auth inode on the slave?
[20:42] <sagelap> s/no/on/
[20:43] <gregaf> so I think it actually maintained auth for some reason, but I'm not sure
[20:43] <gregaf> the slave gets the request finish, then it starts the commit
[20:43] <gregaf> normally the locking gets cleaned up in the commit_finish handler, which calls request_finish or whatever the function is called
[20:44] <gregaf> but in this case it's getting requests for xlocks before the commit comes back as done
[20:44] <bchrisman> (quick aside, what do are you guys referring to with the 'slave' monicker.. is that a client that's not authoritative or something? osd?)
[20:44] <sagelap> hmm i see
[20:44] <sagelap> bchrismas: this is for renames across mds nodes.. one is the master and one is the slave for the given request
[20:45] <bchrisman> ahh okay.. thx makes sense
[20:47] <gregaf> so I'm trying to figure out if it was broken going into this rename, or if the master's broken, or the slave is
[20:47] <sagelap> gregaf: maybe the problem is just that the xlock request shouldn't assert that it's not already xlocked. that could be a problem more generallly
[20:47] <gregaf> but it can't be xlocked by more than one request at a time, right?
[20:47] <gregaf> otherwise it's…not really an exclusive lock!
[20:48] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[20:48] <sagelap> right, but the second request should wait, not assert
[20:48] <sagelap> in this case, wait until the slave commit happens and the locks are dropped.
[20:49] <gregaf> okay, so probably the slave is broken and it's doing a blind xlock without checking to see if it should wait
[20:49] <sagelap> yeah, in the xlock request handling
[20:49] <sagelap> that's where it asserted, right?
[20:50] <gregaf> yeah
[20:50] <gregaf> 1: (Locker::local_xlock_start(LocalLock*, MDRequest*)+0x3a7) [0x5c7597]
[20:50] <gregaf> 2: (Locker::xlock_start(SimpleLock*, MDRequest*)+0x38e) [0x5c99ae]
[20:50] <gregaf> 3: (Locker::acquire_locks(MDRequest*, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&)+0x24f3) [0x5dd243]
[20:50] <gregaf> 4: (Server::dispatch_slave_request(MDRequest*)+0x45c) [0x51a3cc]
[20:50] <gregaf> 5: (Server::handle_slave_request(MMDSSlaveRequest*)+0x148) [0x526ad8]
[20:50] <sagelap> ah.
[20:51] <sagelap> it might be that the not-yet-xlocked check is happening in dispatch_slave_req instead of further down. the local xlock (versionlock) is added to the xlock set by acquire_locks
[20:52] <gregaf> well xlock_start has checks to make sure it's xlockable, otherwise it places the request on a wait list and returns false to acquire_locks
[20:52] <sagelap> actually it sounds like can_xlock_local() is broken
[20:53] <sagelap> if get_xlock() immediately asserted after that
[20:55] <gregaf> oh, can_xlock and get_xlock don't match up
[20:55] <gregaf> can_xlock returns true if get_xlock_by_client() == client
[20:56] <gregaf> but get_xlock asserts get_xlock_by_client() == 0
[20:56] <Tv> food?
[20:56] <gregaf> I'm planning on Qdoba but I missed the fast pick-up time so I'm trying to wait :)
[20:57] <Tv> too hungry for that now
[20:57] <gregaf> err, I mean get_xlock checks that get_xlock_by() == 0
[20:59] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:59] <gregaf> and I was a little mistaken there, because yes the error is actually in local_xlock_start
[21:00] <gregaf> where can_xlock_local just checks !is_wrlocked()
[21:02] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[21:10] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:35] * sagelap (~sage@ has joined #ceph
[21:39] * sagelap (~sage@ has left #ceph
[21:39] * sagelap (~sage@ has joined #ceph
[21:59] * alexxy[home] (~alexxy@ Quit (Ping timeout: 480 seconds)
[22:36] * sagelap (~sage@ Quit (Read error: Connection reset by peer)
[22:36] * sagelap (~sage@ has joined #ceph
[23:16] * cmccabe (~cmccabe@ Quit (Ping timeout: 480 seconds)
[23:16] * cmccabe (~cmccabe@ has joined #ceph
[23:48] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.