#ceph IRC Log


IRC Log for 2011-03-09

Timestamps are in GMT/BST.

[0:11] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[0:15] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit ()
[0:27] <Tv> ../src/gtest/lib/.libs/libgtest.so: undefined reference to `pthread_key_create'
[0:27] <Tv> hate
[0:27] <Tv> so much hate
[0:31] <Tv> i think i'm seeing something like a race condition in autoconf
[0:31] <Tv> i keep fiddling with comment/uncomment chunks and every now and then it'll compile
[0:31] <cmccabe1> from the linker perspective, gtest should come before lpthread on the link line
[0:31] <Tv> time to clear ccache, just in case
[0:32] <cmccabe1> I think I used to get that problem when compiling on older versions of debian, like on flab
[0:32] <cmccabe1> I was usually able to "fix" it by adding -lpthread in a bunch of places
[0:33] <cmccabe1> I was never really able to come up with a good patch though
[0:33] <cmccabe1> libtool makes it a lot harder to understand what's going on.
[0:37] <cmccabe1> here might be one place to start:
[0:37] <cmccabe1> cmccabe@metropolis:~/src/ceph2/src/gtest$ grep pthread ./lib/libgtest.la
[0:37] <cmccabe1> inherited_linker_flags=' -pthread'
[0:37] <cmccabe1> (yes, under libtool, the .la file is not a real library, it's a text file with... stuff)
[1:16] <Tv> the lovely part is.. it still depends on something invisible on whether that fails or not
[1:18] <cmccabe1> tv: well, on the plus side, a lot of open source projects use automake, so if you learn more about it, you'll probably use it eventually
[1:19] <Tv> yeah but once a build tool throws reliable builds out the window, i'm not very enthusiastic about the learning part
[2:08] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Remote host closed the connection)
[2:13] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[2:38] * joshd1 (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:40] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:55] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[2:57] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit ()
[2:58] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[2:59] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit ()
[3:08] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[4:01] * sjusthm (~sam@adsl-76-208-183-201.dsl.lsan03.sbcglobal.net) Quit (Quit: Leaving.)
[6:04] * Jiaju (~jjzhang@ Quit (Remote host closed the connection)
[7:01] * yehuda_wk (~quassel@ip-66-33-206-8.dreamhost.com) has joined #ceph
[7:05] * yehudasa (~quassel@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[7:05] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[7:06] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[7:29] * atgeek (~atg@please.dont.hacktheinter.net) has joined #ceph
[7:29] * monrad (~mmk@domitian.tdx.dk) has joined #ceph
[7:31] * Meths_ (rift@ has joined #ceph
[7:31] * chip (~chip@brma.tinsaucer.com) Quit (resistance.oftc.net osmotic.oftc.net)
[7:31] * Meths (rift@ Quit (resistance.oftc.net osmotic.oftc.net)
[7:31] * atg (~atg@please.dont.hacktheinter.net) Quit (resistance.oftc.net osmotic.oftc.net)
[7:31] * monrad-51468 (~mmk@domitian.tdx.dk) Quit (resistance.oftc.net osmotic.oftc.net)
[7:31] * chip (~chip@brma.tinsaucer.com) has joined #ceph
[8:16] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:57] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[8:57] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[9:18] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:19] * cmccabe1 (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[9:29] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[9:56] * verwilst_ (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[10:04] * verwilst_ (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[10:06] * allsystemsarego (~allsystem@ has joined #ceph
[10:18] * Yoric (~David@ has joined #ceph
[10:26] * Yoric (~David@ Quit (Quit: Yoric)
[10:27] * Yoric (~David@ has joined #ceph
[10:33] * uppi (ca419f6a@ircip3.mibbit.com) has joined #ceph
[10:34] * uppi (ca419f6a@ircip3.mibbit.com) Quit ()
[11:15] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[11:19] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[11:21] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[11:24] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[12:37] * Guest3918 (quasselcor@bas11-montreal02-1128535815.dsl.bell.ca) Quit (Remote host closed the connection)
[12:39] * bbigras (quasselcor@bas11-montreal02-1128535815.dsl.bell.ca) has joined #ceph
[12:59] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[13:23] * Meths_ is now known as Meths
[14:50] * Yoric_ (~David@ has joined #ceph
[14:50] * Yoric (~David@ Quit (Ping timeout: 480 seconds)
[14:50] * Yoric_ is now known as Yoric
[15:28] * bbigras is now known as Guest34
[15:30] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[15:47] * verwilst_ (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[15:50] * Yoric (~David@ Quit (Quit: Yoric)
[15:52] * Yoric (~David@ has joined #ceph
[15:52] * Yoric_ (~David@ has joined #ceph
[15:52] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[15:52] * Yoric_ is now known as Yoric
[17:18] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[17:25] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[17:47] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:50] * greglap (~Adium@ has joined #ceph
[18:12] * Yoric_ (~David@ has joined #ceph
[18:16] * Yoric (~David@ Quit (Ping timeout: 480 seconds)
[18:16] * Yoric_ is now known as Yoric
[18:24] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:28] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:34] * Yoric (~David@ Quit (Ping timeout: 480 seconds)
[18:43] * greglap (~Adium@ Quit (Quit: Leaving.)
[18:49] <Tv> cmccabe1: hi. can you explain the reasoning behind commit 463d624d, -Wl,--as-needed? the commit message doesn't include any reasons for the change.
[18:59] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:12] <gregaf> Tv: what information are you after with that?
[19:12] <Tv> the "why" part; was it purely to make the libs smaller, was it a bug workaround, etc
[19:13] <Tv> because --as-needed is a very special option in that its position matters
[19:13] <gregaf> ah
[19:13] <Tv> and i think it is the source of some of the pthread linking confusion that's roaming around
[19:13] <gregaf> I believe it was just to make the libs smaller, but to be certain we'll have to ask later
[19:14] <Tv> now we have stuff like -lpthread -W-l,--as-needed -lpthread slapped on as workarounds in the final commands
[19:14] <Tv> just add more -lpthreads until you accidentally get on the correct side of --as-needed
[19:14] <Tv> that didn't sound like a good idea..
[19:15] * verwilst_ (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[19:16] <cmccabe> tv: I added as-heeded at sage's request. Basically we did it to weed out extraneous dependencies
[19:16] <cmccabe> tv: arguably, it might be better to clean up those dependencies manually, but we didn't explore that path
[19:17] <cmccabe> tv: I have a hunch that doing manual dependency cleanup would also probably require splitting up certain libraries into smaller parts
[19:23] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[19:26] <cmccabe> tv: I'm pretty sure that AM_LDFLAGS always comes before AM_LDADD
[19:27] <cmccabe> tv: so I don't understand your comment about "getting on the right side of --as-needed"... where does the randomness come from?
[19:33] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[19:42] <lxo> hey, folks, is it normal for unlink, creat, link, mkdir or getdents (right after mkdir) to take several seconds every now and again? (like, a long pause in rsync every minute or so)
[19:43] <lxo> I figured out I can greatly speed up rsync onto ceph by pre-creating directories, because one of its processes often does mkdir followed immediately by getdents, and getdents blocks for a few seconds at that point more often than not
[19:44] <gregaf> Ixo: hmm
[19:44] <lxo> oh, symlink is another syscall that often triggers such a long pause
[19:44] <gregaf> like any one of those operations individually triggers a pause, or they trigger a pause if they're in a just-created directory?
[19:44] <lxo> I found out that with one copy of metadata, the pause is some 3-4 seconds; with 2, it was 5-6; with 3, 7-10
[19:45] <lxo> any of these operations individually; getdents is the only one that AFAICT is affected by the just-created dir. sorry about the ambiguity
[19:45] <gregaf> hmm
[19:45] <lxo> this is on ceph 0.24.3, 3 nodes, rsync running on one of the nodes
[19:45] <lxo> I tried with a single node as well, not much of a difference
[19:45] <gregaf> sorry, rsync running on one of the nodes?
[19:46] <lxo> yep, I'm rsyncing from a local filesystem on one of the nodes that hosts osds (and also a mon and a mds)
[19:47] <gregaf> ah
[19:47] <gregaf> is the local filesystem hosting the OSD data store?
[19:47] <lxo> tried tar as well. it's slightly faster because it doesn't do the getdents, but also slower because it does mkdir on the same process, and it slows to a crawl at the end because of all the symlinks
[19:48] <lxo> yes and no. there are 3 disks on this one machine, each running its own osd
[19:48] <gregaf> and the local filesystem you're rsyncing from is on a different disk?
[19:49] <lxo> one of the osds, on the fastest (internal) disk, shares the btrfs with the local filesystem and the mon filesystem
[19:49] <gregaf> hmm
[19:49] <gregaf> it's mildly conceivable that it's doing a sync and that's slowing down the rsync source
[19:49] <gregaf> but the problem is probably in a different area
[19:50] <lxo> it's not the rsync source that slows down, it's the creation of files, links or dirs, operations that can sometimes be used as filesystem mutexes
[19:50] <gregaf> still, if you could try rsyncing from a node that isn't hosting any OSDs that'd narrow down the cause to something internal, rather than a weird interaction between the local FS and the server
[19:51] <lxo> I can easily disable the local osd and mon, but IIRC I did that before, and it didn't help
[19:51] <gregaf> okay
[19:51] <gregaf> well, there are a number of caching and memory layers where a flush to disk can cause a "hiccup"
[19:51] <gregaf> and I've seen them occasionally, but never one lasting several seconds — more like .5 seconds
[19:52] <Tv> the duration of the hiccup tends to depend on the amount of free RAM
[19:53] <Tv> push the machine too close to full, and it becomes near-eternal trashing
[19:53] <gregaf> my initial guess is that it's when the MDS is flushing out journaled metadata changes to the on-disk directories
[19:53] <gregaf> I can keep an eye out for it in the future, but unless you can give us some fairly isolated debugging logs for when it happens I'm afraid I can't do much for you
[19:58] <lxo> it happens very often here, even one a single node. if I eventually have to recreate the filesystem, I'll try to isolate that, but it's been so common that I figured it was just something I was going to live with
[19:59] <lxo> with 8GB of RAM and pretty much nothing else running on it, I kind of doubt it was thrashing
[19:59] <lxo> but if it's not a known issue, I'll be glad to investigate
[19:59] <gregaf> yeah, it's not thrashing
[20:00] <gregaf> with a metadata-intensive operation like rsync, you're probably running through the MDS journal more quickly than it can be flushed out to disk
[20:00] <lxo> this is the main reason why my initialization of the shared filesystem is taking *so* long
[20:00] <gregaf> actually, you could test that pretty easily by increasing the MDS journal size
[20:00] <gregaf> and seeing if the frequency of hiccups decreases
[20:00] <lxo> what's this MDS journal?
[20:01] <lxo> I don't recall having set or read anything about it before
[20:01] <lxo> except for once in which MDSes failed to restart claiming an error reading a journal, that I assumed to be the ODS journals ;-)
[20:01] <gregaf> the MSD journals all metadata operations
[20:01] <gregaf> the journal still lives on the OSD
[20:02] <lxo> aha
[20:02] <gregaf> but it lets the MDS make metadata changes safe via streaming writes rather than random writes onto the directory objects on-disk
[20:03] <lxo> right!
[20:04] <gregaf> I believe the config option is md_log_max_segments
[20:04] <gregaf> it defaults to 30, you can try turning it up and see if that helps
[20:08] <gregaf> cmccabe: Tv: the debian gitbuilder is failing on master with
[20:08] <gregaf> "In file included from auth/cephx/../KeyRing.h:21,
[20:08] <gregaf> from auth/cephx/CephxAuthorizeHandler.cc:2:
[20:08] <gregaf> ./auth/Auth.h:22: fatal error: common/entity_name.h: No such file or directory
[20:08] <gregaf> compilation terminated.
[20:08] <gregaf> "
[20:08] <Tv> huh, never seen that
[20:08] <gregaf> do we need to specifically add files to the debian package or something?
[20:08] <gregaf> it's a new file from Monday
[20:09] <cmccabe> gregaf: seems to have built successfully here: http://ceph.newdream.net/gitbuilder/log.cgi?log=1f120284ed80ee1258b556fbedacab209098a0d1
[20:09] <gregaf> yeah, that's why I'm asking about the packaging :)
[20:09] <cmccabe> but ah, it's a noinst_HEADERS thing
[20:13] <cmccabe> this is something that I've had to do several times, for my own and other people's new header files
[20:14] <gregaf> not sure what you mean?
[20:14] <cmccabe> gregaf: see commits b60444b5c1c8f4c, 778902b4033f712, f9694648fc74a124, dad494c563d75b551
[20:14] <Tv> cmccabe: the commit messages don't answer "why" :(
[20:15] <gregaf> what's noinst_HEADERS?
[20:15] <cmccabe> gregaf: automake is too dumb to figure out what it needs to build .cc files
[20:15] <cmccabe> gregaf: therefore, you have to tell it all about your header files
[20:15] <Tv> sounds like a "if you added files, run make distcheck before pushing" thing?
[20:16] <cmccabe> gregaf: if you fail to do this, builds will work correctly when you make an "in-source" build (i.e. a builder where srcdir==buildir), but not a so-called VPATH build
[20:16] <gregaf> ah
[20:16] <cmccabe> gregaf: make distcheck will find this
[20:16] <cmccabe> gregaf: since the debian builder does a clean build from some kind of special chroot it hits this
[20:17] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[20:17] <Tv> debian build systems often use VPATH to be able to build debug/non-debug etc variants of the package nicely
[20:17] <cmccabe> VPATH builds are just better... spraying object files all over your source directory is so 1975
[20:18] <Tv> sadly make is a pain to get right for all the little details
[20:18] <cmccabe> I wish we used VPATH builds more, but a lot of the scripts like vstart.sh make that hard
[20:18] <cmccabe> it works great with cmake...
[20:18] <cmccabe> and in fairness, if you do everything right, automake does it fine too
[20:18] <cmccabe> it's just not the default in automake
[20:18] <Tv> it's more when you need to plug in that one shell script as part of the build process, then things fall apart
[20:19] <cmccabe> well, actually, automake can handle that too
[20:20] <cmccabe> srcdir is always available with the $srcdir prefix
[20:20] <cmccabe> builddir is always the cwd
[20:21] <cmccabe> but yeah, some of those scripts just aren't ready to have their paradigm shifted
[20:50] <lxo> erhm... I'm not sure how to tell whether the setting took effect. it may have sped things up overall (not sure yet), but the hiccups seem to be much longer
[20:51] <lxo> anyhow... thanks for the tips, this gives me something to play with for some time ;-)
[20:52] <lxo> another thing I've been wondering about is the mds standby replay setting. I can't tell whether the nodes set up as such are actually getting or replaying logs; it appears that they recover from scratch
[20:53] <lxo> is this setting not yet supported on 0.24.3, or does a replay mds get reported as standby, and is mostly silent in terms of logging?
[20:54] <gregaf> Ixo: hmm, there may be some subtleties to the log trimming
[20:54] <gregaf> Sage is at Cloud Connect today but it's on my queue to discuss with him tomorrow :)
[20:54] <gregaf> setting up the standby-replay stuff is a little obtuse, I'm afraid
[20:55] <gregaf> but I believe v0.25 is the first regular release it's avaialable in
[20:55] <gregaf> *available
[20:58] <lxo> ah, ok, so that's why. the option is silently accepted and has no effect on 0.24.3, then, eh? I guess I'll soon switch to 0.25 ;-)
[20:58] <lxo> thanks again!
[21:01] <gregaf> I did all that work a while ago, I don't remember exactly what happens in which version — just looking at the release notes for .25 it says the hot-standby behavior (which is what standby-replay does) is new for that release
[21:01] <gregaf> welcome!
[21:05] <lxo> I really need to start looking at the code ;-) it does me (and ceph) little good if I take it as a blackbox ;-)
[21:06] <lxo> I have some fault-tolerance and distributed systems background, but my wife is the expert. she teaches distributed systems and operating systems at the uni, and I'm trying to get her to get students to work on ceph, besides hadoop which she has been using lately
[21:07] <lxo> so if you guys have (or could set up) a web page with projects that undergrads (or even grads) could undertake, individually or as a group, for semester-long classes, that would greatly increase the odds that you get this kind of voluntary contribution ;-)
[22:14] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:11] * f4m8 (~f4m8@lug-owl.de) Quit (Server closed connection)
[23:11] * f4m8 (~f4m8@lug-owl.de) has joined #ceph
[23:32] <bchrisman> curious… how does rbd keep the system buffer cache from being inconsistent on two different clients? Or does it bypass caching?
[23:34] <Tv> bchrisman: as far as i understand, if there are simultaneous writers for the same object, they get put into synchronous mode.. slower but consistent
[23:34] <Tv> as soon as one promises to not write, the other one gets to buffer writes again
[23:36] <bchrisman> ahh ok
[23:39] <bchrisman> librados applications would be responsible for managing their own caching I'd expect?
[23:45] <gregaf> Tv: bchrisman: no, librados and librbd do not handle synchronization, you need to do that yourself
[23:46] <gregaf> including any issues with caching layers
[23:47] <Tv> gregaf: so what layer provides that when it's posix files?
[23:47] <gregaf> the ceph layers handle that
[23:47] <gregaf> libceph, uclient, kclient
[23:47] <gregaf> it's all done via capabilities
[23:48] <Tv> ahh i thought it relied on rados and just gave the right objects to talk to
[23:48] <yehuda_wk> gregaf: bchrisman: tv: librbd, librados don't have any caching anyway.. the issue is with the kernel and qemu rbd implementation
[23:48] <gregaf> when there are multiple writers they get the read and write capabilities, but not the cache or buffer capabilities
[23:48] <Tv> so switching between the different modes readers-only/some-writers etc involves mds?
[23:48] <gregaf> yeah
[23:48] <Tv> ok
[23:49] <Tv> but yeah sharing writable block devices over the network has historically been problematic
[23:49] <Tv> if you try to "solve" it, you just make it slower
[23:49] <Tv> ocfs2 and friends burned a lot of effort in making the locking for who gets to write where efficient
[23:50] <Tv> for the qemu etc case, the thing using the block device will probably have it's own buffering, and not understand how the disk contents can change underneath it
[23:51] <Tv> shared-write block device usage is very rare, outside of special cases like ocfs2 clusters
[23:51] <bchrisman> gregaf: thanks.. that makes sense.
[23:52] <gregaf> yeah
[23:52] <gregaf> librbd actually does have very limited synchronization to make the snapshots work
[23:52] <gregaf> we set that up so that you could eg take a snapshot and then make backups of the snapshot from a different mount while the VM continues doing work
[23:53] <yehuda_wk> from a different mount <= from a different client
[23:53] <gregaf> Tv: I believe that synchronized shared-writers should work okay, since there's no caching in the librbd layer
[23:54] <Tv> gregaf: i'm saying regardless of what rbd does, they'll need to coordinate carefully, and things that can do that are very rare
[23:54] <Tv> i don't think anyone is planning on running ocfs2 on top of rbd ;)
[23:54] <gregaf> heh
[23:54] <gregaf> I did run across a blog post about running lustre on rbd :)
[23:55] <yehuda_wk> http://www.tinkergeek.com/?p=155
[23:56] <Tv> "As an exercise in excitement" ;)
[23:57] <Tv> also, i'm always amused when people complete something and yell out about violins..
[23:59] <cmccabe> hey, violas can be pretty exciting

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.