#ceph IRC Log


IRC Log for 2011-04-19

Timestamps are in GMT/BST.

[0:15] <cmccabe1> so I'm having trouble understanding what our plan is for things like g_lockdep and buffer_track_alloc
[0:16] <cmccabe1> these things are part of common code, and will be used by libraries, but they're hardwired with the assumption that there's a single configuration, not multiple
[0:17] <cmccabe1> since these things seem like they're just for debugging, maybe it would be best to take them out of g_conf completely and just have it controlled by an environment variable that is read by a global constructor.
[0:18] <cmccabe1> there is only one environment for a given process-- every UNIX programmer understands that-- so it resolves the confusion.
[0:21] <gregaf> yeah, I'm pretty sure there's not a plan yet for stuff like that
[0:22] <cmccabe1> just calling getenv seems like a pretty easy and good plan
[0:22] <sagewk> yeah, that makes sense to me.
[0:22] <cmccabe1> k
[0:22] <sagewk> you mean take them out of md_config_t? we can still specify them in the .conf?
[0:23] <sagewk> or _only_ settable via the environment?
[0:23] <cmccabe1> well, I was thinking of having them specified in the environment
[0:23] <cmccabe1> like _MALLOC_CHECK
[0:23] <cmccabe1> or the tcmalloc tweaking stuff
[0:23] <cmccabe1> I mean, they are basically things developers use, and regular users won't care about
[0:24] <sagewk> yeah, that makes sense. these are really only for developer use.
[0:24] <sagewk> yeah
[0:24] <Tv> cmccabe1: CEPH_* please
[0:24] <cmccabe1> it also makes them usable by programs that don't read ceph.conf
[0:24] <cmccabe1> and there are a few utilities that legitimately don't need to read that file
[0:25] <sagewk> yep
[0:36] * Juul (~Juul@slim.dhcp.lbl.gov) has joined #ceph
[2:10] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:27] * djlee (~dlee064@des152.esc.auckland.ac.nz) has joined #ceph
[2:38] <djlee> colin: I don't see the update that you've put as a 'tip' in cluster_configuration?
[2:39] <cmccabe1> djlee: it reads like this. "Some tips: Ceph produces a lot of logs currently. Make sure your log partition is on a fast disk."
[2:39] <cmccabe1> djlee: under http://ceph.newdream.net/wiki/Cluster_configuration
[2:40] <djlee> argh;
[2:40] <djlee> what about all the noatime stuffs?
[2:40] <cmccabe1> djlee: atime is important for reading (like in the OSD store directory) but not for writing (like logfiles)
[2:40] <cmccabe1> djlee: I guess I could add a comment saying to enable noatime on the OSD store
[2:41] <djlee> noatime -> speeds up the read, correct?
[2:41] <cmccabe1> djlee: basically
[2:41] <cmccabe1> djlee: with atime, every read also involves a write
[2:42] <djlee> right, but would it then be sort of breaking the standard e.g., posix, and sort of cheats
[2:43] <djlee> or at least if it is 'normal' or 'abnormal' when benchmarking? i don't know
[2:44] <cmccabe1> djlee: I don't know if noatime is in POSIX or not
[2:44] <cmccabe1> djlee: Ceph doesn't use atime anywhere, though, so why pay for things you don't use?
[2:46] <djlee> cmccabe1: right, but I'd thought Zenon was concerned about it
[2:46] <djlee> he said ext4 partition with a time enabled?
[2:47] <cmccabe1> djlee: he's concerned about *not* using noatime
[2:47] <cmccabe1> djlee: not about whether atime is in POSIX :)
[2:48] <djlee> i see, thought he initially had problem with low performance;
[2:48] <cmccabe1> djlee: he did, and still does. noatime helped him somewhat, that was the point
[2:48] <djlee> but ceph doesn;t use noatime as you said, so it must just be the logging stuffs..?
[2:49] <cmccabe1> djlee: we don't control the options the partition was mounted with
[2:49] <cmccabe1> djlee: you can mount the partition with -o sync if you want. Everything will just go really slow
[2:52] <djlee> from my understanding till now noatime does improve, but not a default or normal condition i guess.
[2:56] <djlee> e.g., i don't believe anyone would base the benchmark purely on `noatime', because this is just not a default option;, or not as typical
[2:56] <djlee> and the same goes for -o sync, normally it is not enabled, so leave it :)
[2:57] <cmccabe1> djlee: relatime is the default on linux now
[2:57] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:57] <cmccabe1> djlee: anyway, I'd rather benchmark a properly set up system, not one that has all defaults
[2:57] <cmccabe1> djlee: it seems pretty silly to deliberately handicap yourself
[2:59] <djlee> sure, but I think it is really difficult to do that
[3:00] <cmccabe1> djlee: night!
[3:00] * cmccabe1 (~cmccabe@ has left #ceph
[3:00] <djlee> ngiht!
[3:07] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Operation timed out)
[3:54] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[4:00] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[5:23] * Juul (~Juul@slim.dhcp.lbl.gov) Quit (Quit: Leaving)
[5:55] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) Quit (Quit: Leaving)
[5:59] * greglap (~Adium@ has joined #ceph
[6:01] * lxo (~aoliva@ Quit (Ping timeout: 480 seconds)
[6:02] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) has joined #ceph
[6:56] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[7:07] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[7:53] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[7:58] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:03] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[8:38] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[8:42] * andret (~andre@pcandre.nine.ch) has joined #ceph
[9:11] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[9:13] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[10:17] * Yoric (~David@87-231-38-145.rev.numericable.fr) has joined #ceph
[10:51] * allsystemsarego (~allsystem@ has joined #ceph
[14:59] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:25] * chraible (~chraible@blackhole.science-computing.de) Quit (Ping timeout: 480 seconds)
[16:23] * gregorg (~Greg@ Quit (Quit: Quitte)
[16:23] * gregorg (~Greg@ has joined #ceph
[17:12] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[17:38] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:51] * greglap (~Adium@ has joined #ceph
[17:54] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[18:17] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:22] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[18:49] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[18:57] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:58] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:03] * joshd1 (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:03] * neurodrone (~neurodron@dhcp212-010.wireless.buffalo.edu) has joined #ceph
[19:06] * Yoric (~David@87-231-38-145.rev.numericable.fr) has left #ceph
[19:13] <Tv> sagewk: about commit 05c281, an alternate fix would be to make the debian packaging for lenny pass in --without-tcmalloc
[19:15] <sagewk> tv: yeah, that'd be better
[19:16] <sagewk> i just didn't know the funky syntax to do that
[19:16] <sagewk> and then forgot to ask you the next day :)
[19:20] <Tv> sagewk: i see your release.sh is already sedding it out of debian/control.. i could use that to trigger it
[19:21] <sagewk> oh! yeah i knew i did something fugly with it before, but couldn't remember what
[19:25] <Tv> make makes it ugly :(
[19:26] <Tv> oh and it didn't handle non-i386 non-amd64 correctly before
[19:26] <Tv> hrmph
[19:47] * neurodrone_ (~neurodron@dhcp212-010.wireless.buffalo.edu) has joined #ceph
[19:47] * neurodrone (~neurodron@dhcp212-010.wireless.buffalo.edu) Quit (Read error: Connection reset by peer)
[19:47] * neurodrone_ is now known as neurodrone
[19:49] * neurodrone_ (~neurodron@dhcp212-010.wireless.buffalo.edu) has joined #ceph
[19:49] * neurodrone (~neurodron@dhcp212-010.wireless.buffalo.edu) Quit (Read error: Connection reset by peer)
[19:49] * neurodrone_ is now known as neurodrone
[19:50] * neurodrone_ (~neurodron@dhcp212-010.wireless.buffalo.edu) has joined #ceph
[19:50] * neurodrone (~neurodron@dhcp212-010.wireless.buffalo.edu) Quit (Read error: Connection reset by peer)
[19:50] * neurodrone_ is now known as neurodrone
[19:51] * neurodrone (~neurodron@dhcp212-010.wireless.buffalo.edu) Quit ()
[20:09] <gregaf> cmccabe: I'm a bit concerned about switching all these config options to environment variables
[20:09] <gregaf> it adds a lot of admin complexity, doesn't it?
[20:11] <cmccabe> gregaf: these are debug settings for programmers
[20:11] <gregaf> okay, so it adds a lot of admin complexity for programmers
[20:12] <gregaf> and admins might want to profile stuff too
[20:12] <cmccabe> gregaf: not really... just add a few "export FOO=bar " lines to your init.ceph
[20:12] <gregaf> so that they can, you know, profile how the program reacts to their workloads
[20:13] <gregaf> yes, but previously we could change those variables while the program was running and you removed that capability
[20:13] <gregaf> I used that functionality on a couple occasions
[20:13] <cmccabe> gregaf: does TCMALLOC actually honor changes to its environment variables
[20:13] <gregaf> yes
[20:13] <cmccabe> gregaf: so the confusing and bad thing is that setenv is not thread-safe
[20:14] <gregaf> so?
[20:14] <cmccabe> gregaf: since this is all a developer thing, I guess we can gloss over that and provide some kind of injectenv
[20:14] <gregaf> is there something wrong with leaving in the working solution we already had?
[20:15] <Tv> gregaf: what do you do when i instantiate two librados instances in one process?
[20:15] <cmccabe> gregaf: it doesn't work for libraries, because those settings are inherently global to the process
[20:16] <gregaf> and the libraries don't use any of those settings
[20:16] <cmccabe> gregaf: anyway, it's just silly to wrap all of tcmalloc's environment variables in config options, as if they were ceph-related
[20:16] * lxo (~aoliva@ has joined #ceph
[20:16] <gregaf> well it may be silly but it worked and exposed more functionality than what you've got there now
[20:17] <gregaf> you realize that the tcmalloc stuff is only used on the MDS and the OSD, right?
[20:17] <gregaf> it's explicitly disabled on everything else
[20:17] <cmccabe> gregaf: once we have injectenv, we'll have more functionality than before.
[20:17] <gregaf> there are no concerns about the libraries using these options
[20:18] <cmccabe> gregaf: I don't like having weird and broken stuff in libcommon
[20:18] <gregaf> and it was broken how?
[20:18] <cmccabe> (11:15:54 AM) cmccabe: gregaf: it doesn't work for libraries, because those settings are inherently global to the process
[20:18] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[20:19] <gregaf> the libraries ignore those options
[20:19] <gregaf> they don't apply
[20:19] <cmccabe> gregaf: and what about mutex debugging, and bufferlist debugging?
[20:19] <gregaf> the common init code sets a few bits and then the library never looks at it ever again
[20:19] <gregaf> I didn't say that all the changes were bad
[20:19] <gregaf> just these ones
[20:20] <gregaf> the tcmalloc ones
[20:20] <gregaf> that I was talking to you about
[20:20] <gregaf> they might be bad too, but I don't know and I'll let Sage deal with those
[20:20] <Tv> sagewk: take a look at branch deb-tcmalloc-fixes; warning untested!
[20:20] <gregaf> I have an interest in the tcmalloc stuff and you ripped it out to no purpose and so I'm asking you to please restore the functionality
[20:21] <cmccabe> gregaf: it's not "ripped out"; it's available through setting environment variables
[20:21] <gregaf> and if you restore it by doing something with setenv please make sure to update the documentation about profiling
[20:21] <gregaf> not if you want to adjust anything on-the-fly...
[20:21] <Tv> i'm surprised the env var change worked on the fly
[20:21] <Tv> surely the damn thing can't be looking at env vars all the time, that's slow
[20:21] <cmccabe> tv: yeah, it's bizarre.
[20:22] <cmccabe> tv: and racy. setenv/getenv are not thread-safe :(
[20:22] <gregaf> no idea, but I'm pretty sure I set it up and ran through it
[20:22] <cmccabe> gregaf: ok. I promise to get it working "on-the-fly" again today.
[20:22] <gregaf> also you can do things like turn it on and off and it certainly worked to change the env vars between those runs
[20:22] <cmccabe> gregaf: ok?
[20:23] <gregaf> please make sure to update http://ceph.newdream.net/wiki/Memory_Profiling too
[20:23] <gregaf> thanks
[20:24] <cmccabe> tv: it looks like tcmalloc uses SetNumericProperty/GetNumericProperty. I would guess that it only reads the environment once (?)
[20:25] <Tv> what env var is this?
[20:25] <Tv> i can't see anything in tcmalloc source that would read env all the time
[20:25] <Tv> DEFINE_int64(heap_profile_allocation_interval,
[20:25] <Tv> EnvToInt64("HEAP_PROFILE_ALLOCATION_INTERVAL", 1 << 30 /*1GB*/),
[20:25] <Tv> ...
[20:26] <Tv> that looks like define time one-shot
[20:26] <gregaf> it might not have read it dynamically
[20:26] <gregaf> but it should have read them on each start of the profiler
[20:26] <gregaf> and you can turn it on and off
[20:26] <cmccabe> tv: no, that's what I'm saying. There are functions to change it at runtime, but zee env vars, zey do nothing
[20:27] <gregaf> Tv: or am misremembering?
[20:27] <Tv> gregaf: how do you turn it on/off?
[20:27] <gregaf> there were functions to start and stop it
[20:27] <gregaf> I think ProfilerStart() and ProfilerStop()
[20:27] <cmccabe> tv: HeapProfilerStart, HeapProfilerStop
[20:28] <Tv> well ceph_heap_profiler_start could setenv() before the call, if that's really what you wanted
[20:29] <Tv> but the point is, if the settings are in a "library config", *which instance do you obey*?
[20:29] <Tv> perhaps the heap profiler settings need to be (optional) part of the "turn it on" command
[20:29] <gregaf> and again, it's never invoked in the libraries...
[20:29] <Tv> gregaf: that doesn't change the code structure
[20:30] <Tv> gregaf: the config structure is no longer inherently "one per daemon"
[20:30] <Tv> that's the whole point of a lot of work Colin's been doing, as far as I understand it
[20:30] <cmccabe> tv: exactly
[20:30] <Tv> we could have a "global config" that contains just this stuff, that's separate from "ceph.conf read into ram"
[20:30] <Tv> but that's just almost the same as the env vars
[20:31] <cmccabe> tv: UNIX already has a global config call the "environment" :)
[20:31] <Tv> Tv: perhaps the heap profiler settings need to be (optional) part of the "turn it on" command
[20:31] <gregaf> there can be more than one config in a program
[20:31] <Tv> that's the best recommendation i can make right now
[20:31] <gregaf> but the programs that are going to dereference the profiler config values are only going to have one config...
[20:32] <gregaf> I mean if we need to change how stuff's set up that's fine
[20:32] <cmccabe> gregaf: that just makes things even more confusing for users trying to figure out what things do
[20:32] <gregaf> I just don't like it when people rip out functionality because they don't care about it themselves and don't provide substitute ways to handle it
[20:33] <gregaf> well that's nice but if you're trying to make config.cc user-friendly you'd better start by hiding everything that they're not supposed to set on their own
[20:33] <gregaf> which is…most of the config values
[20:34] <cmccabe> gregaf: it's about logical consistency and discoverability
[20:34] <gregaf> then write your damn setenv and it won't be in the config
[20:35] <gregaf> it's already not in the config because you ripped it out without bothering to consider whether you'd broken anything
[20:35] <gregaf> so I really don't know what you're arguing with me for except that you like to tell everybody else that their use cases are stupid
[20:35] <cmccabe> gregaf: um
[20:35] <cmccabe> gregaf: I never said your use case was stupid
[20:36] <cmccabe> gregaf: didn't kick your dog even once
[20:36] <gregaf> then why are we still arguing about it?
[20:36] <cmccabe> gregaf: hardly ever peed on your lawn
[20:36] <Tv> trying to find a clean way to support it
[20:36] <cmccabe> gregaf: I have to go implement this now. Hopefully after lunch you can tell me what you think
[20:37] <sagewk> calm down everyone, it's not a big deal. :)
[20:50] * iggy gets popcorn
[20:54] <gregaf> I think the show ended
[20:54] <gregaf> you can probably fine something in lkml or fs-devel though if you've still got popcorn ;)
[20:54] <gregaf> *find
[20:55] <cmccabe> for sure
[21:03] <iggy> there's always some drama happening on lkml
[21:17] * tjfontaine (tjfontaine@tjfontaine.chair.oftc.net) has joined #ceph
[21:20] <Tv> cmccabe: can you clarify a logging question for me? i have log file = results/log/$name.log, no log per instance anywhere in log, yet that dir also gets an extra client.4109 file, which looks like what log per instance would do, except it's supposed to default to false.. what else would cause a file like that?
[21:20] <Tv> cmccabe: (i need to avoid all extra symlink copies because they make autotest store duplicate copies of the log content, that takes up too much disk space)
[21:20] <cmccabe> tv: random guess: pid file?
[21:21] <Tv> cmccabe: it's content is
[21:21] <Tv> 2011-04-19 11:58:53.250624 7ff8206f3720 ceph version 0.26-309-g54284c0.commit: 54284c0aefac863e486fd4c46d945f2f077af379. process: cfuse. pid: 1562
[21:21] <Tv> ooh it's name is .4109 and it says it's pid is 1562? funky
[21:21] <Tv> but still, that looks like a log not a pid file
[21:21] <cmccabe> tv: yeah
[21:21] <Tv> good to know it comes from cfuse though
[21:22] <Tv> well, so does the client.0.log i really wanted
[21:22] <Tv> cmccabe: i traced all the getpid() alls and they all seem to be inside if conf->log_per_instance
[21:23] <cmccabe> tv: I'm assuming this is created every time?
[21:24] <Tv> cmccabe: seems so
[21:24] <cmccabe> in /var/log/ceph?
[21:24] <Tv> in results/log/
[21:24] <Tv> and the only thing ever referring to that path is log file = results/log/$name.log
[21:25] <Tv> which confuses me quite a lot :-/
[21:25] <cmccabe> tv: are you sure there's not another config file cfuse might be reading
[21:25] <cmccabe> tv: certainly the default log dir doesn't include "results"
[21:25] <Tv> cmccabe: this is on autotest workers, they're not supposed to have e.g. /etc/ceph ever
[21:26] <Tv> just to avoid that kind of trouble
[21:26] <cmccabe> tv: well, there's a search path for ceph.conf
[21:26] <cmccabe> tv: one directory it will check is the current working directory
[21:26] <Tv> /etc has no ceph*, ~root has no ceph*
[21:26] <Tv> the ceph.conf in cwd is the right one
[21:27] <cmccabe> I could be misremembering, but I think cfuse plays games with cwd
[21:27] <Tv> cmccabe: if it chdirred and then read ceph.conf, it wouldn't find its way back to *this* results/log dir
[21:28] <Tv> cmccabe: cwd is in a random dir
[21:28] <cmccabe> tv: are you getting any logs from cfuse, or just that file?
[21:28] <Tv> i get client.0.log just fine
[21:28] <cmccabe> which has logs from cfuse?
[21:28] <Tv> with the same content
[21:29] <Tv> it seems it doesn't log more than that one line, in this test
[21:30] <cmccabe> for debugging, try giving an absolute path to a config file with -c
[21:35] <Tv> huh the 4109 is not the pid.. and only client.0 gets that symlink.. what's going on?
[21:35] <cmccabe> can I log in and check it?
[21:36] <Tv> root@sepia17:/usr/local/autotest/tmp/tmpnuISRV_ceph_dbench.cluster0/results/log
[21:36] <Tv> lsof says cfuse has the client.0.log file open for writing, not the symlink
[21:37] <Tv> hmmm "fusetrace"
[21:37] <Tv> dout_create_rank_symlink(client->get_nodeid().v);
[21:37] <Tv> ok that looks suspect
[21:38] <cmccabe> I don't seem to have access to this machine
[21:38] <Tv> cmccabe: im me your ssh key
[21:38] <cmccabe> root password has been changed?
[21:38] <Tv> ssh in as ubuntu
[21:40] <Tv> so what's "rank"?
[21:40] <cmccabe> that's tied in with something the mds does
[21:41] <cmccabe> oh, weird, cfuse uses that too?
[21:41] <Tv> yeah
[21:41] <Tv> that seems to be the culprit
[21:42] <Tv> is it just so greg&sage know what logfile to read?
[21:42] <cmccabe> tv: for the mds at least, the idea was you'd have these nice symlinks organized by 'mds rank'
[21:43] <cmccabe> tv: I'm not sure why cfuse is doing it...
[21:43] * tjfontaine (tjfontaine@tjfontaine.chair.oftc.net) has left #ceph
[21:43] <Tv> cmccabe: well, my options are now either removing the symlinks after the fact (& hoping autotest doesn't start rsyncs "early", as i *think* it does, so this option seems doomed), or disable them
[21:44] <cmccabe> tv: we should try to figure out who originally added it, and why
[21:44] <Tv> (because it seems i can't tell autotest to preserve symlinks, as it doesn't strictly require rsync, etc etc)
[21:45] <Tv> 79991ed4981d0f0ff24d745089987cc9f694a6a0
[21:46] <Tv> though perhaps the old - _dout_create_courtesy_output_symlink("client", client->get_nodeid().v); did that
[21:46] <Tv> yeah it seems so, tracing..
[21:46] <cmccabe> tv: heh. courtesy output
[21:46] <Tv> paging developers to courtesy white phone
[21:47] <Tv> 1e4215e94261721a6ee0cc46bb15c3bb3a33ec3c
[21:47] <Tv> "write output to default/* by default"
[21:47] <cmccabe> tv: can git-grep search the past?
[21:47] <Tv> heh
[21:48] <Tv> cmccabe: what i did was git blame src/cfuse.cc, pick the commit that introduced it, git show the commit, if it just moved the line or changed it, loop back to git blame COMM~1 src/cfuse.cc
[21:48] <Tv> 1e42 is the root
[21:49] <Tv> cmccabe: but to answer the direct question, yes, just give it a treeish as first arg
[21:49] <cmccabe> tv: hmm, is COMM like HEAD?
[21:49] <Tv> bbl lunch
[21:49] <Tv> cmccabe: oh replace with the commit
[21:49] <cmccabe> tv: k
[21:49] <Tv> should have said SHA
[21:49] <cmccabe> tv: yeah I have no idea why cfuse does that. I suspect if anyone remembers it would be sage
[21:50] <cmccabe> tv: probably if you ask, he'll just say remove it
[21:50] <cmccabe> tv: but best to make sure nobody is using it of course
[22:05] <cmccabe> gregaf: well, some bad news here.
[22:06] <cmccabe> gregaf: as best as I can tell by looking at the tcmalloc sources, tcmalloc only reads HEAP_PROFILE_ALLOCATION_INTERVAL and HEAP_PROFILE_INUSE_INTERVAL once-- in global constructors when the process starts.
[22:06] <cmccabe> gregaf: there isn't any mechansim to change them at runtime.
[22:20] <gregaf> cmccabe: hmm, that's odd as I thought I remembered it working when I set them and then started the profiler
[22:20] <gregaf> it's when the process starts and not when it activates profiling?
[22:20] <cmccabe> gregaf: yes
[22:20] <gregaf> *sigh*
[22:20] <gregaf> guess it doesn't matter then
[22:20] <cmccabe> gregaf: I pinned my hopes on SetNumericProperty/GetNumericProperty
[22:21] <cmccabe> gregaf: but those functions are nearly useless
[22:21] <gregaf> no, if it doesn't ever check them again it's not worth fussing over too much, I just thought it did
[22:21] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:21] <cmccabe> gregaf: actually SetNumericProperty does let you change tcmalloc.max_total_thread_cache_bytes
[22:21] <gregaf> Tv: did you figure out that thing with cfuse and the symlink?
[22:21] <cmccabe> gregaf: but that's it
[22:22] <gregaf> yeah, that's deep twiddling people won't be adjusting dynamically
[22:22] <Tv> gregaf: yeah the "rank" thing
[22:22] <gregaf> or at least I can't imagine people doing so, we can deal with it if it becomes interesting
[22:22] <cmccabe> gregaf: yeah, it seems unlikely.
[22:22] <gregaf> Tv: the "rank" term is old and refers to a concept we no longer have
[22:23] <Tv> gregaf: well, some sort of internal id that gets assigned at runtime, not at config time
[22:23] <Tv> gregaf: mds and cfuse are the only ones that provide those special symlinks, it seems
[22:23] <gregaf> yeah
[22:23] <Tv> gregaf: do you use them?
[22:23] <gregaf> not I
[22:23] <Tv> sagewk: do you use the cfuse/mds "rank" symlinks?
[22:24] <Tv> at the minimum, i need to make them configurable, but i might as well kill them if nobody uses them
[22:24] <cmccabe> gregaf: how would I find out what mds has rank foo without the symlinks?
[22:24] <cmccabe> gregaf: probably ./ceph mds dump something-or-other?
[22:25] <gregaf> oh, sorry, I do occasionally use the mds rank symlinks — not often, though
[22:25] <gregaf> that's different from the cfuse ones I think though
[22:25] <sagewk> tv: nope
[22:25] <cmccabe> yeah, it confused me to see that in cfuse
[22:25] <Tv> dout(10) << "map says i am " << addr << " mds" << whoami << "." << incarnation
[22:25] <Tv> << " state " << ceph_mds_state_name(state) << dendl;
[22:25] <gregaf> I think the MDS ones are provided through a different mechanism than the cfuse ones
[22:25] <sagewk> well, the mds ones can be handy sometimes, but i can live without them too.
[22:25] <Tv> the "whoami" is the mds rank
[22:25] <cmccabe> someone explained to me earlier (I forget who) what the point was for mds
[22:25] <gregaf> Tv: those symlinks come from different places, don't they?
[22:25] <Tv> at least that gives you the info
[22:26] <Tv> gregaf: same code
[22:26] <Tv> gregaf: different caller
[22:26] <gregaf> oh okay
[22:26] <sagewk> if it's a headache i'm fine with dropping that code. the mds is the only place where it's somewhat useful.
[22:26] <Tv> sagewk: i need to disable them for autotest, to save disk space
[22:26] <Tv> (and make processing simpler)
[22:27] <Tv> so your options are 1) new config option 2) kill
[22:27] <Tv> apparently cfuse is 2) kill, is mds 2) kill too?
[22:27] <Tv> gregaf: does it make a difference for your debugging?
[22:27] <sagewk> hmm
[22:27] <gregaf> mmm, keeping the mds ones is probably worth the minor hassle of a config option
[22:28] <Tv> ok so kill cfuse, make mds configurable
[22:28] <sagewk> yeah
[22:30] <Tv> so if "rank" is an obsolete term, what are they?
[22:30] <Tv> it's the messenger id, as far as i can tell from the code
[22:30] <gregaf> yeah
[22:30] <gregaf> it used to be that there could be multiple "entities" in one process using a single messenger
[22:30] <Tv> messenger name i mean
[22:30] <gregaf> and each messenger had a unique rank
[22:31] <gregaf> which dealt with delivery to the machine
[22:31] <Tv> oh here's a funky thing
[22:31] <gregaf> and then the messenger delivered it to the entity
[22:31] <Tv> nothing seems to ever remove those symlinks
[22:31] <Tv> if (oldwhoami != whoami)
[22:31] <Tv> dout_create_rank_symlink(whoami);
[22:31] <Tv> so if mds changes its name, it just gets new symlinks
[22:31] <Tv> funnnky
[22:31] <gregaf> we ripped out the multiple entities thing though, I don't remember how it was used and it was a big hassle that made the code a lot more complex
[22:32] <gregaf> Tv: it's most useful in vstart; it might be broken on a distributed cluster?
[22:32] <sagewk> the mds only ever goes from nobody -> some rank. then it dies or restarts
[22:32] <Tv> that makes me think the symlinks are inherently unreliable, and i really dislike traps :-/
[22:32] <sagewk> yeah.. make the config option off by default, but on in vstart.sh
[22:32] <Tv> sagewk: oh that's definitely not how the code reads..
[22:32] <gregaf> do the symlinks get wiped out on startup?
[22:33] <cmccabe> gregaf: no
[22:33] <Tv> $SUDO rm -rf out/*
[22:33] <Tv> vstart does ;)
[22:33] <cmccabe> tv: oh
[22:33] <gregaf> yeah, I meant by the daemon ;)
[22:33] <cmccabe> tv: that won't help regular users though
[22:33] <cmccabe> tv: actually, I wonder-- could vstart.sh do this by itself (without help from dout)?
[22:33] <Tv> cmccabe: i guess nobody not-sage not-greg will ever enable that option
[22:34] <Tv> cmccabe: yeah i'd like to have a "discover the name of mds blahlbah" or something
[22:34] <Tv> and just move this logic there
[22:34] <cmccabe> tv: it seems like the kind of thing that *might* be able to just move into vstart.sh
[22:34] <Tv> simple code makes tv happy
[22:34] <gregaf> Tv: which name and what's blahblah?
[22:34] <cmccabe> tv: I guess you have to figureo out how to avoid a race though
[22:34] <Tv> gregaf: well either way, mds.N <-> messenger name, as long as you can query the mapping
[22:34] <sagewk> couldn't be done by vstart.
[22:34] <Tv> cmccabe: it's racy already ;)
[22:35] <Tv> sagewk: yes on the level of this doesn't exist until it's running, but could be done by a developer tool, "gimme-my-convenience-symlinks.sh"
[22:35] <gregaf> well mds.N is determined by the config, and the mds rank is separate from the messenger rank (at least conceptually, and I think actually)
[22:35] <sagewk> we can just remove it. it's also mildly annoying that autocomplete finds two names in production (and vstart) envirionments
[22:35] <Tv> gregaf: codewise i see "whoami" being used to set both the symlink name and the "messenger name"
[22:36] <Tv> gregaf: do you agree with sage enough so i can get my negative code contribution count going again ;)
[22:36] <gregaf> heh
[22:37] <gregaf> I'd hate to stand in the way of your negative productivity
[22:37] <Tv> i'll happily look at ways to query "what mds instances have what messenger names", as a replacement
[22:37] <gregaf> just as long as you realize that when we implement SLOC production metrics in our yearly reviews that you're going to be in deep doodoo!
[22:37] <gregaf> :P
[22:37] <Tv> then you could run a simple shell script to create the symlinks
[22:38] <Tv> gregaf: back in one job we joked about QA getting bonuses for bugs found, and us developers being bribable to add easy bugs..
[22:38] <gregaf> Tv: it's mostly just helpful because the MDS daemon acting as each MDS can change and the symlinks help with that
[22:38] <gregaf> it's not a super-big deal
[22:38] <gregaf> Tv: the author of Dilbert published a book about stupid things companies do
[22:39] <gregaf> there was a story about one company that gave out bug bounties to the developers who fixed bugs and the QA people who found the bugs
[22:39] <Tv> gregaf: sage earlier said that change happens exactly once in lifetime of an MDS process, from undefined->some value
[22:39] <gregaf> and the devs who fixed the bugs were the same as the devs who wrote the code
[22:39] <cmccabe> http://www.joeindie.com/images/dilbert-minivan.gif
[22:39] <gregaf> yeah, the symlink is helpful in the case where you have multiple daemons running and restarting — it's easy to open up mds 0 without figuring out which log is the current one
[22:39] <Tv> gregaf: the CEO of that company wanted us to remove a dilbert comic from a cubicle wall because it "caused a bad atmosphere"
[22:40] <gregaf> it's really not a big deal
[22:40] <sjust> Tv: seriously?
[22:40] <Tv> gregaf: we thought he had confused cause and effect..
[22:40] <Tv> sjust: yup
[22:40] <gregaf> yeah
[22:40] <gregaf> apparently this company though ended up paying a few tens of thousands of dollars to a single dev and a single QA person before they figured out how that incentive was broken
[22:41] <gregaf> I liked that book
[22:41] <gregaf> there was another story about a company that wanted to improve worker productivity so they switched everybody from desktops to laptops
[22:41] <gregaf> and then to prevent theft they locked all the laptops to the desks and kept the keys
[22:45] <cmccabe> http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/000000/20000/3000/800/23875/23875.strip.gif
[22:56] <Tv> nice
[23:06] * pombreda (~Administr@29.96-136-217.adsl-dyn.isp.belgacom.be) has joined #ceph
[23:14] <djlee> guys, with the simplest setting, all equal test and network links, disk, etc, but except the machine spec difference
[23:14] <djlee> and run two test, one for highspec machine, and other for lowspec machine, and the result turns out different, e.g., highspec gets about twice of lowspec
[23:14] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[23:15] <djlee> then what would you say the main reason? just limited cpu and ram?
[23:16] <djlee> specs im comparing is, e.g., xeon with 18gb, vs atom with 4gb
[23:16] <gregaf> djlee: depends on the test and which daemons are on which machines, but basically yes
[23:17] <Tv> djlee: cpu speed and amount of RAM make a huge difference for just about anything
[23:17] <djlee> hm,,,
[23:17] <Tv> xeon 18GB vs an atom is not an even match by any means
[23:18] <djlee> well you see, the thing is;; xeon with 18gb pushes about 15 to 20MB/s per osd disk, whereas atom one does about 9mb/s
[23:18] <djlee> i meant, random read files
[23:18] <Tv> for example, on a read-heavy load the Xeon is going to have ~17GB of the disk cached in RAM
[23:21] <gregaf> cmccabe: did you implement that config change callback you were talking about?
[23:21] <cmccabe> gregaf: working on it now
[23:21] <gregaf> ah, okay
[23:21] <gregaf> thought you'd pushed it already for some reason
[23:21] <cmccabe> gregaf: I've been pushing pieces of it
[23:22] <djlee> when the writes go first to journal, and after long continous writes end, what exactly happens after?
[23:23] <djlee> does it do some kind of flush and do the write at the backend?
[23:23] <gregaf> cmccabe: cool…I want it for #1010, can you let me know when you push the framework for it?
[23:23] <cmccabe> gregaf: sure. probably later today it iwll be in
[23:23] <gregaf> sweet
[23:24] <djlee> for lowspec machines, immediately the write finishes, if i try reading data from it, i get terrible speed, e.g., at KB/s, so i think i need to wait much longer before i can do the read at full speed,
[23:25] <djlee> for highspec machine, i didn't see this problem
[23:25] <djlee> of course i did full sync, dropcache, unmount/remount before doing the read
[23:25] <djlee> so it must be the low cpu computation?
[23:26] <Tv> djlee: "vmstat 1" is a quick way to stop guessing
[23:27] <Tv> djlee: if one case shows cpu user+system near 100%, other one shows lots of idle time, then yes
[23:28] <djlee> well i was relying on iostat one,
[23:28] <djlee> let me check if its similar :p
[23:30] <Tv> well that gives you the same cpu percentages
[23:30] <Tv> a bit harder to head but same data
[23:30] <djlee> yeah
[23:32] <djlee> my log says, for lowspec machine,
[23:32] <djlee> avg-cpu: %user %nice %system %iowait %steal %idle
[23:32] <djlee> 6.60 0.00 12.47 54.77 0.00 26.16
[23:32] <djlee> but for definitely the disks, yes 100% util
[23:33] <Tv> so it's not about cpu usage
[23:34] <gregaf> are the disks actually the same?
[23:34] <Tv> perhaps your atom has a significantly worse controller
[23:35] <Tv> or perhaps your load is read-heavy enough that the xeon has things cached in ram, and thus needs to read less from disk
[23:35] <Tv> you can compare blocks in/out to see if the xeon does less disk operations
[23:35] <gregaf> my naive assumption would be that it's taking longer to move stuff from the journal to the OSD store
[23:35] <Tv> gregaf: seems so, but not directly because of the slower cpu..
[23:36] <djlee> tv: how to compare block in/out? thanks
[23:36] <Tv> djlee: well, for example iostat output has the per-second rates
[23:36] <gregaf> Tv: yeah, that's why I was asking about the disks
[23:36] <djlee> greg: yeah journal-to-osd-store, where to find more info about this? when write-ahead done, it immediately issues that?
[23:36] <gregaf> djlee: how much data are you writing before you start your read?
[23:36] <gregaf> and how many OSDs do you have?
[23:36] * Meths_ (rift@ has joined #ceph
[23:37] <djlee> greg: 12 disks = 24tb, i write at least 1.2TB or 2.4TB before i do the reading
[23:37] <gregaf> just one machine hosting them all?
[23:38] <djlee> 2 nodes, and then i moved to 4 nodes (lowspec),
[23:38] <djlee> i'll try 6 nodes but keep the 12disks, so 2disks each per node
[23:39] <gregaf> hmm, at 1.2TB there shouldn't be a lot of caching effects
[23:39] <djlee> yeah, plus i do the random reading, multiple threads, run for 10min+
[23:40] <gregaf> and what were your read rates?
[23:40] <Tv> djlee: size of RAM is going to have a huge impact for that test
[23:41] <gregaf> Tv: I don't think the RAM will make a big difference in reading of a 1.2TB dataset?
[23:41] <djlee> greg: about 10mb/s or so per osd (for lowspec), so i got like 175mb/s max for write(writeahead journal), but 150mb/s for random-read
[23:41] <djlee> greg: for highspec, i get 15mb/s+ or so per osd
[23:41] <gregaf> djlee: that's megabytes, not megabits, right?
[23:41] <Tv> gregaf: 18GB vs 4GB is going to have a significant effect on the cache hit percentage
[23:41] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[23:41] <Tv> gregaf: and RAM is so much faster that i expect that to make a difference
[23:42] <gregaf> Tv: err, it's ~.04% cached versus ~.18% cached
[23:43] <gregaf> and the node-local prefetching shouldn't help with the random access and objects
[23:43] <gregaf> I mean I don't have a good model of how much that helps, but I don't think it's enough over a 10 minute run?
[23:44] <Tv> gregaf: umm 18G / 1228G = 1.45%, not .18?
[23:45] <gregaf> err
[23:45] <djlee> Tv: so i also did on two highspec, (12disks same, but split to 6 each), but the improvement over 1node wasn't visible.
[23:46] <gregaf> sorry, I can't do math I guess
[23:46] <gregaf> I was wondering if maybe the extra RAM meant it needed to sync to disk less on the writes
[23:46] <gregaf> and then the reads are blocking
[23:46] <gregaf> but with a data set so much larger than the OSD journals I don't think that can be the problem
[23:48] <gregaf> djlee: hmm, I wonder if there's something else going on here then — with that much RAM and that many disks I'd expect much better numbers
[23:48] <djlee> i should check the iostats block in/out rate, and compare high-vs-low, but er, highspec will obviously have higher rates?
[23:48] <djlee> greg: im also on normal ext4 default
[23:48] <gregaf> my dev machine has 8 GB of RAM and when I run one of each daemon off one disk I get numbers comparable to yours
[23:49] <gregaf> I think with 2 OSDs I still can push 15MB/s, even if they're both thrashing the same disk
[23:49] <gregaf> although it's been a while since I tested random reads so maybe that's the problem
[23:50] <djlee> yeah,
[23:51] <djlee> gregaf: is it normally instant for journal-to-osd-store?
[23:51] <gregaf> djlee: no
[23:52] <gregaf> when using ext4 the OSD needs to write everything to the journal, and then it writes it again to the general filesystem
[23:52] <gregaf> along with a number of calls to sync at various points to guarantee no data is lost in the event of a power failure
[23:53] <djlee> right, one journaling per fs, so mkfs.ext3/4 without journal, but enabling it on ceph, means no-journal, correct? sorry for my confusion :p
[23:54] <gregaf> oh no, the OSD journal is completely separate from any journal the FS may be using
[23:54] <djlee> riggght, so basically im having two kind of journal, one for ext4, and other for ceph;
[23:54] <gregaf> oh, yeah, that's possible
[23:55] <djlee> guys so do you guys have ffsb for try?
[23:56] <djlee> ive got the config file :p
[23:56] <gregaf> we have had it, I'm not sure if we're running it right now...Tv?
[23:58] <djlee> generally though, i do strongly suspect the ram;

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.