#ceph IRC Log


IRC Log for 2011-02-19

Timestamps are in GMT/BST.

[0:02] <Tv> gregaf: well apart from dout and cond.Signal, there's not much there
[0:02] <gregaf> indeed
[0:03] <gregaf> for a bit I thought there might be something in cond.Signal
[0:03] <cmccabe> weirdly, common/Clock has a Mutex, but no code uses it
[0:03] <gregaf> but the only thing that waits on that cond is the SafeTimer thread, which runs under the osd_lock as well
[0:04] <Tv> now, cond.Signal might very well put the thread to sleep for a while
[0:04] <Tv> but 20 seconds, on a machine that's idle, is too much
[0:05] <cmccabe> does pthread_cond_broadcast allow the thread using the lock to run immediately, or only after unlock?
[0:05] <cmccabe> *thread using lock = waiter
[0:07] <cmccabe> I wrote a test program which seemed to show that just calling broadcast doesn't cause the waiters to be run. You must release the lock, *then* they run.
[0:07] <cmccabe> but actual documentation seems to be non-existent, except for a cryptic comment about being sure to hold the lock when broadcasting in order to get "predictable scheduling behavior"
[0:09] <Tv> if you don't hold the lock, waiters can miss your signal due to race
[0:09] <Tv> that's one of the classics, not so sure about pthread specifics
[0:10] <cmccabe> tv: yeah, that's the gist of it.
[0:11] <Tv> uhh, phtread_cond_broadcast, as far as i understand it, moved the mutex ownership to the thread that got woken up
[0:11] <Tv> i don't see how the signaler should unlock anything afterwards
[0:11] <Tv> http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_cond_signal.html
[0:11] <Tv> "the thread shall own the mutex"
[0:12] <cmccabe> well, if the signaller owned the mutex before calling pthread_cond_broadcast, he'll own it afterwards too.
[0:12] <gregaf> I don't think you guys are making much sense?
[0:12] <cmccabe> write a test program if you're unsure
[0:12] <gregaf> pthread_cond_broadcast guarantees to wake up every thread waiting on that cond
[0:12] <Tv> gregaf: oh yeah usually you want _signal
[0:12] <gregaf> and it guarantees that when they are woken they are holding the lock that they waited with
[0:13] <Tv> frankly, pthreads makes no sense to me
[0:13] <Tv> it's overly complex and clunky as hell
[0:13] <cmccabe> tv: sadly, Cond::Signal() now does pthread_cond_broadcast
[0:13] <Tv> even worse
[0:14] <cmccabe> tv: don't blame me. I voted for Kodos.
[0:14] <Tv> anyway, yeah i think the above confusion stems from the whole what lock you call _wait *with* and what the condition you're sleeping on me means
[0:14] <Tv> cmccabe: don't blame me i usually vote for message-passing or clone(2)&futex interfaces ;)
[0:15] <cmccabe> tv: excellent choices!
[0:16] <Tv> seriously, writing go is such a pleasure.. it's like Python for the world of compiled languages
[0:16] <cmccabe> tv: I would really like to try golang. I've been pretty excited about the language since I read about it.
[0:18] <cmccabe> check out this test program:
[0:18] <cmccabe> #include <stdio.h>
[0:18] <cmccabe> #include <pthread.h>
[0:18] <cmccabe> pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
[0:18] <cmccabe> pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
[0:18] <cmccabe> void *func(void *v) {
[0:18] <cmccabe> pthread_cond_wait(&cond, &lock);
[0:18] <cmccabe> printf("func running here!\n");
[0:18] <cmccabe> fflush(stdout);
[0:18] <cmccabe> sleep(1);
[0:18] <cmccabe> pthread_mutex_unlock(&lock);
[0:18] <cmccabe> }
[0:18] <cmccabe> int main(void) {
[0:18] <cmccabe> printf("creating thread!\n");
[0:18] <cmccabe> fflush(stdout);
[0:18] <cmccabe> pthread_t thread;
[0:18] <cmccabe> pthread_create(&thread, NULL, func, NULL);
[0:18] <cmccabe> sleep(5);
[0:18] <cmccabe> printf("main calling signal!\n");
[0:18] <cmccabe> fflush(stdout);
[0:18] <cmccabe> pthread_mutex_lock(&lock);
[0:18] <cmccabe> pthread_cond_broadcast(&cond);
[0:18] <cmccabe> printf("main running here!\n");
[0:18] <cmccabe> fflush(stdout);
[0:18] <cmccabe> sleep(20);
[0:18] <cmccabe> pthread_mutex_unlock(&lock);
[0:18] <cmccabe> printf("shutting down in 10 seconds.\n");
[0:18] <cmccabe> fflush(stdout);
[0:18] <cmccabe> sleep(10);
[0:18] <cmccabe> return 0;
[0:18] <cmccabe> }
[0:19] <cmccabe> I get:
[0:19] <cmccabe> creating thread!
[0:19] <cmccabe> main calling signal!
[0:19] <cmccabe> main running here!
[0:19] <cmccabe> shutting down in 10 seconds.
[0:19] <cmccabe> func running here!
[0:20] <cmccabe> so I dunno. I think pthread_cond_broadcast doesn't release the lock
[0:20] <gregaf> it's not supposed to release the lock!
[0:20] <DeHackEd> well, you're supposed to be holding the lock before waiting
[0:20] <Tv> yeah, it doesn't, how could it since it doesn't even have a reference to it
[0:20] <gregaf> it explicitly says it doesn't release the lock! — you don't even need to hold the lock to call it
[0:20] <cmccabe> gregaf: the cond object has a reference to the lock inside
[0:20] <gregaf> and we don't want it to release the lock!
[0:20] <Tv> and yes that call to pthread_cond_wait is buggy
[0:21] <gregaf> you can wait using different locks on one cond
[0:21] <cmccabe> yes, relying on sleep() is buggy. I understand. This is just a test.
[0:21] <Tv> cmccabe: you're supposed to hold the mutex before calling _wait
[0:21] <cmccabe> oh #$#...
[0:21] <yehudasa> cmccabe: that's not a proper cursing!
[0:21] <Tv> "They shall be called with mutex locked by the calling thread or undefined behavior results."
[0:22] <DeHackEd> _wait atomically unlocks the lock and waits. when woken up, it re-aquires the lock (waiting to so do) and resumes
[0:23] <cmccabe> so anyway, long story short, broadcast/signal doesn't do an unlock
[0:23] <Tv> but given a use case that simple, shouldn't you just try to lock the lock, and when it succeeds you're good to go?
[0:24] <cmccabe> tv: the point was to test out cond variables and scheduling
[0:24] <gregaf> why would anybody have thought it did unlock?
[0:24] <cmccabe> tv: yes, I agree, it's a crappy program for sure
[0:24] <Tv> yeah just saying most people don't need cond variables
[0:24] <gregaf> you guys are rapidly dragging my knowledge about pthread locking backwards
[0:24] <Tv> gregaf: i would, because i end up using non-pthread locking systems more ;)
[0:25] <cmccabe> gregaf: well, the man page refers cryptically to "predictable scheduling behavior" if you're holding the lock when you broadcast
[0:25] <cmccabe> gregaf: which could be interpreted as "run the waiter *now*"
[0:25] <Tv> cmccabe: more like, lack of races
[0:25] <cmccabe> gregaf: but I guess has to be interpreted as "run the waiter immediately after unlocking the lock you're currently holding"
[0:25] <gregaf> you need to read the man page more holistically, then
[0:26] <gregaf> if you're not holding the lock when you call it then the scheduler is free to do random things like run the other thread before you finish your function call
[0:26] <cmccabe> gregaf: I don't subscribe to a literal interpretation. It's more of a set of parables for how to live your life.
[0:26] <gregaf> that is unpredictable scheduling behavior
[0:27] <Tv> pthreads is a funky mapping over native behavior that is not quite what pthreads tries to expose, anyway.. it's gonna get confusing, because it describes a "standardized" behavior that the implementation doesn't quite providew
[0:27] <Tv> don't rely on the details, like what thread gets scheduled when
[0:31] <Tv> ohh crap proper key management for a cluster.. lovely
[0:31] <Tv> who generates what key and who all need to see it.. yay
[0:32] <yehudasa> Tv: cmccabe: just a note.. he mentioned that he was running it on a 2.6.28-rc (sic? probably 2.6.38-rc?) with cgroups
[0:33] <Tv> cgroups might limit the scheduler a lot, if configured so.. but it would take an extreme config to make a process sleep 20sec
[0:35] <cmccabe> with the pretty pictures: http://www.domaigne.com/blog/computing/condvars-signal-with-mutex-locked-or-not/
[0:36] <cmccabe> so basically, broadcast/signal on linux moves the waiters directly from the condition variable queue to the mutex queue without a context switch (when signalling with mutex locked)
[0:37] <cmccabe> I think we can safely write off that call Signal in our scheduling woes
[0:38] <Tv> ha!
[0:40] <Tv> but seriously, at this point i'm offering 4 options: 1) locking 2) slow IO 3) naughty kernel making (unrelated?) process sleep 4) corrupt data structures make it e.g. loop pseudorandomly between the two douts
[0:40] <Tv> oh and if this is under virtualization, the guest-host communication is known to cause 3)
[0:40] <Tv> ltrace will reveal much more
[0:41] <cmccabe> yeah, unfortunately C++ has this little thing called undefined behavior, so #4 isn't as silly as it sounds
[0:41] <Tv> a kvm guest pushing heavy IO, with certain versions from a few months ago, would make individual processes sleep 120sec or more, and triggered the kernel watchdog printk's
[0:41] <cmccabe> my money is still on #2 though
[0:42] <Tv> even when the IO was from unrelated processes inside the guest
[0:43] <cmccabe> I think sync() is well-known for causing problems like that
[0:43] <cmccabe> we don't call that thing any more do we?
[0:43] <Tv> i once had a bit of code that had linked lists in kernelspace.. a race would make it corrupt the list to have a cyclic tail, then it'd spin there until another race would corrupt the list again ;)
[0:44] <Tv> and it could work a hundred hours straight with no problems, even with ~1 corruption per hour
[0:44] <Tv> that one came down to, it's really hard to verify the behavior is correct, on that level of stress
[0:45] <Tv> so even if it mishandled things, the automated tests couldn't tell
[0:46] <cmccabe> I think Google has the right approach with their tracing setup
[0:46] <cmccabe> they have tracing running on all production machines
[0:46] <Tv> the modern "a few NOPs that we binary patch when we want to add a tracepoint" approach is pretty sweet
[0:47] <cmccabe> I forget which kernel-level tracing toolkit they were using... I think LTT?
[0:47] * greglap (~Adium@ has joined #ceph
[0:48] <cmccabe> LTT-ng I think
[0:49] <cmccabe> from http://ltt.openrapids.net/papers/bligh-Reprint.pdf
[0:49] <cmccabe> "In addition, we need a system that can capture failures at the earliest possible moment; if a problem takes a week to reproduce, and 10 iterations are required to col- lect enough information to fix it, the debugging process quickly becomes intractable."
[0:49] <Tv> ok so who needs to know what key, for authentication? do all keys go to mons for checking, and they hand out crypto capabilities, or?
[0:50] <cmccabe> "The ability to instrument a wide spectrum of the system ahead of time, and provide meaningful data the first time the problem appears, is extremely useful. Having a system that can be deployed in a production environment is also invaluable. Some problems only appear when you run your application in a full cluster deployment; re-creating them in a sandbox is impossible."
[0:50] <cmccabe> so in other words, never turn off tracing! Otherwise, you're doomed.
[0:50] <Tv> turn off != compile out
[0:51] <Tv> they don't actively use it until they encounter problems, AFAIK
[0:51] <cmccabe> well, verbosity is always an issue.
[0:51] <Tv> considering that they're well-known for considering a 1% performance drop an unacceptable regression
[0:52] <cmccabe> "Debugging Approach: By setting our tracing tool to log trace data continuously to a circular buffer in memory, and stopping tracing when the error condition was detected, we were able to capture the events preceding the problem (from a point in time determined by the buffer size, e.g. 1GB of RAM) up until it was reported as a timeout."
[0:53] <darkfader> cmccabe: that is awesome
[0:53] <cmccabe> so in other words, it's always "actively in use" even if no problems have been encountered. But it isn't flushed to disk unless it needs to be.
[0:53] <darkfader> i'll read that tomorrow, but that approach is simply genius
[0:54] <cmccabe> yeah, google's cluster expertise is a pretty big advantage
[0:54] <cmccabe> often overlooked
[1:13] <cmccabe> do you think API functions should be rados_verb_noun or rados_noun_verb
[1:13] <cmccabe> sage made a comment that it should be consistent but I'm not sure which is really better
[1:36] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[1:39] * greglap (~Adium@ Quit (Quit: Leaving.)
[1:56] <Tv> cmccabe: if the nouns cluster the calls into meaningful groups, then noun first is good
[1:56] <Tv> much like the rados_ in the beginning groups all rados-related calls
[1:58] <cmccabe> hmm, my first instinct was to go with verb noun because that's what most of the calls already are like
[1:58] <cmccabe> I guess I don't really have a very strong opinion though
[1:59] <Tv> the way i think of it is if you have datatype foo (in C), you'd have foo_init and foo_release, foo_get/foo_put, etc
[1:59] <Tv> so now foo just has internal structure
[1:59] <Tv> but i'm not religious about that
[2:00] <cmccabe> yeah, it is often nice to have something kind of like a module name as a prefix
[2:01] <cmccabe> however, rados doesn't really have these different functions in different modules at the moment
[2:01] <Tv> but e.g. rados_conf_get is a lovely example
[2:01] <Tv> rados_conf_get, rados_conf_get_bool, rados_conf_get_uint32 -- makes perfect sense
[2:02] <cmccabe> yeah
[2:02] <Tv> rados_get_conf? rados_get_conf_bool? rados_get_bool_conf?
[2:02] <cmccabe> get_conf just sounds like you are getting a configuration, which isn't true at all
[2:02] <Tv> yup
[2:02] <Tv> and when you need to add more, you don't know where to put it
[2:02] <cmccabe> I mean, verb-object doesn't make sense when the verb is not operating on that object
[2:03] <cmccabe> in my opinion at least
[2:03] <Tv> but even foo_init, when i am initing a foo, makes sense to me.. things like grep foo_, tab completion, all my toolage suggests shared prefixes are better than shared suffixes
[2:04] <cmccabe> but again, that's mainly true for when the prefixes really do distinguish between modules
[2:05] <cmccabe> anyway, you're probably right.
[2:05] <cmccabe> I guess I'll probably redo this as noun object at some point
[2:16] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:16] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[2:34] * Plnt (~someone@rhea.pwn.cz) has joined #ceph
[3:26] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:32] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[4:13] * Meths (rift@ has joined #ceph
[4:20] * bbigras (quasselcor@bas11-montreal02-1128535815.dsl.bell.ca) has joined #ceph
[4:20] * bbigras is now known as Guest1881
[4:24] * Guest1645 (quasselcor@bas11-montreal02-1128535815.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[4:40] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) Quit (Quit: Leaving)
[4:50] * atg (~atg@please.dont.hacktheinter.net) Quit (Ping timeout: 480 seconds)
[4:50] * atg (~atg@please.dont.hacktheinter.net) has joined #ceph
[4:56] * johnl_ (~johnl@ has joined #ceph
[4:58] * johnl (~johnl@ Quit (Ping timeout: 480 seconds)
[5:25] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[5:34] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[5:44] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[5:46] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[6:08] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[6:09] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[6:19] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[6:19] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[6:23] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:27] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[6:28] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[6:36] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[6:36] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[6:44] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[6:45] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[6:48] * MK_FG (~MK_FG@ Quit (Ping timeout: 480 seconds)
[6:55] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[6:59] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[7:02] * MK_FG (~MK_FG@ has joined #ceph
[7:10] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[7:10] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[7:21] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[7:22] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[7:30] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[7:37] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[7:44] * MK_FG (~MK_FG@ Quit (Quit: o//)
[7:47] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[7:48] * MK_FG (~MK_FG@ has joined #ceph
[7:55] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[7:56] * MK_FG (~MK_FG@ Quit (Quit: o//)
[7:56] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[7:58] * MK_FG (~MK_FG@ has joined #ceph
[8:13] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[8:14] * orzzz (~root@ has joined #ceph
[8:14] * orzzz (~root@ Quit ()
[8:16] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[8:40] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[8:41] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[8:41] * MK_FG (~MK_FG@ Quit (Quit: o//)
[8:49] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[8:49] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[8:54] * MK_FG (~MK_FG@ has joined #ceph
[8:58] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[8:58] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[9:00] * MK_FG (~MK_FG@ Quit (Quit: o//)
[9:11] <wido> greglap: Yes, that is/was on 2.6.38. Machine panic'ed afterwards
[9:11] <wido> It's 2.6.38-rc5, the code which is in there
[9:17] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[9:17] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[9:37] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[9:38] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[9:45] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[9:49] * allsystemsarego (~allsystem@ has joined #ceph
[10:00] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[10:06] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[10:09] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[10:17] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[10:21] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) has joined #ceph
[10:37] * MK_FG (~MK_FG@ has joined #ceph
[11:30] * Meths_ (rift@ has joined #ceph
[11:36] * Meths (rift@ Quit (Read error: Operation timed out)
[14:07] * tjikkun (~tjikkun@82-168-5-225.ip.telfort.nl) has joined #ceph
[14:20] * Meths_ is now known as Meths
[16:39] * tjikkun (~tjikkun@82-168-5-225.ip.telfort.nl) Quit (Remote host closed the connection)
[16:46] * tjikkun (~tjikkun@82-168-5-225.ip.telfort.nl) has joined #ceph
[17:39] * yx (~yx@82VAABWSB.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[17:39] * yx (~yx@tory.uvt.nl) has joined #ceph
[17:59] * eternaleye_ (~eternaley@ has joined #ceph
[18:03] * eternaleye (~eternaley@ Quit (Ping timeout: 480 seconds)
[18:34] * eternaleye_ is now known as eternaleye
[20:11] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Remote host closed the connection)
[21:36] * valtha (~valtha@ohmu.fi) has joined #ceph
[21:43] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[22:38] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:44] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.