#ceph IRC Log


IRC Log for 2011-05-13

Timestamps are in GMT/BST.

[0:11] <Tv> well, now i know osd locking is broken ;)
[0:11] <Tv> it daemonizes after fcntl
[0:12] <Tv> though that doesn't explain what i'm seeing; i tell it not to daemonize
[0:16] * sung (~sung@doot.realfuckingnews.com) has joined #ceph
[0:19] <Tv> hmm actually i can't explain the behavior i'm seeing
[0:27] <Tv> so fcntl logs are documented to be lost on fork, yet i see cosd fork and retain the lock?
[0:27] <Tv> s/logs/locks/
[0:28] <gregaf> Tv: "lost on fork" meaning what here?
[0:28] <Tv> Record locks are not inherited by a child created via fork(2), but are preserved across an
[0:28] <Tv> execve(2).
[0:28] <Tv> from fcntl(2)
[0:28] <gregaf> Tv: yeah, but the parent process holds it, right?
[0:29] <Tv> the parent process exits when cosd daemonizes
[0:29] <gregaf> ah, k
[0:29] <sage> huh. so is the right thing to do to test the lock pre-fork, error out if held, and then take+hold it after fork? racy
[0:30] <Tv> sage: my favorite solution is not maintaining daemonization code ;)
[0:30] <sage> yeah yeah :)
[0:31] <Tv> but the stuff that's out there seems to all establish a communication channel between child & parent
[0:31] <Tv> signals okness
[0:31] <sage> that's what cfuse does. annoying
[0:31] <yehuda_hm> sage: getting too late here.. I didn't get that to work, might be that some of those functions are just for stacked devices..
[0:31] <Tv> e.g. python's subprocess.py does the same, it has an fd to signal certain kinds of errors like executable not found
[0:32] <sage> yehuda_hm: weird ok. might be worth asking on the scsi list or lkml
[0:36] <Tv> ohhh FileStore::mount grabs the lock again
[0:36] <Tv> that explains it
[0:36] <Tv> it's unlocked after daemonization until that point
[0:36] <Tv> sage: i hear you wanted to avoid racy, eh?-)
[0:46] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Ping timeout: 480 seconds)
[0:59] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[1:04] <sage> :)
[1:07] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[1:07] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) has joined #ceph
[1:11] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) Quit ()
[1:12] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) has joined #ceph
[1:12] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) Quit ()
[1:14] * Guest598 (~matthew@pool-96-228-59-187.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[1:19] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[1:21] * Guest700 (~matthew@pool-96-228-59-187.rcmdva.fios.verizon.net) has joined #ceph
[1:21] * Guest700 (~matthew@pool-96-228-59-187.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[1:25] * MK_FG (~MK_FG@ Quit (Quit: o//)
[1:36] * MK_FG (~MK_FG@ has joined #ceph
[1:38] * greglap (~Adium@ has joined #ceph
[1:53] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:27] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[2:30] * Guest600 (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) Quit (Remote host closed the connection)
[2:30] * bbigras (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) has joined #ceph
[2:31] * bbigras is now known as Guest703
[2:36] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[2:59] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[5:12] * lxo (~aoliva@ Quit (Ping timeout: 480 seconds)
[5:42] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) Quit (Quit: Leaving)
[5:53] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[5:54] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:53] * lidongyang (~lidongyan@ Quit (Read error: Connection reset by peer)
[6:58] * lidongyang (~lidongyan@ has joined #ceph
[7:07] * macana (~ml.macana@ has joined #ceph
[7:18] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: Been here. Done that.)
[10:13] * Jiaju (~jjzhang@ Quit (Remote host closed the connection)
[10:29] * allsystemsarego (~allsystem@ has joined #ceph
[10:32] * Meduka_Meguca (~Yulya@ip-95-220-177-191.bb.netbynet.ru) has joined #ceph
[10:53] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[13:15] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) has joined #ceph
[13:39] * admix (admix@admix.eu) has joined #ceph
[14:02] * darktim (~andre@pub-wlan.office.nine.ch) has joined #ceph
[14:02] * darktim (~andre@pub-wlan.office.nine.ch) Quit (Remote host closed the connection)
[14:13] * Guest703 (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) Quit (Remote host closed the connection)
[14:54] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[15:22] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[15:22] * morse (~morse@supercomputing.univpm.it) Quit (Read error: Connection reset by peer)
[15:25] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[15:33] * morse (~morse@supercomputing.univpm.it) Quit (Ping timeout: 480 seconds)
[15:42] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Ping timeout: 480 seconds)
[15:51] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[16:00] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[16:02] * andret (~andre@pcandre.nine.ch) has joined #ceph
[16:03] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:06] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[16:11] * morse (~morse@supercomputing.univpm.it) Quit (Read error: Connection reset by peer)
[16:11] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:29] * bbigras (~bbigras@bas11-montreal02-1128536388.dsl.bell.ca) has joined #ceph
[16:30] * bbigras is now known as Guest762
[17:10] * andret (~andre@pcandre.nine.ch) has joined #ceph
[17:15] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has left #ceph
[17:16] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:16] * Guest762 (~bbigras@bas11-montreal02-1128536388.dsl.bell.ca) Quit (Remote host closed the connection)
[17:19] * bbigras_ (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) has joined #ceph
[17:39] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:50] * greglap (~Adium@ has joined #ceph
[18:18] * cmccabe (~cmccabe@ has joined #ceph
[18:40] <sagewk> bchrisman: merged the readdir_r fix
[18:41] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:49] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[18:53] <sagewk> i have meetings this morning so you guys will have to get along without me
[18:58] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:58] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:00] * lxo (~aoliva@ has joined #ceph
[19:01] <sagewk> yehuda_hm: so far so good (with the leak fix), still running iozone. pushed some more cleanup bits to req_coll
[19:08] <yehuda_hm> sagewk: does it also include my changes?
[19:08] <yehuda_hm> oh, doh
[19:37] <bchrisman> greglap: heh??? having a bit of trouble chasing down this client issue I've got. libceph's ceph_readdir_r is a simple wrapper around Client::readdir_r??? which seems like it's not working. I see the fuse client ducks underneath that and calls readdir_r_cb directly with a callback which stuff data into fuse.
[19:41] <bchrisman> since the rest of the mechanism works fine with fuse, I presume the problem is in either Client::readdir_r, Client::readdirplus_r, or Client::_readdir_single_dirent_cb
[19:48] <gregaf> bchrisman: did you see my update on the tracker?
[19:48] <gregaf> readdir_single_dirent_cb is set up to break if you call it more than once, and the readdir_r_cb function usually calls 3 times (at least twice)
[19:48] <gregaf> not sure how it happened but it's just busted
[19:53] <bchrisman> ahh
[19:56] <bchrisman> readdirplus_r ???> readdir_r_cb is invoking the callback multiple times when it's just supposed to be invoked once per readdirplus_r call.
[19:58] * sagelap (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:58] <bchrisman> must be the call inside the while loop? But it's also called for the cases treated specially as '.' and '..' in readdir_r_cb
[19:58] <bchrisman> should those special cases be returning?
[19:59] <gregaf> I don't know/remember how this code works, sorry
[19:59] <bchrisman> I'm guessing that fuse might not notice this bug, as fuse_add_direntry could probably be called multiple times.
[19:59] <gregaf> yeah, it's only libceph that uses that callback
[19:59] <gregaf> I'm not sure if it's because libceph is only supposed to return one entry at a time or what
[20:00] <bchrisman> yeah.. readdir_r passes in a single dirent and expects a single dirent out.
[20:00] <bchrisman> looks like those special cases need to have a return at the end of them?
[20:01] <bchrisman> right now it only returns if the callback returns < 0
[20:01] <gregaf> well, I don't think they can just return, all the other callbacks can handle more than one I think
[20:01] <gregaf> maybe an option that gets passed in?
[20:02] <bchrisman> hmm.. sounds like it's not quite readdir then??? maybe we need a different function in here for different lookups?
[20:02] <sagelap> the readdirplus_r single_readdir callback only takes a single dirent and then returns -1 to signal no more
[20:02] <bchrisman> ahh so that's just signaling that it's done?
[20:03] <sagelap> yeah the if (c->full) return -1; means i alrady got one dentry, no more
[20:03] <sagelap> then readdir[plus]_r will return that one thing
[20:03] <bchrisman> so readdirplus_r & readdir_y return ???1 when completing successfully?
[20:04] <sagelap> hmm don't think they're supposed to.
[20:04] <sagelap> this may be broken, nothing uses it currently :)
[20:04] <sagelap> oh i see
[20:04] <bchrisman> ahh okay.. so everything may be working just fine except the return code...
[20:04] <bchrisman> ?
[20:05] <cmccabe> bchrisman: I spotted a few strange return codes in librados before
[20:05] <sagelap> so probably instead of -1 the callback should return -ENOSPC or -ERANGE (whatever getdents does would be good) and readdir_r_cb should mask that error (but still stop)
[20:05] <cmccabe> bchrisman: like setxattr returns the number of xattrs set, whereas in Linux it's 0 on success
[20:06] <bchrisman> Client::setxattr?
[20:06] <bchrisman> ahh librados.
[20:06] <bchrisman> gotchya
[20:06] <cmccabe> bchrisman: well, rados' setxattr at least
[20:06] <cmccabe> bchrisman: I assume it's getting it from Client
[20:06] <bchrisman> yeah??? optimally those should parallel the standard system call return codes.. but??? :)
[20:07] <sagelap> yeah they definitely should as much as possible
[20:07] <gregaf> cmccabe: rados doesn't go through the client at all ??? it's built on top of Objecter, and the Client is too!
[20:08] <cmccabe> gregaf: heh, you're absolutely right
[20:08] <bchrisman> I'll fix the readdir* exposed callbacks and then see if I can get the callback and stuff to have return codes which make sense.. want to make sure I don't interfere with the cfuse code path??? but the callback mechanism should make that okay
[20:08] <cmccabe> gregaf: I just got confused because there's something called RadosClient
[20:08] <cmccabe> gregaf: which pretty much every rados function goes through
[20:08] <cmccabe> gregaf: but that is completely separate from class Client
[20:10] <sagelap> bchrisman: the thing to test with cfuse is just that you get correct results doing ls on a direcory with more than 4k of filename characters (i.e. teh callback filling up its buffer at least once and continueing)
[20:11] <bchrisman> ahh ok
[20:12] <bchrisman> ahh looks like it almost gets there:
[20:12] <bchrisman> if (r < 0)
[20:12] <bchrisman> return r;
[20:12] <bchrisman> if (sr.full)
[20:12] <bchrisman> return 1;
[20:13] <bchrisman> looks like the intent was there, but return code will be ???1 when the dirent is filled...
[20:13] <sagelap> yeah something like (r == -ERANGE (or whatever)) return 0; if (r < 0) return r;
[20:13] <sagelap> yeah
[20:13] <bchrisman> cool
[20:14] <sagelap> er.. actaully i'm thinking in readdir_r_cb, the if (r<0) check after the cb(..) call
[20:14] <sagelap> but i'm only half paying attention here, don't take m word for it :)
[20:15] <bchrisman> yeah.. no worries, I'll try for minimum mutilation and make sure readdirplus_r and readdir_r return the correct code
[20:15] <sagelap> cool
[20:15] <bchrisman> code seems to work for fuse.
[20:19] * df (davidf@dog.thdo.woaf.net) has joined #ceph
[20:33] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[20:34] * alexxy[home] (~alexxy@ has joined #ceph
[20:48] <sagewk> yehuda_hm: yep, it's still going strong. memory usage looks constant too after ~2.5 hours
[20:49] <sagewk> i'll do a bit more cleanup and push it to the tree. maybe fyodor can give it a go today
[20:49] <sagewk> not sure what time zone he's in
[20:54] * sagelap (~sage@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[20:55] * alexxy[home] (~alexxy@ Quit (Ping timeout: 480 seconds)
[20:58] * alexxy (~alexxy@ has joined #ceph
[21:04] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:16] <yehuda_hm> sagewk: that rbd change is a bit big for -rc7
[21:17] <sagewk> yeah.. it fixes a real bug, though, and its the simplest solution that does so
[21:19] <sagewk> i made a superficial pass though but didn't follow all of the logic carefully...
[21:19] <sagewk> wanna take a final look at make sure things look right?
[21:32] <yehuda_hm> yeah
[21:37] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[21:55] <yehuda_hm> sagewk: pushed some more cleanup
[22:00] <sagewk> k
[22:23] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:29] * aliguori (~anthony@ has joined #ceph
[22:33] <sagewk> yehuda_hm: pushed proper patch to for-linus, look ok?
[22:35] <yehuda_hm> sagewk: yeah
[22:36] <sagewk> looks ok in your testing?
[22:39] <sagewk> does it look like the bio alignment stuff should have worked but didn't?
[22:40] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[22:43] <yehuda_hm> sagewk: what do you mean?
[22:44] <sagewk> before you were thinking the requests coming down should be aligned...
[22:44] <sagewk> was that just not the case, or was it not worked as advertised?
[22:45] <yehuda_hm> my original thought for a fix was to make it sending requests within block boundaries
[22:45] <yehuda_hm> I think I misunderstood the api
[22:45] <sagewk> the blk_queue_segment_boundary stuff looks unrelated
[22:45] <sagewk> yeah ok cool
[22:45] <yehuda_hm> if I understood correctly it was more about the memory allocations were aligned
[22:45] <yehuda_hm> or something like that
[22:46] <sagewk> yeah
[22:47] <yehuda_hm> I'd assume that such an api would be trivial, but apparently it wasn't
[22:51] <yehuda_hm> maybe for .40 or .41 we can make a bigger change and create a device hierarchy so that we never wait on to IOs for each request
[23:13] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[23:18] <cmccabe> tv: I have a python question that maybe you can help with
[23:18] <cmccabe> tv: what is the best type for storing binary data in python
[23:18] <Tv> shoot
[23:18] <Tv> ah
[23:18] <cmccabe> tv: I was thinking class array
[23:18] <cmccabe> tv: but don't have a good way of converting between POINTER(c_char)
[23:18] <Tv> what do you need to do with it?
[23:19] <Tv> like, just read/write whole, append bytes one by one, ...
[23:19] <Tv> the base answer is "use a string", except append is expensive and then the answer becomes "use a list of strings, ''.join(l) when you need the whole thing"
[23:19] <cmccabe> tv: I'd like my python wrapper for librgw to have a function acl_bin2xml
[23:20] <cmccabe> tv: that takes an Array.array and sends it to the C library as POINTER(c_char)
[23:20] <Tv> you almost never should use any "array" in python
[23:20] <cmccabe> tv: and likewise, a function acl_xml2bin that takes a python string and returns an array.Array with the bin
[23:20] <Tv> (numpy is the one exception i can think of)
[23:20] <cmccabe> tv: the problem is the binary data can contain nulls
[23:20] <cmccabe> tv: does the python string class handle that?
[23:21] <Tv> >>> print 'foo\0bar'
[23:21] <Tv> perfectly fine
[23:21] <cmccabe> that's unexpected
[23:21] <cmccabe> but useful here I guess
[23:21] <Tv> not really
[23:21] <Tv> python is not C
[23:22] * bbigras_ (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) Quit (Remote host closed the connection)
[23:24] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[23:24] <cmccabe> tv: the doc for ctypes.c_char has the enigmatic warning: "Represents the C char * datatype when it points to a zero-terminated string. For a general character pointer that may also point to binary data, POINTER(c_char) must be used. "
[23:24] <cmccabe> tv: but I can't figure out how to use this POINTER(c_char) thing
[23:24] <cmccabe> tv: I think that's the last missing piece
[23:24] <Tv> cmccabe: ctypes,c_char is not a python string
[23:24] * bbigras (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) has joined #ceph
[23:24] <cmccabe> tv: right, I understand
[23:24] <cmccabe> tv: but it still raises the question of how to convert between POINTER(c_char) and python strings
[23:24] * bbigras is now known as Guest805
[23:25] <Tv> >>> c_char_p("Hello, World")
[23:25] <Tv> c_char_p('Hello, World')
[23:25] <Tv> that'll function as char* to C
[23:25] <cmccabe> I understand that, but c_char_p doesn't help here
[23:25] <cmccabe> because the data can have nulls
[23:25] <Tv> well then you can't use C strings
[23:25] <cmccabe> "POINTER(c_char) must be used"
[23:25] <Tv> c_void_p?
[23:26] <Tv> what says that "must be used"?
[23:26] <cmccabe> the ctypes docs
[23:26] <cmccabe> I guess I could try c_void_p
[23:27] <Tv> ok i see it.. so you need a working example of how to use POINTER?
[23:27] <cmccabe> yep
[23:27] <Tv> gimme a few min
[23:28] <Tv> seems like ctypes.create_string_buffer(init_or_size[, size]) would do everything you need for you
[23:28] <Tv> you can either pass in a string, or just size of buffer to allocate
[23:30] <Tv> that gives you an array of c_char, and then in the function call you should be able to just say buf.byref()
[23:30] * alexxy (~alexxy@ has joined #ceph
[23:30] <cmccabe> that looks promising
[23:30] <cmccabe> I'm a little suspicious about what it does about NULLs
[23:31] <cmccabe> but I'll just have to experiment
[23:31] <Tv> well the .byref() can't be a null..
[23:31] <Tv> if you mean nils, as in \0 in the string, i'm testing that too
[23:31] <Tv> but it really allocates a buffer as big as you say, etc
[23:32] <cmccabe> my first instinct would be to explicitly tell it to allocate a buffer of len(string) and then copy the full string in
[23:32] <cmccabe> because if it tries to create a C string out of the python string, everything after a null would be lost
[23:33] <Tv> accessing buf.value does the 0-terminate thing, accessing buf.raw is the raw contents
[23:33] <cmccabe> interesting
[23:33] <Tv> s = ctypes.create_string_buffer('foo\0bar')
[23:33] <Tv> print repr(s.raw)
[23:34] <Tv> 'foo\x00bar\x00'
[23:34] <cmccabe> I guess that's a good solution then
[23:34] <cmccabe> thanks
[23:34] <Tv> gets an extra \0 at the end that way, though
[23:34] <cmccabe> do you think the string constant had an implicit \0 at the end?
[23:34] <cmccabe> that would be the case in C
[23:35] <Tv> well it got mangled into a c string, and then it did
[23:35] <Tv> s = ctypes.create_string_buffer(7)
[23:35] <cmccabe> well... a C string is NULL-terminated, that thing wasn't
[23:35] <Tv> s[:] = 'foo\0bar'
[23:35] <Tv> print repr(s.raw)
[23:35] <Tv> 'foo\x00bar'
[23:35] <Tv> that is more explicit and avoids it
[23:35] <cmccabe> so really create_string_buffer is a very poor name
[23:35] <Tv> well it's more buffer than string ;)
[23:36] <cmccabe> yeah
[23:37] <Tv> and you can just use the above s and it'll act like char* to C
[23:37] <cmccabe> so on the whole this is still a little bit annoying, since presumably callers will want to read a string from a file
[23:37] <cmccabe> and then pass that to my function
[23:37] <Tv> libc.printf("test %s\n", s)
[23:38] <cmccabe> so I can use create_string_buffer, but I have to remember that if the argument is a string, I'll get an extra 0 at the end
[23:38] <Tv> do the assignment via the s[:] trick
[23:38] <Tv> then you don't get \0 at end
[23:38] <cmccabe> so that coerces the input data into a slice?
[23:38] <cmccabe> a slice of characters probably
[23:38] <Tv> not.. quite
[23:39] <Tv> python doesn't really coerce like that
[23:39] <cmccabe> I mean it creates a new object out of the old
[23:39] <cmccabe> coerce is a bad way of putting it I guess
[23:39] <Tv> it just seems that the create_string_buffer init and slice set methods process input differently
[23:39] <cmccabe> strings are immutable
[23:40] <cmccabe> seems workable... will try in a bit
[23:40] <cmccabe> thanks

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.