#ceph IRC Log

Index

IRC Log for 2011-05-12

Timestamps are in GMT/BST.

[0:07] <Tv> alright so 1) autotest chdirs all over the place, in several operations 2) autotest helpers can't do things like "run this subprocess with this as cwd"
[0:07] <Tv> time to write my own helpers it seems
[0:08] <cmccabe> are you using threads or what
[0:08] <Tv> gevent, in this case
[0:08] <cmccabe> I guess either way chdir is a pain
[0:08] <Tv> yup
[0:08] <Tv> well the way autotest is written, it also chdirs in utility functions, and "tries to" chdir back
[0:09] <Tv> which is a recipe for pain too
[0:09] <Tv> i already found a utility function that forgets to chdir back ;)
[0:09] <Tv> in the meanwhile, the osd killing test works quite often, and seems to be surfacing bugs
[0:09] <Tv> as in, bugs in ceph, not bugs in test framework
[0:16] <Tv> # Give it enough time to crash if it's going to (it shouldn't).
[0:16] <Tv> time.sleep(5)
[0:16] <Tv> the more i read, the more i weep
[0:17] <cmccabe> probably they think that will catch core dumps
[0:17] <Tv> that runs an ssh to worker node
[0:18] <cmccabe> but really they should be using a usermode core dump helper
[0:22] <sagewk> cmccabe, yehuda_hm: are the canned acl types supported yet? http://tracker.newdream.net/issues/1081
[0:23] <Tv> sagewk: s3 tests set canned acls via boto, and enforce they work right
[0:23] <sagewk> hrm
[0:23] <Tv> sagewk: so it's probably differences in wire protocol that trigger the 500
[0:23] <cmccabe> sagewk: obsync doens't treat canned acls specially
[0:24] <cmccabe> sagewk: but it doesn't need to
[0:24] <cmccabe> sagewk: although it could as a performance optimization
[0:24] <cmccabe> sagewk: as tv mentioned, rgw supports canned ACLs, and there are at least some test for that. I don't know how the coverage is
[0:24] <sagewk> k they're putting more detail in the bug
[0:27] <Tv> i'm tcpdumping boto traffic to compare
[0:29] <Tv> but any 500 is still a bug ;)
[0:34] <Tv> bleh rgw is refusing to auth, something changed in the uid/access_key stuff again, let's see..
[0:35] <cmccabe> tv: I'm having some auth troubles myself
[0:35] <cmccabe> tv: still haven't gotten test-obsync.py working with rgw again
[0:35] <cmccabe> tv: though it works fine with amazon
[0:38] <cmccabe> tv: could be a few lingering buckets with the wrong ACLs on them
[0:38] <cmccabe> tv: I just got the test to pass after deleting my old buckets...
[0:39] <Tv> i'm running against local vstart.sh thingie that gets wiped clean
[0:40] <Tv> there we go, that worked
[0:40] <Tv> latest master was better
[1:01] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[1:20] <sagewk> cmccabe: can you look at 1081? i think yehuda_hm is asleep now
[1:20] <cmccabe> ok
[1:21] <bchrisman> hmm.. there a simple macro I can't find for stringifying a struct stat?
[1:21] <cmccabe> bchrisman: not to my knowledge
[1:21] <bchrisman> ahh yippie :)
[1:22] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[1:22] <cmccabe> tv: there is a lot of spew in 1081
[1:22] <cmccabe> tv: did you identify the specific request that failed?
[1:23] <cmccabe> tv: is it "x-amz-acl: private" that is failing?
[1:23] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[1:37] * greglap (~Adium@198.228.209.88) has joined #ceph
[1:39] <cmccabe> sagewk: can we get the log files from this failure?
[1:39] <sagewk> from the rgw side on the live cluster?
[1:39] <cmccabe> yes
[1:39] <sagewk> yeah
[1:40] <cmccabe> really just the last few lines
[1:40] <sagewk> or you can probably run the same request against your cluster... it's a small bit of perl
[1:40] <sagewk> that's probably easier to test/verify
[1:40] <cmccabe> I don't know what libraries they're using
[1:41] <cmccabe> I guess theoretically I could look at the ndn source and figure it out
[1:41] <cmccabe> but that could take a long time
[1:41] <cmccabe> can't we just see logs?
[1:42] <sagewk> yeah
[1:42] <cmccabe> all the 500 errors I ever got from radosgw were caused by segfaults
[1:42] <sagewk> which log is it?
[1:42] <sagewk> wodrich is also sending you a bit of perl to reproduce
[1:42] <sagewk> ah ok
[2:30] * greglap (~Adium@198.228.209.88) Quit (Read error: Connection reset by peer)
[3:00] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:02] * MK_FG (~MK_FG@219.91-157-90.telenet.ru) has joined #ceph
[3:08] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[3:09] <bchrisman> libceph doesn't like dots: http://pastebin.com/uA9LLuu3
[3:10] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[3:15] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:27] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[4:04] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[4:27] <sagewk> bchrisman: hmm, the culprit is probably Client::_lookup
[5:16] * Guest598 (~matthew@pool-96-228-59-187.rcmdva.fios.verizon.net) has joined #ceph
[5:40] <bchrisman> sagewk: I'll take a peek at that
[5:47] <bchrisman> client logs don't give me an obvious culprit.. debug 20 gives: http://pastebin.com/GVp5URnB
[5:48] <bchrisman> for the two lookups of fs0/buildshare and /.
[5:48] <bchrisman> one has _lookup result of 0 and the failing has _lookup result of -1
[5:48] <bchrisman> err -2
[5:50] <bchrisman> though this might be suspicious: 2011-05-12 03:41:54.724396 7f22cabd9710 client4563 hrm is_target=0 is_dentry=1
[5:50] <bchrisman> vs. is_target=1 for the fs0/buildshare
[5:52] <bchrisman> though _lookup implements a special case for '..', but not for '.'
[5:52] <bchrisman> I'll add a special case for '.'
[5:54] <bchrisman> sets target = dir and follows 'done' label
[6:42] * bbigras (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) has joined #ceph
[6:42] * bbigras is now known as Guest600
[6:46] * Guest356 (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[6:49] * alexxy (~alexxy@79.173.81.171) Quit (Read error: Connection reset by peer)
[6:49] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[7:01] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[7:02] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[7:08] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[7:11] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[7:26] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[7:28] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[7:35] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[7:39] * Meths (rift@2.25.214.179) Quit (Ping timeout: 480 seconds)
[8:00] * MK_FG (~MK_FG@219.91-157-90.telenet.ru) Quit (Ping timeout: 480 seconds)
[8:13] * MK_FG (~MK_FG@188.226.51.71) has joined #ceph
[8:50] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[13:30] * atg (~atg@please.dont.hacktheinter.net) Quit (Remote host closed the connection)
[13:30] * atg (~atg@please.dont.hacktheinter.net) has joined #ceph
[14:53] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[14:57] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[15:00] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:00] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[15:01] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:05] * morse (~morse@supercomputing.univpm.it) Quit (Ping timeout: 480 seconds)
[15:10] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[15:25] * murb (~murb@red.danu.be) Quit (Remote host closed the connection)
[15:36] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[15:36] * murb (~murb@red.danu.be) has joined #ceph
[17:23] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:26] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[17:27] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[17:27] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit ()
[17:39] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:43] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[17:51] <sagewk> yehuda_hm: around?
[17:56] <yehuda_hm> sagewk: yes
[17:57] <sagewk> have you looked at http://tracker.newdream.net/issues/1083 ?
[17:57] * greglap (~Adium@198.228.209.85) has joined #ceph
[17:58] <yehuda_hm> oh, didn't notice that
[17:59] <yehuda_hm> I'm not sure I get the first problem
[17:59] <yehuda_hm> can't we just list all log per date
[17:59] <yehuda_hm> if you remove the bucket it doesn't remove the logs
[18:00] <sagewk> hmm yeah that would work too
[18:00] <sagewk> is bucket creation/deletion logged in the bucket log object?
[18:00] <yehuda_hm> probably
[18:01] <sagewk> that it means adding a log list function
[18:01] <yehuda_hm> rados -p .log ls | grep ..
[18:03] <sagewk> are the log objects named by bucket name or by pool id?
[18:03] <yehuda_hm> by bucket name
[18:04] <sagewk> is that safe? i can't remember the restrictions on bucket names
[18:04] <yehuda_hm> it's <date>-<bucket>
[18:05] <yehuda_hm> hmm.. I'm not sure about the restrictions either.. but you can't make anything bad because of the date prefix anyway
[18:13] <sagewk> i'm mainly worried about long bucket names..
[18:13] <yehuda_hm> max bucket name is 255 characters
[18:15] <sagewk> ok cool
[18:20] <sagewk> yeah that'll work. can you confirm bucket creation/deletion is logged?
[18:42] <yehuda_hm> sagewk: yeah, it's logged
[18:44] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[18:44] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:46] <sagewk> yehuda_hm: for list objects, does rgw gather the entire result set before sending the result? or does it stream it?
[18:47] <yehuda_hm> sagewk: I think it streams it
[18:47] * greglap (~Adium@198.228.209.85) Quit (Read error: Connection reset by peer)
[18:47] <yehuda_hm> i'll verify
[18:48] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:49] <sagewk> i fixed the pgls to not miss objects, but there is still the possibility of dups. i wonder if rgw should filter those out
[18:51] <yehuda_hm> i'm not sure it's even possible
[18:51] <yehuda_hm> the client sets a marker and a limit
[18:52] <sagewk> oh it can span multiple requests too?
[18:52] <yehuda_hm> yeah
[18:53] <sagewk> huh.. it always lists objects in alphabetical order apparently
[18:53] <yehuda_hm> yeah..
[18:53] <yehuda_hm> it actually reads everything
[18:54] <sagewk> rgw does you mean?
[18:54] <yehuda_hm> and rebuilds the result starting at the marker
[18:54] <yehuda_hm> yeah
[18:54] <sagewk> oh i see
[18:54] <sagewk> in that case it should be trivial to filter dups
[18:54] <sagewk> it probably already puts them in a set<> or something?
[18:55] <yehuda_hm> I think that librados reads the data in parts?
[18:55] <yehuda_hm> yep
[18:55] <yehuda_hm> so there aren't dups
[18:55] <sagewk> ok cool
[18:55] <yehuda_hm> we still work too hard to achieve that though
[18:55] <sagewk> yeah
[18:56] <yehuda_hm> we should be able to filter that at the origin
[18:59] <yehuda_hm> I mean.. maybe not the dups
[18:59] <bchrisman> Client::getxattr has been changed to remove the 'const' from 'const char * path' argument.
[19:00] <bchrisman> seems like generally expected behavior is 'const char *path' as input argument to getxattr
[19:00] <bchrisman> is there a reason that would be changed for Client and libceph?
[19:02] <bchrisman> I can't think of a legitimate reason for the client class modifying the path argument, and samba's vfs layer likes to compile with that argument as a const??? meaning warnings.
[19:05] <yehuda_hm> bchrisman: it should be const
[19:05] <bchrisman> hmm.. maybe it changes and gets changed back but rebase is choking on the first.
[19:05] <bchrisman> will check on that a little bit later.
[19:07] <yehuda_hm> I actually don't see where it was removed
[19:07] <sagewk> yehuda_hm: any ideas on http://tracker.newdream.net/issues/1086 ?
[19:08] <yehuda_hm> hmm
[19:09] <sagewk> i did some other stuff between iozone competion and that dd, though, so the cache pages may have been flushed and the data reread.
[19:09] <bchrisman> yehuda_hm: ack.. sorry.. my bad.. I was looking at a dead source tree..
[19:10] <yehuda_hm> sagewk: yeah.. is it easily reproducible?
[19:10] <sagewk> i just did it the once so far. it's pretty slow under uml on fatty (took almost an hour maybe?)
[19:11] <sagewk> i'll try it again
[19:14] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:14] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:18] <yehuda_hm> sounds like a page cache issue
[19:26] * yehuda_hm (~yehuda@bzq-79-178-112-50.red.bezeqint.net) Quit (Read error: Connection timed out)
[19:26] * yehuda_hm (~yehuda@bzq-79-178-112-50.red.bezeqint.net) has joined #ceph
[19:30] <Tv> meeting?
[19:31] <Tv> that's a yes
[19:32] <sagewk> yep!
[19:57] <sagewk> yehuda_hm: if we're listing the rgw log objects directly from the store, maybe radosgw_admin should let you parse a raw object
[19:58] <sagewk> right now they have to list objects, parse the names to get (date,bucket) pairs, then feed that to radosgw_admin log show
[19:58] * rsharpe (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[19:59] * dwm (~dwm@vm-shell4.doc.ic.ac.uk) has joined #ceph
[20:02] <bchrisman> what does the '_cb' mean in the various client methods like readdir_r_cb(, and where is the cb( function reference in client: 'int r = cb(p, &de, &st, -1, next_off);'
[20:03] <yehuda_hm> callback
[20:03] <dwm> I'd expect _cb would simply be 'callback'
[20:03] <yehuda_hm> sagewk: so we can just tell it to show log for specific object
[20:06] <sagewk> yeah
[20:06] <sagewk> instead of separately specifying date and bucket
[20:06] <sagewk> which also will let us adjust the log object names however we want.. like $year-$month-$day-$hour-$bucket or something
[20:06] <sagewk> for more frequent log scraping
[20:10] <bchrisman> there a good way to test/assert that a callback pointer being passed in is??? well.. 'proper'.. readdir_r_cb is getting passed add_dirent_cb_t *.. it dereferences/executes it, and I'm getting a crash in there.
[20:10] <cmccabe> so I wondered if anyone had any thoughts about how we should support running (lib)rgw through nginx
[20:11] <Tv> fastcgi is a standard interface...
[20:11] <cmccabe> so fastcgi is what we have now
[20:11] <cmccabe> which involves separate processes per request
[20:11] <sagewk> the concern is that it won't be fast or scalable enough
[20:11] <cmccabe> so if we decide that fastcgi is all we'll ever need, the architecture doesn't need to change.
[20:11] <yehuda_hm> I think cmccabe is aiming at the software architecture
[20:11] <Tv> all fcgi means is that nginx/whatever needs to act as a proxy
[20:12] <Tv> you can have N frontends and M backends, across multiple machines
[20:12] <cmccabe> can fcgi keep persistent rgw processes around?
[20:12] <Tv> cmccabe: that's the whole point of fcgi
[20:12] <sagewk> this is a bit further out.. we're going to see what teh load and performance looks like and based on that decide whether to invest in something like an nginx or lighty module
[20:12] <yehuda_hm> right
[20:12] <cmccabe> I think yehuda was enthusiastic about modularizing rgw a little bit more
[20:13] <cmccabe> for example, he wanted to split the frontend and backend into very separate modules, possibly separate libraries
[20:13] <Tv> sagewk: here's my strong recommendation *against* custom lighty modules in this modern world
[20:13] <sagewk> tv: lighty specifically you mean?
[20:13] <Tv> sagewk: yes
[20:13] <cmccabe> tv: what about nginx?
[20:13] <Tv> as in, nginx is not quite as hard to code for, though just about as pointless
[20:13] <cmccabe> tv: are you just making a general comment about lighty vs. nginx?
[20:13] <Tv> there's no scalability bottleneck in fcgi
[20:14] <Tv> there is a bit of extra latency, but many places will choose to have a load balancer anyway
[20:14] <yehuda_hm> cmccabe: I was making a point that librgw currently doesn't need to link everything in if it just parses xml
[20:14] <Tv> cmccabe: i'm saying making general-purpose lighty modules is painful
[20:14] <sagewk> i think the concern is that we have a radosgw process for every concurrent request, whereas a native nginx (or whatever) module will not duplicate all the process state
[20:14] <cmccabe> yehuda_hm: that is true, but stripping out the non-xml parts would take time
[20:15] <cmccabe> yehuda_hm: and I want to make sure that it's what we want
[20:15] <Tv> sagewk: that's not an fcgi limitation...
[20:15] <Tv> sagewk: not sure what would force you to do that, currently
[20:15] <sagewk> apache
[20:15] <gregaf> bchrisman: not especially, I'd probably add debug to print out the pointer addresses and see what you get...
[20:15] <yehuda_hm> cmccabe: librgw at this point should only include rgw_acl.cc and librgw.cc, you don't need to split any code
[20:15] <sagewk> but the only fastcgi implementation that behaves wrt 100 continue and other oddities is apache's mod_fastcgi
[20:15] <Tv> 100 continue is pretty darn hard to get right anywhere
[20:15] <Tv> :(
[20:16] <sagewk> btw: http://www.infoq.com/news/2011/05/Google-Storage-for-Developers
[20:16] <yehuda_hm> radosgw{,_admin} will link against it too
[20:16] <gregaf> Tv: too bad it's required for S3 compatibility
[20:16] <cmccabe> we don't really have an option to not get 100 continue right
[20:16] <cmccabe> so that would need to be addressed in any potential move
[20:16] <Tv> gregaf: yeah just saying e.g. varnish can't proxy 100 continue right, last i checked
[20:17] <bchrisman> gregaf: ok.. that's what I was thinking.
[20:17] <cmccabe> sagewk: I thought gmailfs was my google data storage :)
[20:18] <cmccabe> oh, also, Google has added Golang to Google App Engine now.
[20:20] <Tv> i wonder if apache supports the fcgi-over-tcp variant
[20:20] <Tv> that takes the worker pool management out of apache's hands
[20:21] <Tv> or perhaps rgw just needs to speak http directly, and if you want lb/proxy/etc you just put a reverse proxy in
[20:21] <cmccabe> tv: adding another layer of indirection just seems ugly
[20:21] <Tv> can't find much info; frankly fcgi is an abandoned red-headed bastard child of cgi
[20:21] <cmccabe> tv: <-- replying to the fcgi-over-tcp comment
[20:21] <Tv> cmccabe: there's no addition
[20:21] <cmccabe> tv: or I guess another layer of latency
[20:21] <Tv> cmccabe: just shifting the pool management to a dedicated worker
[20:22] <cmccabe> tv: since the indirection was already there
[20:22] <Tv> the difference between tcp over localhost and unix domain sockets is tiny enough to not show up in benchmarks
[20:22] <cmccabe> tv: yeah-- for localhost
[20:22] <cmccabe> tv: not for remote host
[20:22] <cmccabe> tv: pretty key difference
[20:22] <Tv> so don't run it like that if you don't want it
[20:23] <cmccabe> tv: it's well known that linux's TCP-over-localhost is well-optimized and competive with UNIX domain sockets.
[20:23] * MK_FG (~MK_FG@188.226.51.71) Quit (Read error: Operation timed out)
[20:24] <sagewk> the easiest path may just be to fix 100 continue for nginx's fcgi
[20:24] <cmccabe> so what I'm hearing is that we should keep librgw a little more minimal for now, and push off some of the decisions about what goes into what library into the future
[20:24] <sagewk> but this isn't necessarily a problem yet, so no need to worry just yet.
[20:24] <sagewk> cmccabe: that works for me. for now all we need is the acl bits.
[20:25] <Tv> sagewk: btw, for nginx you're gonna need that external worker pool manager
[20:25] <cmccabe> I think his main objection was that rgw didn't link to librgw
[20:25] <Tv> sagewk: if you have it, and apache actually supports fcgi over tcp, you already have the result you wanted..
[20:26] * MK_FG (~MK_FG@188.226.51.71) has joined #ceph
[20:26] <Tv> ahh FastCgiExternalServer /webroot/http -host 192.168.1.10:9000
[20:26] <Tv> so that's what you need, no need for nginx quite yet
[20:27] <Tv> just need the fcgi worker management parts -- not sure what c++ lib you're using, and whether it knows how to do that
[20:27] <sagewk> tv: libfcgi
[20:29] <Tv> DLLAPI int FCGX_OpenSocket(const char *path, int backlog);
[20:29] <Tv> * path is the Unix domain socket (named pipe for WinNT), or a colon
[20:29] <Tv> * followed by a port number. e.g. "/tmp/fastcgi/mysocket", ":5000"
[20:29] <Tv>
[20:29] <Tv> it can listen on tcp by itself, apache can talk to it, no need to suffer apache pool management
[20:29] <Tv> that doesn't magically make it use threads, but that's how you can make that happen
[20:31] <sagewk> cool./
[20:38] <sagewk> yehuda_hm: i think i found the rbd issue
[20:39] <yehuda_hm> sagewk: having a google compatible shim should be easier than adding swift support
[20:39] <sagewk> see the last update in teh bug, and look at the order the replies came in
[20:39] <sagewk> i thinkr bd is competing the read after teh first osd response instead of after both
[20:41] <yehuda_hm> sagewk: what log are you looking at
[20:41] <yehuda_hm> ?
[20:53] <yehuda_hm> sagewk: not sure about the case where rc == -ENOENT at rbd_req_cb().. bytes is set before from op->extent.length
[20:57] * neurodrone (~neurodron@dhcp215-064.wireless.buffalo.edu) has joined #ceph
[20:57] <yehuda_hm> oh, nm
[21:00] <sagewk> yehuda_hm: this fatty:/home/sage/ceph/src/out/osd.0
[21:06] <yehuda_hm> sagewk: did you refer to a specific response?
[21:07] <sagewk> client4311.1:430307 and client4311.1:430308
[21:07] <sagewk> the 430308 response came back first, and i think rbd completed the whole read then (before 430307 reply arrived)
[21:08] <sagewk> which is why iozone failed with bad data, but my dd right after that showed good data
[21:16] <yehuda_hm> oh, I see
[21:16] <yehuda_hm> probably something with the bio split doesn't work
[21:16] <sagewk> yeah
[21:16] <sagewk> specifically the completion
[21:20] <sagewk> is that enough to go on? having trouble reproducing the split bio from userspace
[21:20] <yehuda_hm> hopefully it's enough
[21:21] <yehuda_hm> probably the culprit
[21:26] * Meths (rift@2.25.214.205) has joined #ceph
[21:57] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[22:07] <yehuda_hm> sagewk: I think the problem is that we call blk_end_request() for each chunk, and it's the wrong api function to call as it seems that it implies ordering
[22:07] <yehuda_hm> there's blk_update_request() that should be called instead
[22:43] <Tv> status update on the osd kill tests: autotest is still problematic, but definitely seeing ceph bugs too.. trying to get things reproducible
[22:44] * neurodrone (~neurodron@dhcp215-064.wireless.buffalo.edu) Quit (Quit: zzZZZZzz)
[23:06] <Tv> sometimes, a replacement osd started well >=60sec after previous instance exited, immediately exits with status 1, claiming that lock_fsid failed
[23:06] <Tv> but i have clear indication of the previous osd exiting, and i sleep 60 secs in between :(
[23:09] * neurodrone (~neurodron@dhcp215-064.wireless.buffalo.edu) has joined #ceph
[23:10] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[23:37] <sage> tv: the locking stuff happens sometimes.. the kernel doesn't free up teh lock state until the process goes away, and that can take a while sometimes. like if there are in-progress syscalls, or pages swapped out, or.. something, not really sure what.
[23:37] <sage> tv: service ceph restart osd.foo frequently won't actually start due to that error tho
[23:37] <Tv> there can't be in-progress syscalls if the process exited
[23:37] <Tv> the lock is documented to go away on fd close
[23:38] <gregaf> sage: is the MDS supposed to let the client have Fc/b caps if they have don't have Fr/w?
[23:38] <Tv> i have this feeling we're not understanding the whole picture, currently
[23:38] <Tv> digging into it
[23:39] <sage> hmm. well not sure what it is then. but the init script 'stop' waits for /proc/$pid to go away, and we frequently get lock failues after that. maybe the procfs state is teh wrong thing to wait for there.
[23:39] <Tv> sagewk: well in my case the parent process does waitpid, so i'm really, really, sure that the process is dead at that point
[23:39] <sage> gregaf: yes that happens sometimes. like when you stat a file, you can't write, but you don't have to flush all your dirty data
[23:39] <sage> ok nm then :)
[23:40] <gregaf> ah, that makes sense
[23:40] <sage> just got access to coverity's linux kernel scan. a couple dozen ceph items in there
[23:41] <sage> yehuda_hm: there?
[23:42] <yehuda_hm> yeah
[23:42] <yehuda_hm> I'm still trying to work that out
[23:43] <yehuda_hm> the problem is that just using blk_end_request() function is a no go apparently
[23:43] <yehuda_hm> I'm trying to avoid having to send split requests in the first place, by setting the queue parameters
[23:44] <yehuda_hm> but I'm having trouble getting it to work
[23:44] <sage> can't just wait for both acks before calling blk_end_request?
[23:45] <yehuda_hm> that'll require a bigger change
[23:46] <yehuda_hm> currently we let the rq tracking the original request length
[23:46] <yehuda_hm> also, there are other issues that might arise
[23:47] <yehuda_hm> like, what if we sent a request and couldn't allocate a second one
[23:47] <yehuda_hm> so we'd rather just not have to send two requests..
[23:47] <yehuda_hm> and I think that's how it's supposed to work
[23:48] <yehuda_hm> there's a set of blk_queue functions that set limits on the queue: blk_queue_max_segment_size(), blk_queue_segment_boundary(), etc.
[23:52] <sage> yeah that sounds better :)
[23:56] <yehuda_hm> if only it worked..
[23:58] * neurodrone (~neurodron@dhcp215-064.wireless.buffalo.edu) Quit (Quit: zzZZZZzz)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.