#ceph IRC Log


IRC Log for 2011-07-07

Timestamps are in GMT/BST.

[0:02] <cmccabe> back
[2:16] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has left #ceph
[3:00] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[3:28] * Tv (~Tv|work@ip-64-111-111-107.dreamhost.com) Quit (Ping timeout: 482 seconds)
[3:30] * yoshi (~yoshi@p15251-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:34] * joshd (~joshd@ip-64-111-111-107.dreamhost.com) Quit (Quit: Leaving.)
[6:08] * jbd (~jbd@ks305592.kimsufi.com) has left #ceph
[6:08] * jbd (~jbd@ks305592.kimsufi.com) has joined #ceph
[7:31] * jbd (~jbd@ks305592.kimsufi.com) has left #ceph
[7:31] * jbd (~jbd@ks305592.kimsufi.com) has joined #ceph
[8:29] * hijacker (~hijacker@ Quit (Quit: Leaving)
[8:29] * hijacker (~hijacker@ has joined #ceph
[8:51] * gregorg_taf (~Greg@ Quit (Quit: Quitte)
[8:51] * gregorg (~Greg@ has joined #ceph
[12:32] * yoshi (~yoshi@p15251-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:55] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[13:15] * hijacker (~hijacker@ Quit (Quit: Leaving)
[13:15] * hijacker (~hijacker@ has joined #ceph
[14:02] * Meths_ (rift@ has joined #ceph
[14:09] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[14:57] * alexxy[home] (~alexxy@ Quit (Ping timeout: 480 seconds)
[17:53] * greglap (~Adium@ has joined #ceph
[17:54] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:08] * Tv (~Tv|work@ip-64-111-111-107.dreamhost.com) has joined #ceph
[18:13] * joshd (~jdurgin@adsl-99-6-189-121.dsl.irvnca.sbcglobal.net) has joined #ceph
[18:33] * jbd (~jbd@ks305592.kimsufi.com) has left #ceph
[18:33] * joshd (~jdurgin@adsl-99-6-189-121.dsl.irvnca.sbcglobal.net) Quit (Quit: Leaving.)
[18:45] * greglap (~Adium@ Quit (Quit: Leaving.)
[18:57] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[19:00] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:36] <bchrisman> sagewk: the issue on lio/scst is clustered persistent reservations??? they've been implemented for scst, but (as far as we know) not for lio
[19:42] * joshd (~joshd@ip-64-111-111-107.dreamhost.com) has joined #ceph
[19:51] <cmccabe> you guys do know that signalling a condition variable before anyone waits on it goes to /dev/null, right?
[19:52] <cmccabe> check out librados::RadosClient::pool_create
[19:52] <cmccabe> I see a definite possibility that cond.signal will happen before cond.wait
[19:59] <cmccabe> this could perhaps be resolved by taking mylock earlier
[19:59] * Meths_ is now known as Meths
[20:02] <gregaf> cmccabe: I'm looking at RadosClient::pool_create
[20:02] <gregaf> and, umm, what?
[20:02] <cmccabe> ok
[20:02] <gregaf> oh, I see
[20:02] <cmccabe> yes
[20:02] <cmccabe> and unfortunately that pattern is repeated a lot in librados
[20:03] <cmccabe> now the question is, what is the lock ordering between mylock and RadosClient::lock
[20:03] <cmccabe> does anyone ever hold one and take the other?
[20:03] <gregaf> there's only a problem here on error code returns
[20:03] <gregaf> from before the message gets sent out
[20:03] <cmccabe> no, there is a problem in general
[20:03] <gregaf> which I think only happens on ENOMEM?
[20:04] <cmccabe> because the condition variable may get signalled before the wait happens
[20:04] <gregaf> it only waits on the cond while (!done)
[20:04] <cmccabe> no mutexes are held by the thread on line 1191
[20:04] <gregaf> and everything that signals the cond:
[20:04] <gregaf> 1) sets done first
[20:04] <gregaf> 2) happens after a message gets sent out and received
[20:04] <gregaf> or should, anyway
[20:04] <cmccabe> hmm
[20:04] <cmccabe> maybe the done variable does save us here, let me check
[20:05] <gregaf> that is sort of what it's for ;)
[20:05] <cmccabe> but error code returns will still set done, won't they
[20:06] <cmccabe> it is all done through C_SafeCond::finish
[20:06] <gregaf> from a quick browse it looks like they don't, but the only error codes you're going to get before message passing are ENOMEM, which is busted anyway due to lack of exception-handling, right?
[20:07] <cmccabe> are you talking about errors from create_pool or errors from the remote monitor
[20:07] <gregaf> the local errors
[20:07] <gregaf> remote errors come back through the normal path and set done and signal the cond
[20:08] <cmccabe> the error that I'm trying to debug is not local
[20:08] <gregaf> so the cond signalling is unlikely to be your problem
[20:08] <cmccabe> well, it is local, but it results from an old osdmap later on
[20:09] <gregaf> if librados is still hanging it's probably because wait_for_new_map (or whatever the function is called) didn't actually ask for a new map
[20:09] <cmccabe> although it's hard to believe, pool_op_submit doesn't seem to return any error codes
[20:09] <gregaf> the logic there is obtuse and may well be wrong
[20:09] <cmccabe> there was a function called maybe get new map, or something
[20:09] <gregaf> yes, that one
[20:09] <gregaf> follow those around, unless I'm gravely mistaken that's the problem causing your hang
[20:10] <gregaf> didn't we talk about this yesterday?
[20:10] <cmccabe> yes, but we didn't solve it
[20:10] <gregaf> okay, but did you check that it is in fact waiting forever to get a new map?
[20:10] <gregaf> if that's the problem it's because it's not asking for a new map
[20:10] <cmccabe> my latest trace doesn't hang, just gets the error
[20:10] <gregaf> what error?
[20:10] <cmccabe> I've seen it do both
[20:11] <cmccabe> I just need to look at the osdmap requesting logic more, that's really what it boils down to
[20:12] <cmccabe> brb
[20:12] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has left #ceph
[20:34] <sjust> would it make sense to just take a snap shot before starting to steam a large object?
[20:35] <sjust> for purposes of atomic get
[20:35] <bchrisman> depends on whether you want wrinkles in it??? oh??? stream.. sorry.. :)
[20:36] <sjust> ...stream a large object
[20:52] <Tv> we should upgrade to nuclear gets
[20:57] <sagewk> :)
[20:58] <sagewk> snapshot (or clone) is a possibility, but pretty heavyweight (there's a write involved in every read)
[20:58] <sagewk> then there's the whole cleanup issue
[21:00] <sagewk> i'd be more inclined to move toward versioning where we keep old versions around.. then the overhead is on write, and also gets us all the other benefits of full versioning. cleanup vs very slow readers is still an issue if we don't want full retention, though..
[21:07] <Tv> sagewk: for how long?
[21:08] <Tv> ah you said that, nm
[21:09] <Tv> i've done something like this before, i liked it: 1) requests must complete within 24h (enforce in code) 2) clean up old versions >48h
[21:09] <Tv> 1) is pretty much protection against slowloris
[21:10] <Tv> you pay the cost of needing to do the expiry, though
[21:13] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[21:46] <cmccabe> gregaf: after the POOL_OP_CREATE gets sent, it looks like the reply message from the monitor has m->epoch = 0
[21:49] <gregaf> cmccabe: I'll look at it again, but I checked that already and thought it was good
[21:50] <gregaf> err, actually, shouldn't it be looking at m->version?
[21:50] <cmccabe> gregaf: isn't it the epoch that reflects changes in the osdmap?
[21:51] <gregaf> the map epoch, yes, but I think the variable recording that in the message is titled "version"
[21:51] <gregaf> based on what the various components are checking
[21:51] <cmccabe> well there is another variable in there called epoch
[21:52] <cmccabe> yeah, I'm not sure I understand what the difference in function is between MPoolOpReply::epoch and MPoolOpReply::version
[21:54] <cmccabe> gregaf: I feel like PaxosServiceMessage::version has something to do with paxos, and probably not so much with the OSDMap... ?
[21:57] <gregaf> oh, you're right, epoch is the one in use
[21:58] <gregaf> I just got confused because the epochs are (for whatever dumb reason) referred to as versions in the paxos code
[21:58] <gregaf> so it's always filled in with the pending map's epoch
[21:58] <gregaf> maybe that's not always initialized, though
[21:59] <gregaf> oh wait, they're not, some of them are filled in with the current map's epoch
[21:59] <cmccabe> ic
[21:59] <gregaf> no, I'm confusing myself again
[22:00] <gregaf> they're always filled in with the pending map's epoch
[22:00] <gregaf> except for *certain* error codes
[22:00] <gregaf> which are filled in with zero
[22:00] <gregaf> you are correct
[22:00] <cmccabe> so do we have access to the pending map's epoch in the preprocess_* functions?
[22:00] <gregaf> so I guess the error codes should be filled in with the current map's epoch
[22:01] <gregaf> except for the pool creates they're unconditionally filled in with the pending map's epoch
[22:02] <cmccabe> ok
[22:02] <gregaf> and it shouldn't ever be able to exist with a 0 epoch; it's filled in at creation
[22:02] <Tv> yehudasa: i see a bug in the s3test we talked about
[22:03] <yehudasa> Tv: there was a bug that we found and berler pushed a fix
[22:03] <yehudasa> do you see another bug?
[22:03] <Tv> hold on, that's the first time i hear about that
[22:03] <Tv> please use channels for discussion :(
[22:04] <gregaf> yeah, this all looks as it should, it only returns an epoch of 0 for ops where you're trying to treat snaps as the wrong type of [pool selfmanaged]
[22:04] <gregaf> otherwise it returns pending_map.epoch
[22:04] <Tv> yehudasa: actually, no, that makes very little sense..
[22:04] <gregaf> cmccabe: you sure it was sending back 0 to the clients? or was that just your guess based on code paths
[22:05] <cmccabe> gregaf: I printed it out
[22:05] <yehudasa> Tv: ?
[22:05] <cmccabe> gregaf: I am confused now because every code path I find ends with _pool_op(m, <error code>, pending_inc.epoch)
[22:05] <cmccabe> gregaf: which as you say, should be what we want here
[22:06] <Tv> yehudasa: i don't see how if could ever be at self.offset==self.size and have count>0, since count = min(size, self.size - self.offset)
[22:06] <Tv> yehudasa: that min will be min(whatever, 0)
[22:06] <Tv> yehudasa: and that change wouldn't fix the bug i suspect, in either case
[22:06] <gregaf> cmccabe: yeah, the only other options are prepare_pool_op calls it with 0, but only for naughty snap usage
[22:07] <gregaf> and C_PoolOp calls it with the epoch it was given, but all the creators of that use pending_inc.epoch as far as I can see
[22:07] <cmccabe> well, maybe pending_inc.epoch was 0.
[22:07] <cmccabe> is that possible?
[22:07] <yehudasa> Tv: self.offset is increased just before that test
[22:08] <yehudasa> Tv: so and the last actual read it can happen, and then count is not 0
[22:08] <yehudasa> Tv: then it'll be called again, but this time count will be 0
[22:08] <Tv> yehudasa: oh yeah i see that, sorry i have local edits to flush out the bug i suspect
[22:09] <Tv> yehudasa: basically, since we already sent content-length, the sneaky part is not sneaky enough
[22:09] <gregaf> cmccabe: not as far as I can tell; every time it creates a new pending_inc it sets the epoch to odsmap.epoch+1 in the constructor
[22:09] <gregaf> and create_pending is called on every switchover, under lock
[22:09] <cmccabe> gregaf: I'll add some things to the monitor logging and see what it says
[22:09] <gregaf> and the epoch is encoded and decoded as part of the message...
[22:10] <yehudasa> Tv: yeah, effectively it generated an extra read that wasn't expected, while the PUT was completing, which triggered the non-atomic GET problem
[22:10] <Tv> yehudasa: the test, as it is now, is very racy; it can be made much less racy
[22:10] <Tv> for example, it won't trigger for me at all
[22:11] <yehudasa> Tv: yeah, actually it didn't trigger for me on my setup either, but was triggering consistently on dho
[22:11] <yehudasa> Tv: I wonder if that's due to some other changes, but I can't really think of any that would do that
[22:11] <Tv> yehudasa: latency differences probably
[22:11] <yehudasa> Tv: oh, yeah
[22:11] <Tv> i'm fairly sure i can trigger it reliably, hold on
[22:14] <Tv> err now i'm confused again
[22:14] <Tv> oh well, re-reading ;)
[22:14] <cmccabe> gregaf: ok, on my last run, m->epoch did get filled in
[22:14] <cmccabe> gregaf: but now I got a hang
[22:15] <cmccabe> gregaf: this is really fishy
[22:15] <gregaf> so my guess if you're getting a hang because librados isn't actually asking for a new OSDMap when we want it to be
[22:15] <cmccabe> gregaf: yeah...
[22:15] <gregaf> that logic is pretty obtuse and when I looked at it yesterday it did not convince me of its correctness
[22:16] <cmccabe> gregaf: ok, reran and got m->epoch = 0 again
[22:16] <cmccabe> gregaf: mon.c@2(peon).osd e3 _pool_op(POOL_OP_CREATE) got -EEXIST. pending_inc.epoch = 0
[22:17] <cmccabe> from a printout I added to OSDMonitor::preprocess_pool_op_create
[22:17] <gregaf> hrm, maybe my understanding of those interfaces is incorrect
[22:17] <cmccabe> so apparently pending_inc.epoch was 0
[22:21] <cmccabe> gregaf: following this back to PaxosService::dispatch...
[22:21] <yehudasa> Tv: from what I understand (and I may be very well wrong here) the tests are single threaded, so any races are due to different scheduling of the apache processes
[22:21] <Tv> yehudasa: let's do this on paper
[22:21] <yehudasa> apache/radosgw
[22:21] <cmccabe> gregaf: preprocess_pool_op_create ultimately comes out of PaxosService::preprocess_query
[22:21] <cmccabe> gregaf: so really the question is what is the meaning of pending_inc.epoch at that point, if any
[22:22] <gregaf> yes
[22:22] <gregaf> it's filled in by create_pending
[22:22] <gregaf> which I thought was called every time the map gets changed
[22:42] <sagewk> cmccabe: oh, i bet pending_inc was 0 because it was a slave monitor.
[22:42] <sagewk> if its a slave, though, returning the current osdmap epoch should have been.
[22:42] <sagewk> and regardless, if we return EEXIST due to what is in pending_inc that is a bug.. we should wait for paxos to commit before returning that
[22:43] <sagewk> yeah, it should just return osdmap->get_epoch(), not the next pending epoch.
[22:45] <cmccabe> sagewk: so what are the intended semantics here
[22:45] <sagewk> return EEXIST if it exists.
[22:45] <sagewk> s/pending_inc.epoch/osdmap.get_epoch()/
[22:46] <cmccabe> sagewk: ic
[22:46] <cmccabe> sagewk: I think we might need to really examine the code since there's a lot of prepare_* methods that are accessing pending_inc right now
[22:47] <sagewk> those need to
[22:47] <sagewk> but they should avoid returning errors until commit
[22:48] <cmccabe> well, nearly every call to _pool_op is referencing pending_inc
[22:48] <sagewk> ugh.. yeah its a mess
[22:48] <cmccabe> so if that doesn't work on the slaves, then we'll need to fix
[22:48] <gregaf> many of them should be since they're doing pool creates and stuff that won't happen until the next map is committed
[22:48] <gregaf> and those ones will only happen on the master
[22:49] <cmccabe> but the next map won't be committed if there's an error
[22:49] <sagewk> any prepare_ call only runs on teh master, so pending_inc is fair game there
[22:49] <sagewk> just not in preprocess_*
[22:49] <gregaf> I think it's only the ones in prepare_* that you need to worry about, and only the error code ones
[22:49] <gregaf> err, preprocess* I mean
[22:49] <cmccabe> it's kind of unfortunate that there's no way to enforce that rule
[22:50] <gregaf> we could maybe use accessors instead and wall it off from public consumption
[22:50] <sagewk> or pass it as an explicit argument to prepare_* methods
[22:51] <cmccabe> explicit argument makes sense
[22:51] <cmccabe> in general it would be nice if the paxos stuff was broken down into more components rather than inheriting from one giant base class of doom
[22:52] <gregaf> I don't think that obscuring the origin even further is going to help anything
[22:54] <sagewk> it needs to be reworked at some point anyway, so it's not worth refactoring now.
[22:54] <cmccabe> gregaf: my point is that we might want separate classes for preprocess, prepare, etc.
[22:54] <gregaf> there is no etc, it's preprocess and prepare ;)
[22:56] <sagewk> looks like preprocess_pool_op() is broken. i'll open a bug for this.
[22:57] <cmccabe> gregaf: fair enough
[22:57] <cmccabe> gregaf: but still, preprocess should not have access to the same class data that prepare does
[22:57] <cmccabe> gregaf: another thing that annoys me is that there are leader-only variables in class OSDMonitor
[22:58] <cmccabe> gregaf: they still exist if you're not the leader, which is confusing
[22:58] <cmccabe> gregaf: anyway
[23:02] <cmccabe> it looks like rados_open_pools_parallel does pass with this change
[23:02] <sagewk> yay
[23:03] <cmccabe> 99f5ca7140da820c0873b4eadb4741a152a01567
[23:08] <gregaf> joshd: is there documentation for how to use teuthology-lock?
[23:08] <gregaf> I can't work out how to deal with my previously-locked machines being locked as greg rather than gregf@kai
[23:09] <joshd> gregaf: -h, and use the --owner option
[23:09] <gregaf> yes, I'm doing it wrong and the CLI isn't telling me how to do it
[23:09] <joshd> gregaf: also the readme
[23:09] <gregaf> ./virtualenv/bin/teuthology-lock --owner sepia12.ceph.dreamhost.com is wrong
[23:09] <gregaf> ./virtualenv/bin/teuthology-lock --update --owner sepia12.ceph.dreamhost.com crashes
[23:10] <joshd> gregaf: you need to specify an operation, in this case --unlock
[23:10] <gregaf> ah, there we go??? ???update machine ???field-to-change value
[23:11] <joshd> gregaf: that doesn't change the owner, it only affects the status (up or down) and description
[23:11] <gregaf> ???it executed without giving me any errors :(
[23:11] <joshd> I'll fix that
[23:12] <gregaf> I can't unlock them if I don't own them, it looks like
[23:13] <sagewk> cmccabe: i'll merge teh pool op stuff, half way through a fix of the affected code
[23:14] <joshd> gregaf: that's right, you need the --owner option to set the owner you run as
[23:15] <gregaf> ungh
[23:24] <sagewk> cmccabe: pushed fix(es) to master, merge/rebase on that

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.