#ceph IRC Log


IRC Log for 2011-07-06

Timestamps are in GMT/BST.

[0:06] * u3q (~ben@jupiter.tspigot.net) Quit (Quit: changing servers)
[0:08] * u3q (~ben@jupiter.tspigot.net) has joined #ceph
[0:12] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[0:12] * gregorg_taf (~Greg@ has joined #ceph
[0:20] * u3q (~ben@jupiter.tspigot.net) Quit (Quit: changing servers)
[0:24] * u3q (~ben@jupiter.tspigot.net) has joined #ceph
[0:48] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[0:57] <Tv> sagewk: at some point, you asked about command-line access to python docstrings. here goes: "pydoc urllib2", "pydoc urllib2.urlopen", "cd YOUR_TEUTHOLOGY_SRC && ./virtualenv/bin/python -m pydoc teuthology.task.ceph.task"
[0:58] <sagewk> thanks
[0:58] <sagewk> i may stick to reading the source :)
[0:59] <Tv> or in ./virtualenv/bin/python repl or teuthology interactive mode: from teuthology.task import ceph; help(ceph.task)
[0:59] <Tv> sagewk: that's what i do ;)
[1:00] <Tv> sagewk: we can also generate pretty html from that; the result when well done looks this pretty: http://readthedocs.org/docs/virtualenv/en/latest/index.html
[1:00] <Tv> http://readthedocs.org/docs/django-testing-docs/en/latest/index.html
[1:00] <Tv> etc
[1:01] <Tv> (where "from that" has values like "that and as much prose as you care to write; more tends to be better")
[1:01] <Tv> don't expect a book autogenerated from api docs ;)
[1:01] <sagewk> :)
[1:02] <Tv> sagewk: e.g. the pypy project uses that whole doc system as a "official wiki replacement": http://readthedocs.org/docs/pypy/en/latest/
[1:47] * Tv (~Tv|work@ip-64-111-111-107.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:14] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has left #ceph
[3:00] * joshd (~joshd@ip-64-111-111-107.dreamhost.com) Quit (Quit: Leaving.)
[3:04] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:51] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[5:16] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[6:31] * lxo (~aoliva@9KCAAAHXS.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[6:35] * lxo (~aoliva@9YYAABUIC.tor-irc.dnsbl.oftc.net) has joined #ceph
[8:33] * jiaju (~jjzhang@ has joined #ceph
[10:56] * lxo (~aoliva@9YYAABUIC.tor-irc.dnsbl.oftc.net) Quit (Read error: Connection reset by peer)
[10:57] * lxo (~aoliva@09GAAFAOX.tor-irc.dnsbl.oftc.net) has joined #ceph
[11:14] * jiaju (~jjzhang@ Quit (Ping timeout: 480 seconds)
[12:53] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Remote host closed the connection)
[13:01] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[13:22] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[13:41] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[13:43] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[14:36] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[16:50] * slang (~slang@chml01.drwholdings.com) has joined #ceph
[16:51] <slang> ceph reads from the primary to ensure consistency is maintained
[16:51] <slang> but what if a file is marked immutable?
[16:51] <slang> does it still always read from the primary in that case?
[17:52] * greglap (~Adium@ has joined #ceph
[18:40] * joshd (~joshd@ip-64-111-111-107.dreamhost.com) has joined #ceph
[18:40] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[18:49] * Tv (~Tv|work@ip-64-111-111-107.dreamhost.com) has joined #ceph
[18:59] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:55] <yehudasa> wido: are you there?
[20:05] <Tv> joshd: for teuthology commit f80a2, i straced the exec and there was only PWD in the environment.. i think the SetEnvs in apache.conf are completely ignored, you might want to verify that they get set, perhaps move them to the shell script wrapper
[20:05] <joshd> Tv: they are necessary for something, rgw doesn't work without them set there
[20:06] <Tv> joshd: ahh okay i think then they inside the "fastcgi environment"
[20:06] <Tv> joshd: which is separate from the unix env
[20:06] <joshd> Tv: ah, that makes sense
[20:06] <Tv> so i'd say rip out LD_LIBRARY_PATH from there and add a comment that warns it's not the unix env
[20:07] <Tv> (fastcgi env can change per request, that's why it's not the unix env)
[20:07] <joshd> sounds good
[20:08] <gregaf> all: anybody using sepia27 or sepia28?
[20:08] <gregaf> they're not locked on the wiki page but I'm going to grab them so if you forgot to lock, speak up now...
[20:13] <Tv> not me
[20:14] <Tv> my google apps email is woefully inadequate for dealing with mailing lists :( time to learn how to do it right
[20:14] <Tv> how do you send plain text by default.. :-/
[20:15] <gregaf> Tv: don't you already?
[20:16] <Tv> vger just refused a message from me
[20:17] <gregaf> I don't remember having this problem in the past, maybe I enabled a lab for it at some point...
[20:24] <Tv> apparently if you toggle it to plain text once, that becomes the new default
[20:24] <Tv> now i need to figure out "reply to list"
[20:25] <gregaf> there's a lab to set default reply-to-all
[20:25] <gregaf> that's what I use
[20:26] <cmccabe> so pool creation doesn't seem to have read-after-write consistency
[20:27] <cmccabe> I just created a pool, fired off a semaphore, and another process tried to open that pool, and failed with ENOENT
[20:27] <Tv> gregaf: Thank you Sir, you are a gentleman and a scholar.
[20:27] <gregaf> heh, welcome
[20:27] <gregaf> cmccabe: I'm not sure that's built into librados by default :(
[20:28] <gregaf> since each process will need the updated OSDMap to see the newly-created pools
[20:28] <Tv> that's worse than "reply to list", but I'll take it
[20:28] <cmccabe> tv: when you hit reply-to-all, doesn't the list get included?
[20:28] <cmccabe> tv: if the list was in the CC: or to: originally?
[20:28] <gregaf> in rgw I think there's code that fetches a new OSDMap whenever a bucket lookup fails for ENOENT
[20:29] <cmccabe> gregaf: I think I understand why this is eventually consistent
[20:29] <cmccabe> gregaf: it relies on getting a new OSDMap, but there's no guarantees as to when that will happen
[20:29] <gregaf> yeah
[20:29] <cmccabe> gregaf: well maybe there are some guarantees, but not the kind that would provide read-after-write
[20:29] <Tv> cmccabe: http://www.mutt.org/doc/manual/manual-4.html#ss4.8
[20:30] <gregaf> maybe we could have it as an option to force map updates to deal with that kind of thing; it's just not something you want to enable by default since it adds message traffic
[20:30] <cmccabe> tv: ok, so it's about not sending 2 copies to someone
[20:30] <cmccabe> tv: that isn't even a concern for me with gmail because gmail does the de-duplication for me
[20:31] <cmccabe> tv: but I certainly see how it could be a problem since not everyone uses gmail
[20:31] <gregaf> not all mail clients are that clever though ??? it's one of the reasons I decided I couldn't handle Apple Mail
[20:31] <cmccabe> tv: it seems like a big dilemma though because you don't even know who is on the mailing list
[20:31] <gregaf> to which I say: make your mail client suck less!
[20:31] <cmccabe> tv: so you can trim people on the assumption that they're on the list, but they may or may not be
[20:31] <Tv> hence the follow up to logic that got de facto standardized
[20:33] <cmccabe> tv: hmm, I had no idea that header existed
[20:34] <cmccabe> I wonder how many non-techno-geek email clients implement it
[20:34] <cmccabe> given that to most people email means HTML sent from outlook... sigh
[20:36] <cmccabe> gregaf: I have confirmed that rgw does handle this race in open_bucket_ctx
[20:36] <cmccabe> gregaf: thanks
[20:36] <cmccabe> looks like I'll have to do the same
[20:55] <cmccabe> so it seems that rados_pool_create(foo) can return EEXIST, but then on the next line, rados_ioctx_create(foo) can fail with ENOENT
[21:08] <wido> yehudasa: here now
[21:15] <wido> yehudasa: I'm afk in a moment, but you can access my cluster via root@logger.ceph.widodh.nl and "ssh gateway"
[21:15] <wido> your key should still be loaded, unless you have a new key
[21:16] <yehudasa> wido: thanks, I'll try that soon enough
[21:16] <cmccabe> yehudasa: so it looks like no matter how long I wait, I never get the new OSDMap in the other process
[21:16] <cmccabe> yehudasa: I can see that you use a stat operation to force a new OSDmap in rgw
[21:16] <cmccabe> yehudasa: but since I don't have a pool open, I cannot do a stat operation in my test
[21:17] <cmccabe> yehudasa: I guess I could create a pool just to do a stat operation on it, but there must be a better way of getting the new osdmap.
[21:17] <yehudasa> cmccabe: iirc, the stat is not for getting the osdmap in the client, but to let the client notify the osd about the new epoch
[21:18] <yehudasa> you need to run it where you created the pool
[21:18] <cmccabe> yehudasa: I just have a process that creates a pool, adds some objects, and then closes the ioctx and exits
[21:19] <cmccabe> yehudasa: that should be enough to make the changes visible to others, but it's not
[21:19] <cmccabe> yehudasa: if nothing else, destroying the ioctx should be some kind of synchronization point
[21:19] <yehudasa> cmccabe: follow the epoch
[21:20] <cmccabe> I supposed I can start adding more debug statements or something
[21:20] <cmccabe> but first, I just want to get an understanding of how it's supposed to work
[21:20] <yehudasa> cmccabe: you should see that in the requests
[21:20] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has left #ceph
[21:21] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[21:21] <cmccabe> yehudasa: so rados_pool_create should create a new epoch
[21:21] <cmccabe> yehudasa: how do other processes know that that new epoch has happened?
[21:23] <gregaf> they don't; OSDMap updates are pushed around on top of other communications
[21:23] <gregaf> if there are no other communications it won't know about changes to the cluster
[21:24] <cmccabe> gregaf: so I can wait forever to open this pool?
[21:24] <cmccabe> gregaf: usually there's at least a timeout after which changes become visible
[21:24] <gregaf> *shrug*
[21:24] <cmccabe> gregaf: in s3 there isn't any special operation you have to do, you just have to wait
[21:24] <gregaf> yes, that's correct
[21:24] <gregaf> librados is not S3
[21:25] <gregaf> it's not even particularly S3-like
[21:25] <cmccabe> gregaf: I also don't understand how I can get pool_create == EEXIST followed by ioctx_open == ENOENT
[21:25] <gregaf> well, you wouldn't
[21:25] <cmccabe> gregaf: well, I just did
[21:25] <gregaf> but ioctx_open is also called by other functions
[21:27] <cmccabe> so at the very least, pool_create should get us the newest osdmap
[21:27] <gregaf> well, we could make that a configurable option
[21:28] <cmccabe> I just want to understand what the correct behavior is
[21:28] <gregaf> but it's not in there right now and probably shouldn't be the default since it creates a lot of extra message traffic in the common cases
[21:28] <cmccabe> do you think get pool_create == EEXIST followed by ioctx_open == ENOENT should be possible or not
[21:28] <cmccabe> ok. So if that doesn't fetch a new OSDMap, then the answer is yes, it is possible.
[21:28] <gregaf> I really don't know the context or what librados looks like anymore
[21:28] <gregaf> it seems odd to me though
[21:29] <cmccabe> so pool create is a separate, standalone message that doesn't trigger an osdmap update?
[21:29] <gregaf> oh, let me look at the code
[21:30] <cmccabe> it looks like it just fires off POOL_OP_CREATE and waits for a numeric error code return
[21:31] <cmccabe> so that explains that
[21:33] <cmccabe> well, at the very least, I need to finish writing this test
[21:33] <cmccabe> so I have to figure out some way of getting the new osdmap in the other process
[21:34] <gregaf> so in a single process you did a pool_create and that failed on an EEXIST?
[21:34] <cmccabe> 2 processes
[21:34] <gregaf> and in that same process you did a rados_ioctx_create and it failed on ENOENT?
[21:34] <cmccabe> yes
[21:35] <gregaf> okay, so pool_op_create is a Pool Op
[21:36] <gregaf> and pool op replies aren't supposed to come through until they've reached the appropriate epoch
[21:36] <gregaf> (and it triggers a map update if necessary)
[21:36] <gregaf> it looks like maybe the epoch-to-reach doesn't get set if there's an error on the op
[21:37] <cmccabe> gregaf: I'm looking at OSDMonitor.cc now
[21:39] <cmccabe> so it's all done through pending_inc as usual
[21:40] <cmccabe> it's not clear to me how to make the error case send out an osdmap update
[21:40] <cmccabe> obviously, we only want to send it to the caller, not actually create a new epoch for everyone.
[21:41] <gregaf> hmm, actually they do include the proper version
[21:41] <gregaf> so in Objecter::handle_pool_op_reply
[21:42] <gregaf> it just doesn't do the map update if there's an error code
[21:42] <gregaf> that's probably just wrong
[21:43] <gregaf> I'm trying to think if there are any error codes that shouldn't result in a map update
[21:44] <cmccabe> if someone gave us a map, why would we throw it away?
[21:44] <cmccabe> unless it's older than what we've got
[21:44] <gregaf> they don't actually piggyback the maps on everything; they're sent as separate messages
[21:44] <gregaf> and the monitors don't gratuitously spew them out the way OSDs do
[21:44] <gregaf> so for Pool ops the Objecter itself requests a new map if necessary
[21:45] <gregaf> you can see it on line 1107 when it calls wait_for_new_map
[21:45] <cmccabe> yes, I see that
[21:45] <gregaf> so that only happens if there wasn't an error code returned with the request
[21:45] <gregaf> if there was an error code, it doesn't wait for the new map
[21:45] <gregaf> I think that's just incorrect
[21:46] <cmccabe> I can't think of a pool op where you would not want to wait for the new map... except maybe the snap stuff
[21:46] <gregaf> it was probably done because waiting for new maps can involve message traffic and a (relatively) lot of time, and if you're getting like an EPERM error the new map isn't that interesting
[21:46] <cmccabe> even then, I think the same issues would come into play
[21:46] <gregaf> yes
[21:46] <cmccabe> can't create a pool snap with name X... can't open pool snap with name X... wtf
[21:48] <cmccabe> I don't think errors need to be optimized
[21:48] <cmccabe> and if they are, not in a way that leads to this level of wtf :)
[21:48] <cmccabe> so I'll try to fix
[21:48] <gregaf> yes, I agree
[21:59] <slang> is CEPH_OSD_FLAG_BALANCED_READS ever used (i.e. is it possible to configure ceph to use that flag)?
[22:00] <cmccabe> slang: I think that was originally added for Hadoop
[22:00] <slang> cmccabe: ah ok
[22:01] <cmccabe> slang: I'm not very familiar with how that flag worked though
[22:01] <slang> cmccabe: looks like it just chooses a replica randomly
[22:01] <gregaf> I know the flag you're talking about but I can't even find it with git grep?
[22:01] <cmccabe> me neither... with regular grep in head-of-line
[22:02] <gregaf> CEPH_OSD_OP_BALANCEREADS, maybe
[22:02] <gregaf> although it's not actually used anywhere that I can see
[22:02] <slang> its in Objecter::recalc_op_target
[22:02] <gregaf> oh, there we go: CEPH_OSD_FLAG_BALANCE_READS
[22:03] <gregaf> no 'D'
[22:03] <gregaf> anyway, sjust would know better but I believe there's no way to expose it through the Ceph posix layer
[22:04] <gregaf> it's only really useful for direct users of RADOS who can handle consistency on their own
[22:05] <slang> gregaf: the similar CEPH_OSD_FLAG_LOCALIZE_READS is actually available through a hidden argument to cfuse: --localize-reads
[22:06] <slang> gregaf: just wondering about the BALANCE_READS one
[22:06] <gregaf> yeah, I believe localize_reads was for Hadoop, and that handles the consistency on its own :)
[22:06] <slang> gregaf: I know there's consistency issues with reading from a replica
[22:06] <gregaf> bit of a hack I suppose
[22:06] <slang> gregaf: I was wondering about files that have the immutable flag set though
[22:07] <slang> gregaf: where consistency is no longer an issue
[22:07] <gregaf> it's probably something that could be exposed, it just isn't right now
[22:07] <slang> gregaf: ok
[22:08] <gregaf> there's generally not a lot of use to randomizing read locations for Ceph files
[22:08] <slang> gregaf: right
[22:08] <gregaf> since the clients cache stuff locally and the OSDs will have it in page cache and large files are already striped
[22:08] <slang> gregaf: I was thinking it might be extended a little bit by picking the replica on the osd with the lowest load
[22:09] <gregaf> yeah, there was a bit of work a while ago trying to do read "shedding" to replicas automatically, that was abandoned for consistency reasons
[22:10] <gregaf> it wouldn't be that helpful for single objects (rather than for all objects) since the client would need to have an idea of how busy each OSD is at any given time
[22:11] <gregaf> so the randomize_reads does exist but nobody's gone to the effort of hooking it up to the filesystem in useful ways, since it's complicated and error-prone (though certainly correct under the right circumstances)
[22:11] <gregaf> it's just way down our cost-benefit curve right now :)
[22:11] <gregaf> although if a user had a use-case where it actually mattered that would bump it up
[22:12] <slang> is there any load info passed around along with the heartbeat messages between osds?
[22:13] <slang> gregaf: a quick query to all the osds for an object to get their loads doesn't seem like it would be that much overhead
[22:14] <gregaf> if any of them actually have high enough loads that reading a 4MB object will be a problem, that query would get stuck in the queue
[22:14] <gregaf> I guess we could give it higher priority but it's just a lot of code for, like i said, not a lot of benefit
[22:14] <gregaf> we've only got 7 people here ;)
[22:15] <slang> gregaf: oh, I'm not suggesting you guys do any of this
[22:15] <slang> gregaf: just trying to figure out how much work would be involved
[22:15] <gregaf> patches are always welcome if you wanted to do it yourself :)
[22:15] <slang> gregaf: and how useful it would be
[22:15] <gregaf> what you're saying makes sense, if you're looking for feedback
[22:16] <gregaf> but it's a pretty limited use case so I'm really not sure it helps much; your OSDs "should" all have pretty similar load since everything's striped and replicated etc
[22:16] <yehudasa> wido: can I run client commands from anywhere?
[22:16] <slang> gregaf: yeah, its definitely for a specific "hadoop-style" workload
[22:18] <slang> gregaf: if you're not striping objects for example, and you know your files are immutable, and you know some nodes will have higher load than others
[22:19] <gregaf> yeah
[22:19] <gregaf> I don't remember where it is exactly but the OSDs do report some load data to each other
[22:19] <slang> ooh
[22:20] <slang> gregaf: any keys I can search for?
[22:21] <gregaf> in OSD::heartbeat there's a section on cpu load averages
[22:21] <gregaf> don't recall how it's used or passed around, but start there?
[22:22] <slang> gregaf: ok, thanks
[22:23] <gregaf> np!
[22:40] <yehudasa> wido: nevermind, actually found the culprit
[23:12] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[23:12] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[23:34] <cmccabe> I'm getting a hang on pool create
[23:34] <cmccabe> any ideas for debugging?
[23:35] <cmccabe> I suppose I could try to reproduce with messenger debugging on
[23:38] <gregaf> see which messages are failling; that only requires ms debug 1 which is probably on already
[23:38] <gregaf> then look at the daemon that's blocking
[23:39] <cmccabe> it looks like 82fdd2aa6e84f756e29e88ac96919fcbac1f3390 didn't work
[23:39] <cmccabe> because I just got the EEXIST, ENOENT thing again
[23:40] <cmccabe> I probably will have to figure out a way to turn on ms debug in this program before I can start to figure out these issues
[23:41] <gregaf> you could look at the monitor logs
[23:43] <cmccabe> brb, lunch

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.