#ceph IRC Log

Index

IRC Log for 2011-04-22

Timestamps are in GMT/BST.

[0:38] * gregorg_taf (~Greg@78.155.152.6) has joined #ceph
[0:38] * gregorg (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[0:42] * aliguori (~anthony@32.97.110.59) Quit (Ping timeout: 480 seconds)
[0:55] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) has joined #ceph
[0:57] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) Quit ()
[1:12] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[1:18] <Tv> i find it amusing that i argue by interpreting/writing code
[1:20] <cmccabe> I'm a little frustrated that some of the replies seem to retread ground we already covered
[1:20] <cmccabe> like greg suggested "ditching collision handling" entirely, after the long discussion about how malicious users could deliberately create collisions in publicly writable RGW buckets
[1:21] <gregaf> cmccabe: we're going to use SHA1, if a user manages to find collisions in SHA1 they deserve to break us
[1:21] <cmccabe> and I guess maybe I am the only one who cares about fast prefix search
[1:21] <gregaf> also, don't be a dick
[1:21] <cmccabe> but I might also be the only one who has actually used s3
[1:22] <cmccabe> gregaf: er, I don't think I am being a dick by bringing this up
[1:23] <cmccabe> in general I think the discussion has been pretty productive
[1:23] <Tv> gregaf: so with sha1 & no collisions it'll only break when and osd crashes at the right moment
[1:23] <cmccabe> gregaf: also SHA1 is not what you should use if preventing collisions is important: http://www.schneier.com/blog/archives/2005/02/sha1_broken.html
[1:24] <Tv> gregaf: you think that's much better?
[1:24] <Tv> I'm 100% serious when I say there's lot more races in the lfn code.
[1:24] <Tv> you can't construct atomic operations from multiple operations without being very, very, careful
[1:24] <gregaf> Tv: what crash are you concerned about?
[1:24] <Tv> gregaf: it's not any single one, it's the plurality of them
[1:25] <cmccabe> tv: to be fair to the LFN guys, this stuff would be happening under the PG lock
[1:25] <Tv> cmccabe: doesn't stop it from crashing and leaving bad things around
[1:25] <cmccabe> tv: of course it's still potentially complex in the presence of crashes
[1:25] <Tv> e.g. lfn file create = create + set_xattr, crash in between and you have a bad file in there without the xattr
[1:25] <Tv> really, if you don't see that you shouldn't argue that the lfn code is simple
[1:26] <gregaf> Tv: that's an easy recovery...
[1:26] <cmccabe> tv: yeah, there's no way to atomically create something with an xattr on it.
[1:26] <Tv> gregaf: sure, but can you find the next one? before it eats something in production?
[1:26] <cmccabe> tv: I guess we could create it in a temp dir and then move it, but the overhead would probably be unacceptable.
[1:26] <gregaf> and in fact I'm pretty sure you can do it transactionally in btrfs
[1:27] <cmccabe> gregaf: I'm curious to what extent btrfs transactions insulate us from these kinds of issues
[1:27] <cmccabe> gregaf: that's something I don't completely understand about the current code
[1:28] <Tv> then you need to make everyone use btrfs too, not ext4
[1:28] <cmccabe> gregaf: it seems like if we're doing the equivalent of a full snapshot periodically, maybe we don't need to care about these kind of issues at all?
[1:28] <cmccabe> tv: also certain large organizations want to use ext3/ext4
[1:28] <gregaf> Tv's right that it's a new kind of race, because we could have the file without the xattr and then try to recreate it from our replicas and look and see the xattrs don't match
[1:28] <yehudasa> btw, the race Tv is worried about can be fixed by removing that unlink before renaming
[1:29] <Tv> cmccabe: yup, because they can buy support for it, and it doesn't invalidate their RHEL support, etc
[1:29] <Tv> yehudasa: once again; a single race
[1:29] <gregaf> but it's an easy one to guard against and it's not like there are unplumbed depths to it...
[1:29] <Tv> the real problem is that there's many of them
[1:29] <Tv> can you really say you've found them all?
[1:30] <Tv> my code maps atomic operations 1:1, except for a bunch of idempotent things up front
[1:30] <Tv> that's the whole point of it
[1:30] <gregaf> yes, and it's very nice except that it hilariously pollutes the namespace tree, which I admit just makes me nauseous so I don't want to do it ;)
[1:31] <cmccabe> can someone explain the locking strategy in filestore right now?
[1:31] <gregaf> and I am concerned about accessing objects since it means we'll need to convert all accesses to be based on looping dir objects rather than paths
[1:31] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[1:32] <gregaf> that'll have to be yehudasa, I don't recall the structure that well
[1:32] <gregaf> and it's evident that nobody else looked that deeply into it either
[1:32] <cmccabe> gregaf: in fairness, yehuda's proposed solution involves a loop too
[1:32] <Tv> gregaf: i don't see why -- you rewrite the pathname to have the backslash escaping, and to be split into the chunk with slashes, and then you just open it
[1:32] <Tv> gregaf: no listdirs etc there
[1:32] <Tv> <3 short, straight codepaths
[1:32] <gregaf> Tv: because by doing the escaping and adding slashes we can construct a path that's more than 4k
[1:32] <yehudasa> cmccabe: the loop is in the extremely rare case of a collision
[1:32] <cmccabe> tv: greg is talking about the potential mkdir/rmdir loop on creating or deleting an object
[1:33] <cmccabe> tv: for lookup, yes, there is no loop in the directory-based solution
[1:33] <Tv> cmccabe: only creation needs a retry loop
[1:33] <Tv> cmccabe: rmdirs can be done e.g. at scrub time
[1:33] <gregaf> no, I mean there's a 4k limit on the length of paths we can use...
[1:33] <Tv> cmccabe: and need no retry loop; if it fails due to having files, move on
[1:34] <Tv> yeah the 4k limit is a good point
[1:34] <cmccabe> tv: interesting idea
[1:34] <cmccabe> gregaf: well, the limit for s3 is 2k
[1:34] <cmccabe> gregaf: and I don't think mds-created objects will explore that territory either
[1:34] <gregaf> right now objects are accessed by path everywhere else, I believe
[1:34] <Tv> though 2k of all backslashes -> 4k
[1:34] <cmccabe> gregaf: however, that is a good point. The directory-based solution has a max object length.
[1:35] <gregaf> I'm not sure how much effort is involved in converting that
[1:35] <cmccabe> tv: well, s3 objects are utf-8, which is a strict subset of unix file names
[1:36] <cmccabe> tv: I don't know what our proposed (or actual?) escaping rules are like, though.
[1:36] <cmccabe> tv: I guess at minimum we need to handle the infamous all-slashes object name :)
[1:36] <Tv> cmccabe: currently, \ -> \\, / -> \\, afaik
[1:36] <Tv> err
[1:36] <Tv> cmccabe: currently, \ -> \\, / -> \/, afaik
[1:36] <Tv> err that can't be either
[1:36] <Tv> anyway ;)
[1:36] <Tv> slashes are escaped, and escapes are escaped
[1:37] <cmccabe> leaving us exactly enough room to make 2048, heh
[1:37] <Tv> foo/bar -> foo\sbar_head
[1:37] <yehudasa> actually \ -> \s
[1:37] <yehudasa> ahmm..
[1:37] <yehudasa> I mean / -> \s
[1:37] <Tv> cmccabe: not with _head etc
[1:38] <cmccabe> tv: hmm
[1:39] <cmccabe> tv: frustrating
[1:40] <gregaf> I find it far more likely that we have to deal with that than with somebody finding a SHA1 collision in names of size 2k and then feeding them in ;)
[1:42] <cmccabe> gregaf: did you read the schneier.com link?
[1:42] <cmccabe> gregaf: "This attack builds on previous attacks on SHA-0 and SHA-1, and is a major, major cryptanalytic result. It pretty much puts a bullet into SHA-1 as a hash function for digital signatures (although it doesn't affect applications such as HMAC where collisions aren't important)."
[1:42] <gregaf> yes, it has become easier to find collisions than ever before
[1:43] <cmccabe> gregaf: "collisions in the the full SHA-1 in 2**69 hash operations, much less than the brute-force attack of 2**80 operations based on the hash length."
[1:43] <gregaf> yes, Colin, I read the thing
[1:43] <cmccabe> gregaf: we don't *have* to use SHA-1 for this.
[1:43] <Tv> i'm not worried about the collisions, i'm worried about the code not handling the corner cases
[1:43] <gregaf> you are overestimating the impact for things that don't need to be cryptographically secure
[1:44] <gregaf> and really, probably for things that do need to be cryptographically secure but I'm not a crypto expert
[1:44] <iggy> there's a difference between looking for collisions and them happening in the course of things
[1:44] <cmccabe> anyway, sounds like yehuda's proposed solution adds an integer at the end to force uniqueness
[1:44] <iggy> of course, I don't know what you guys are talking about, so ignore me
[1:44] <cmccabe> so it's irrelevant to the proposed solution
[1:44] <Tv> cmccabe: ..and fails to handle it properly ;)
[1:46] <cmccabe> tv: I'm curious what isn't handled properly in that solution
[1:46] <cmccabe> tv: I guess perhaps the fact that objects may be deleted
[1:46] <cmccabe> tv: which could cause a spurious ENOENT... hmm... yes
[1:47] <yehudasa> cmccabe, Tv: the unlink is redundant
[1:47] <Tv> cmccabe: unlink+crash leaves a gap in the sequence -> the rest of them are not found
[1:47] <yehudasa> you can just
[1:47] <Tv> the point is not that there's a single race
[1:47] <cmccabe> yehudasa: I suppose you rename the highest one to be the new lowest
[1:47] <Tv> the point is i can look at it for 10minutes and see race
[1:47] * eternaleye_ is now known as eternaleye
[1:47] <Tv> that does not bode well
[1:47] <yehudasa> you can just rename.. right
[1:48] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[1:48] <cmccabe> yehudasa: so what kind of locking are we looking at here?
[1:48] <cmccabe> yehudasa: I somehow got the impression that the PG has the whole directory to itself, and all ops take place under the PG lock. Is that incorrect?
[1:48] <gregaf> actually, you know what, that will be handled by recovery anyway
[1:48] <yehudasa> no, that's right
[1:49] <cmccabe> yehudasa: so at least the race will not surface in normal operation
[1:49] <Tv> gregaf: redoing the operation will see that the file does not exist, it will not heal the gap
[1:50] <gregaf> Tv: but recovery will discover all the missing objects and retrieve them from replicas, no?
[1:51] <Tv> create+crash creates a junk file in the bucket; redoing the operation will see it doesn't have the xattr, so it will not touch it, it'll create a new file
[1:51] <gregaf> I agree that the postfix is a bit of a silly way to handle collisions, though — doesn't one typically just rehash on the current name?
[1:51] <Tv> gregaf: that doesn't help with the gap problem
[1:52] <gregaf> mmm, I suppose not
[1:52] <gregaf> anyway, you're making a mountain out of a molehill: first of all, btrfs transactions are wonderful things
[1:52] <gregaf> second of all, we can check for hashed files with non-existent xattrs
[1:53] <Tv> gregaf: where as my stance is, you're creating a big pile of hard to debug bugs here
[1:53] <Tv> gregaf: feel free to go ahead, but i'm just gonna mentally flag FileStore as unreliable :-/
[1:54] <gregaf> and my stance is that I get that less code makes Tv happy, but you're overestimating the difficulties here
[1:54] <gregaf> and underestimating the downsides of your proposed solution
[1:54] <Tv> gregaf: on the contrary, the fact that i've been the only one pointing out these races makes this a huge problem
[1:55] <gregaf> jesu christu
[1:55] <Tv> these will translate to really horrible bugs if left in place
[1:55] <gregaf> you aren't the only competent person here ;)
[1:55] <Tv> i sure hope i'm not
[1:55] <gregaf> I quote: "This is a work in progress, a proper locking is required and will be applied."
[1:55] <gregaf> "Yeah, we're well aware of those races."
[1:55] <Tv> "Just Add Simplicity"
[1:56] <Tv> i guess what i'm saying comes down to this: if you don't build it ground-up reliable, it'll never be reliable. IMHO.
[1:56] <cmccabe> tv: I think you are right to be concerned with race conditions.
[1:56] <cmccabe> tv: it frustrates me sometimes when people dismiss the difficulties of race conditions
[1:56] <cmccabe> tv: I would rather have almost any bug than a race.
[1:57] <Tv> gregaf: i do agree the 4k path limit is a bummer, i'm trying to see if there's a nice way around that
[1:57] <gregaf> you see what my problem with your argument is here, right?
[1:58] <gregaf> I agree that race conditions blow
[1:58] <gregaf> but I just don't think that your solution is actually going to have less or less complicated code by the time we finish accounting for the things you haven't accounted for yet
[1:58] <gregaf> and I HATE the namespace pollution ;)
[1:58] <Tv> gregaf: at least in my case tab completion works and the filenames are readable ;)
[1:59] <cmccabe> gregaf: maybe I'll get flamed for this, but...
[1:59] <cmccabe> gregaf: namespace pollution?
[1:59] <gregaf> yeah, that's why I want to mash them together
[1:59] <Tv> wait what?
[1:59] <cmccabe> gregaf: you'd seriously rather have FOO_ef5434e_001 than FOO/BAR/BAZ?
[1:59] <gregaf> the file manipulations
[1:59] <cmccabe> gregaf: I guess a windows 95 joke goes here?
[1:59] <gregaf> so that we'd have eg 0.FOO_hashvalues_head
[2:00] <gregaf> instead of just the hashes that the current implementation does
[2:00] <Tv> gregaf: you're not giving us enough context to read your mind
[2:00] <gregaf> so right now lfn is doing hashes over the mangled names
[2:00] <Tv> current implementation = FOOBAR works, FOOBARBAZ fails due to 255 chars
[2:00] <Tv> lfn = FOO_ef5434e_1
[2:00] <Tv> prefixdirs = FOO/BAR/BAZ
[2:00] <gregaf> I'm not sure how difficult it would be but I think I prefer Sage's suggestion of doing all the name mangling in place
[2:01] <gregaf> it doesn't show the full object name in the path of course but it would provide enough for the very rare case that you want to grab it out by hand
[2:01] <Tv> lfn_earlier = FOO_ef5434e_1_head
[2:02] <Tv> gregaf: so you think FOO_ef5434e_1_head is nicer than FOO/BAR/BAZ_head ?
[2:02] <gregaf> I don't think it's nicer to read, no, but I think it's nicer to read than FO_hashvalue and I think it doesn't do horrible things to the namespace and I think it prevents us generating 4k paths
[2:03] <Tv> gregaf: in my mind, FOO/BAR/BAZ_head is less horrible things to the namespace than hashing some names and not all
[2:04] <gregaf> *shrug*
[2:04] <Tv> that's literally a namespace violation; namespaces of short and long filenames are now overlapping, and there can be confusion, which can only be resolved via external means (xattrs)
[2:04] <gregaf> the namespace is otherwise entirely based on PGs, and we'd like to extend it to pre-hashing future PGs
[2:04] <gregaf> making parts of object names part of the namespace...eww
[2:05] <Tv> the object names have been part of the namespace as long as i've seen ceph store objects :-o
[2:05] <Tv> now this is just confusing
[2:05] <cmccabe> http://valerieaurora.org/monkey.html
[2:06] <Tv> perhaps gregaf means he prefers the flat directory tree?
[2:06] <cmccabe> looks like SHA-2 hasn't been weakened yet, for what it's worth.
[2:06] <gregaf> yeah, the dir tree
[2:06] <cmccabe> gregaf: ironically, you probably have a good idea how much a flat directory tree sucks for the FS implementor
[2:07] <cmccabe> gregaf: from the MDS code
[2:07] <yehudasa> cmccabe: should I point out that git uses SHA1?
[2:07] <Tv> cmccabe: but listdir() is such fun!
[2:07] <cmccabe> actually, I believe btrfs should handle that case quite well though.
[2:07] <Tv> yehudasa: hey git uses directory prefixes! i win!
[2:07] <gregaf> I also know that PGs have limited size and we're arguing over collisions that will never happen
[2:07] <Tv> l)
[2:07] <yehudasa> Tv: well.. you can both have collisions and prefixes
[2:07] <Tv> ;)
[2:07] <cmccabe> yehudasa: there was a long discussion about git and SHA1, and it was pointed out that the chance of generating a patch that caused a collision AND was valid code was very, very small.
[2:08] <Tv> yehudasa: git is also different in that it doesn't need to store potentially hostile content
[2:08] <yehudasa> cmccabe: hence our point
[2:08] <cmccabe> yehudasa: it's not like an object name where almost anything is fair game.
[2:08] <cmccabe> yehudasa: a code patch has to be... valid code!
[2:09] <cmccabe> yehudasa: valerie actually explicitly addresses this point in her essay
[2:09] <cmccabe> yehudasa: "Other systems only allow trusted users to add data to the system; the various CBA-based version control systems like git and Monotone fall into this category. If a user can create hash collisions in the system, they can also directly check in code to accomplish the same effects, so why worry about a fancy hash collision attack? Educate your users not to store colliding data intentionally and the problem is solved."
[2:10] <cmccabe> yehudasa: however, S3 has the concept of buckets to which any user can upload data
[2:10] <cmccabe> anyway, I am probably spending too much time thinking about SHA-1.
[2:11] <cmccabe> considering the other attacks people could make against us, it is a very minor point.
[2:11] <Tv> i'd be fine with SHA1
[2:11] <Tv> if you just actually start off with the mentality of handling the edge cases
[2:11] <Tv> it's not even about just collisions
[2:11] <cmccabe> Still I would argue that if you want to get rid of the suffix and have a pure hash-based system, you should use SHA-2 and be done.
[2:11] <cmccabe> no reason to build something that doesn't use the best available technology.
[2:12] <cmccabe> this is all so much more exciting than.... JNI... :P
[2:15] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[2:16] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[2:16] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:31] <bchrisman> cmccabe: speaking of things more exciting than JNI… is libceph ready for me to continue on with??
[2:32] <cmccabe> bchrisman: yeah. testceph is working, and I added some more tests to it
[2:32] <cmccabe> bchrisman: right now, looking at CephFSInterface.cc
[2:32] <cmccabe> bchrisman: but none of that should affect you I think
[2:33] <cmccabe> bchrisman: if anything else changes, it might be those wacky getter/setters for file stripe unit and so forth
[2:34] <cmccabe> bchrisman: but that's really minor
[2:34] <cmccabe> tv: here is an idea for getting around the 4096-byte limit.
[2:35] <cmccabe> tv: use invalid UTF-8 characters instead of escaped sequences to represent /
[2:35] <cmccabe> tv: heh
[2:39] <bchrisman> cmccabe: cool.. I'll pick it up on our next build cycle.. thanks
[2:39] <cmccabe> bchrisman: np
[2:54] <cmccabe> anyone have any information about the hypertable stuff?
[2:54] <cmccabe> it seems to not exist in the makefiles even
[3:04] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:17] <cmccabe> looks like it is part of hypertable itself. wacky.
[3:18] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[4:39] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[5:05] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[5:12] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:48] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:09] * Juul (~Juul@c-76-21-88-119.hsd1.ca.comcast.net) has joined #ceph
[6:15] * Juul (~Juul@c-76-21-88-119.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[6:46] * yehudasa_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[6:52] * Juul (~Juul@c-76-21-88-119.hsd1.ca.comcast.net) has joined #ceph
[7:49] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: Been here. Done that.)
[7:51] * yehudasa_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[8:25] * yehudasa_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[8:29] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:58] * allsystemsarego (~allsystem@188.25.131.212) has joined #ceph
[9:07] * Juul (~Juul@c-76-21-88-119.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[9:07] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has left #ceph
[11:43] <lxo> ok, I got a new mds core file generated by the assert (to > trimmed_pos); the mds had just finished recovering the cluster when it got an OSD reply that ended up calling C_PurgeRange::finish, which in turn called Journaller::C_Trim::finish, that aborted when it found that to == trimmed_pos
[11:43] <lxo> what other info would be useful to diagnose this?
[11:46] <lxo> the prior mds seems to have got laggy, and ended up being kicked
[11:48] <lxo> it took some 70 seconds between old mds realizing it was down and new mds (previously in standby replay) to complete recovery and then crash
[11:49] <Yulya_> hm
[11:53] <Yulya_> how can i change replication level on existing filesystem?
[11:53] <Yulya_> and how can i change size of chunk?
[11:53] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[11:53] <lxo> now, it looks like the cfuse client on the machine that was actively modifying the filesystem won't reconnect to the new mds, although the other machines did just fine
[11:53] * Disconnected.
[11:53] -solenoid.oftc.net- *** Looking up your hostname...
[11:53] -solenoid.oftc.net- *** Checking Ident
[11:53] -solenoid.oftc.net- *** No Ident response
[11:53] -solenoid.oftc.net- *** Found your hostname

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.