#ceph IRC Log

Index

IRC Log for 2011-03-01

Timestamps are in GMT/BST.

[0:01] <Tv> sjust: oh btw bad news, i re-created the vm with the exact same size disk as it had before; rerunning..
[0:11] <lxo> well, it looks like it has indeed deleted data from the osd0 that held all the data before the reconfiguration, so stuff is now definitely lost from the storage pool
[0:12] <lxo> now, I wonder if, should the root of the pool be lost, access to the filesystem would block forever as I'm observing, without it ever “garbage collecting” children of the lost root directory
[0:14] <sagewk> currently yes. you'll want to just rerun mkcephfs
[0:18] <lxo> cool, thanks a bunch!
[0:33] <neurodrone> Hi, I had a small question on the use of CRUSH algorithm used for PG <-> OSD hashing in Ceph.
[0:33] <neurodrone> As is observed, the client retrieves the inode and OIDs for a particular filename from the auth MDS and initiates a contact with the corresponding OSDs.
[0:33] <neurodrone> So in this process does it - compute hash of OID and map it to PGs and also perform CRUSH(PGid) and get a list of primary/secondary OSDs or does it pass the list of OSDs to the cosd daemon and wait till it informs the client about the mappings?
[0:34] <cmccabe> neurodrone: the clients talk to OSDs directly
[0:35] <neurodrone> cmccabe: And the computations of these mappings using CRUSH is performed by the OSDs themselves, right?
[0:35] * lxo (~aoliva@201.82.177.26) Quit (Read error: Connection reset by peer)
[0:36] * lxo (~aoliva@201.82.177.26) has joined #ceph
[0:36] <cmccabe> neurodrone: honestly I'm a little unclear about this part of the puzzle. The PGMap has a role here too
[0:36] <neurodrone> Okay, I will make a note of that.
[0:37] <neurodrone> Also, when you say talk to the OSDs directly, you do mean that the client connects to a daemon sitting on any OSD and then is redirected to the primary OSD by that daemon, right?
[0:37] <Tv> neurodrone: clients evaluate CRUSH independently, based on the osdmap they get from monitors
[0:38] <Tv> neurodrone: and that tells them what OSD to contact
[0:38] <neurodrone> Tv: Yea, that should reduce latency a lot.
[0:38] <Tv> neurodrone: if they got it wrong (stale data), they are given a delta to the latest osdmap
[0:39] <neurodrone> Tv: But, are they going to get stale data? I mean the OSDs are well synchronized, aren't they?
[0:39] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[0:39] <Tv> neurodrone: e.g. client C downloads the osdmap data from monitors, sleeps for 10 sec, meanwhile monitors decide on new version of osdmap and osds migrate data, then client tries to fetch an object
[0:39] <Tv> neurodrone: they = client
[0:39] <neurodrone> Tv: oh okay that case.
[0:39] <cmccabe> neurodrone, tv: that's right, monitors can't push out changes to everyone instantaneously
[0:39] <neurodrone> I see.
[0:40] <Tv> same thing goes for just about anything, though
[0:40] <Tv> osdmap updates spread by gossip
[0:41] <cmccabe> neurodrone, tv: looks like the OSDMap also has information about pools and information about whether OSDs are up or down
[0:41] <Tv> yeah i'm not sure if "osdmap" is all of that info, or whether just part of it
[0:41] <Tv> but it's "state of the cluster" as a whole
[0:42] <cmccabe> tv: it might also be reflected elsewhere, but I think OSDMap is how clients learn of it
[0:42] <neurodrone> And how fast is the change monitored and reflected on the OSDMap ?
[0:43] <Tv> neurodrone: timeout to declare something dead was 15 seconds if i recall, then it spreads non-instantaneously by gossip
[0:43] <neurodrone> Is it like if a OSD is down (disk failure, or a complete system crash) the other OSDs periodically check on it and hence know its down and out and update the map?
[0:43] <Tv> neurodrone: anyone detecting a non-responsive osd starts gossiping that it's sick
[0:43] <neurodrone> Tv: Ah, Gossip seems to be very effective in these kind of scenarios.
[0:44] <Tv> neurodrone: but monitors are the only thing receiving regular heartbeats from anything
[0:44] <neurodrone> I see.
[0:45] <neurodrone> Tv: Also, on the point you stated about that if the client gets a stale information its given a diff, how does the daemon know if client is requestng for a stale data or not?
[0:46] <neurodrone> Object versions?
[0:46] <cmccabe> tv, neurodrone: actually, looking at my notes, and again at the code, it seems that PGMap messages only get sent between monitors.
[0:46] <neurodrone> cmccabe: Oh, I see.
[0:46] <cmccabe> tv, neurodrone: OSDMap, on the other hand, goes pretty much everywhere
[0:46] <cmccabe> tv, neurodrone: MDSMap goes everywhere except OSDs, as you might expect
[0:47] <Tv> neurodrone: osdmaps have version numbers; when handshaking a connection, if the peer has lower number than us, we hand it the delta
[0:47] <neurodrone> Oh, the OSDMaps have versions too. Okay, thats better.
[0:48] <neurodrone> cmccabe: So, the client is oblivious of the PGMaps and just deals with the OSDMaps and MDSMaps, right?
[0:48] <cmccabe> yes
[0:48] <neurodrone> Okay, makes sense.
[0:49] <cmccabe> epoch_t is generally used to represent the version numbers tv was talking about earlier
[0:49] <gregaf> Tv: actually the OSDs do regular heartbeats of their peers
[0:49] <cmccabe> well,
[0:49] <neurodrone> cmccabe: And all these 3 maps have distinct "epoch_t"s ?
[0:49] <cmccabe> there's two things: an epoch and a version within the epoch
[0:49] <cmccabe> whenever a big enough change happens that starts a new epoch (in the OSDMap)
[0:49] <gregaf> the monitors get some heartbeats now as well, but that's recent and not integral to the algorithms
[0:50] <Tv> gregaf: oh yeah sure, peers -- but not all osds to all osds
[0:50] <neurodrone> gregaf: So, there is no chance a OSD failure can be missed if we do Gossip + heartbeat.
[0:50] <Tv> gregaf: those guys have an active connection anyway
[0:50] <gregaf> neurodrome: nope!
[0:51] <cmccabe> gregaf: yeah, the osd_heartbeat_grace setting actually refers to the heartbeat between OSDs
[0:51] <gregaf> we used to have problems if all the OSDs failed simultaneously — that's why they report to the monitors now
[0:51] <neurodrone> Aha.
[0:51] <cmccabe> gregaf: yes. And that timeout is much longer than 20 seconds
[0:51] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:51] <gregaf> otherwise we wouldn't have it — Ceph was originally architected to minimize the number of connections each daemon needed to maintain
[0:52] <gregaf> the thinking being that if you have 100000 OSDs trying to talk to 3 monitors you might start running into scaling issues
[0:52] <gregaf> we've dropped that a bit but it may come back in the future
[0:53] <neurodrone> gregaf: How many monitors do you normally deploy on a network having like 50 OSDs and 5 MDSs ?
[0:53] <gregaf> 3 is our default number
[0:53] <neurodrone> I hope that is a reasonable figure for MSDs and OSDs. :)
[0:53] <gregaf> we haven't hit its scaling limits yet
[0:53] <neurodrone> Ah.
[0:53] <gregaf> I'm sure one would be fine but then there's no redundancy
[0:55] <neurodrone> gregaf: And if the number of clients ranges upto even 1000, still the system would function smoothly?
[0:56] <cmccabe> neurodrone: the osds report to the monitors very infrequently... I think it was like every 10 minutes or something
[0:56] <gregaf> neurodrone: it should!
[0:57] <gregaf> the clients don't place any real burden onto the system in terms of synchronization
[0:57] <neurodrone> cmccabe: But the changes to the OSDMap are introduced via the monitor daemon only I guess.
[0:57] <neurodrone> gregaf: Yep, agreed.
[0:57] <gregaf> neurodrone: that's what the monitors do — they maintain the shared cluster states
[0:58] <cmccabe> the thinking is that changes to the osdmap should be relatively rare, even with 1000 osds
[0:58] <gregaf> only the monitors are allowed to update the various Maps, and they do so in a guaranteed-safe fashion via an implementation of paxos
[0:58] <cmccabe> it's certainly not required for every operation or anything like that
[0:58] <neurodrone> So you have 3 monitor daemons which service the client requests, keep changing the OSDMap/MDSMap along with other states, and still no significant latency problems?
[0:59] <neurodrone> gregaf: You also must have to use a load balancer to direct client requests around the set of monitor daemons, right?
[1:00] <gregaf> neurodrone: the clients don't talk to the monitors very much — the entire thing is very distributed
[1:00] <gregaf> gotta run though, be back on later
[1:00] <neurodrone> gregaf: Oh yea, just the request and then they do the computations themselves.
[1:00] <neurodrone> gregaf: Sure, catch ya later. :)
[1:01] <neurodrone> cmccabe: If the changes to the maps are ocassional, then I guess using 3 monitors won't be that bad.
[1:01] <neurodrone> cmccabe: Do you know if they were using monitors from their very first version or is it a part of a newer design?
[1:12] <cmccabe> neurodrone: monitors were in the design from the beginning
[1:12] <neurodrone> I see.
[1:12] <cmccabe> neurodrone: greg was just talking about a particular heartbeat-type message from osds to monitors that got added later
[1:12] <cmccabe> neurodrone: the most important heartbeat is between the OSDs
[1:12] <neurodrone> cmccabe: I think I do see the mention of heartbeats used by OSDs in their 2006 paper but I don;t know if they had implemented it already.
[1:12] <neurodrone> I agree.
[1:12] <neurodrone> I am also curious on how the OSDs (i.e. the primary) achieves multi-writes without any locks.
[1:12] <cmccabe> neurodrone: different osds are responsible for different PGs
[1:12] <neurodrone> If a single file is written to by multiple writers are the changes made by one visible to the rest while the others are making changes?
[1:12] <neurodrone> cmccabe: Yep true. And the writes for a particular file are directed through only the primary assigned for that replica.
[1:12] <cmccabe> neurodrone: files are split into objects by the MDS layer
[1:12] <cmccabe> neurodrone: each object goes into a PG
[1:12] <neurodrone> I get that part. But I don't quite see how the multiple writers scenario works ensuring consistency.
[1:12] <cmccabe> each PG has one OSD as primary. That OSD will handle all writing (and most reads)
[1:12] <neurodrone> cmccabe: Yep, I get it till there.
[1:12] <neurodrone> cmccabe: But if a file is opened by multiple writers how are the changes made by one visible to others?
[1:12] * lxo (~aoliva@201.82.177.26) Quit (Read error: Connection reset by peer)
[1:12] <neurodrone> I mean while the data is in the buffer and not committed to the disk.
[1:12] <neurodrone> I don't see a locking mechanism at a file granularity. Does it exist at the Object granularity?
[1:12] <cmccabe> neurodrone: more or less
[1:13] <neurodrone> This is what it says in the paper - "First, clients are interested in making their updates visible to other clients. This should be quick: writes should be vis- ible as soon as possible".
[1:13] <neurodrone> I wonder how they achieve simultaneous editing of a file.
[1:13] <neurodrone> Kinda like Google Buzz. :)
[1:14] <cmccabe> yeah, that is a good question and I don't have the answer
[1:14] <neurodrone> Okay, no problem. :)
[1:14] <cmccabe> it seems you would need something a little like oplocks
[1:14] <neurodrone> Yea, true.
[1:14] <sagewk> neurodrone: when a file is open by multiple writers io is synchronous so the osds and serialization happens there.
[1:14] <neurodrone> but then wouldn't locking mean that only one writer can be writing at a time?
[1:15] <sagewk> the "locking" is what makes them go synchronous. they can do whatever reads/writes they want, they'll just go over the wire (and be slower)
[1:16] <cmccabe> does anyone have a concise specification of POSIX semantics
[1:16] <cmccabe> I feel like it should be standardized somewhere but I've never actually seen it
[1:17] <cmccabe> things like reads must be visible after writes, etc.
[1:17] <neurodrone> sagewk: I understand. Maybe I am still uncertain about how its able to achieve locks and visibility and simultaneous editing at the same time.
[1:17] <sagewk> http://www.unix.org/single_unix_specification/
[1:17] <sagewk> maybe not "concise", but... :)
[1:17] <Tv> sjust: autotest should be good to go again
[1:18] <sjust> Tv: cool
[1:19] <cmccabe> yeah, this sort of mixes the spec for... everything... in
[1:19] * greglap (~Adium@166.205.139.133) has joined #ceph
[1:20] <cmccabe> I was hoping there would be like a 2-page summary somewhere out there
[1:20] <cmccabe> I think I already get the main points... POSIX semantics is mainly a cache-coherency thing
[1:20] <neurodrone> Are the object-locks released after an "ACK" is received by the client or are they released immediately after the write is performed?
[1:21] <cmccabe> but there's a lot of interesting little things like using files after unlink
[1:22] <neurodrone> "write is performed" means here that the data is pushed onto the primary by Writer A and now waits for an ACK, but in a non-blocking manner.
[1:25] <sagewk> cmccabe: it's about cache coherency insofar as everything is coherent on a single host, and that's how things the semantics were originally defined.
[1:26] <sagewk> neurodrone: there actually aren't any object locks currently.
[1:26] <cmccabe> sagewk: it looks like profiling generates output when signals are not masked in threads
[1:26] <sagewk> excellent. hopefully we can unmask the one signal that matters?
[1:26] <cmccabe> sagewk: can you remember why we masked it originally?
[1:26] <cmccabe> sagewk: something to do with qemu as I recall
[1:26] <sagewk> nope. look at the past commits i guess?
[1:26] <neurodrone> sagewk: erm, I see. Can you elucidate a bit upon how multiple writers perform writes on the primary in a consistent fashion?
[1:27] <sagewk> oh.. yeah
[1:27] * lxo (~aoliva@201.82.177.26) has joined #ceph
[1:27] <sagewk> they are serialized at the osd. A writes "foo", B writes "bar". whichever arrives first is written first, second is written second.
[1:27] <neurodrone> Oh. :)
[1:27] <greglap> cmccabe: I believe Ceph code was stealing signals that qemu wanted to be handling
[1:28] <cmccabe> perhaps we should only mask signals in library-generated threads for now
[1:28] <sagewk> yeah
[1:28] <neurodrone> sagewk: I am sorry for bringing in the locks thingy. This implementation sounds better.
[1:28] <sagewk> :)
[1:29] <greglap> neurodrone: they both have their ups and downs — locks would let the clients buffer their writes, which they can't do in multi-writer situations right now
[1:29] <bchrisman> Looks like locks work though, for application synchronization of file access across clients.
[1:29] <sagewk> cmccabe: a flag that's set during library init that controls whether they're masked. whatever common_init has turned into
[1:29] <cmccabe> the locks would be better is if two different clients did non-overlapping I/O on the same file
[1:29] <greglap> I'm interested in adding byte-range locks to our current capabilities system, but that's not something I think we'll look at until version2+
[1:29] <neurodrone> greglap: True. But locks would hog up on the latency. Current implementation is way fast.
[1:29] <sagewk> bchrisman: greglap will be happy to hear that :)
[1:29] <cmccabe> I don't know how common that actually is though
[1:29] <greglap> bchrisman: advisory file locking should work!
[1:30] <greglap> but in this context we were talking about the locks that eg Lustre uses on the backing store regions
[1:30] <greglap> (at least, I think? that's what I was interpreting it as)
[1:30] <greglap> neurodrone: have you tried multi-writers yet?
[1:31] <neurodrone> greglap: hehe no. I was just imagining a case. :)
[1:31] <greglap> I think all the tests I've seen show it reaching somewhere between .5MB and 10MB/sec
[1:31] <neurodrone> Not bad at all.
[1:31] <bchrisman> greglap: still implementing some testing on that...
[1:31] <greglap> which IIRC is comparable to most filesystems and about half the speed of ones which are tailored for multiple writers
[1:32] <neurodrone> Ah, true. :)
[1:32] <greglap> it's rather slower than single-writer scenarios, which usually get close to saturating the network :)
[1:32] <cmccabe> for what it's worth, I think GFS was tailored for multiple appends to a file
[1:33] <cmccabe> GFS = google fs
[1:33] <neurodrone> cmccabe: Yep, and just that. ;)
[1:34] <neurodrone> They also keep stating frequently that they are trying to solve a rare problem and hence have to move beyond the workings of a "conventional" filesystem.
[1:34] <cmccabe> yeah, you're expected to append all the time and very rarely overwrite as I remember
[1:35] <cmccabe> I think the block size was huge too
[1:35] <neurodrone> A log-structured FS is perfect for their case.
[1:35] <neurodrone> yep, 64 MB chunks.
[1:36] <neurodrone> I guess file-counting was a major problem they were trying to solve in their early days.
[1:36] <Tv> cmccabe: comparing to GFS is kinda moot, i recall it doesn't even do overwrite ever..
[1:36] <Tv> append-only always
[1:37] <cmccabe> apparently Sean Quinlan says that GFS is "in some respects... reminiscent of a log-structure[d] filesystem"
[1:37] <cmccabe> but I don't know if it is actually a log-structured filesystem per se
[1:37] <neurodrone> Nah, they don't use one AFAIK.
[1:37] <cmccabe> I guess if it doesn't have overwrite, it might not really be a "file system" either :)
[1:38] <greglap> it's not terribly helpful anyway since they've only published a couple papers and no source code
[1:38] <neurodrone> But most of us don't know about the changes they have made in GFS2 Colossus.
[1:38] <greglap> it's a custom FS for distributed computing, just as HDFS is
[1:38] <neurodrone> I think they have to do a huge amount of random reads and thus a LFS is a turn-off when trying to satisfy that use case.
[1:39] <cmccabe> everything I've read says that the main change in GFS2 is moving away from a single master
[1:39] <cmccabe> I actually never heard the codename "colossus" though, heh
[1:39] <neurodrone> Hadn't they implemented Multi-mastered cells in GFS itself?
[1:40] <cmccabe> http://www.ninesys.com/2010/09/12/constant-improvement/
[1:40] <neurodrone> I mean the paper didn't state it. But, when the engineers used to talk about it, they used to talk about Master-failovers in a single cell.
[1:41] <greglap> mmm, it might be failovers like HDFS namenode standbys
[1:41] <cmccabe> it really sounds like they did a version of the old "lets keep all metadata in memory" cheese
[1:41] <neurodrone> Yea, maybe like that.
[1:41] <greglap> and now they've got multi-masters like Ceph has multiple monitor servers
[1:41] <neurodrone> haha :)
[1:41] <greglap> or (probably more accurately) like Zookeeper has multiple nodes
[1:42] <neurodrone> I have been reading up on Ceph very very recently and I somehow like it very much.
[1:42] <cmccabe> there's like a ton of object stores out there that just keep all metadata in memory and write out the world to disk every N seconds
[1:42] <cmccabe> of course GFS was no doubt more advanced than that, but... same basic idea
[1:43] <sagewk> cmccabe: yeah, lets just set a flag for the signal masking, and only set that for libceph, librados, etc.
[1:43] <cmccabe> sagewk: k
[1:43] <cmccabe> sagewk: I wish there was some way to know what "kind" of thing we were at compile-time
[1:44] <cmccabe> sagewk: like we should know whether we're linking a library, daemon, or utility
[1:44] <cmccabe> sagewk: I guess libcommon is still shared among all those, so it still needs to be an if statement at some level though
[1:45] <cmccabe> sagewk: I think I have an idea for how to do that in Thread.h
[1:47] <sagewk> you can do that with the automake stuff..
[1:48] <cmccabe> yeah
[1:48] <sagewk> librados_la_CFLAGS = -DFOO
[1:48] <cmccabe> I think the best way is to use the linker
[1:48] <sagewk> not sure if that's the cleanest approach.. most other runtime behavior is controlled via the conf stuff.
[1:48] <cmccabe> and have each build product link in a special .o that has a different value for some symbol called code_type
[1:49] <cmccabe> the problem with using the preprocessor is that libcommon is only compiled once.
[1:49] <cmccabe> it's really the linker that can put in a different symbol for each build product.
[1:49] <sagewk> that's what i mean, it's not.. automake builds it separately for different targets if there are different CFLAGS specified
[1:49] <cmccabe> hmm
[1:49] <sagewk> for example, the libraries all build with -fPIC and the daemons don't
[1:50] <sagewk> that's one reason why make is so slow :)
[1:50] <cmccabe> yeah
[1:50] <sagewk> libcommon is build separately for daemons, libceph, librados, librbd, etc.
[1:50] <Tv> sagewk: can you point me to Al Viro's comments on ceph kernel client?
[1:50] <sagewk> i'll forward it. it's off list.
[1:50] <Tv> ahh
[1:51] <Tv> that explains why it was so hard to find
[1:51] <sagewk> :)
[1:51] <cmccabe> I'm interested in hearing about that issue too
[1:51] <cmccabe> the mds is something I still don't completely understand yet
[1:51] <cmccabe> mds/kclient
[1:53] <greglap> basically, the I_COMPLETE flag on the kclient isn't being maintained properly due to races in the VFS
[1:53] <greglap> that's about all I managed to get out of it, anyway
[1:54] <greglap> which is...terribly sad
[1:54] <greglap> the kclient is going to be slow in the next kernel release :(
[1:54] <Tv> oh heh i don't have a kernel git clone on this machine.. i feel naked..
[1:55] <greglap> sagewk: speaking of which, should probably put that issue in the tracker
[1:56] * greglap1 (~Adium@166.205.137.3) has joined #ceph
[2:00] <Tv> sagewk: i'm wondering why no one just did http://news.gmane.org/find-root.php?message_id=%3cAANLkTimUjC8B0%3dnn5%2byDshzp0hy6Zk1wxYHWEZbk83Tf%40mail.gmail.com%3e -- or if i'm just missing an email saying why it won't work
[2:00] <cmccabe> although we could use -DCODE_TYPE=foo to distinguish between daemon code and library code, I think that will make ccache almost useless
[2:00] <neurodrone> I had one more question. Was there any specific reason that Btrfs was chosen against EBOFS to work at the OSD level?
[2:00] <Tv> cmccabe: err, daemons use libraries too
[2:01] <greglap1> neurodrone: EBOFS kept all its metadata in memory
[2:01] <cmccabe> well, right now, most programs are compiled with the same flags, even if they are compiled multiple times by the "brilliant" automake
[2:01] <Tv> neurodrone: my guess: "somebody smart was working on it, and making good progress"
[2:01] <greglap1> and was custom-built for Ceph, which meant it needed to be maintained and kept safe for data
[2:01] <cmccabe> so ccache can just translate those multiple compiles into a single compile
[2:01] <greglap1> and btrfs mapped pretty nicely into fixing those issues
[2:01] <cmccabe> but if we start having more cflags differences between stuff, that would change.
[2:01] <neurodrone> greglap1: I see. :)
[2:01] <Tv> cmccabe: how is what you're doing not an argument to some _init function?
[2:02] <neurodrone> I was just arbitrarily checking up on Btrfs on lwn the other day. Didn't know it was custom built for Ceph. :)
[2:02] <greglap1> Tv: that link doesn't seem to work
[2:02] <Tv> greglap1: works for me, huh
[2:02] <Tv> http://article.gmane.org/gmane.comp.file-systems.ceph.devel/1676
[2:02] <cmccabe> threads can be created at global constructor time, and certain people stubornly believe they shouldn't have to call an init function
[2:03] <neurodrone> Oh you mean EBOFS was custom made for Ceph. Sorry, my bad.
[2:03] * greglap (~Adium@166.205.139.133) Quit (Ping timeout: 480 seconds)
[2:03] <Tv> cmccabe: you mean something starts creating threads before it's been inited? that's evil
[2:03] <greglap1> neurodrone: yep!
[2:03] <cmccabe> this is all a little weird because I remember you arguing bitterly against requiring any init in unit tests
[2:04] <neurodrone> greglap1: metadata in memory is bad? Isn't it faster? :S
[2:04] <neurodrone> keeping metadata*
[2:04] <greglap1> Tv: that's a separate issue from what we're talking about now
[2:04] <Tv> cmccabe: no, i am against a single global init function
[2:04] <greglap1> neurodrone: it had to keep its metadata in-memory
[2:04] <greglap1> if there was more metadata than could fit in memory, bad things happened
[2:04] <neurodrone> Oh, so that was mandatory. I see.
[2:04] <greglap1> all filesystems keep some of their metadata in-memory
[2:05] <cmccabe> tv: having multiple confusing init functions that programmers may or may not remember to call won't improve things
[2:05] <Tv> cmccabe: your problem is at "confusing"
[2:05] <Tv> cmccabe: and how many independent libraries do you expect to use?
[2:05] <greglap1> Tv: that issue is what made Al Viro look at it enough to discover our current problem, which is that the way we maintain I_COMPLETE isn't safe
[2:05] <cmccabe> tv: don't get me wrong, your other points about modularizing init were totally valid
[2:06] <sjust> sagewk: scrub_noblock has been updated and seems to be testing ok
[2:06] <Tv> greglap1: yes, but are we still looking for a short term fix, or proper vfs locking for real?
[2:06] <greglap1> Nick Piggin has apparently disappeared off the face of the earth (as in, not sure he's okay, I think?) and nobody else has been brave enough to make the fix he discussed yet
[2:06] <cmccabe> greglap1: yeah, that inode scalability stuff was deep magic
[2:06] <Tv> greglap1: the d_count==0 instead of parent!=NULL part seems pretty easy
[2:07] <greglap1> well, we can hack a fix for that by copying the dentry pointers, but since we can't maintain the I_COMPLETE flag anyway there's no point
[2:07] <greglap1> Tv: dunno, not familiar with it at all
[2:07] <greglap1> but most everybody seems pretty scared of changing rcu-walk itself
[2:07] <Tv> greglap1: ahh ok so you're saying don't do a partial ugly workaround
[2:07] <Tv> rcu is a bitch to understand
[2:08] <greglap1> Tv: well, somebody decided not to do a partial ugly workaround
[2:08] <Tv> but switching from one constraint to another is doable, if you can audit the constraints to be equal
[2:10] <greglap1> anyway, I'm off
[2:10] <greglap1> night all
[2:10] <Tv> greglap1: have a good recharge
[2:10] <Tv> ah i see Viro's argument
[2:10] * greglap1 (~Adium@166.205.137.3) Quit (Read error: Connection reset by peer)
[2:11] <Tv> yeah, CEPH_I_COMPLETE is bogus as is
[2:11] <Tv> i think i could restore it's earlier level of bogosity easily
[2:11] <Tv> but it'll still think some dirs are complete when they're not
[2:12] <Tv> sagewk: ^
[2:12] <Tv> looking at the larger issue..
[2:14] <Tv> ah ok so the temp copy of parent pointer is going to be stale, yup..
[2:19] <Tv> *and* the _I_COMPLETE flag is cleared too late
[2:27] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:34] * chip (~chip@brma.tinsaucer.com) has joined #ceph
[2:50] <neurodrone> I was wondering what happens when the network partitions take place in the OSD cluster. Imagine if OSD0 is a primary for a particular Object and the writes are directed to it and OSD1, OSD2 are replicas for it. And the network is temporarily partitioned into {OSD0, OSD1}, {OSD2}. How are the updates propagated to OSD2? Is the delta kept track and applied to OSD2 when the partition is healed?
[2:55] <cmccabe> I have to catch a train, I'll see you guys later
[2:56] <cmccabe> neurodrone: partitions are an interesting topic. Remember though that an OSD is the primary for a PG, not just for individual objects
[2:57] <cmccabe> neurodrone: individual objects all go inside PGs
[2:57] <cmccabe> neurodrone: different OSDs can wind up with divergent histories for the same PG and there is code in there that tries to merge those histories
[2:58] <cmccabe> neurodrone: frankly it's not well tested because those scenarios don't often happen
[2:58] <cmccabe> neurodrone: but history merging is a big deal in the PG
[2:58] <cmccabe> anyway, got to go, see you later
[2:58] <neurodrone> cmccabe: Ah, I see.
[2:58] <neurodrone> cmccabe: Thank you for the info. :)
[2:58] * cmccabe (~cmccabe@208.80.64.121) has left #ceph
[2:58] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:29] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[4:07] * Juul (~Juul@131.243.46.153) has joined #ceph
[4:12] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[4:22] <greglap> neurodrone: partitions are handled by the monitors and the OSDMap
[4:23] <neurodrone> Oh okay.
[4:23] <greglap> an OSD will only accept writes if it's the primary for the PG in question
[4:23] <neurodrone> but then the history-merging which cmccabe was talking about earlier, that is a tedious process, ain't it?
[4:24] <neurodrone> yep, that is true.
[4:24] <greglap> and if it gets separated from its peers then it will eventually get marked down
[4:24] <neurodrone> But in case the primary OSD is separated from the rest, how will the rest of OSDs reconcile the updates?
[4:24] <greglap> but the timeouts are set up such that it should start refusing to accept writes before the monitors set up the other OSDs to take over as a primary
[4:25] <greglap> so the primary OSD may accept updates that the other OSDs don't know about
[4:25] <neurodrone> Oh I see.
[4:25] <neurodrone> and then they will eventually reconcile when the partition heals?
[4:25] <greglap> but once they are able to communicate again, then they can know for certain that the writes on the other OSDs are newer than the writes on the original primary
[4:25] <greglap> so that's how they merge their history
[4:25] <neurodrone> Ah, that sounds good enough.
[4:26] <neurodrone> How are versions kept track of?
[4:26] <neurodrone> Ceph uses vector clocks?
[4:26] <neurodrone> How does it know that it has a newer version than the primary?
[4:26] <greglap> which versions?
[4:27] <greglap> the way OSDs get marked down is via a change in the OSDMap :)
[4:27] <neurodrone> That is true. But once a PG changes, the PGMap version changes too?
[4:27] <greglap> it's not a full-on vector clock for each write, but one of the OSDMap's purposes is to keep track of and order PG ownership changes
[4:27] <greglap> yeah, the PGMap changes more often than the OSDMap actually
[4:28] <neurodrone> Oh okay. That seems likely.
[4:28] * MK_FG (~MK_FG@188.226.51.71) Quit (Quit: o//)
[4:29] <greglap> the PGMap is not a fully-distributed piece of state like the OSDMap; so it's used for more trivial things like keeping track of how much data is in each PG as well as stuff like whether a PG is peering or active
[4:29] * MK_FG (~MK_FG@188.226.51.71) has joined #ceph
[4:29] <neurodrone> Aha.
[4:29] <neurodrone> So, PG will keep changing continuously if new files are being created and written into?
[4:29] <neurodrone> a PGMap*
[4:29] <greglap> yep
[4:30] <greglap> it's basically just a reporting service, I don't think it gets distributed to anybody at all
[4:30] <neurodrone> Oh okay. Yea, it can be contained within its own PG.
[4:30] <greglap> "contained within its own PG"?
[4:30] <neurodrone> oh sorry, I mean an OSD.
[4:31] <greglap> still not sure what you mean
[4:31] <neurodrone> each OSD will have multiple PGs, right? Which will be stored as a PGMap?
[4:31] <greglap> ah, I see
[4:31] <greglap> yes, each OSD has multiple PGs
[4:31] <greglap> but like all the other maps, the PGMap is maintained by the monitors
[4:32] <greglap> the OSDs send usage and status data to the monitors, which incorporate it into a PGMap
[4:32] <neurodrone> A single PGMap per instance of Ceph?
[4:32] <greglap> yeah
[4:32] <neurodrone> Ah. Makes sense.
[4:33] <greglap> for each PG, the PGMap contains its current state (active, degraded, peering, etc) and its current number of objects and total size
[4:33] <neurodrone> Oh I see. :)
[4:33] <greglap> accurate to within some delta which can be tuned based on a number of config options
[4:33] <neurodrone> Ah, that clear some dust.
[4:34] <neurodrone> So when I update an object, only the version no. of that particular OID changes, right?
[4:34] <neurodrone> No other maps or status's change except some of the object metadata, right?
[4:34] <greglap> but the PGMap doesn't influence where PGs are placed (that's all in the OSDMap) and the state of a PG can change without the PGMap knowing, as the PG's state doesn't influence placement or anything else
[4:34] <greglap> object versions are changed based on how you're using snapshots
[4:34] <neurodrone> Yep, seems like I kinda understand it now.
[4:35] <neurodrone> on snapshots? not based on applied updates?
[4:35] <greglap> I don't think they change on every write, though
[4:35] <greglap> I could be wrong, I don't recall exactly how this works
[4:35] <neurodrone> Oh okay. I see.
[4:36] <greglap> I think the PG keeps track of the last epoch an object was written to (along with some other data if the PG is degraded)
[4:36] <greglap> but otherwise the object version only changes when you tell it to — which only happens when you create a new snapshot
[4:37] <neurodrone> ah, the last epoch time.
[4:37] <neurodrone> okay, so object versioning is used mainly for failure recovery.
[4:37] <greglap> well, that and snapshots
[4:37] <neurodrone> yep true. Nice design idea.
[4:38] <greglap> there may actually be separately maintained versions for these two pieces of functionality
[4:38] <greglap> I'm less familiar with recovery and on-disk storage than some of the other guys
[4:38] <neurodrone> Aha.
[4:38] <neurodrone> haha, you did clear the air for most of my doubts though. :)
[4:39] <neurodrone> I need to re-read the paper again now to make sure I have got the fundamentals right.
[4:39] <greglap> :)
[4:39] <greglap> just don't bother re-reading the ebofs stuff ;)
[4:39] <neurodrone> haha, yea no point doing that now it seems. :)
[5:24] <neurodrone> Time for me to crash. Been a long tiring day.
[5:25] <neurodrone> Thank you for your time, greglap.
[5:25] <neurodrone> Catch ya tomorrow. :)
[5:25] <greglap> np, cya
[5:25] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has left #ceph
[5:40] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:24] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Ping timeout: 480 seconds)
[6:27] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[7:47] * Juul (~Juul@131.243.46.153) Quit (Ping timeout: 480 seconds)
[9:17] * lidongyang (~lidongyan@222.126.194.154) has joined #ceph
[9:25] * allsystemsarego (~allsystem@188.27.166.127) has joined #ceph
[9:26] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[9:42] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[10:50] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[12:45] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[12:56] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[13:07] * Guest2417 (quasselcor@bas11-montreal02-1128535815.dsl.bell.ca) Quit (Remote host closed the connection)
[13:09] * bbigras (quasselcor@bas11-montreal02-1128535815.dsl.bell.ca) has joined #ceph
[13:10] * bbigras is now known as Guest3097
[13:46] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[14:15] * iggy (~iggy@theiggy.com) has joined #ceph
[14:35] * neurodrone (~neurodron@dhcp213-058.wireless.buffalo.edu) has joined #ceph
[14:39] * neurodrone (~neurodron@dhcp213-058.wireless.buffalo.edu) Quit (Read error: Connection reset by peer)
[14:39] * neurodrone (~neurodron@dhcp213-058.wireless.buffalo.edu) has joined #ceph
[15:05] * neurodrone (~neurodron@dhcp213-058.wireless.buffalo.edu) Quit (Ping timeout: 480 seconds)
[17:08] * neurodrone (~neurodron@dhcp211-070.wireless.buffalo.edu) has joined #ceph
[17:09] * neurodrone (~neurodron@dhcp211-070.wireless.buffalo.edu) Quit ()
[17:15] * neurodrone (~neurodron@dhcp211-070.wireless.buffalo.edu) has joined #ceph
[17:20] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[17:47] <Ormod> sagewk: Are the slides of your talk publicly available from somewhere?
[17:49] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:52] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:17] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:18] <sagewk> ormod: i'll post them now
[18:29] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) has joined #ceph
[18:33] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[19:09] * joshd1 (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:13] <sagewk> ormod: they're on the ceph blog under SCALE 9x
[19:13] <sagewk> cmccabe: ready for skype?
[19:13] <cmccabe> ok
[19:28] * neurodrone (~neurodron@dhcp211-070.wireless.buffalo.edu) Quit (Quit: neurodrone)
[19:33] * neurodrone (~neurodron@dhcp211-070.wireless.buffalo.edu) has joined #ceph
[19:51] <Tv> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/path-lookup.txt#l129
[19:51] <Tv> i loled at the note on the diagram
[19:51] * wido (~wido@fubar.widodh.nl) Quit (Remote host closed the connection)
[19:51] * wido (~wido@fubar.widodh.nl) has joined #ceph
[19:57] * neurodrone (~neurodron@dhcp211-070.wireless.buffalo.edu) Quit (Quit: neurodrone)
[20:36] * Disconnected.
[20:36] -solenoid.oftc.net- *** Looking up your hostname...
[20:36] -solenoid.oftc.net- *** Checking Ident
[20:36] -solenoid.oftc.net- *** Found your hostname
[20:36] -solenoid.oftc.net- *** No Ident response

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.