#ceph IRC Log


IRC Log for 2011-08-30

Timestamps are in GMT/BST.

[0:00] <sagewk> ajm: great. anything strange you were doing with those nodes?
[0:00] <sagewk> are they running reasonably recent btrfs?
[0:00] <ajm> kernel, 2.6.39
[0:00] <ajm> they potentially have some hardware issues that i'
[0:00] <ajm> *that i'm going to hopefully resolve soon
[0:01] <ajm> a situation where perhaps that file was created, but it wasn't able to write to it for whatever reason
[0:01] <sagewk> in theory it shouldn't have mattered.. we never read from current/.. osd uses the most recent snap on startup. the questino is how an incomplete file snuck in there.
[0:02] <ajm> if i had to guess, i'd still blame a crash / hardware issue for creating the problem
[0:02] <ajm> but you say your seeing it elsewhere... so perhaps not
[0:02] <ajm> either way it'd be nice if it handled it better
[0:04] <ajm> either way, help is appreciated, if you can think of anything you want me to run on the 2nd osd before I fix it, shoot me a /msg, i'll wait till I get home to fix that in about an hour or so
[0:30] * slang (~slang@chml01.drwholdings.com) has joined #ceph
[0:53] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[0:58] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[1:02] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[1:36] <ajm> sagewk: if your still around, http://adam.gs/osd.2.log, this is the other osd, it doesn't appear to be the same issue
[1:37] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[1:37] <sagewk> ajm: looks the same to me?
[1:37] <ajm> nevermind
[1:37] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[1:37] <ajm> i did osd debug = 20, not debug osd = 20 :)
[1:38] <sagewk> :) the backtrace looks the same.
[1:38] <ajm> yeah, i was missing the lines I expected with the bad cluster :P
[1:57] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[1:58] <Tv> http://ceph.newdream.net/docs/latest/
[1:58] <Tv> (very much work in progress, but getting there...)
[1:59] <ajm> Tv: thats very nice
[2:07] <Tv> yehuda: json escaping change broke unit tests
[2:08] <yehudasa> Tv: thanks
[2:19] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:26] * lxo (~aoliva@9KCAAAP54.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[2:26] * lxo (~aoliva@83TAAC2LK.tor-irc.dnsbl.oftc.net) has joined #ceph
[2:29] <Tv> bleh then i didn't remove the thttpd mention from the commit message before pushing.. oh well.. doc overhaul just hit master
[2:30] <Tv> i'll spend more time filling in the blanks tomorrow
[2:31] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[2:38] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:49] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has left #ceph
[3:01] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:06] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[3:08] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[3:12] * lxo (~aoliva@83TAAC2LK.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[3:29] * lxo (~aoliva@9KCAAAQDC.tor-irc.dnsbl.oftc.net) has joined #ceph
[3:40] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[3:41] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit ()
[4:03] * adjohn (~adjohn@ Quit (Quit: adjohn)
[4:14] * lxo (~aoliva@9KCAAAQDC.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[4:14] * lxo (~aoliva@1RDAAAIM5.tor-irc.dnsbl.oftc.net) has joined #ceph
[4:25] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[4:27] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[5:01] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[5:48] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[5:50] * yehuda_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[6:02] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[6:52] * yehuda_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[7:37] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[8:48] * Meths_ (rift@ has joined #ceph
[8:54] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[9:04] * The_Bishop (~bishop@port-92-206-21-65.dynamic.qsc.de) Quit (Ping timeout: 480 seconds)
[9:12] * The_Bishop (~bishop@port-92-206-38-158.dynamic.qsc.de) has joined #ceph
[9:13] * The_Bishop (~bishop@port-92-206-38-158.dynamic.qsc.de) Quit (Remote host closed the connection)
[9:13] * The_Bishop (~bishop@port-92-206-38-158.dynamic.qsc.de) has joined #ceph
[9:27] * The_Bishop (~bishop@port-92-206-38-158.dynamic.qsc.de) Quit (Ping timeout: 480 seconds)
[9:36] * The_Bishop (~bishop@port-92-206-251-64.dynamic.qsc.de) has joined #ceph
[9:37] * andret (~andre@pcandre.nine.ch) has joined #ceph
[9:54] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[9:59] * u3q (~ben@uranus.tspigot.net) Quit (Read error: Connection timed out)
[9:59] * u3q (~ben@uranus.tspigot.net) has joined #ceph
[12:25] * IalexI (~alex@ has joined #ceph
[12:26] <IalexI> I am pretty sure you have been asked this numerous times before.. when will ceph be ready for production use? :D
[12:29] <IalexI> if its going to work, it will be the most exciting fs ever, in my humble opinion
[13:22] * lxo (~aoliva@1RDAAAIM5.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[13:23] * lxo (~aoliva@9KCAAAQTK.tor-irc.dnsbl.oftc.net) has joined #ceph
[14:38] * pzb (~pzb@gw-ott1.byward.net) has joined #ceph
[14:41] <pzb> is there any data on how many nodes ceph can support?
[15:17] * huangjun (~root@ has joined #ceph
[16:09] * Juul (~Juul@ has joined #ceph
[16:29] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[16:34] * Juul (~Juul@ Quit (Quit: Leaving)
[16:39] <huangjun> quit
[16:39] * huangjun (~root@ Quit (Quit: leaving)
[17:03] * IalexI (~alex@ Quit (Ping timeout: 480 seconds)
[17:05] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[17:18] * greglap (~Adium@ has joined #ceph
[17:18] <greglap> pzb: more nodes than anybody's tested it with so far :)
[17:20] <pzb> greglap: any idea what has been tested?
[17:20] <greglap> 128 MDS nodes, although it was a while ago
[17:20] <greglap> OSDs....dunno
[17:20] <greglap> 96?
[17:21] <greglap> although Sage probably went higher during his thesis work
[17:21] <greglap> the OSD cluster should basically go as large as you can build, once it's set up properly
[17:21] <u3q> i have a cluster with 72 of them in it now
[17:22] <pzb> and is there any limit of "client" nodes (i.e. nodes that are not acting as OSDs)?
[17:22] <greglap> for the object store?
[17:22] <greglap> nope
[17:22] <greglap> I mean, I guess eventually you could run out of available connections...
[17:23] <greglap> the filesystem clients probably don't scale as high, but there's no hard limit there either
[17:32] * greglap (~Adium@ Quit (Quit: Leaving.)
[17:50] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:57] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[17:58] * Juul (~Juul@ has joined #ceph
[18:18] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:31] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[18:34] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:40] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[18:44] * Juul (~Juul@ Quit (Quit: Leaving)
[18:55] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:57] * Meths_ is now known as Meths
[19:09] * aliguori (~anthony@ has joined #ceph
[19:09] * IalexI (~alex@ has joined #ceph
[19:32] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[19:34] * The_Bishop (~bishop@port-92-206-251-64.dynamic.qsc.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[19:42] <Tv> sagewk: sphinx theme gallery: http://sphinx.pocoo.org/theming.html
[19:43] <Tv> sagewk: site that hosts a bazillion sphinx docs from many projects: http://readthedocs.org/
[19:46] <Tv> it seems e.g. http://readthedocs.org/docs/django-piston/en/latest/ lists their actual ToC as "Documentation"
[19:46] <Tv> oh and then sort of again as "Contents"
[19:47] <jojy> we are looking at a lock issue and mds logs look interesting:
[19:47] <jojy> mds0.server handle_client_file_setlock: start: 0, length: 0, client: 7707, pid: 23310, type: 4
[19:48] <jojy> mds0.server state prior to lock change: ceph_lock_state_t. held_locks.size()=3, waiting_locks.size()=0, client_held_lock_counts -- {7707=1,8227=2}
[19:48] <jojy> client_waiting_lock_counts -- {}
[19:48] <jojy> held_locks -- start: 1073741824, length: 1, client: 8227, pid: 5548, type: 1
[19:48] <jojy> 2011-08-30 17:17:53.957288 tart: 1073741826, length: 510, client: 7707, pid: 23310, type: 1
[19:48] <jojy> 2011-08-30 17:17:53.957294 tart: 1073741826, length: 510, client: 8227, pid: 5548, type: 1
[19:48] <jojy> so there is a UNLOCK request by process 23310 at offset=0, len=0
[19:49] <jojy> the question is shud this result in the 2nd lock held being released
[19:51] <sagewk> gregaf: can you take a look?
[19:56] <cmccabe> tv: so how do I build the docs?
[19:58] <gregaf> jojy: sorry, was away from my desk
[19:59] <jojy> np
[19:59] <gregaf> umm, I don't think actually asking for a 0-length lock is valid?
[19:59] <gregaf> so the encoding currently uses that as a shortcut for "rest of file"
[19:59] <gregaf> or actually that might be encoded by the spec, I'll have to check
[20:00] <gregaf> so yes, it is intended that it result in the second lock being unlocked
[20:00] <gregaf> it's possibly incorrect, though I hope not/don't think so
[20:01] <jojy> what is see is that when i just look at the "state after" logs
[20:02] <jojy> the result has been inconsistent
[20:02] <jojy> for this particular request, the "state after" log is:
[20:02] <jojy> 2011-08-30 17:17:53.958146 7ffbd292e700 mds0.server state after lock change: ceph_lock_state_t. held_locks.size()=3, waiting_locks.size()=0 , client_held_lock_counts -- {7707=1,8227=2}
[20:02] <jojy> 264498 client_waiting_lock_counts -- {}
[20:02] <jojy> 264499 held_locks -- start: 1073741824, length: 1, client: 8227, pid: 5548, type: 1
[20:02] <jojy> 264500 2011-08-30 17:17:53.958154 tart: 1073741826, length: 510, client: 7707, pid: 23310, type: 1
[20:02] <jojy> 264501 2011-08-30 17:17:53.958160 tart: 1073741826, length: 510, client: 8227, pid: 5548, type: 1
[20:03] <gregaf> wait, sorry, by "second lock" you meant the lock held by process 23310, right?
[20:03] <jojy> yes
[20:03] <gregaf> and it's not always unlocking?
[20:03] <sagewk> cmccabe: see commit f1d89644998e227319fcafb44f0f3a17df502f5a
[20:03] <jojy> its been inconsistent
[20:04] <cmccabe> sagewk: yeah, I'm trying that, but no success yet
[20:04] <jojy> this time as you see it didnt unlock
[20:04] <cmccabe> sagewk: it needs libpk-gtk-module.so apparently
[20:04] <gregaf> jojy: hmmm....
[20:04] <sagewk> i think i got everything i needed when i installed sia (?)
[20:05] <cmccabe> sagewk: dia?
[20:05] <gregaf> jojy: what's the setlock look like for that failed case?
[20:05] <sagewk> yeah that :)
[20:05] <IalexI> are you the active developers of this project?
[20:05] <cmccabe> lalexl: yes
[20:05] <cmccabe> lalexl: do you have a question?
[20:06] <IalexI> cmccabe, yes, I am evaluating all sort of distributed file systems
[20:06] <IalexI> cmccabe, ceph does not seem to be ready (but most promising)
[20:06] <IalexI> cmccabe, from your point of view.. why should one not use moosefs?
[20:07] <cmccabe> lalexl: one thing ceph has that moosefs does not is an in-kernel client
[20:07] <IalexI> cmccabe, I am more worried about the metadataserver of moosefs
[20:08] <gregaf> IalexI: that (the metadata server) would be why :)
[20:08] <IalexI> I just read the docs how it keeps metadata.. in my understanding data corruption/loss can occur pretty fast in case of a hardware failure
[20:08] <cmccabe> lalexl: I must confess, most of what I know about moosefs I learned from wikipedia. Although I did download the source :P
[20:08] <gregaf> I'm not super-familiar with MooseFS but it represents a single point of failure that I'm not sure it's easy to recover from :)
[20:08] <jojy> gregaf:2011-08-30 17:17:53.782721 7ffbd292e700 mds0.server handle_client_file_setlock: start: 1073741826, length: 510, client: 7707, pid: 23310, ty pe: 1
[20:08] <jojy> 263465 2011-08-30 17:17:53.782727 2011-08-30 17:17:53.782735 7ffbd292e700 mds0.server state prior to lock change: ceph_lock_state_t. held_locks.si ze()=1, waiting_locks.size()=0, client_held_lock_counts -- {7707=1}
[20:08] <jojy> 263466 client_waiting_lock_counts -- {}
[20:08] <jojy> 263467 held_locks -- start: 1073741824, length: 1, client: 7707, pid: 23310, type: 1
[20:08] <cmccabe> lalexl: there is a big conceptual difference between a single-mds system and a clustered one
[20:08] <IalexI> cmccabe, I know.. but moosefs seems to be stupid by design :P
[20:08] <cmccabe> lalexl: if you are OK with a single metadata server, HDFS is probably what you should choose
[20:09] <cmccabe> lalexl: because HDFS is stable and well-tested, and used very widely
[20:09] <gregaf> jojy: yeah, but what's the unlock request look like? I want to make sure it's sending in the right thing
[20:09] <IalexI> cmccabe, I am going to take a look at it
[20:09] <cmccabe> lalexl: another thing to consider is pNFS, which parallelizes the data path, but not the metadata path
[20:09] <jojy> gregaf: u mean the unlock request on the client?
[20:09] <gregaf> this bit: ??mds0.server handle_client_file_setlock: start: 0, length: 0, client: 7707, pid: 23310, type: 4
[20:10] <IalexI> cmccabe, what do you exactly mean by parallelizing the data path?
[20:10] <cmccabe> lalexl: generally speaking, HDFS will scale better than pNFS, but pNFS gives you the familiar NFS semantics
[20:10] <cmccabe> lalexl: formerly, your NFS clients would talk a single host and get both data and metadata from there
[20:11] <cmccabe> lalexl: this represented a network bottleneck
[20:11] <jojy> gregaf: i am not sure i understand what u wanted
[20:11] <cmccabe> lalexl: NFS systems with a clustered backend do exist, but they had to jump through hoops to "look like" a single host
[20:12] <gregaf> jojy: you say that you're getting inconsistent results from attempts to unlock
[20:12] <cmccabe> lalexl: with pNFS, you still have a single host you talk to for metadata, but then he tells you where to go for the data.
[20:12] <cmccabe> lalexl: in general this makes it much more scalable
[20:12] <gregaf> I want to see what information the server is getting to make sure the unlock attempts are themselves consistent
[20:13] <jojy> ok the log i pasted was the case where it didnt unlock
[20:13] <IalexI> cmccabe, I understand.. it would also be great if the fs is aware of "hot spots" (data which is requested frequently)
[20:13] <gregaf> oh, there we go, I did't see the last one
[20:14] <IalexI> cmccabe, I have also read some documents about the google fs (which is basically just a user daemon).. and its able to increase the goal of these hot spots
[20:14] <cmccabe> lalexl: there is more than one google fs
[20:14] <gregaf> jojy: so your unlock requests are very different
[20:14] <cmccabe> lalexl: the original one was a single-MDS system similar to HDFS
[20:14] <gregaf> the first one was starting at offset 0 and asking to clear all locks
[20:14] <IalexI> cmccabe, is it even know what google currently uses?
[20:14] <cmccabe> lalexl: but unlike HDFS, GFS supported "random writers" meaning multiple hosts could write to the same file simultaneously
[20:14] <gregaf> jojy: the second one is starting at offset (very large) and asking to clear everything from that point on
[20:15] <gregaf> but the lock actually started at (very large ??? 2)
[20:15] <cmccabe> lalexl: google later came out with something called GFS2, aka Collossus
[20:15] <gregaf> so a small range remained locked
[20:15] <IalexI> cmccabe, is ceph going to support hot spots?
[20:15] <cmccabe> lalexl: sorry, GFS3 was colossus...
[20:15] <cmccabe> lalexl: Ceph is designed to avoid hot spots
[20:16] <cmccabe> lalexl: metadata sharding is dynamic
[20:17] <jojy> here is a case where the unlock did happen:
[20:17] <jojy> mds0.server handle_client_file_setlock: start: 0, length: 0, client: 7707, pid: 23310, type: 4
[20:17] <jojy> 262243 2011-08-30 17:17:53.620474 2011-08-30 17:17:53.620483 7ffbd292e700 mds0.server state prior to lock change: ceph_lock_state_t. held_locks.si ze()=1, waiting_locks.size()=0, client_held_lock_counts -- {7707=1}
[20:17] <jojy> 262244 client_waiting_lock_counts -- {}
[20:17] <jojy> 262245 held_locks -- start: 1073741826, length: 510, client: 7707, pid: 23310, type: 1
[20:17] <gregaf> jojy: it looks like that value is a bit over 2^30 so there might be something bad going on with 32-bit signed offsets
[20:17] <IalexI> cmccabe,data which is requested many times by clients is called a hot spot.. basically, it depends on your usage if you have them or not. google's fs is able to distribute hot spots to a higher number of chunkservers, so the fs performs better
[20:18] <jojy> gregaf: looks like that
[20:18] <gregaf> jojy: do you know how these offsets are getting generated on your end?
[20:18] <cmccabe> lalexl: yes, I understand that. But Ceph can duplicate data across multiple MDSes so hot spots should not be a problem
[20:18] <gregaf> cmccabe: he's worried about data, not metadata
[20:19] <jojy> the lock offsets looks suspicious and are set by sqlite
[20:19] <cmccabe> lalexl: I can't remember if we allow you to read from non-primary OSDs or not... josh do you remember?
[20:19] <jojy> gregaf: maybe some uninitialized struct issue
[20:20] <gregaf> jojy: it might be we're doing something bad in translation somewhere; if you can find out what locks are getting passed into the kernel and they're good going in then I'll check it out
[20:21] <gregaf> but until then I'm going to suspect "user" error ;)
[20:21] <IalexI> cmccabe, what is your schedule for ceph? I read that the project is alive for 7 years
[20:21] <gregaf> lalexl: so right now Ceph doesn't do anything to deal with data hotspots\
[20:22] <cmccabe> gregaf: ok
[20:22] <cmccabe> lalexl: another thing to note is that Ceph has POSIX semantics, whereas a lot of distributed FSes do not.
[20:22] <gregaf> but since files are sharded over 4MB and both the OSD and the clients will cache them, to actually get a hotspot that's choking you on reads is difficult
[20:23] <cmccabe> lalexl: there is a roadmap here: http://tracker.newdream.net/projects/ceph/roadmap
[20:23] <gregaf> there are some mechanisms to allow for clients to read data from replicas as well, though they're not in use right now; there are no current plans to enable dynamic extra replication of heavily-read objects
[20:23] <IalexI> gregaf, if the reads can be caught by the cache, you won't really have a performance hit
[20:23] <cmccabe> lalexl: I'm not sure if there is a more high-level roadmap out there or not
[20:24] <IalexI> cmccabe, how many active developers do you have?
[20:24] <gregaf> IalexI: yeah, the assumption here is that since a hot file will be spread out over the OSDs in 4MB chunks you're just going to devote a lot more cache to it, which is what you want anyway
[20:25] <cmccabe> lalexl: we have about half a dozen developers
[20:25] <IalexI> cmccabe, it could become the most important fs for linux, you know..
[20:26] <IalexI> all the designs I have seen are rubbish
[20:26] <cmccabe> lalexl: yeah, there's definitely a lot of stuff out there that doesn't scale that well
[20:26] <cmccabe> lalexl: the thing is, a single metadata server works really well, until it doesn't.
[20:27] <cmccabe> lalexl: google went for years on that design, using hardware much worse than what exists today
[20:27] <IalexI> cmccabe, scalability wouldn't be the main concern.. there are many small linux servers out there. a true distributed file system will be the main reason why people choose your fs
[20:27] <cmccabe> lalexl: so sometimes it's hard to get people to see the big picture and step away from single-point-of-failure designs
[20:28] <Tv> sagewk: updated docs, ToC should make much more sense now
[20:30] <IalexI> cmccabe, did you also think about chunkservers which are powered on just a couple of hours per day?
[20:30] <cmccabe> lalexl: I honestly don't think ceph would be a good match for that kind of usage scenario
[20:30] <cmccabe> lalexl: by nature it distributes data across many different OSDs, which are assumed to be powered on
[20:31] <cmccabe> lalexl: in general, what you're talking about is more of a 3rd tier system (backup) and it will have its own set of requirements
[20:32] <IalexI> cmccabe, not really.. I am think of networks with 1 or 2 servers (which are powered on all the time) and which use the disk space of their clients to create more redundancy
[20:32] <cmccabe> lalexl: at a previous company I worked on a system that did dynamically power on and off drives, and was more backup-focused
[20:32] <IalexI> cmccabe, ceph should already work with it.. the rebalancing code would just need to be made a bit smarter
[20:33] <cmccabe> lalexl: the thing is, though, when solid-state disks take over the world, turning off drives will be unecessary
[20:33] <IalexI> this will probably still take a while
[20:33] <cmccabe> lalexl: it may seem absurd to talk about that now, but I am old enough to remember cathode ray tubes and "floppy" disks
[20:33] <cmccabe> lalexl: including the ones that really flopped
[20:34] <cmccabe> lalexl: have you heard of Tivoli?
[20:34] <jojy> gregaf: that fair :)
[20:34] <IalexI> cmccabe, well, I know people who remember dot matrixes (I am not sure about the correct English word) ;)
[20:34] <IalexI> cmccabe, not really?
[20:34] <cmccabe> lalexl: tivoli is IBM's hierarchical storage manager product
[20:34] <IalexI> Tivoli backup?
[20:35] <IalexI> a right.. I think my university has a tape robot which runs this software
[20:35] <jojy> gregaf: although the inconsistency in the unlock doesnt look right
[20:35] <cmccabe> lalexl: it gives you a filesystem where some of the files are really on tapes, which are loaded on demand when you read the relevant files
[20:35] <IalexI> cmccabe, yes, and a number of harddrives to speed up access
[20:36] <IalexI> cmccabe, and it does not work
[20:36] <cmccabe> lalexl: there is a lot more to it, I'm sure... it's a huge software suite now and probably has multiple layers of caching
[20:36] <cmccabe> lalexl: it probably works if you have a support contract with big blue :)
[20:37] <IalexI> cmccabe, I am pretty sure that our university has.. it's one of the largest in Germany
[20:37] <IalexI> cmccabe, however, the tivoli is slow as hell..
[20:37] <cmccabe> lalexl: well, that's expected. It's fetching things from a tape after all!@
[20:37] <IalexI> in my opinion there is no reason to use tape drivers anymore
[20:37] <cmccabe> lalexl: well, just to play devil's advocate, they are cheap, and tape backups can last a long time
[20:38] <Tv> IalexI: just to clarify, ceph as it stands now just does not do hierarchical storage; it's not really in the design
[20:38] <IalexI> harddisks have become so cheap.. you just need to organize them in a smart way
[20:38] <Tv> IalexI: ceph could be improved to support it, but there's so many other things to do too..
[20:38] <cmccabe> lalexl: hard drives will go bad over time in storage, believe it or not. Some people claim that tapes have longer life since they have no active components
[20:38] <IalexI> Tv, I agree.. you should get the basic functionaly running first
[20:38] <u3q> what
[20:39] <u3q> wow i guess the heads could go bad
[20:39] <IalexI> cmccabe, I think the only way to store data is constantly copy and verify data
[20:39] <cmccabe> u3q: well, just to give one example, capacitors will eventually leak
[20:39] <u3q> ah true
[20:39] <Tv> in the olden days, the lubing would get sticky as time went on
[20:39] <Tv> don't know if that's still a problem with current hardware
[20:40] <cmccabe> yeah, the thing is, nobody is testing or rating these things for long-term survival
[20:40] <Tv> then again, disc has the huge benefit of coming pre-packaged with its own reader; plenty of places have been unable to read their own tapes after they lost the last compatible drive
[20:40] <cmccabe> so you have to keep moving your data from drive to drive every few years to keep it alive, like a nomad
[20:40] <Tv> long-term survival in the digital age is a question of shuffling the bits often enough, that part is solvable
[20:40] <Tv> the real challenge is understanding the formats..
[20:41] <IalexI> cmccabe, the problem is that research is immediately turned into products
[20:41] <IalexI> cmccabe, I don't know another business except it where the rate is so fast..
[20:42] <IalexI> does ceph use some kind of bch code for redundancy or do you simple store copies of the data?
[20:42] <Tv> IalexI: simple copies; faster
[20:42] <cmccabe> lalexl: simply copies
[20:43] <cmccabe> lalexl: I'm curious if any object-based storage system is using BCH codes/parity codes these days
[20:44] <cmccabe> lalexl: there was this wacky experiment called HDFS-RAID, but I haven't heard much about it lately
[20:44] <IalexI> cmccabe, some day it will.. but from a software point of view, it is really complex. I agree
[20:44] <Tv> cmccabe: depends on the "object" part
[20:44] <Tv> Tahoe-LAFS does erasure coding etc in a distributed fs, it's pretty interesting... just aimed more at full crypto than speed.
[20:46] <IalexI> cmccabe, what is your strategy for split brain situations?
[20:46] <cmccabe> lalexl: the monitors keep track of the cluster topology using paxos-based consensus
[20:47] <cmccabe> lalexl: so in general adding a new MDS or marking one down needs to go through them (as well as OSDs)
[20:48] <cmccabe> lalexl: someone might correct me on this, but I think in general Ceph gives consistency and availability, but not really partition tolerance
[20:48] <Tv> yeah
[20:48] <Tv> can't do P in a POSIX filesystem
[20:48] <cmccabe> tv: yeah
[20:49] <IalexI> well, anyway.. a really interesting project from an academic point of view
[20:51] <IalexI> many people are waiting for this. you should get it running as soon as possible ;)
[20:51] <cmccabe> lalexl: you should check out the papers, if you haven't already
[20:52] <cmccabe> http://ceph.newdream.net/publications/
[20:52] <Tv> IalexI: it's running & ready for your testing, just don't put it into production yet
[20:52] <cmccabe> it's a little confusing, I must admit, because some aspects changed since the early papers
[20:52] <Tv> cmccabe: yeah that's part of why i want to get the "Architecture" part of new docs fleshed out
[20:53] <Tv> cmccabe: is this right: rados_t is a "config instance", rados_ioctx_t is more like an active connection?
[20:54] <IalexI> Tv, I wish I had the time for it.. I am ceo of a company. but ceph (or something like it) is really what my company needs. actually, I was about to give my developers the order to create something like it (not posix compatible.. just a few user space daemons which emulate file system calls)
[20:54] <Tv> it's multiple connections, but you get what i mean
[20:55] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Ping timeout: 480 seconds)
[20:55] <cmccabe> tv: rados_t is an active connection, ioctx is more like a pool
[20:56] <cmccabe> tv: there is a config instance in the rados_t, but there are other things as well
[20:56] <Tv> hmmh
[20:57] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[20:57] <cmccabe> tv: in general rados_t represents the cluster as a whole
[20:57] <Tv> ioctx seems to be about individual pools
[20:57] <cmccabe> tv: I almost wish I had been named something to suggest that, but the name predates me
[20:57] <Tv> i guess that makes sense
[20:57] <Tv> just trying to flesh out "one of every kind" for docs
[20:57] <cmccabe> tv: ioctx is mostly about pools, but there are some other aspects like what snapshot you're using
[20:57] <Tv> i already have rbd(8) manpage, now doing the simplest possible librados intro, to show how the markup works
[20:57] <IalexI> I need to leave.. have a great day, guys
[20:58] * IalexI (~alex@ Quit (Quit: Verlassend)
[20:58] <cmccabe> tv: I kind of dislike the overuse of the word "context" in our source, but I don't think it can be changed at this point... you just have to learn what an ioctx is to make use of stuff
[21:04] * adjohn is now known as Guest7876
[21:04] * Guest7876 (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Read error: Connection reset by peer)
[21:04] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[21:09] <jojy> gregaf: so looks like sqlite uses the "pendingbyte" offset (0x40000000) for locking
[21:17] <gregaf> jojy: well the locked bit is starting there and the unlock is for two past it, which suggests to me that Ceph is handling them correctly
[21:18] <jojy> gregaf: since unlock request is for start=0 and len=0
[21:18] <jojy> shudnt it have unlocked the entire range?
[21:19] <gregaf> the unlock request in this instance is for offset 0x40000002, though
[21:19] <gregaf> at least the one that's showing up on the MDS
[21:19] <gregaf> I can't conceive of any way in which an unlock request for 0 would get turned into 2^30+2
[21:19] <gregaf> in our codebase
[21:20] <jojy> so here is the unlock request:
[21:20] <jojy> 2011-08-30 17:17:53.957268 7ffbd292e700 mds0.server handle_client_file_setlock: start: 0, length: 0, client: 7707, pid: 23310, type: 4
[21:20] <gregaf> yes, in some cases
[21:20] <gregaf> but the one you pasted where it didn't unlock was for offset 0x40000000
[21:21] <jojy> and here r some logs after that request:
[21:21] <jojy> 2011-08-30 17:17:53.957273 2011-08-30 17:17:53.957282 7ffbd292e700 mds0.server state prior to lock change: ceph_lock_state_t. held_locks.si ze()=3, waiting_locks.size()=0, client_held_lock_counts -- {7707=1,8227=2}
[21:21] <jojy> 264460 client_waiting_lock_counts -- {}
[21:21] <jojy> 264461 held_locks -- start: 1073741824, length: 1, client: 8227, pid: 5548, type: 1
[21:21] <jojy> 264462 2011-08-30 17:17:53.957288 tart: 1073741826, length: 510, client: 7707, pid: 23310, type: 1
[21:21] <jojy> 264463 2011-08-30 17:17:53.957294 tart: 1073741826, length: 510, client: 8227, pid: 5548, type: 1
[21:21] <jojy> 264464 2011-08-30 17:17:53.957299 waiting_locks --
[21:21] <gregaf> 2011-08-30 17:17:53.782721 7ffbd292e700 mds0.server handle_client_file_setlock: start: 1073741826, length: 510, client: 7707, pid: 23310, type: 1
[21:21] <gregaf> oh wait, that's a new set lock, not an unlock
[21:21] <gregaf> ugh
[21:22] <gregaf> jojy: can you zip up the log and give it to me so I can see each request?
[21:22] <gregaf> pasting bits and pieces is too likely to result in confusion or missing data
[21:22] <jojy> will do
[21:22] <gregaf> thanks
[21:22] <Tv> gregaf: http://www.sqlite.org/src/doc/trunk/src/os_unix.c
[21:30] <jojy> gregaf: how do u want me to send it to u? email?
[21:31] <gregaf> sure, gregory.farnum@dreamhost.com
[21:39] <Tv> flushed out http://localhost:8080/api/librados/ a bit
[21:39] <Tv> see doc/api/librados.rst for the markup
[21:39] <ajm> interesting, i have * up, but the kclient cephfs refuses to mount, the mount system call just hangs
[21:41] <Tv> s/flu/fle/ i guess
[21:50] <cmccabe> ajm: there should be messages in dmesg
[21:51] <ajm> nothing abnormal
[21:51] <ajm> clientXXXX fsid xxxx-xxx
[21:51] <ajm> mon1 x.x.x.x:x session established
[21:52] <cmccabe> ajm: you might need to turn up debugging for the kernel client
[21:52] <cmccabe> ajm: I am surprised that you don't see messages from the messenger about how it can't connect, or similar
[21:52] <ajm> cmccabe: have a doc on how?
[21:53] <cmccabe> http://ceph.newdream.net/wiki/Debugging
[21:54] <cmccabe> ajm: for a start, you could try echo 'module ceph +p' > /sys/kernel/debug/dynamic_debug/control
[21:54] <ajm> hrm, need to rebuild my kernel
[21:54] <ajm> i'm going to try the git kernel sources as well I think
[21:55] <cmccabe> seems reasonable
[21:55] <cmccabe> like I said, I am surprised you're not seeing messenger errors if it's hung on mount
[21:55] <cmccabe> greg or sage might have more ideas about why that is
[21:56] <cmccabe> you could also try the fuse client; it's a lot easier to turn on debugging for that
[21:56] <ajm> its... odd
[21:56] <gregaf> cmccabe: I don't right now, but if you're trying to ping us you need to use our irc names (sagewk and gregaf)
[21:56] <cmccabe> gregaf: ok
[21:57] <ajm> http://adam.gs/mount.ceph.hang.txt
[22:04] <ajm> running different versions of ceph in different places in the same cluster is considered a complete no-go I assume?
[22:04] <ajm> like a cmds with older cosd
[22:05] <cmccabe> ajm: in general I wouldn't do it
[22:05] <cmccabe> ajm: we have been trying to version various messages that get passed back and forth, but very little testing goes on of such configurations at the moment
[22:05] <ajm> yeah, wasn't thinking to really try it, more curious
[22:12] <sagewk> ajm: normally it's pretty safe. there is somet stuff in current master that changes a bunch of protocols, so you have to upgrdae it all at once (daemons will refuse to talk to each other otherwise).
[22:13] <sagewk> is it the current master branch that's failing to mount for you?
[22:15] <ajm> 0.33
[22:15] <ajm> i'm upgrading to 0.34 at the moment, kernel client is 2.6.39
[22:16] <ajm> hrm, wiki says to checkout the "unstable" branch for the latest code but I don't see one
[22:16] <sagewk> ajm: which page? that's old info, it's 'master' now
[22:16] <ajm> http://ceph.newdream.net/wiki/Getting_kernel_client_source
[22:16] <sagewk> ajm: but you should just stick with 'stable' i think.
[22:18] <ajm> hrm, lemme only do 15 things at a time and upgrade to 0.34 for now
[22:18] <wido> hi
[22:18] <cmccabe> wido: hi
[22:18] <wido> is there still a issue with building the new packages for Ubuntu Lucid for example? I don't see them on /debian
[22:18] <wido> I have a cluster running where I do apt-get upgrade once every few weeks
[22:19] <wido> I see the packages are being build for oneiric (11.10), but not for Lucid (10.04)
[22:19] <cmccabe> wido: I'm not sure how often they're updating
[22:20] <wido> Ah, I thought it was a autobuild proces
[22:21] <wido> btw, check: http://packages.ubuntu.com/search?keywords=ceph&searchon=names&suite=oneiric&section=all
[22:22] <wido> I know it might be still low prio, but Ubuntu 12.04 will be coming out in April, which is a new LTS
[22:22] <wido> I guess somebody would need to put some time in getting new packages to Ubuntu
[22:22] <sagewk> wido: i dropped lucid at release build time a few weeks back, i think because CLOEXEC was missing.. but that's fixed now, so i can add it back in for next relese
[22:22] <wido> I'd volunteer to get that running, so the new LTS starts shipping with a newer version of Ceph, which is much better useable for people
[22:23] <wido> sagewk: Ah, tnx :-) Don't feel upgrading to cluster shortly to a newer Ubuntu
[22:23] <sagewk> clint byrum is the one to ping about the ubuntu packages
[22:23] <wido> sagewk: I'll do, by that time you might have reached 0.40 I guess, but there should be a descent version in the LTS imho
[22:24] <sagewk> yeah
[22:24] <wido> I'll contact clint for that, better start early with it
[22:26] <ajm> cmccabe/sagewk: after upgrading to 0.34 (and obviously restarting * osd/mds/mon) mount works fine again :/
[22:27] <sagewk> ajm: great
[22:28] <ajm> for the kernel cephfs client, the one that you have there (that looks like 2.6.39) thats the best to use ?
[22:32] <cmccabe> ajm: to my knowledge, ceph-client is the best repo to use
[22:33] <ajm> k, i'll give that a shot
[22:35] <wido> sagewk: In which version again where those new patches going for the kclient? Those readahed patches. 3.2?
[22:40] <sagewk> yeah, missed the 3.1 window
[22:46] <wido> ok, i'll give 3.2 a try then later on, as soon as the patches hit 3.2
[22:48] <sagewk> wido: in the meantime they're sitting in the ceph-client.git master branch (rebase on top of 3.0)
[22:56] <wido> ok, tnx
[23:05] * aliguori (~anthony@ Quit (Remote host closed the connection)
[23:07] * lxo (~aoliva@9KCAAAQTK.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[23:09] * Meths_ (rift@ has joined #ceph
[23:15] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[23:24] <Tv> can someone give me admin access to wiki? there's a bunch of spam there again...
[23:30] <gregaf> jojy: hmm, it looks to me like the MDS is actually busted regarding 0-length locks
[23:30] <gregaf> it works by mistake in some cases but I'm going to have to do an audit of the code
[23:42] <jojy> gregaf:thanks for looking
[23:47] <jojy> gregaf: what i dont understand is why i cant see any logs in "lock_state->remove_lock" call (in method Server::handle_client_file_setlock)
[23:48] <gregaf> jojy: it's under the "debug" rather than "debug mds" conf option
[23:49] <gregaf> I don't remember why, I think just because once it was working why would you want to see the logs
[23:49] <jojy> time to get turn that on i guess
[23:50] <gregaf> I think I have what I need to fix it from looking at the code; should do it today or tomorrow
[23:50] <gregaf> just want to fix up something else first
[23:50] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[23:50] <jojy> kool
[23:51] <gregaf> oh, Sage says it's blocking you ??? I'll get it done today :)
[23:51] <jojy> in ceph.conf, shud i just say debug=20 to see those logs?
[23:51] <gregaf> yeah, in the mds section
[23:52] <gregaf> the locking code is doing searches for overlapping locks
[23:52] <gregaf> and it's just not set up to handle locks of 0 length, for some reason
[23:52] <gregaf> so in the case where there's only one lock, it works by mistake
[23:52] <gregaf> but otherwise it doesn't
[23:54] * verwilst (~verwilst@dD576F5B5.access.telenet.be) has joined #ceph
[23:55] <jojy> gregaf: i think there might be a multiclient factor also there. the fact that it doesnt release locks (with 0 len) especially when multiple clients have locks held on that file
[23:55] <gregaf> well like I said, it works by mistake when there's only one lock held on the file
[23:56] <gregaf> which is a lot more likely when there's only one client ;)
[23:56] <bchrisman> so then if it's multiply locked, with one of those locks being a range of 0 bytes.. then it would trigger.. regardless of which client?
[23:57] <gregaf> bchrisman: not sure I understand the scenario you're proposing
[23:57] <gregaf> if you have looked at the locking code though?
[23:58] <gregaf> then you'll see there's a function get_overlapping_locks
[23:58] <gregaf> which is called both when adding and removing them
[23:58] <gregaf> that and its helper get_last_before don't deal properly with locks of 0 length

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.