#ceph IRC Log


IRC Log for 2011-04-13

Timestamps are in GMT/BST.

[0:00] <cmccabe> or an openstack + nova bug... I guess you can plug in other things for Nova (?)
[0:11] <cmccabe> sage: are you doing the rados_pool_lookup change or should I?
[0:12] <sage> cmccabe: you should do it
[0:12] <sage> it's at the rgw layer, btw, not osd layer (in #985 comment)
[0:12] <cmccabe> sage: ok.
[0:13] <sage> rados_pool_lookup(..., poolname) returns the int pool id.
[0:13] <sage> (the num. prefix for pg names)
[0:13] <cmccabe> I forget where we left the collection stuff
[0:14] <cmccabe> I thought basically each pg gets its own collection now
[0:14] <sage> you mean prehashing? "future work"
[0:14] <cmccabe> I forgot where pool names fit into all that
[0:14] <sage> collection == directory. yeah.
[0:14] <sage> they don't.. the pool names exist only in the osdmap
[0:14] <cmccabe> ic
[0:15] <cmccabe> and PGs go in exactly one pool each
[0:15] <sage> yeah
[0:16] <cmccabe> so basically we want to stop rgw from creating those weird PG long names
[0:17] <sage> long object names, right. it just needs to put the bucket acl somewhere, and is creating an object named after the bucket to store it on.
[0:17] <cmccabe> so object names are a problem then
[0:17] <cmccabe> in general
[0:18] <sage> yeah
[0:18] <cmccabe> we may need to fix the lower layer to properly return an error when something like this happens
[0:18] <sage> yeah.. that's bug 963
[0:18] <cmccabe> actually, why haven't we seen this w.r.t rados objects?
[0:18] <cmccabe> rados object names can be 2048 bytes
[0:19] <sage> we apparently haven't tested it :)
[0:19] <cmccabe> I only tested that we rejected object names that were too long :)
[0:19] <sage> (and i'm not sure where 2k came from...)
[0:19] <cmccabe> 2k came from Amazon
[0:20] <sage> oh.... aie.
[0:20] <sage> well, file a bug for that too then :)
[0:20] <cmccabe> er, oops, 1024 bytes
[0:20] <cmccabe> http://tracker.newdream.net/issues/920
[0:21] <cmccabe> sage: From the "Amazon Simple Storage Service Developer Guide", API Version 2006-03-01: ("Object Key and Metadata")
[0:21] <cmccabe> sage: "The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long."
[0:23] <cmccabe> sounds like a good use for... prehashing!
[0:23] <sage> yeah maybe we can solve both problems at once. will be annoying though
[0:24] <cmccabe> well, at minimum, we should return an error code when someone tries to manipulate a too-long object name
[0:24] <sage> yep
[0:25] <cmccabe> I think the best option would be to correctly handle ENAMETOOLONG from open(2)
[0:26] <cmccabe> I don't understand why FileStore::_do_transaction always returns 0
[0:27] <sage> because by the time we're actually doing the transaction it's usually too late. but yeah, at the very least we can catch long names before we get that far
[0:27] <cmccabe> what are the issues related to fixing that
[0:27] <cmccabe> if I replace that with return r what will not work
[0:28] <sage> it won't solve anything.. the transactions are applied to the fs in a worker thread long after we've happily gone on our way. we need to catch the error beforehand.
[0:28] <cmccabe> it does seem like we would need some kind of undo_transaction function to allow do_transactions to handle the failure of an individual transaction
[0:28] <cmccabe> blech
[0:29] <sage> yep
[0:29] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[0:30] <cmccabe> so the assumption is that once we've journalled it successfully, we will write it
[0:30] <sage> right.
[0:30] <cmccabe> I mean basically that's the whole point of the OSD journal
[0:30] <sage> yep. which is why enospc, eio, et all are such headaches for these sorts of systems.
[0:31] <sage> btrfs, transactions, and enospc is 80% of what i talked about with the whamcloud guy at lsf
[0:31] <cmccabe> but the object store all runs on the OSD right
[0:32] <sage> yeah
[0:33] <cmccabe> I kind of wish we could just use blocking syscalls + write with O_NONBLOCK
[0:35] <cmccabe> in general, I think things would be a lot better with a one-thread-per PG architecture
[0:36] <sage> the point of the workqueues is you can control the concurrency based on how much cpu you have to burn and how much concurrent io you want outstanding with the fs
[0:37] <cmccabe> it used to be that 1000 threads was scary, but with 100-odd core systems on the horizon, it's heavily mutexed systems with 1-or-two way concurrency that are scary
[0:38] <greglap> sage: mon_mds looks fine to me :)
[0:38] <cmccabe> at least in theory, the kernel I/O scheduler should be able to handle multiple operations in a reasonable way
[0:38] <sage> greglap: cool thanks
[0:39] <cmccabe> I guess in practice, the options for doing stuff "like ionice" are limited right now
[0:39] <cmccabe> I was sad to learn that ionice itself only works with CFQ
[0:39] <sage> and you can get that behavior with osd op threads = 1000. in practice, it tends to just jack up the load on the machine
[0:40] <cmccabe> that might be an interesting idea for a FUSE filesystem... a filesystem that artificially limits you to some predetermined I/O bandwidth
[0:41] <cmccabe> sage: osd op threads = 1000 still leaves us with a system where you spend most of your time waiting for mutexes...
[0:41] <cmccabe> sage: dout mutex being the worst, but there are many others almost as bad
[0:41] <sage> dout mutex is not relevant, logging will be off
[0:42] <sage> anyway, i'll have to think about that. in the meantime, i have work to do :)
[0:42] <cmccabe> sage: anyways, I guess there must be a pre-operation hook that can do this kind of check
[0:42] <gregaf> cmccabe: what other osd mutexes are you worried about?
[0:42] <gregaf> the mds is stuck on one mutex but that can scale in other ways and it should be the only one
[0:42] <cmccabe> gregaf: OSD::osd_lock, the PG locks
[0:43] <gregaf> I don't think osd_lock has that much coverage area
[0:43] <sage> pg->lock is functional equivalent to 1 thread per pg
[0:43] <gregaf> similarly with the PGs — remember each OSD will have 50-500 PGs that should all see roughly equal traffic
[0:43] <cmccabe> we at least have fine-grained locking in most places, which is good (for performance)
[0:45] <cmccabe> for example, one earlier designs I did was state machined based, where there was a thread per hard drive
[0:45] <cmccabe> there was no mutex for the drive, just a queue where you put commands
[0:46] <cmccabe> that system didn't have anything like a global bandwidth limit though.
[0:49] <gregaf> you do realize that our stuff like the pg lock is analagous to the queue lock in that system you just described?
[0:49] <gregaf> it's not like you sit and hold the pg lock while the write is in progress to disk...
[0:49] <cmccabe> gregaf: there was no queue lock. Putting things into the queue and taking them out were both atomic ops.
[0:49] <cmccabe> gregaf: whereas the PG lock is often held for long periods of time...
[0:50] <sjust> cmccabe: right, but progress can be made on other PG
[0:50] <sjust> *pg's
[0:50] <cmccabe> sjust: that is true. In either system, the PG can only do 1 thing at once.
[0:50] <gregaf> isn't it only held for "long periods" when doing recovery stuff?
[0:51] <gregaf> …and now i'm trying to come up with a method of atomically implementing a queue
[0:51] <sjust> gregaf: depends on the definition of 'long periods', but I would want to see profiling before I took a guess as to whether it was a problem
[0:51] <cmccabe> gregaf: not sure, I'd have to check the code
[0:52] <gregaf> I don't think you can, there must have been internal locking :)
[0:52] <cmccabe> gregaf: you can implement a queue atomically with CMPXCHG
[0:52] <sage> holding the pg lock during a slow io and having a single thread doing the io are equivalent.
[0:52] <cmccabe> sage: true.
[0:53] <Tv> assuming thread pool size > num of pgs
[0:53] <gregaf> and disregarding all the non-io stuff you need to hold pg locks for
[0:53] <cmccabe> gregaf: not sure why you would disregard that
[0:54] <gregaf> because there's a lot of non-io stuff you need to hold pg locks for
[0:54] <gregaf> that doesn't last very long
[0:54] <gregaf> and so if the PG locks were held during io then that would suck, but they aren't so it's okay
[0:54] <gregaf> handle_osd_map is the worst offender there I think (as it is with everything else, heh)
[0:55] <sage> well, they are for reads.. there is effectively a single read per pg at a time
[0:55] <gregaf> sage: we hold pg_lock during reads from disk?
[0:55] <cmccabe> anyway, even if we forget about changing the threading model, one thing that I think would improve things is having a well-defined and encapsulated OSD state machine
[0:56] <gregaf> cmccabe: CMPXCHG doesn't buy you that much in a queue, even in a simple one you have to dereference too often
[0:56] <cmccabe> handle_osd_map should just be:
[0:56] <cmccabe> {
[0:56] <cmccabe> return pg_state_machine->handle_osd_map(foo)
[0:56] <cmccabe> }
[0:56] <sage> gregaf: currently, yes
[0:56] <cmccabe> ... or similar
[0:56] <cmccabe> gregaf: think harder.
[0:56] <sage> cmccabe: we definitely are (slowly) moving in that direction.
[0:57] <gregaf> hmm, I wasn't aware of that; it's a little nastier than I realized
[0:57] <cmccabe> gregaf: sorry, not trying to be annoying, but there is a solution that often uses only a single CMPXCHG for each operation
[0:57] <gregaf> often, sure, but not if there's actually contention
[0:57] <cmccabe> I say often because the nature of CMPXCHG is that it needs to be in a loop
[0:58] <gregaf> since you need to get a stable end-of-list dereference, then a stable next dereference, handle empty, etc
[0:58] <sage> let's just say there is a lot of improvement to be had. we won't know where to start without some profiling. in the meantime, let's fix the bugs and get things stable.
[0:58] <gregaf> there's very little difference between a CMPXCHG and a spinlock at that point
[0:58] <cmccabe> gregaf: spinlocks are often implemented with CMPXCHG. In fact, I wouldn't be surprised if Linux's was on i386
[0:59] <gregaf> cmccabe: …and now you arrive at my point
[0:59] <cmccabe> not at all, since you must see that a lock/unlock pair will be 2x as many operations as a single CMPXCHG
[0:59] <cmccabe> on average
[1:00] <gregaf> and 2 CMPXCHG versus 1 CMPXCHG matters in what universe where you have IO?
[1:00] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[1:01] <cmccabe> CMPXCHG can be very expensive on NUMA due to the CPU cache implications
[1:01] <cmccabe> anyway, the solution is to stuff the head/tail offsets of the queue into a single 32-bit word
[1:02] <cmccabe> then, you construct a new word with the head/tail offsets once you've added/removed an element from the queue, and try to XCHG that into the memory location
[1:04] <cmccabe> the trick is that the queue only contains pointers, which have a fixed size of course, and the queue itself has a fixed power-of-2 size
[1:04] <cmccabe> less than 65k
[1:04] <cmccabe> ok, back to bugfixes...
[1:09] <cmccabe> so ideally I'd like this object ID length check to happen in object_t::object_t
[1:12] <sage> you mean with an exception?
[1:12] <cmccabe> yeah... maybe
[1:12] <cmccabe> I guess the alternative is to have a factory function from str which returns an error
[1:13] <sage> i would put a check in OSD::handle_osd_op if (oid.length() > max) { reply_request(-ENAMETOOLONG); return; } or whatever.
[1:13] <cmccabe> ok
[1:14] <sage> we probably need that anyway, even if the client side also enforces a limit there
[1:16] <cmccabe> I'd be more inclined to add a limit in object_t if I knew the limit would be permanent
[1:17] <cmccabe> I guess either way, we would need to check in OSD.cc though
[1:17] <sage> it would also be messy to catch object_name_too_long_exception in the messenger and handle it intelligently (i.e. replying with ENAMETOOLONG)
[1:18] <cmccabe> I'm not thrilled with exceptions because I suspect a lot of this code isn't exception-safe
[1:21] <Tv> alright autotest, i tried to be nice... *grabs a bigger hammer*
[1:30] <gregaf> sage: okay, it took me a while but I think I figured out where bchrisman's issue is coming from
[1:30] <gregaf> it looks like if we rename over an existing dentry/inode
[1:30] <gregaf> then the inode does get unlinked
[1:31] <gregaf> but predirty_journal_parents isn't called on the stray dir to deal with the move
[1:31] <gregaf> s/deal with the move/prepare for the move
[1:31] <sage> ah.
[1:31] <sage> i was probably trying to cut corners, since who really cares about the stray directory.
[1:32] <gregaf> heh
[1:32] <sage> probably best to be consistent though
[1:32] <gregaf> yeah
[1:32] <sage> i'm about to take off. i can help tomorrow morning if you need help fixing
[1:32] <gregaf> I get the feeling it will make fsck a lot simpler, for one
[1:32] <sage> yep
[1:33] <gregaf> I'm not really sure where it should go to begin with — what deals with renaming over an inode?
[1:33] <gregaf> is it only in _rename_apply?
[1:34] <gregaf> …yeah, I think that's it, it just checks if there's an inode in the target space and unlinks it if so
[1:35] <gregaf> anyway, cya tomorrow
[1:35] <cmccabe> have a good one.
[1:39] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[1:50] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[2:19] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[3:04] * cmccabe (~cmccabe@ has left #ceph
[3:10] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:21] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[3:46] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Read error: Connection reset by peer)
[4:27] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[4:51] * sjustlaptop (~sam@adsl-76-208-183-201.dsl.lsan03.sbcglobal.net) has joined #ceph
[6:28] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Remote host closed the connection)
[6:38] * hubertchang (~hubertcha@ has joined #ceph
[6:38] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[6:44] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[6:49] * Meths_ (rift@customer15835.pool1.unallocated-106-128.orangehomedsl.co.uk) has joined #ceph
[6:52] * hubertchang (~hubertcha@ Quit (Quit: Leaving)
[6:56] * Meths (rift@customer800.pool1.unallocated-106-128.orangehomedsl.co.uk) Quit (Ping timeout: 480 seconds)
[7:14] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[7:16] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[7:20] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[7:32] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[7:33] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[7:44] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[8:15] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:45] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[9:16] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:17] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[9:17] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit ()
[9:32] * sjustlaptop (~sam@adsl-76-208-183-201.dsl.lsan03.sbcglobal.net) Quit (Quit: Leaving.)
[9:49] * Meths_ is now known as Meths
[9:50] * allsystemsarego (~allsystem@ has joined #ceph
[10:13] * Yoric (~David@did75-14-82-236-25-72.fbx.proxad.net) has joined #ceph
[13:18] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) Quit (Quit: Leaving)
[13:19] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) has joined #ceph
[14:06] * Meths_ (rift@customer14404.pool1.unallocated-106-192.orangehomedsl.co.uk) has joined #ceph
[14:12] * Meths (rift@customer15835.pool1.unallocated-106-128.orangehomedsl.co.uk) Quit (Ping timeout: 480 seconds)
[14:13] * Meths_ is now known as Meths
[14:14] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[14:18] * Meths_ (rift@customer16161.pool1.unallocated-106-128.orangehomedsl.co.uk) has joined #ceph
[14:23] * Meths (rift@customer14404.pool1.unallocated-106-192.orangehomedsl.co.uk) Quit (Ping timeout: 484 seconds)
[15:07] * chraible (~chraible@blackhole.science-computing.de) has joined #ceph
[15:09] <chraible> hi I compilied & configured ceph 0.26 today and now i want to start ceph wir /etc/init.d/ceph -a start oder service ceph -a start but the service is unknown and iun /etc/init.d there is noch such file!
[15:09] <chraible> wir = with
[15:15] <chraible> i compiled ceph with following commands... ./autogen.sh ---- CXXFLAGS="-g -O2" ./configure --prefix=/usr --sbindir=/sbin --localstatedir=/var --sysconfdir=/etc ---- make ---- make install
[16:03] * jbdenis (~jbdenis@brucciu.sis.pasteur.fr) Quit (Quit: Lost terminal)
[16:07] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (Quit: Ex-Chat)
[16:12] * yehudasa (~quassel@ip-66-33-206-8.dreamhost.com) has joined #ceph
[16:39] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[17:30] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:52] * greglap (~Adium@ has joined #ceph
[18:08] <greglap> chraible: what OS are you on?
[18:15] * Yoric_ (~David@87-231-38-145.rev.numericable.fr) has joined #ceph
[18:22] * Yoric (~David@did75-14-82-236-25-72.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[18:22] * Yoric_ is now known as Yoric
[18:33] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:34] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[18:39] * Yoric (~David@87-231-38-145.rev.numericable.fr) Quit (Quit: Yoric)
[18:39] <Tv> cmccabe: quick browsing didn't make me find this; is there something explicitly controlling the mode of log files? they're non world-readable right now
[18:39] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[18:40] <cmccabe> tv: nothing besides umask
[18:40] <Tv> ok then i guess something is setting umask behind my back..
[18:40] <cmccabe> well, actually, maybe I explicitly set user-only
[18:41] <cmccabe> tv: yeah, looks like.
[18:41] <Tv> cmccabe: 1) where 2) is it configurable
[18:41] <cmccabe> tv: in _read_ofile_config, that call to open
[18:41] <cmccabe> tv: it's not configurable at the moment
[18:42] <Tv> oh right S_IWUSR etc
[18:42] <Tv> i guess i'll chmod them after the fact, for now
[18:43] <cmccabe> tv: I guess I should check out what other daemons do
[18:43] <Tv> cmccabe: usually, you let the directory control who gets to access the logs
[18:44] <Tv> cmccabe: i often do chgrp adm /var/log/something
[18:45] <cmccabe> tv: so in that case, the directory has the restrictive permissions (or not) rather than the files
[18:46] <Tv> yeah
[18:46] <Tv> which means you don't have to configure the daemon specifically, just set the dir right once
[18:46] <Tv> less code = happier tv ;)
[18:46] <cmccabe> tv: that also leaves you the option of using umask to restrict things
[18:47] <cmccabe> tv: although in our case, there are other files we create so it would have more implications
[18:48] <Tv> cmccabe: another question: what does DOUTSB_FLAG_OFILE do?
[18:48] <cmccabe> tv: that's just an implementation detail
[18:48] <Tv> when is it really set?
[18:48] <cmccabe> tv: it is set when output to a file is configured
[18:48] <Tv> ok
[18:48] <cmccabe> tv: well, more precisely, it's set when we should write to a file
[18:48] <cmccabe> tv: if we get errors writing to the file, that flag will be cleared and we'll stop trying to beat the dead horse... or whatever
[18:50] <Tv> cmccabe: if i set log dir = foo ; log file = bar; does it get written to foo/bar or just bar?
[18:50] <cmccabe> tv: foo/bar
[18:50] <cmccabe> tv: unless bar is an absolute path, in which case it "wins"
[18:51] <Tv> yeah ok thanks
[18:51] <cmccabe> tv: I didn't come up with it, and yes, it's annoying
[18:51] <cmccabe> tv: we've been talking about nixing log_dir for a while now that metavariables can do pretty much everything it ever did
[18:52] <Tv> in this case, i actually like it ;)
[18:52] <Tv> less copy-paste
[18:52] <cmccabe> tv: well, you don't need copy-paste either way
[18:52] <Tv> cmccabe: oh can i say [global] log file = results/log/$name.$id.log
[18:52] <cmccabe> tv: just put something like "log file = /my/path/to/foo/$name.log" into global
[18:52] <cmccabe> tv: yeah
[18:52] <Tv> err yeah $name is $type.$id right
[18:52] <lxo> yay, /me is very happy, finished setting up and loading data onto my home cluster! just in time to get away from home for a week
[18:53] <lxo> sage, I'll look into the situation that leads to the apparent need for the patch I posted the other day, and try a build with your patch for bug 1001 too
[18:54] <cmccabe> tv: $name=$type.$id
[18:54] <cmccabe> tv: yeah
[18:57] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:00] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:05] <gregaf> bchrisman: did you see my patch on #989?
[19:06] <bchrisman> gregaf: Yeah.. I've got some more testing in progress before I can verify whether the check_rstats output is gone.. hopefully can get that today
[19:06] <gregaf> cool
[19:06] <gregaf> just pushed it to master too
[19:07] <bchrisman> ahh okay.. yeah.. it'll get autobuilt in a couple hours and I should be able to catch it in test..
[19:07] <bchrisman> did you push to master more than 20 min ago? :)
[19:07] <gregaf> no, literally just now
[19:07] <bchrisman> ahh okay.. next cycle then...
[19:08] <gregaf> wanted to check with Sage that I hadn't missed any subtleties before i put it in the repo, looks all good though
[19:08] <bchrisman> trying to get vfs module loaded today.. :)
[19:10] * raso (~raso@debian-multimedia.org) Quit (Quit: WeeChat 0.3.4)
[20:05] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:04] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:20] * Anticime1 is now known as Anticimex
[21:38] <Tv> cephbooter is full again
[21:38] <Tv> dammit
[21:41] <Tv> for a storage-oriented group, we sure don't have storage figured out right yet ;)
[21:41] <Tv> need ceph to be stable
[21:43] <Tv> -rw------- 1 root root 2.1G 2011-04-13 02:29 /images/sepia/core.cosd.sepia2.18936
[21:44] <Tv> who wants that? i'll remove it soon otherwise
[21:44] <Tv> also 4 >2GB cmon cores in root/
[21:45] * Guest576 (quasselcor@bas11-montreal02-1128535815.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[21:45] <Tv> sagewk: yehudasa: gregaf: sjust: joshd: yours, yes or no?
[21:45] <gregaf> don't think it's mine, no
[21:46] <Tv> cmccabe: yours, yes or no?
[21:46] <sagewk> kill it
[21:46] <Tv> with fire
[21:46] <cmccabe> tv: I don't need that file
[21:46] <yehudasa> tv: not mine
[21:46] <cmccabe> tv: I also haven't uised sepia in a while, so I doubt it's mine
[21:46] <joshd> tv: probably not mine
[21:47] <Tv> ok they're gone now already
[21:47] <Tv> 28GB disk with ~10GB taken by core files
[21:47] <Tv> oh what cephbooter has /data that has 520GB free!
[21:47] <gregaf> haha, you're only now noticing the joys of our partitions?
[21:47] <Tv> except that's xfs, ewww
[21:48] <gregaf> what's wrong with xfs?
[21:48] <gregaf> <— doesn't use linux filesystems except at work
[21:48] <Tv> gregaf: it has very few users compared to extN, so it's more likely to have funky bugs
[21:49] <Tv> and, well, IRIX. may i never experience that brain damage again.
[21:49] <gregaf> all I know about it is that some of the most active people on fs-devel do xfs development
[21:49] <gregaf> and xfstests makes me happy
[21:50] * bbigras (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) has joined #ceph
[21:50] <bchrisman> do you guys play 'collect the whole set' for underlying filesystems for your internal ceph clusters? :)
[21:51] * bbigras is now known as Guest1850
[21:51] <Tv> bchrisman: this box is nfsroot & dhcp, and probably predates most of ceph code..
[21:51] <Tv> looks like it's /etc directory was created in 2001
[21:52] <Tv> that's... a long time ago
[21:53] <bchrisman> ah :)
[21:55] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[21:55] <cmccabe> tv: xfs is more complex than ext3/4, but it has more features too
[21:55] <Tv> s/more/different/
[21:56] <cmccabe> tv: the guaranteed-rate I/O thing sounds pretty interesting, although the FS is a weird layer to implement it in
[21:56] <Tv> back in early last decade, if you were building a dvr, xfs made a lot of sense
[21:56] <cmccabe> tv: what does XFS lack that ext4 has then?
[21:57] <Tv> for example, speed of metadata operations has been a stumbling block
[21:57] <gregaf> I get the feeling it's a playground for new FS features that the kernel devs want to make standard
[21:57] <cmccabe> tv: that's not exactly a "feature" :)
[21:57] <gregaf> or are thinking about making standard
[21:57] <Tv> gregaf: last decade, it was; these days i'd say that's btrfs
[21:57] <gregaf> Tv: it's definitely shifting to btrfs
[21:58] <cmccabe> tv: wikipedia at least seems to believe that XFS metadata ops got a lot faster due to some work by Dave Chinner
[21:58] <gregaf> but there's still a lot of fs-devel threads about new features in the VFS or whatever where they're like "it's an ioctl in XFS and it doesn't seem to be a good interface"
[21:59] <Tv> gregaf: yeah, especially earlier on xfs had a huge pile of features
[21:59] <Tv> but.. that doesn't always inspire confidence in a filesystem (/in me)
[22:00] <cmccabe> it's true that a lot of people used ext because of its relative simplicity
[22:00] <Tv> Ted Tso has written a few good rants on how to handle all the funky error cases, etc. Just about anything else has never reached the level of stability.
[22:00] <Tv> (else than extN, his baby, that is)
[22:00] <Tv> not just simplicity, extN has been blazing fast compared to many other file systems
[22:01] <Tv> at one point, someone benchmarked extN as faster than solaris ramfs ;)
[22:01] <cmccabe> that calls to mind the quote about lies, damn lies, and benchmarks
[22:02] <cmccabe> it's easy to get superfast performance if you're only benchmarking the page cache
[22:02] <Tv> and about 80% of statistics are made up on the spot ;)
[22:02] <Tv> but that was the whole point of that benchmark
[22:02] <Tv> short code paths etc = win
[22:03] <Tv> solaris syscall costs were high enough to make it lose
[22:04] <cmccabe> solaris has some weird stuff, like spinlocks that turn into mutexes if you hold them for too long (seriously)
[22:21] <Tv> i see sage merged in the dead-code branch.. if i don't count gtest, that makes my current code contribution be -21379 lines
[22:22] <Tv> if i only could keep it negative ;)
[22:32] <Tv> bwahah
[22:32] <Tv> gregaf: ok so i tried dbench with just one mds
[22:33] <Tv> good news it, it didn't hang
[22:33] <Tv> bad news is, it's actually slower than over cfuse
[22:35] <Tv> same 4-node cluster setup on sepia (mon&mds / osd / osd / client): dbench over kclient = 0.73MB/s, dbench over cfuse = 0.93MB/s
[22:45] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:56] <Tv> an autotest moment: read_keyval doesn't quite seem to work; it's only caller is its unit tests
[22:56] <Tv> *le rage*
[22:57] <Tv> oh no there are users
[22:57] <Tv> so it has two different file formats called "keyval", then..
[22:58] <bchrisman> Tv: cfuse has been performing better for me in a few cases as well...
[23:04] <bchrisman> gregaf: check_rstats messages no longer showing up
[23:04] <gregaf> bchrisman: excellent
[23:04] <gregaf> Tv: bchrisman: the kclient got a lot slower for 2.6.39 (or maybe 2.6.38?)
[23:05] <gregaf> we had to disable local lookups entirely due to newly-discovered race conditions
[23:05] <gregaf> so workloads that are heavy on file creates and stuff might be faster on cfuse now since it needs to send fewer messages
[23:30] <gregaf> cmccabe: you rejiggered the symlinks, I take it?
[23:31] <cmccabe> gregaf: the symlinks behavior shouldn't have changed
[23:31] <gregaf> well something changed in terms of what's in out out/ dir
[23:31] <gregaf> not really a problem, just confused the hell out of me for a second
[23:31] <cmccabe> gregaf: let me see if there's any interaction between log_dir and the symlink dance
[23:31] <gregaf> I seem to have eg mds.a.log now
[23:32] <gregaf> instead of kai.1234
[23:32] <gregaf> with mds.a and mds.0 both symlinks pointing to it
[23:32] <cmccabe> blah. Looks like there is interaction.
[23:34] <cmccabe> looks like log_per_instance was always specified to work with log_dir, not with log_file
[23:45] <Tv> gregaf: ok i guess dbench is heavy on metadata operations; i'll keep fiddling with iozone etc

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.