#ceph IRC Log

Index

IRC Log for 2010-12-15

Timestamps are in GMT/BST.

[0:21] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[0:37] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[2:08] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) Quit (Quit: Leaving.)
[3:04] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:44] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[5:35] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[6:00] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[6:35] * ijuz_ (~ijuz@p4FFF784B.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[6:44] * ijuz_ (~ijuz@p4FFF5989.dip.t-dialin.net) has joined #ceph
[6:46] * f4m8_ is now known as f4m8
[6:53] * ijuz__ (~ijuz@p4FFF652C.dip.t-dialin.net) has joined #ceph
[7:01] * ijuz_ (~ijuz@p4FFF5989.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[8:03] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:54] * allsystemsarego (~allsystem@79.115.53.84) has joined #ceph
[9:08] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[11:32] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[12:07] * Yoric (~David@213.144.210.93) has joined #ceph
[13:53] * DLange (~DLange@dlange.user.oftc.net) Quit (Quit: kernel upgrade)
[13:56] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[14:59] * alexxy[home] (~alexxy@79.173.81.171) has joined #ceph
[15:04] * alexxy (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[15:14] * alexxy (~alexxy@masq246.gtn.ru) has joined #ceph
[15:15] * alexxy[home] (~alexxy@79.173.81.171) Quit (Read error: Connection reset by peer)
[15:18] * alexxy (~alexxy@masq246.gtn.ru) Quit ()
[15:18] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[15:31] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[15:31] * Yoric (~David@213.144.210.93) has joined #ceph
[15:44] * f4m8 is now known as f4m8_
[16:11] * allsystemsarego_ (~allsystem@188.25.128.213) has joined #ceph
[16:18] * allsystemsarego (~allsystem@79.115.53.84) Quit (Ping timeout: 480 seconds)
[16:59] * alexxy (~alexxy@79.173.81.171) Quit (Remote host closed the connection)
[17:04] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[17:20] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[17:21] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[17:21] * MarkN (~nathan@59.167.240.178) Quit (Ping timeout: 480 seconds)
[17:40] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[17:44] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:52] * greglap (~Adium@166.205.139.33) has joined #ceph
[17:53] * MarkN (~nathan@mail.zomojo.com) has joined #ceph
[18:19] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:30] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:41] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[18:41] * greglap (~Adium@166.205.139.33) Quit (Quit: Leaving.)
[18:58] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[19:02] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:46] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[19:53] <wido> sagewk: How well does (or should) mds_log_unsafe affect the filesystem performance?
[19:53] <sagewk> i don't think it does at all anymore, actually.. that's an old option
[19:55] <wido> Ah, Wiki is outdated :)
[19:55] <wido> Just saw greg's issue about rsync being slow, that's why I brought it up
[19:57] <sagewk> wido: have you updated your btrfs kernel at all recently?
[19:57] <sagewk> there was a problem where all snap creations were async (even via the normal sync ioctls)..
[19:58] <wido> sagewk: It's a stock Ubuntu kernel, got a update I think 3 days ago
[19:58] <sagewk> an -rc though?
[19:59] <wido> Yes, 2.6.37-rc5
[20:00] <sagewk> ok. btrfs-unstable has a bunch of fixes now, for that async thing and some other stuff. any chance you can run something that includes those patches?
[20:00] <sagewk> or we can wait for -rc6, he should be sending it to linus shortly.
[20:00] <wido> sagewk: I could build the latest btrfs module against this kernel I think?
[20:01] <sagewk> yeah probably?
[20:01] <wido> I'll try that
[20:01] <wido> btw, upgraded 'noisy' today to 4 WD Greenpower 2TB disks and a SSD for journaling
[20:01] <wido> that runs pretty smooth right now
[20:02] <sagewk> nice
[20:06] <cmccabe> wido: I think there were some complaints about greenpower drives a while ago...
[20:06] <cmccabe> http://kerneltrap.org/mailarchive/linux-kernel/2008/4/9/1386304
[20:08] <cmccabe> wido: something about how they were super-eager to park the heads all the time, which led to a lot of unecessary head parking
[20:08] <cmccabe> wido: I don't know if they ever solved the problem, or what the conclusion was
[20:08] <gregaf> I haven't seen any actual complaints about their longevity — somebody complaining about head parks on a mailing list doesn't mean there's actually a problem
[20:08] <yehudasa> cmccabe: it's like more than two and a half years ago?
[20:09] <cmccabe> yehudasa: yeah, that's why I'm curious what the conclusion was
[20:11] <gregaf> wido: what SSD are you using for the journal?
[20:11] <wido> gregaf: A pretty old one which I had lying around. SLC drive, MTron 128GB
[20:11] <wido> does about 120MB/sec write
[20:11] <gregaf> cool
[20:12] <wido> cmccabe: I'm aware of the Green power drives, we had some issues with them in a RAID config, but I just wanted to see how they would work for a OSD
[20:12] <wido> because of their low price and power consumption
[20:12] <wido> If they don't work, they can always store my 'backups' ;)
[20:14] <wido> sagewk: The current master branch doesn't seem to build against 2.6.37-rc5, there is a next-rc branch (2 weeks old) and a for-linus (6 weeks old) from btrfs. Would the next-rc work?
[20:15] <wido> I mean, would that have the 'fix' I need?
[20:15] <gregaf> wido: sage is out for lunch
[20:15] <cmccabe> wido: cool. I'm interested to hear about your experiences with Greenpower. I considered buying one of them a while ago for power reasons, but didn't have enough information to make the decision
[20:18] <wido> cmccabe: I'll keep you updated. If I get some weird behaviour i'll let you know. I've got these: WD20EARS-00MVWB0 (2TB)
[20:19] <cmccabe> wido: It's also nice that you have an SSD for journaling. Perhaps "noisy" won't live up to its name any more?
[20:21] <cmccabe> wido: I worked with SSDs a bunch at a previous company, but never actually got the chance to have one on my dev/home machine (sigh)
[20:21] <ijuz__> that is probably interesting, that are numbers from a big french online retailer about the failure rates of hard disks http://www.hardware.fr/articles/810-6/taux-pannes-composants.html
[20:22] <wido> cmccabe: noisy is at our office, it's a SuperMicro office server with some huge fans, it's really noisy! But I've been testing with SSD's for a long time now in various applications, it's really great. I'd advice on for your laptop or desktop.
[20:23] <gregaf> cmccabe: you should buy yourself an SSD! They're wonderful in a laptop/desktop.
[20:23] <gregaf> I'm not so sure about our work machines with their big in-memory disk caches
[20:23] <cmccabe> gregaf: I think I'm getting one for christmas... due to some not-so-subtle hints/questions from a friend
[20:23] <wido> ijuz__: My french is pretty bad.. But I understand the numbers! :)
[20:24] <wido> The Intel X25-M is affordable
[20:24] <wido> 80GB for about $160
[20:24] <ijuz__> wido: the same is true for me, the numbers were enough for me, alternatively google can translate it for oyu
[20:24] <cmccabe> gregaf: yeah, the page cache does help a lot. But sometimes you just need to do a find or something and you hit the cold-cache scenario on your rotating disk...
[20:24] <gregaf> yeah, that doesn't bother me as much as when using a stupid Windows machine and I have to wait for a disk access before the start menu opens up...
[20:25] <gregaf> I be loving my desktop's x25-m as a boot drive
[20:25] <gregaf> and my new laptop has a built-in one (Macbook Air)
[20:25] <gregaf> it's a bummer that the consumer SSDs are such a bad choice for journaling, though :(
[20:25] <cmccabe> gregaf: some people believe that SSDs will displace traditional HDs in the consumer market within a few years
[20:25] <wido> Nice :-) I've got a X25-M in my laptop and desktop, wouldn't want anything else.
[20:26] <wido> cmccabe: There will be hybrids, OS on the SSD, HDD for the mass storage
[20:26] <wido> or a Hybrid disk like the Seagate Momentus XT
[20:26] <gregaf> those disks don't seem to work very well right now, though
[20:26] <cmccabe> depends on which industry pundit you read, I guess.
[20:26] <gregaf> in terms of actually accelerating access
[20:26] <cmccabe> From the point-of-view of your average (l)user, SSDs give a pretty big performance increase.
[20:26] <wido> gregaf: Write Intel and e-mail, ask for a SLC SSD of 4GB with 200MB/sec write speeds
[20:26] <wido> those would be fine for journaling
[20:27] <gregaf> lol
[20:27] <cmccabe> And moore's law (about transistors) is still operating like before, although clock speeds are maxed.
[20:27] <gregaf> write resistance is a serious problem for SSDs as they transition to smaller transistors, though
[20:27] <gregaf> the x25-m only gets 6 months or something under constant write loads
[20:27] <gregaf> which is fine for consumer stuff, but it's got like 10k write/erase cycles per cell
[20:28] <cmccabe> gregaf: unfortunately, whatever is fine for consumer stuff is what will win...
[20:28] <gregaf> the 34nm drives only get 3k, and the smaller ones will get worse
[20:28] <ijuz__> wido: the chips are too big this days, you would have to settle for at least 8 or 16GB :)
[20:28] <cmccabe> gregaf: I used to work in the storage industry. We always had to deal with consumer hardware that had been re-purposed to enterprise use.
[20:28] <wido> gregaf: Don't forget the wear leveling. If you only use a few percent that should do it's job
[20:28] <cmccabe> gregaf: because the enterprise market isn't big enough to support completely separate products (although it will support ... er.. .adaptations)
[20:28] <wido> ijuz__: Yes, ofcourse
[20:29] <cmccabe> my biggest gripe about SSDs is that the firmware is closed-source
[20:29] <gregaf> wido: hmm, I don't think that'll actually make a difference under constant write workloads in terms of maximum life?
[20:29] <cmccabe> which makes it hard to know what it's really doing
[20:29] <cmccabe> and hence, hard to know what you should be doing to optimize its lifespan/performance
[20:30] <ijuz__> for journaling a memory device would be best of course
[20:30] <cmccabe> ijuz__: one thing that some people do is they buy from FusionIO
[20:30] <ijuz__> we built some PCIe flash device with also DRAM, but nobody wanted to work with it
[20:30] <cmccabe> ijuz__: basically, they sell you an array of flash on a PCIe bus. Then you can access it as a raw MTD
[20:31] <wido> ijuz__: Something like the gigabyte i-Ram?
[20:31] <ijuz__> wido: no, something with PCIe and not SATA
[20:31] <cmccabe> ijuz__: I'm curious why people wouldn't simply mount your "PCIe flash device" as an MTD using one of the log-structured filesystems like nilfs, yaffs, etc.
[20:32] <cmccabe> ijuz__: or would that not take advantage of the on-board DRAM?
[20:32] <ijuz__> cmccabe: well, easy, just using it as some MTD device would not make it possible to get good performance
[20:33] <ijuz__> the DRAM is separate (but there is a path to write from DRAM to flash)
[20:35] <wido> cmccabe: About the closed firmware of SSD's, yes, that is a big problem. I have loads of MTron SSD's lying around here, which are all broken
[20:35] <wido> their problem is, after some time they lock up. Simply don't respond. With some weird tool I can low-level format them to get them working again
[20:36] <cmccabe> wido: yeah, I heard about a bunch of problems kinda like that with SSDs
[20:36] <wido> not really usefull in production, but good enough for testing. But MTron never helped fixing this issue
[20:36] <gregaf> hmmm, weird problem with their garbage collection or something?
[20:36] <wido> I'd think so, but can't be sure
[20:36] <cmccabe> wido: there was a really interesting article on LWN a while back about how SSDs basically implement a log-structured filesystem inside them
[20:36] <cmccabe> wido: and like all filesystems... er... there can be bugs
[20:37] <ijuz__> the problem is without this SSD firmware... that you have no firmware, we (university computer architecture group) built the hardware, but no software, there are 96 logical chips on the board that have to scheduled for writing/reading
[20:37] <cmccabe> http://lwn.net/Articles/353411/
[20:37] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Remote host closed the connection)
[20:37] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[20:38] <wido> cmccabe: Tnx!
[20:38] <wido> But I'm always curious how the really, reall, really expensive SSD's perform over time. We have some EMC storage, they have 300GB FC SSD's for about 5k each
[20:38] <cmccabe> ijuz__: the question of thick firmware vs. thin firmware is a complicated one
[20:39] <wido> cmccabe: For your desktop: http://www.fusionio.com/
[20:39] <cmccabe> ijuz__: I tend to favor the thin firmware side of the debate because I think that the kernel has more information to use to make decisions
[20:40] <cmccabe> ijuz__: for example, the kernel can do write coalsecing, etc, and also wear levelling
[20:40] <wido> cmccabe: Yes, that's Linux. But then I think you would be talking about servers / enterprise. Since most consumer SSD's will end up in a Windows machine
[20:40] <cmccabe> ijuz__: it also makes it possible for the user to get meaningful information from the kernel using sysfs, and other userspace tools, which you almost never get from a firmware
[20:40] <ijuz__> cmccabe: i'm sure that very thing firmware is the best; but i just had no resources to built the software, i think it would have required about 1 man year
[20:40] <gregaf> yeah, thick firmware is definitely a win in closed-source environments
[20:41] <cmccabe> yeah, thick firmware for SSDs kind of evolved in the windows world, where there is *no* in-kernel log-structured filesystem
[20:41] <cmccabe> so the engineers had to make the best decision with the situation they had
[20:42] <cmccabe> ijuz__: yeah, it depends on what your research is like. Sometimes firmwares can be very helpful.
[20:42] <cmccabe> ijuz__: you might also consider looking into the in-kernel filesystems like nilfs, yaffs, etc. to see if it has any useful ideas
[20:42] <wido> cmccabe: But I think you propose a PCI-E SSD? I'm not S-ATA expert, but does S-ATA have the ability to give through such information?
[20:42] <wido> or via SMART?
[20:43] <cmccabe> wido: SATA was designed for spinning disks, and has information appropriate to that
[20:43] <cmccabe> wido: they have been slowly adding more features that make sense for SSDs, but it's a slow process
[20:43] <ijuz__> cmccabe: i looked at all of them a while back of course, the only valid choices are imo nilfs2 and logfs
[20:44] <cmccabe> wido: see for example the new TRIM command
[20:45] <cmccabe> wido: in general, SMART information is kind of limited. There is kind of a "misecellaneous grab bag" section where you can basically return a bunch of key-value pairs, but I doubt that most SSDs put meaningful info there
[20:46] <ijuz__> cmccabe: the research was to build a fast flash device for a special application (that later vanished) so we build 4x PCIe cards with 192 GB SLC and 512 MB DDR2, the flash chips have to be controlled each on it's own, there are basically 24 controllers with 4 command queues each, the limit due to the slow PCIe stuff is like 750 MB/s reading and like 550 MB/s writing
[20:47] <cmccabe> ijuz__: interesting. I'm not used to hearing PCIe described as "slow", but I guess some applications are different than others :)
[20:48] <gregaf> PCIe…slow…never thought I'd see those two words in the same sentence!
[20:48] <ijuz__> cmccabe: well, it's only 4x and i lost like 10% of the theoretical performance due to the crappy PCIe core from lattice
[20:48] <gregaf> oh, guess I'm slow though ;)
[20:48] <cmccabe> ijuz__: sounds really interesting.
[20:48] <gregaf> why 4x? That just what the controller could handle?
[20:48] <ijuz__> gregaf: that is a cheap fpga
[20:49] <DeHackEd> so it's a scaled down IOdrive ?
[20:49] <cmccabe> ijuz__: the real challenge for system architects is to make a 2010 system look like a 1975 one
[20:50] <cmccabe> ijuz__: because programmers just don't like changing their code
[20:50] <ijuz__> DeHackEd: when we built it, it was faster
[20:50] <cmccabe> ijuz__: and often have trouble with concepts like page cache, buffered IO vs. nonbuffered, etc.
[20:51] <ijuz__> cmccabe: well, code is expensive, no wonder
[20:51] <cmccabe> ijuz__: true that.
[21:06] <wido> cmccabe: So, you say the the current SSD's are not really up to the task for journaling, I mean the consumer or lower-end SSD's
[21:06] <wido> But is there some piece of hardware that could and which is affordable compared to a SSD
[21:07] <wido> we need only a few gigs of fast writing storage protected against power failure
[21:08] <cmccabe> wido: I think the problem greg was mentioning earlier is limited write cycles for SSD
[21:09] <wido> Yes, but over time that is a problem with a OSD which will be running 24/7
[21:09] <cmccabe> wido: we at least talked about using SSDs for journalling in the past, so it's not a crazy idea though
[21:09] <cmccabe> wido: probably ask Sage what he thinks?
[21:10] <wido> Oh, I personally think it would work fine, I've been using the X25-M for about 1.5 years now in some really heavy MySQL databases
[21:10] <cmccabe> wido: he's at lunch now but I'm really curious if he's thought about it in the past
[21:10] <wido> performance is still fine, no problems at all. I simply don't use more then 80% of the SSD's capacity
[21:11] <cmccabe> wido: the big win with SSDs is random access, of course
[21:13] <wido> true, that's the best
[21:13] <wido> Although I think Intel has done a good job with there SSD's, compared to the problems you see with the other vendors which use the JMicron controllers
[21:13] <cmccabe> wido: Intel SSDs got a pretty glowing endorsement from Linus Torvalds a while back
[21:14] <cmccabe> wido: of course, that was at least a year ago, and the other vendors may have gotten better since then
[21:15] <cmccabe> wido: also, of course, Linus was using it on his desktop. He didn't try to simulate scientific computing or anything like that.
[21:16] <wido> Ofcourse, but in my experience in using them in a RAID-5 setup, they work fine.
[21:17] <wido> But there is always a trade-off somewhere
[21:24] <cmccabe> wido: sounds like the SSD journal is generally a good idea
[21:26] <wido> cmccabe: Yes, but a crazy-fast (small) device somewhere should be even better
[21:29] <wido> cmccabe: Do you think it is 'safe' to merge the syslog branch into the rc branch (I'll do it local), just to test the syslog features
[21:30] <cmccabe> wido: it might be kind of a messy merge
[21:30] <cmccabe> wido: also, there were a few fixes that happened after it got merged into unstable
[21:31] <wido> Ok, i'll probably then switch to unstable
[21:31] <wido> path of the least resistance :)
[21:31] <cmccabe> wido: yeah, that's kind of the easiest thing at the moment
[21:35] <gregaf> wido: I don't think enterprise SSDs with their SLC flash would cause problems as a journal device, though I don't know how much they cost
[21:36] <gregaf> the problem I'm concerned about is that a journaling device is going to be subject to pretty much constant writes, and from what I've read an MLC flash device like the x25-m is only going to last about 6 months under that scenario (though it'll go considerably longer if it's not writing at max speed all the time)
[21:37] <gregaf> whereas the lifetime of an SLC drive based on the same silicon and doing the same workload is going to be measured in many decades
[21:38] <iggy> we got some new viking slc ssds in here... nice and speedy
[21:46] <wido> gregaf: For example, MLC has 10.000 writes before it wears our, SLC 100.000
[21:46] <wido> writing all cells 10.000 times, a 80GB SSD, that takes some time
[21:47] <wido> Indeed, 6 months under FULL load 24/7, but if you underpartition the drives, it wears our the writes over all the blocks
[21:47] <wido> If you use 50% of the SSD, you should gain 100% in lifetime (in theory)
[21:47] <gregaf> I'm not sure how that holds
[21:48] <gregaf> you probably get better practical lifetime since the drive has a way easier job wear-leveling, but the limits are still the same and you're still writing the same total data
[21:48] <wido> true, but when a cell wears out, you have more spare cells
[21:48] <wido> A 32GB SLC SSD is 40GB in total most of the time, but you can only use 32GB
[21:49] <cmccabe> in some ways, the job of the SSD firmware is similar to that of a garbage collector
[21:49] <wido> Sun also has SSD's for their ZFS ZIL, it's a 32/40GB Intel X25-E, but the firmware has been modified so that it's capacity is only 18GB
[21:49] <cmccabe> it must find places to put chunks of data
[21:49] <wido> yes and spread the writes over all the cells, move some data around to get a balance
[21:50] <cmccabe> the additional constraint that it has is that writes happen in fixed sizes called erase blocks
[21:50] <cmccabe> so modifying 1 byte of an erase block means read-modify-write on, potentially, a 16kb chunk
[21:51] <wido> That is the biggest downside, that's why you see people aligning their partitions
[21:51] <cmccabe> the more space that is free, the easier it is for the garbage collector to operate without a lot of read-modify-writes
[21:51] <ijuz__> erase blocks are at least 256 kB
[21:51] <cmccabe> the relationship is nowhere near linear though
[21:51] <wido> ijuz__: For every SSD?
[21:51] <cmccabe> ijuz__: ah, my information is a few years out of date! Wonder how long until they're 512kb
[21:51] <ijuz__> wido: for "every" nand chip
[21:51] <wido> ah, ok
[21:52] <ijuz__> MLC have afaik 512 kB erase blocks
[21:52] <cmccabe> so anyway, that's why Valerie Aurora believes that SSD firmwares are equivalent to log-structured filesystems
[21:52] <wido> aligning your filesystem against that, with a RAID controller and LVM in the mix is a hard job
[21:52] <cmccabe> because they must accumulate a bunch of changes before applying them
[21:52] <cmccabe> because applying a single change is so very expensive
[21:52] <wido> cmccabe: Yes, great article!
[21:52] <cmccabe> SSDs always ship with more storage than they ever tell you about.
[21:53] <ijuz__> and you have to write(program) the pages in each erase blocks in order too
[21:53] <cmccabe> The reason is because if utilization was really 100%, the garbage collector would be stuck. It could not make progress
[21:53] <cmccabe> I'm oversimplifying a little bit here.. .there is also DRAM on these devices
[21:54] <cmccabe> the TRIM command was added to SATA because, prior to its addition, there was no way for the host (Linux/Windows/whatever) to tell the drive that a chunk of data was no longer needed
[21:54] <wido> Yep, there are even SSD's with a condensator on board which acts as a buffer when the power fails, so they can commit there data
[21:55] <wido> Yes, but it now also seems useable for RBD, although I read some stories that using TRIM could destroy your performance
[21:55] <wido> not sure if the SSD or the kernel was to blame
[21:55] <cmccabe> wido: I vaguely remember some complaints that TRIM was implemented by some firmwares in a way that made it slow
[21:56] <cmccabe> wido: and of course, they provide no information about whether TRIM is slow to the OS
[21:56] <gregaf> yeah, I think it was largely a firmware issue
[21:57] <wido> k
[21:57] <cmccabe> wido: so the OS has to guess whether it would be a good idea to use this command or not. Probably it will be solved by keeping a table of firmware version strings? Yuck
[21:57] <ijuz__> the problem with TRIM is of course also that when the kernel sends it to the device that it will not match erase blocks, so it also has to store that information somewhere for garbage collection
[21:57] <gregaf> I'm not sure how well-used TIRM actually is in the kernel, though
[21:57] <gregaf> like i think only ext4 will actually send the command?
[21:57] <wido> btrfs too I thought?
[21:57] <wido> you have to mount ext4 with 'discard' though
[21:58] <cmccabe> gregaf: the filesystem is the one who knows which blocks are meaningful, so yes it would have to go there
[21:58] <gregaf> ah, Sage says it's implemented in xfs and btrfs too but you also need the discard mount option on those
[21:58] <wido> btrfs has a 'ssd' mount option
[21:58] <gregaf> and because the performance was so abysmal their implementations are pretty lousy
[21:59] <wido> gregaf: about http://tracker.newdream.net/issues/644, I'm doing a rsync again on my noisy machine. What I'm noticing right now is that when I start the rsync it goes fine for about 10 minutes, starts syncing fine
[21:59] <gregaf> or at least, they don't use it often because they don't want the fs to grind to a halt
[21:59] <wido> but then the load of the machine starts to go up and up, getting "timed out on osdX, will reset osd"
[22:00] <gregaf> wido: hmmm
[22:00] <wido> this machine has 4 OSD's (repl at 3), mon and MDS on the same host, OSD's all have their own disks (those Green powers)
[22:00] <gregaf> I haven't done much investigation yet, just ran it under a few scenarios to check relative performance
[22:00] <wido> But my question, did you see that too?
[22:00] <wido> Ah, ok
[22:00] <gregaf> we got a report that rsync was fine on large files but slow on small ones
[22:00] <gregaf> which question?
[22:00] <wido> The load going up and up
[22:01] <wido> "load average: 9.97, 8.53, 8.00"
[22:01] <gregaf> you're running on a recent rc branch right now?
[22:01] <wido> yes, latest RC
[22:02] <gregaf> hmm, not sure what would cause that, but it's definitely something we'll want to look at
[22:03] <wido> In the past a rsync with small files always was a killer
[22:03] <gregaf> wido: putting all your OSD journals on a single drive's not bottlenecking writes, is it?
[22:04] <wido> gregaf: It might be, it's a SSD, but that could also be a problem
[22:04] <gregaf> I dunno what its performance characteristics are, but that might be why load is creeping up
[22:04] <wido> iostat is not showing a high utilization on the SSD
[22:04] <cmccabe> gregaf,wido: for purposes of identifying the bottleneck, you could try putting the journal on a tmpfs
[22:05] <wido> cmccabe: Yeah, I'll try that tomorrow
[22:16] <sagewk> wido on 563: i best that warning is something with the ubuntu kernel that's diverged from mainline. annoying
[22:17] <sagewk> the important stuff is what's in the master branch. i'd either wait until the next -rc is cut (with that included) and ubuntu updates their kernel, or switch to a mainline kernel
[22:19] <wido> sagewk: From what I heard, Ubuntu's kernel ppa shouldn't be a patched at all
[22:19] <wido> But i'll build from the btrfs master branch, just a complete kernel
[22:19] <wido> So we can rule it out
[22:20] <sagewk> weird. maybe try your .config on a vanilla kernel and see if that warning comes up?
[22:20] <sagewk> yeah
[22:20] <sagewk> if it's a real build error they should know
[22:21] <sagewk> fwiw hch was complaining about an unfixed build error in #linuxfs yesterday.. chris may not have pushed it into his kernel.org tree yet. might be the one you're seeing.
[22:23] <wido> we'll see, I start to the build in a minute and see what it does
[22:32] <wido> sagewk: The master branch of btrfs still seems to be 2.6.36? Am I using the wrong git repo?
[22:32] <wido> https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories
[22:33] <wido> took btrfs-unstable from that page
[22:33] <sagewk> it's based on something old. i just merge it with the latest master from linus' tree
[22:33] <sagewk> or not, shouldn't matter. the important part is just to get the latest btrfs bits
[22:34] <wido> Yes, but the for-linus branch is 6 weeks old
[22:35] <wido> master is 42 hours, but won't compile against 2.6.37
[22:36] <sagewk> i meant linus' tree, not chris's for-linus branch (that's old)
[22:36] <sagewk> you mean it doesn't compile when you merge linus's master with btrfs's master?
[22:37] <wido> Oh, you lost me here :) You mean just grab the git from the kernel and merge it with btrfs's master?
[22:38] <sagewk> i mean something like git remote add linus git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git ; git merge linus/master
[22:39] <sagewk> oh..
[22:39] <sagewk> looks like linus merged chris's stuff yesterday
[22:39] <sagewk> so just taking linus's latest kernel should work
[22:39] <wido> ah, great :)
[22:42] <wido> I'll leave it cloning and building for the night, going afk
[22:42] <wido> tnx again, ttyl!
[22:43] <sagewk> np ttyl
[22:58] * allsystemsarego_ (~allsystem@188.25.128.213) Quit (Quit: Leaving)
[23:24] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.