#ceph IRC Log

Index

IRC Log for 2011-01-31

Timestamps are in GMT/BST.

[0:04] * shdb (~shdb@gw.ptr-62-65-159-122.customer.ch.netstream.com) Quit (Read error: Connection reset by peer)
[0:07] * shdb (~shdb@gw.ptr-62-65-159-122.customer.ch.netstream.com) has joined #ceph
[0:10] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[0:11] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[1:30] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[1:43] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[1:52] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[2:27] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[3:06] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[3:23] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[3:34] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Ping timeout: 480 seconds)
[4:35] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[6:00] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[6:01] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) Quit (Quit: Leaving)
[6:18] <Anticimex> hehe, i think i ran into http://tracker.newdream.net/issues/325 just now. setup my first ceph two hours ago
[6:18] <Anticimex> single-instance mon,mds, 13x300G OS with btrfs, on one host
[6:19] <Anticimex> mounted it from 150 ms rtt away (cross-atlantic) and did a 44MB write from afar (comcast-300KB/s uplink). first of all, no traffic went out until i typed 'sync', and then it started writing
[6:20] <Anticimex> is our redundancy-copy being written from the kclient as well, or replicated within the cluster?
[6:21] <Anticimex> while this writing is ongoing i did a ls -al on the ceph-host on the same root-ceph dir that i wrote to, and it seems to be blocked by the write indeed
[6:21] <Anticimex> (not an ideal ceph-client by any means, but it could give light to bugs not noticed in a hispeed lan :) )
[6:33] <Anticimex> 400 MB sent now, and the file was 44 MB. almost makes you wonder if you're not sending one copy per OSD
[6:51] * f4m8_ is now known as f4m8
[7:02] <Anticimex> 800 MB now written, 13 OSDs, 44 MB file that is being written
[8:12] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[9:03] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) has joined #ceph
[9:06] * allsystemsarego (~allsystem@188.27.165.195) has joined #ceph
[9:18] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[10:09] * Yoric (~David@213.144.210.93) has joined #ceph
[10:23] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[11:54] * Yoric_ (~David@213.144.210.93) has joined #ceph
[11:54] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[11:54] * Yoric_ is now known as Yoric
[11:56] * Yoric_ (~David@213.144.210.93) has joined #ceph
[11:56] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[11:56] * Yoric_ is now known as Yoric
[12:04] <Anticimex> 5.8 GB written now...
[12:04] <Anticimex> for a 44 MB file being synced. :)
[12:05] <monrad-51468> total redundancy?
[12:06] <monrad-51468> can we get a storage system that also writes data to stone tablets
[12:23] <DeHackEd> maybe, but like a DVD-R you only get to write to it ince
[14:45] * gregorg_taf (~Greg@LPuteaux-156-15-57-183.w82-127.abo.wanadoo.fr) has joined #ceph
[14:52] * gregorg (~Greg@78.155.152.6) Quit (Ping timeout: 480 seconds)
[14:55] * gregorg_taf (~Greg@LPuteaux-156-15-57-183.w82-127.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[14:56] * gregorg (~Greg@LPuteaux-156-15-57-183.w82-127.abo.wanadoo.fr) has joined #ceph
[15:22] * Meths_ (rift@91.106.175.10) has joined #ceph
[15:28] * Meths (rift@91.106.141.211) Quit (Ping timeout: 480 seconds)
[15:48] * Meths_ is now known as Meths
[16:00] * Meths_ (rift@91.106.220.1) has joined #ceph
[16:07] * Meths (rift@91.106.175.10) Quit (Ping timeout: 480 seconds)
[16:10] * Meths_ is now known as Meths
[16:19] <jantje__> hmm, sagewk , what whas that option again to disable logging to /var/log/ceph/stat ? logger = 0 ? in [global] ?
[16:24] <jantje__> yehudasa: I think the ino32 option works, but I'm seeing other issues, have to dig into them another time (not sure if it's ceph related)
[16:34] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[17:13] <wido> jantje__: hi!
[17:43] <wido> On my noisy machine I have PG's which go into inconsistent, I assume the auto-scrub detecs those. Right now I'm repairing them by hand. This morning I repaired 4 PG's, then this afternoon another 42 were inconsistent
[17:50] <wido> Machine (with one VM running) was idle on that point
[17:50] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[17:50] * bchrisman1 (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:54] <wido> jantje__: check config.cc btw, you can find all the options in there
[17:55] * greglap (~Adium@166.205.138.121) has joined #ceph
[18:09] <greglap> wido: you're regularly detecting inconsistent PGs?
[18:10] <greglap> Anticimex: I'm not sure I understand — are you saying you wrote one 44MB file and managed to get 5.8GB of traffic? Of data use reported on the cluster?
[18:11] <greglap> the reason you didn't get any traffic until you synced the file is probably because you didn't run out of your local buffer — I think it's 100MB by default
[18:13] <greglap> we had a report of issues with ls during writes in early December that I was unable to recreate on current branches, so although it may be slow (it needs the data to flush out so it can return accurate size counts) it shouldn't be completely blocking
[18:22] <wido> greglap: yes, I'm seeing them regularly indeed
[18:23] <wido> The machine had been idle the whole weekend, just one VM which was running
[18:23] <wido> So I saw 4 this morning, repaired them, and a few hours later 42 were inconsistent
[18:23] <greglap> hmm
[18:24] <greglap> do you have logs for that time period?
[18:24] <greglap> it would be good to see when they're going inconsistent and if we can identify the root cause
[18:25] <wido> greglap: debug osd was at 0, filestore the same
[18:25] <greglap> ah, bummer
[18:25] <wido> yes :(
[18:25] <wido> what loglevel would be sufficient to find this?
[18:25] <Anticimex> greglap: i setup that single-box ceph cluster with 13x300G disks in sweden
[18:26] <Anticimex> i mounted it with 2.6.37's kernel client from boston
[18:26] <wido> running with 20 is not doable for a longer period
[18:26] <greglap> wido: not sure, but probably pretty high
[18:26] <Anticimex> i copied a 44MB youtube-flv onto it
[18:26] <Anticimex> did a ls -al, saw it there, but no outgoing network traffic
[18:26] <Anticimex> did a sync, traffic stars going
[18:26] <greglap> wido: sjust might have a better idea, he's been doing a lot of bug-finding and -fixing lately
[18:27] <Anticimex> next morning, 5.8GB had gone out (kernel client even survived a change of its IP around 800 MB into it)
[18:27] <Anticimex> i tcpdumped etc and saw traffic going out to the various cosds from the client
[18:27] <Anticimex> i'm guessing i can reproduce it later (at work now)
[18:28] <greglap> Anticimex: when you say "did a ls -al, saw it there, but no outgoing network traffic", you mean you copied it in on your Boston computer and ran an ls on your Boston computer to check it was there?
[18:29] <Anticimex> hmm, no
[18:29] <Anticimex> copied from my boston laptop onto the ceph mount of the swedish ceph box
[18:29] <Anticimex> eg, anticimex@boston:~ mount swedishceph:/ /mnt
[18:29] <Anticimex> anticimex@boston:~ cp youtube.flv /mnt/
[18:29] <greglap> yeah, that's what I meant
[18:29] <Anticimex> anticimex@boston:~ ls -al /mnt/
[18:30] <greglap> so you didn't see any outgoing ceph traffic right away because it defaults to a local buffer of 100MB
[18:30] <Anticimex> (see the file here with full filesize, sitting in my vfs/buffers i guess)
[18:30] <greglap> if you'd left it it would have flushed out at one point
[18:30] <Anticimex> right, i understand that
[18:30] <Anticimex> i understand the sync too, no problem
[18:30] <greglap> k
[18:30] <Anticimex> the 5.8GB are weird though
[18:30] <greglap> yes
[18:30] <Anticimex> (and the blocking)
[18:31] <greglap> I would expect some traffic between the kernel client and the MDS/monitors
[18:31] <Anticimex> sync never terminated though
[18:32] <greglap> but unless you're reading/writing then the OSD connections should actually just time out...do you maybe have something in the background running checks on your fs?
[18:32] <greglap> sync never terminated?
[18:32] <Anticimex> nope
[18:32] <Anticimex> 5.8GB out and sync was still there, waiting
[18:32] <greglap> oh, that's not good
[18:32] <greglap> do you have any logging enabled?
[18:33] <Anticimex> i didnt record all the data, but by glossing over the packets flying by in tcpdump it did seem like most of the traffic was going to the cosds
[18:33] <Anticimex> hmm i think so, i just used some standard thing
[18:33] <Anticimex> tell me what to do and i can check tonight (in 6-7h)
[18:34] <Anticimex> ah, look at that
[18:34] <greglap> well if you can zip up your logs and put them somewhere we can look at them that'd be good
[18:34] <wido> greglap: Ok, i'll see if sjust has some ideas :) When he gets online
[18:34] <Anticimex> the read on the local mount on the ceph-node-itself, using kclient, finally terminated at least :)
[18:34] <Anticimex> and ceph reports the file to be there
[18:34] <greglap> heh
[18:34] <Anticimex> not sure if it is all of it though
[18:34] <Anticimex> i could copy it over here and verify
[18:35] <Anticimex> i mean, ceph says the file is the correct size, but i dont know about the actual data content
[18:35] <greglap> yeah
[18:35] <Anticimex> debug ms = 1
[18:35] <Anticimex> on mon, that's what i have i guess
[18:35] <greglap> is that on all of your nodes, or just the mon?
[18:35] <Anticimex> only under [mon]
[18:35] <greglap> oh :(
[18:36] <Anticimex> someohw i feel confident i can reproduce tonight though :)
[18:36] <Anticimex> i did a very simple and plain operation
[18:36] <greglap> yeah
[18:36] <greglap> we'll at least want debug ms=1 on all the nodes, that'll let us see what the messages are and figure out what else needs to be looked at more closely
[18:36] <Anticimex> ok
[18:37] <Anticimex> i dont think many people intend to use kclient/ceph like this
[18:37] <Anticimex> but i guess it doesnt hurt to find bugs
[18:37] <greglap> yeah, the slow links aren't great and it may just be a degenerate case in terms of file caps, between the two separate mounts
[18:37] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:38] <greglap> but it'd be good to check and see if we can streamline it a little, I just fixed one caps issue on Friday
[18:38] <greglap> gotta go now, be back in ~20
[18:38] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:38] * greglap (~Adium@166.205.138.121) Quit (Quit: Leaving.)
[18:42] <Anticimex> heh, approx 10% of the file was written
[18:42] <Anticimex> 58G for 44MB is kinda rough ;)
[18:43] <Anticimex> hm, no, checking the file with less -f i see nonzero data all over, but playback of the video fails and hash sums says they are different
[18:48] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:00] <gregaf> back
[19:00] <gregaf> Anticimex: yeesh
[19:11] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[19:19] * cmccabe1 (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) has joined #ceph
[19:26] <sjust> wido: what was running on the cluster? just one vm with rbd?
[19:58] <wido> sjust: yes, one VM with Qemu-RBD
[19:58] <wido> right now all the 8224 pg's are clean
[19:58] <sjust> repairing them worked?
[19:59] <wido> yes, that worked out well. None of the OSD's crashed recently bt
[19:59] <wido> btw*
[20:00] <sjust> anything odd in dmesg from the osds?
[20:00] <wido> No, nothing at all
[20:00] * Yoric (~David@88.189.211.192) has joined #ceph
[20:00] <wido> dmesg is clean (cluster is running on one machine right now)
[20:01] <wido> Btw, any idea why #563 was closed? Is there a fix out towards btrfs?
[20:02] <sjust> I guess the btrfs fses to which the osds were writing wasn't full?
[20:03] <wido> No, they are at 31% atm
[20:06] <sjust> wido: it looks like we'll need more logs
[20:06] <wido> sjust: I thought so. But logging is a bit low, I'll have to fix that
[20:06] <wido> What level would you need? osd, filestore 20?
[20:06] <sjust> wido: that would be good, but 10 would at least help if 20 produces too much output
[20:07] <wido> Yes, 20 produces a lot. Do you need 10 for osd, filestore or both?
[20:07] <sjust> probably both
[20:07] <wido> ok
[20:26] <wido> sjust: I looked around, but there is no auto repair yet? Is there? I remember a discussion about this somewhere
[20:27] <sjust> wido: there is at present no auto-repair
[20:28] <cmccabe1> wido: that
[20:28] <cmccabe1> s something we're going to have in a bit
[20:28] <cmccabe1> wido: I assume you mean fsck
[20:29] <wido> cmccabe1: no, repairing inconsistent pg's
[20:29] <bchrisman> what's the meaning of auto-repair in this context?
[20:29] <sjust> bchrisman: you can repair a pg marked inconsistent by scrub
[20:29] <wido> right now I used "ceph pg dump -o -|grep inconsistent|awk '{print $1}'|xargs -n 1 ceph pg repair"
[20:29] <gregaf> wido: your new bug 572 has the same iostat pasted twice :)
[20:29] <wido> No gregaf the one if before the dd, the second is after the dd
[20:30] <wido> see the diff in kB_wrtn
[20:30] <sjust> wido: the fix for your problem should really be to work out why the pg's are going inconsistent in the first place
[20:30] <gregaf> wido: well they're identical
[20:30] <sjust> :)
[20:30] <gregaf> I assume they should be showing something different?
[20:30] <gregaf> oh, wait, no they aren't
[20:30] <gregaf> never mind!
[20:30] <wido> sjust: yes, ofcourse. But a PG could always go bad
[20:30] <wido> for whenever a while, for what ever reason got deleted on the OSD
[20:31] <wido> sysadmin screw up, btrfs bug, etc, etc
[20:31] <wido> gregaf: np :)
[20:31] <sjust> wido: yeah, but in the case that it is a problem with the underlying fs, repairing it might make it worse
[20:31] <wido> yes, especially with replication at 2, since there is no checksum of a object
[20:32] <wido> so you don't know which one is bad, can you?
[20:32] <cmccabe1> wido: well, btrfs does have checksums if I remember correctly
[20:32] <sjust> cmccabe1: each btrfs osd store might have an internally consistent object, but nonetheless disagree with each other
[20:32] <cmccabe1> wido: also, ext4 has journal checksumming
[20:33] <wido> Yes, the have. But if the object is corrupt on the OSD, with corrupt I mean, not the data we want
[20:33] <wido> so the PG's goes bad, you have replication at two, you can't auto repair, you can't be 100% sure which object is valid
[20:34] <sjust> wido: thats my thinking
[20:34] <cmccabe1> sjust, wido: I'm not sure we would get into that state
[20:34] <wido> So you at least need replication at three for a descent auto-repair?
[20:34] <cmccabe1> sjust, wido: assuming that fs-level checksumming was working correctly and there were no ceph bugs
[20:35] <gregaf> cmccabe1: it's large-scale storage, you can have random bit flips
[20:35] <sjust> cmccabe1: in that case, I don't think that scrub can find errors
[20:35] <cmccabe1> gregaf: random bit flips would cause the btrfs checksum to fail, leading to EIO
[20:35] <gregaf> wido: at the moment the auto-repair doesn't even check for stuff like that, it just copies off the primary (or to the primary from whoever's next in line)
[20:36] <wido> aha, ok :)
[20:36] <cmccabe1> gregaf: also the drive has FEC (forward error correction) which prevents a lot of stuff (although obviously not all)
[20:36] <gregaf> eventually it will be better, and maybe we could implement it so 3 copies are more robust from that
[20:36] <wido> My next question indeed was, can I tell which OSD is the primary?
[20:36] <gregaf> but my guess is that we'd implement proper checksumming first, since a lot of the hooks are already there
[20:36] <sjust> osd pg dump lists the active set
[20:36] <sjust> the first one in the set is the primary
[20:36] <wido> But with a repair I can't specify I primary / source
[20:37] <wido> which I think is valid
[20:38] <wido> But I get the point of the auto-repair, a lot of hooks there. Finding the cause right now is more important
[20:38] <sjust> yeah
[20:39] <cmccabe1> I guess I can think of one case where scrub might find errors
[20:40] <cmccabe1> if the node booted and the object store fs was corrupt, and it did a fsck
[20:40] <cmccabe1> and the fsck "fixed" some things
[20:40] <cmccabe1> however, I think any reasonable sysadmin would just wipe that object store FS and let ceph re-replicate
[20:41] <wido> cmccabe1: My node did not reboot. But I did update Ceph. In the dmesg I saw some "unlinked orphans" messages
[20:41] <wido> about three I think
[20:42] <cmccabe1> wido: that comes out of btrfs and it's harmless
[20:44] <wido> ok
[21:10] <wido> Does anybody now why #563 (btrfs warnings) was closed?
[21:10] <wido> know*
[21:11] <cmccabe1> wido: I think those guys are at lunch
[21:11] <cmccabe1> wido: I don't know why 563 was closed, but I suspect that sage thinks it should be filed with the btrfs bug tracker rather than ours?
[21:12] <wido> k
[21:40] <jantje__> wido: thanks for the hint, i'll look into it tomorrow
[22:14] <cmccabe1> well, I posted some thoughts about scrub to the mailing list
[22:15] <cmccabe1> I'm curious what you guys think about it
[22:36] <Anticimex> is btrfs the best way to go for OSDs nowadays?
[22:47] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[22:49] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[22:52] <cmccabe1> anticimex: yeah, I think btrfs is the best way to go
[22:52] <cmccabe1> anticimex: the only caveat is that btrfs is more unstable than ext3
[22:53] <cmccabe1> anticimex: obviously ext3 is pretty much production, and hasn't really changed in years, whereas btrfs still has some rough edges
[23:00] <darkfader> can we just have vxfs? it's been used in production and has all btrfs features today
[23:00] <darkfader> now flame me.
[23:03] <cmccabe1> darkfader: er, I haven't heard vxfs mentioned in years
[23:03] <cmccabe1> darkfader: I think the last time I saw it was in some OS textbook
[23:03] <darkfader> hehe
[23:03] <cmccabe1> darkfader: wasn't it proprietary or something?
[23:03] <darkfader> yes
[23:03] <darkfader> it's made by some guys who develop the stuff we copy.
[23:03] <darkfader> :)
[23:04] * Yoric (~David@88.189.211.192) Quit (Quit: Yoric)
[23:04] <cmccabe1> darkfader: it is kind of funny how long Linux went without having really enterprisey file system features
[23:04] <darkfader> i had 60 vxfs linux boxes in last job, it's quite nice
[23:04] <cmccabe1> darkfader: I mean, ext2 was developed after vxfs, wasn't it?
[23:04] <darkfader> had 2-3 ext3 based data losses in the same time
[23:05] <darkfader> cmccabe1: dunno about that, would both be ages ago for sure
[23:05] <cmccabe1> darkfader: when you say "ext3 based data losses" you mean you hit a bug in ext3?
[23:06] <darkfader> yeah stuff like mounted filesystems, that were broken and didnt fall backt to ro and such
[23:06] <cmccabe1> darkfader: ah, that's too bad
[23:06] <darkfader> or just bad error recovery upon a crash
[23:06] <darkfader> it was a messy environment though
[23:06] <cmccabe1> darkfader: error recovery is kind of a complicated topic in filesystems
[23:07] <darkfader> there's a freeware edition of vxfs, look it up when you feel bored enough
[23:07] <cmccabe1> darkfader: I think just about any filesystem will roll over and die if the hardware does certain things
[23:07] <darkfader> not saying it has holy shines and the vendor is symantec now, so it really sucks
[23:07] <darkfader> cmccabe1: mmmh
[23:08] <darkfader> there's probabilities though
[23:08] <cmccabe1> darkfader: more complex filesystems have advantages, but sometimes they're even more vulnerable to hardware problems
[23:08] <darkfader> i remember softupdates in freebsd. they made error handling much better - unless something went wrong, then it turned out much worse :)
[23:08] <cmccabe1> darkfader: I think the reality is, if your data is only on a single disk, it's only a matter of time before it's gone.
[23:09] <darkfader> hehe
[23:09] <darkfader> lemme say it was an insurance corp, so you have measures and really never, ever, lose data.
[23:10] <darkfader> but 2010's ext3 was about the same stability of 1998's vxfs
[23:10] <darkfader> and worse integrity checking
[23:10] <darkfader> thats why i also rather have new brtfs than ext3
[23:11] <cmccabe1> darkfader: I don't see data checksumming in vxfs' list of features
[23:11] <darkfader> true that
[23:11] <cmccabe1> darkfader: you mean integrity checking as in error handling in general
[23:11] <cmccabe1> ?
[23:11] <darkfader> no i mean fs structure
[23:11] <darkfader> like if you make a 2GB hole into a filesystem
[23:12] <darkfader> with ext3 that will live through an fsck if it is considered unallocated
[23:12] <darkfader> (ok, you need offshore admins to create a test case)
[23:12] <darkfader> oh and there is integrity checking like crc or something
[23:13] <darkfader> it has some strange allocation group mechanism and those seem to have crcs
[23:13] <darkfader> but i'm not that deep into it
[23:13] <cmccabe1> darkfader: one of the biggest new features in ext4 was journal checksumming
[23:13] <darkfader> lol
[23:14] <darkfader> well thats good to have for sure
[23:15] <darkfader> i dont want to say anything bad about ext4 since i only use it on one laptop so far
[23:16] <darkfader> but the level of complexity for checksummed journal doesn't seem high
[23:16] <cmccabe1> darkfader: there was a long thread on lkml called "raid is dangerous but that's secret" where Pavel Machek pointed out that ext3's lack of journal checksumming could really cause serious problems on raid
[23:16] <darkfader> yes but thats all OLD issues in the real unix world
[23:17] <darkfader> people know about bad block relocation issues. every real disk array has ecc on the blocks, etc etc
[23:17] <darkfader> either it's catching up on massive gaps, and (thats what i'm suggesting) it's looking at the wrong end
[23:18] <cmccabe1> darkfader: well, the problem with ext3 and RAID is that if ext3's journal gets corrupt, ext3 will "think" that everything is ok and never fsck, but that is false
[23:18] <darkfader> yes
[23:18] <cmccabe1> darkfader: therefore, the probability of double failures goes up a lot
[23:18] <darkfader> thats why there was a veritas option fsck -o full,nolog
[23:18] <cmccabe1> well, there is an ext3 option to do that too.
[23:18] <cmccabe1> but that kind of defeats the point of journalling, in both cases
[23:19] <darkfader> problem is: will it help to add journal crc and stuff if your fs doesn't notice journal issues, doesn't notice corruption there, etc etc
[23:19] <darkfader> and it's far from tackling complex issues
[23:19] <cmccabe1> it seems like adding a CRC and then not checking it would be silly, I hope you're not suggesting that's what ext4 does
[23:19] <darkfader> i.e. why can't we shrink filesystems in the oss world
[23:20] <darkfader> no not suggesting that
[23:20] <cmccabe1> I don't know how well-tested the general error handling pathways are in ext3/4
[23:20] <cmccabe1> I do think they're better than btrfs' at the moment
[23:20] <darkfader> tunability rocks for such features - sync log writes or not sync, etc
[23:20] <darkfader> yeah probably better now
[23:20] <cmccabe1> btrfs has a problem where a lot of problems lead to a BUG_ON rather than an error code going to userspace
[23:21] <cmccabe1> that is something they're hoping to fix
[23:21] <darkfader> but it's 0.19 btrfs versus 10 years ironing on ext3 which still sucks after that
[23:21] <darkfader> my bets are on btrfs
[23:21] <cmccabe1> haha. You really don't like ext
[23:21] <darkfader> i don't like anything that wastes 100s of hours of my sleep and peace of mind
[23:21] <darkfader> (managers? *hehe*)
[23:22] <cmccabe1> when you say "shrink filesystems" -- you mean dedupe?
[23:22] <cmccabe1> or defrag
[23:22] <darkfader> reduce fs size
[23:22] <darkfader> or thin reclaim via storage library apis
[23:22] <darkfader> that is complex stuff
[23:23] <darkfader> crc in a journal (which still isn't safe unless data=ordered is on) is like ... pat pat well done
[23:23] <cmccabe1> I'm pretty sure ext4 has online resize, not sure about ext3
[23:23] <darkfader> ext3 definitely not
[23:23] <darkfader> nice if it's in ext4
[23:23] <darkfader> so far only my laptop has ext4 which really doesn't see much changes :)
[23:23] <cmccabe1> looks like ext3 got online resize "starting in kernel 2.6.10"
[23:24] <darkfader> yeah but only increases
[23:24] <darkfader> which tended to panic the boxes at random times, or get stuck, from rhel5.0 to rhel5.2ish
[23:24] <cmccabe1> ah, according to random comment threads, ext3 supported online upsize, ext4 supports both shrink/upsize
[23:24] <darkfader> loved that very much, too
[23:24] <darkfader> ahh :)
[23:25] <cmccabe1> I'm not sure what thin reclaim is, are you thinking of something like Tivoli?
[23:25] <cmccabe1> the stuff Tivoli does really doesn't belong in the filesystem proper IMO
[23:25] <darkfader> no, i'll try to put it better
[23:25] <darkfader> you do know ATA_TRIM in ssd's? it's kinda similar to that
[23:26] <darkfader> storage arrays can give you "thin" luns which are like sparse files, so they'll be appended if you really happen to write to them
[23:26] <darkfader> the tricky part is letting the array know when you deleted something in this filesystem
[23:26] <cmccabe1> I'm pretty sure TRIM is now supported in ext3/ext4
[23:27] <darkfader> TRIM == very easy version of issue :)
[23:27] <cmccabe1> the big win was SSDs and virtual machine images
[23:27] <darkfader> yeah, ssds really need it
[23:27] <gregaf> less so as garbage collection has improved, luckily
[23:28] <cmccabe1> the problem TRIM had (has?) is that several manufacturers implemented it slooooooowly
[23:28] <gregaf> I don't know that linux actually uses the trim command anywhere yet, because it's so slow
[23:28] <darkfader> with real disk arrays the problem comes up with detecting the scsi features that expose the reclaiming and talking back to the array from the fs
[23:28] <cmccabe1> so that using it would actually create performance problems. And of course there's no API to tell the kernel "hey, TRIM sucks on this drive"
[23:28] <darkfader> really? i read something like that on one thread a few days ago and thought it wasnt true
[23:29] <cmccabe1> whenever you hear the word "firmware"-- be prepared for the worst
[23:29] <darkfader> so can one trigger it from smartctl or something?
[23:29] <darkfader> hehehe nicely put
[23:29] <cmccabe1> in fairness though, SSD firmwares have been improving
[23:30] <gregaf> darkfader: there are one or two articles at LWN about trim issues
[23:30] <darkfader> i have bought a bunch of cheap OCZ vertex. they're doing ok but are also said to lose 70% of performance when they're out of free space / preerased blocks
[23:30] <darkfader> gregaf: i'll read up. so far i just knew the long post about the alignment stuff
[23:31] <cmccabe1> darkfader: I vaguely remember reading the using trim was a mount option; I'm having trouble finding the info online right now though
[23:31] <gregaf> anyway, modern controllers don't need it so much — OS X still hasn't implemented TRIM but that isn't stopping Apple from selling all-SSD machines, and it doesn't seem to hurt anything :)
[23:31] <darkfader> gregaf: i had an ssd in my macbook
[23:31] <darkfader> i felt it got slow at some point
[23:32] <gregaf> well, it depends on the SSD
[23:32] <darkfader> then i put in a "hybrid" seagate momentus and (there's the point) HELL THAT THING IS SLOWER
[23:32] <gregaf> I had a rebranded Intel one that was fine until it suddenly died and I discovered I had no warranty on it :(
[23:32] <cmccabe1> gregaf: I think the problem is that if you don't have TRIM, your SSD slowly fills up with data that in fact is useless, but the SSD firmware doesn't know that
[23:32] <gregaf> but my new Macbook Air doesn't seem to be having any problems, although maybe I just haven't run it out of free space
[23:32] <cmccabe1> gregaf: then allocations (basically malloc on the SSD) become slower and slower, since there's less free space
[23:33] <gregaf> oh yeah, hybrid drives suck
[23:33] <cmccabe1> gregaf: I don't see how the newness/oldness of the drive would affect that fundamental problem
[23:33] <darkfader> gregaf: my issue was i misconfigured the osx auditd, it wrote out a audit log entry and an error in syslog almost every second
[23:33] <gregaf> heh
[23:33] <cmccabe1> gregaf: one thing to keep in mind, though, is that consumers rarely fill their whole drives. Apple probably realizes that most people don't fill their drive
[23:34] <gregaf> cmccabe1: I dunno how they do it, but modern drives only tend to lose like 10% of performance once they get full, compared to older ones that would lose 50%+
[23:34] <cmccabe1> gregaf: also if an older machine gets slow, that's a good way to sell people on a new machine, especially if those people aren't technically savvy to know why
[23:34] <darkfader> cmccabe1: it's really like that, the usage period and the amount of bytes written and the usage ratio all are bad
[23:34] <darkfader> cmccabe1: hehe you just discovered apples plan
[23:34] <darkfader> maybe in 2-3 years they'll even make a new macbook that has no sharp edges
[23:35] <cmccabe1> darkfader: I feel like I should post one of those stock clips of steve jobs with arched eyebrows now
[23:35] <cmccabe1> darkfader: too lazy to find it online though
[23:35] <darkfader> hehe its ok :)
[23:36] <cmccabe1> gregaf: over-provisioning the SSD can help maintain the performance when all blocks are full
[23:36] <cmccabe1> gregaf: also the firmware could get better at alloc'ing. I mean, who knows.
[23:36] <cmccabe1> gregaf: I guess it's fair to say that TRIM is nonessential, but probably a really good thing in the long term, even for newer drives.
[23:37] <darkfader> yup, see point about storage arrays
[23:37] <darkfader> that feature put to more use can save millions
[23:37] <gregaf> yeah, I think it's a combination of over-provisioning and getting better at the logical->physical translation (which lets you combine data onto more blocks)
[23:37] <cmccabe1> darkfader: yeah, if I understand correctly, that's why SAN is so big now
[23:37] <gregaf> but we'll want TRIM for rbd so I hope it lives long and prospers ;)
[23:38] <darkfader> awesome.
[23:38] <darkfader> :)
[23:38] <cmccabe1> :)
[23:38] <cmccabe1> I finally bought an SSD for my home computer
[23:38] <cmccabe1> I put the root filesystem on it because I think that gives me maximum advantage
[23:38] <darkfader> next weekend i'll rebuild the labby and try the xen rbd driver
[23:38] <darkfader> cmccabe1: and it flies?
[23:38] <cmccabe1> darkfader: yeah, it's good
[23:39] <cmccabe1> darkfader: mostly it just makes running GNOME not such a big irritation any more
[23:39] <darkfader> haha
[23:39] <darkfader> maybe i should switch back to fluxbox and test the difference
[23:40] <cmccabe1> I kind of regret saying this, but SSD is not as big a win for a Linux user as it is for a Windows one
[23:40] <cmccabe1> because Linux doesn't have as much ... general bloat seeking all over the hard disk
[23:40] <darkfader> true
[23:40] <cmccabe1> so my system was pretty fast even without the SSD
[23:40] <darkfader> win7 is the first thing to have a bigger disk footprint than irix
[23:40] <gregaf> I think *nix in general is better at caching the filesystem/drive data
[23:41] <gregaf> whereas about the only way on Windows to avoid ever waiting for the start menu to hit disk is to get an SSD
[23:41] <cmccabe1> I think I'll be buying spinning disks for a long time. I just like lots of storage, and I can not lie.
[23:42] <darkfader> cmccabe1: well ssd as boot disk and nice huge spinning stuff for the storage is probably really the best thing to do
[23:42] <darkfader> keeps data off the desktop, too :)
[23:42] <cmccabe1> darkfader: yeah, that's how I have it now
[23:43] <cmccabe1> darkfader: it's kind of amazing, and sad, that my mp3 player still uses a spinning disk
[23:44] <darkfader> hehe
[23:48] * yx (~yx@04ZAACC69.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[23:48] * yx (~yx@28IAACE0O.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:49] <darkfader> brain says zzzz stomach says dinner time. either of that should happen :) i'll be back tomorrow. happy coding!
[23:49] <cmccabe1> later

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.