[0:02] <jantje> hm, wait
[0:03] <jantje> stat64("/usr/local/gcc34/host/i686-wrs-vxworks/include", 0xffbacf5c) = -1 ENOENT (No such file or directory)
[0:03] <jantje> stat64(".", {st_dev=makedev(0, 0), st_ino=1099511668988, st_mode=S_IFDIR|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=65536, st_blocks=0, st_size=2, st_atime=2011/01/21-09:37:06, st_mtime=2011/01/21-08:36:40, st_ctime=2011/01/21-08:36:40}) = 0
[0:03] <jantje> write(2, "cc1: ", 5cc1: ) = 5
[0:04] <jantje> st_size is just fine
[0:04] <yehuda_hm> here it is fine
[0:06] <bchrisman> I'm blowing out my 1GB RAM + 1GB swap during a drive-fail/osd recovery I've reduced 'osd recovery max active' to 1… from there, reduction in memory will probably only come by reducing the size of each pg? Is there a config parameter for that?
[0:06] <sagewk> oh.. i think cc doesn't like st_ino
[0:08] <jantje> i was just on my way to see how big that one should be
[0:11] <gregaf> bchrisman: you can't cap the number of objects in a PG, you have to increase PG count
[0:11] <gregaf> although that will increase permanent memory usage as well
[0:12] <gregaf> there may be more tuneables you can set to reduce recovery mem usage, but I'm not too familiar with it…I'll ask sagewk when he gets off the phone
[0:12] <jantje> sagewk: is that a mount option too? :-)
[0:12] <jantje> s/that/there
[0:13] <bchrisman> gregaf: iirc, previous failures were tested a bit by killing off cosd… pulling the drive didn't cause my cosd to die/exit/bail… should it have?
[0:14] <darkfader> bchrisman: i think that error will be eaten up by the brtfs/ext underneath ceph
[0:14] <gregaf> bchrisman: you pulled the drive out physically and the cosd kept running?
[0:14] <bchrisman> gregaf: ahh.. so I pulled a drive… memory usage went nuts.. I finally killed off the cosd that was servicing that drive.. and that settled everything back down.
[0:15] <bchrisman> so… reporting up the stack may be an issue.
[0:15] <gregaf> hmmmm
[0:15] <bchrisman> filesystem is responsive again.
[0:15] <gregaf> there definitely are error codes that we're not checking properly in that layer
[0:15] <darkfader> bchrisman: there is no reporting up the stack on linux. if i may say
[0:15] <bchrisman> well.. great to know.. and the filesystem didn't go down.. unmount.. or anything...
[0:15] <gregaf> we were discussing it just a few days ago, I think cmccabe is doing some work with that?
[0:15] <darkfader> it's not really a stack either, it's just a pile.
[0:15] <bchrisman> darkfader: heh
[0:16] <bchrisman> yeah.. okay.. but I see what's going on.. and it's not a huge problem… so.. net positive definitely.
[0:16] <bchrisman> worst case scenario.. hardware monitor daemon or something...
[0:16] <gregaf> it's something we ought to detect
[0:16] <bchrisman> but yeah.. maybe we can get btrfs to report it somehow when a device bails.
[0:16] <bchrisman> good deal… overall.. this is great.
[0:16] <gregaf> I don't know if we will for .25 but probably for the next version, I expect
[0:17] <jantje> yehuda_hm / sagewk : i'm leaving for vacation in a few hours, i just wanted to say all your work is greatly appreciated! Thanks!
[0:17] <darkfader> gregaf: there is a "make some errors" target for devmapper, mght help you with testing
[0:18] <yehuda_hm> jantje: thanks
[0:18] <yehuda_hm> jantje: there's no mount option for that
[0:18] <gregaf> basically we turn all failure modes into "it's gone!", so I think when the on-disk store was written the author(s) weren't quite careful enough to make sure we actually caught all the errors, on the assumption that catastrophic errors would cause death quickly enough on their own
[0:18] <yehuda_hm> jantje: there's no mount option for that
[0:19] <yehuda_hm> jantje: but we may add such an option to have 32 bit inos on 64bit
[0:19] <jantje> that would be great, because i'm stuck with 32bit environments :-)
[0:20] <yehuda_hm> yeah, the problem is that you're doing it with a 64 bit kernel
[0:20] <jantje> well, i was trying that out and I was hoping to improve my build performance
[0:21] <jantje> didn't test it yet
[0:22] <cmccabe> bchrisman, gregaf: yeah, we need to do some serious testing and probably fixups with handling objectstore errors
[0:23] <cmccabe> bchrisman, gregaf: one problem that I've had in the past is that Linux tends to be very slow to fail a bad drive
[0:23] <sagewk> yehuda_hm: yeah we can make a mount option for that i guess.. ino32 or something
[0:24] <cmccabe> bchrisman, gregaf: in a redundant setup, you would ideally like for the system to just start returning EIO a few seconds after something fails. But instead, Linux's default behavior is to keep trying to contact the device until it absolutely, positively, certainly is dead
[0:24] <cmccabe> bchrisman, gregaf: that can be on the order of minutes (you can see in the syslog that the driver will just keep trying various different modes and options)
[0:24] <bchrisman> cmccabe: md's doing something though.. because when I pull a drive, it immediately reports a failure..
[0:24] <gregaf> so you think it's actually hanging in-kernel and not returning error codes to the filestore?
[0:25] <yehuda_hm> sagewk: looking into it now
[0:25] <cmccabe> bchrisman, gregaf: That depends on his particular setup, so I don't know
[0:25] <bchrisman> oh wait.. that's kernel code… invalid opcode reported immediately after drive pull.
[0:25] <cmccabe> bchrisman, gregaf: but in general, we need to be prepared for slow failures
[0:25] <Tv|work> cmccabe: that's also a question of consumer vs RAID disks
[0:25] <bchrisman> it's not hanging in kernel.. or at least not uninterruptible, because I can sigkill the cosd
[0:26] <Tv|work> naturally pulled disks = raid ;)
[0:27] <Tv|work> fwiw i get immediate failures on my home usb disks, when i pull those cables
[0:27] <cmccabe> tv: usb-storage probably behaves differently in that respect than sata
[0:27] <bchrisman> yeah… the kernel recognizes this drive pull immediately, but it doesn't get propagated up through btrfs to get the cosd exiting.
[0:27] <Tv|work> cmccabe: i fear sata itself has no "i was unplugged" signaling mechanis
[0:27] <cmccabe> tv: I had this problem once when designing a system that used MD (Linux software raid)
[0:28] <Tv|work> m
[0:28] <bchrisman> I see the sysfs message immediately at least… or within a few seconds at most.
[0:28] <bchrisman> but yeah.. there could be slow errors too..
[0:28] <cmccabe> tv: actually, sata does have hotplug... *if* you are using AHCI rather than PATA emulation, *and* your sata controller, kernel driver, and sata hardware is new enough
[0:29] <cmccabe> tv: it's not very widely used, except by a few professionals with super-fancy RAID cards
[0:30] <cmccabe> tv: so the sata drivers for regular old PC chipsets tend to get it hotplug wrong (no testing)
[0:30] <Tv|work> yeah i recall it's an extension not innate in the protocol
[0:30] <cmccabe> tv: also as you might imagine, it's hard to automate hotplug testing :)
[0:31] <cmccabe> tv: when I was at Locust Storage, we had to patch the vendor drivers on our RAID controller to get hotplug working
[0:32] <cmccabe> tv: for some reason, LSI hadn't gotten those patches into mainline yet
[0:32] <cmccabe> tv: although they were floating around on the web
[0:49] <yehuda_hm> sagewk: not sure if ino32 mount option is feasible
[0:50] <yehuda_hm> sagewk: because we need to propagate it to ceph_vino_to_ino
[0:50] <yehuda_hm> sagewk: maybe a compile option?
[0:51] <sagewk> the 64->32 conversion in vino_to_ino could be a separate helper, and the mount option could apply that just for kstat.ino in getattr?
[0:52] <sagewk> i_ino would stay 64_bit
[0:54] <yehuda_hm> the problem is that you need to know in a pretty lower level whether to use the conversion or not to
[0:54] <yehuda_hm> oh, you mean, just for the stat?
[0:54] <sagewk> yeah just for stat
[0:55] <yehuda_hm> that would be easy
[0:55] <sagewk> ino32 would just show 32bit inos to userspace (via stat). and readdir too, i guess
[0:55] <sagewk> look for all filldir callers (there are a few of them)
[5:48] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:56] * allsystemsarego (~allsystem@ has joined #ceph
[7:12] * yehuda_hm (~yehuda@adsl-69-228-150-44.dsl.irvnca.pacbell.net) has joined #ceph
[8:14] * yehuda_hm (~yehuda@adsl-69-228-150-44.dsl.irvnca.pacbell.net) Quit (Ping timeout: 480 seconds)
[10:57] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[13:42] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[13:56] * allsystemsarego (~allsystem@ has joined #ceph
[17:53] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
