#ceph IRC Log


IRC Log for 2011-01-20

Timestamps are in GMT/BST.

[0:00] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[2:26] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[2:28] * greglap (~Adium@ has joined #ceph
[2:29] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:41] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[2:52] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:52] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[3:42] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:49] * cmccabe1 (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:53] * DJLee (82d8d198@ircip3.mibbit.com) has joined #ceph
[5:33] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[5:40] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[6:05] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[6:30] * ijuz_ (~ijuz@p4FFF7E10.dip.t-dialin.net) has joined #ceph
[6:37] * ijuz__ (~ijuz@p4FFF5887.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[8:56] * allsystemsarego (~allsystem@ has joined #ceph
[9:12] <jantje> wido: yes, so 10 OSD's in total :-)
[9:30] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:31] <jantje> hmm, how do i get ceph running on debian stable
[9:32] <jantje> (libcrypto++ issue)
[9:48] <darkfader> good $timezone
[10:11] <jantje> :)
[10:20] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[10:28] * Yoric (~David@ has joined #ceph
[10:40] * shdb (~shdb@gw.ptr-62-65-159-122.customer.ch.netstream.com) has joined #ceph
[10:41] * shdb (~shdb@gw.ptr-62-65-159-122.customer.ch.netstream.com) has left #ceph
[10:41] * shdb (~shdb@gw.ptr-62-65-159-122.customer.ch.netstream.com) has joined #ceph
[11:02] * alexxy (~alexxy@ has joined #ceph
[11:29] * raso (~raso@debian-multimedia.org) has joined #ceph
[12:24] <stingray> by debian stable you mean debian ancient?
[12:33] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[13:02] <darkfader> are you still hiring the QA guys? my girlfriend started asking questions about software testing and such yesterday
[13:02] <darkfader> i'd love to find someone how can answer a few questions for her
[13:11] * sakib (~sakib@ has joined #ceph
[13:16] * Yoric (~David@ Quit (Quit: Yoric)
[13:51] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[13:53] * sakib (~sakib@ Quit (Quit: leaving)
[13:55] <stingray> gregaf:
[13:55] <stingray> 2011-01-20 15:54:21.830172 7f16be5f3700 mds0.journaler ADVANCING to next non-zero point
[13:55] <stingray> mds/journal.cc: In function 'virtual void ESession::replay(MDS*)':
[13:55] <stingray> mds/journal.cc:698: FAILED assert(mds->sessionmap.version == cmapv)
[13:55] <stingray> ceph version 0.24.1 (commit:9bf55037e110e56dcd649d1e1216601e29f64b2b)
[13:55] <stingray> didn't really work :)
[14:49] <stingray> sagewk: I just asked josef bacik to add me to ceph package maintainers in fedora
[14:50] <stingray> sagewk: I'll keep the spec file up to date and may help do builds
[16:32] <jantje> Hmm, I have a 32bit chroot environment
[16:33] <jantje> which I successful ran on my debian 64bit machine (64bit software, 64bit kernel), I tried the same on my debootstrapped nfsroot client (with different kernel), and I get an exec format error
[16:33] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[16:33] <jantje> any clues?
[16:41] <jantje> Solved, I left IA32 Emulation off I guss
[16:53] <darkfader> hmm last time i managed something like that i had forgotten elf support
[16:53] <darkfader> but missing the cpu support is also nice
[17:30] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[17:33] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[17:34] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[17:36] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[17:38] <jantje> sagewk: yehuda fixed the norbytes stuff between x86 and x86_64, but now I'm running a 32bit chroot on a 32bit machine
[17:39] <jantje> and I get it all the time :-)
[17:39] <jantje> sagewk: yehuda fixed the norbytes stuff between x86 and x86_64, but now I'm running a 32bit chroot on a 64bit(with 64bit software and keren) machine
[17:39] <jantje> can you ask him if it could be related? thanks!
[17:40] <Tv|work> jantje: i'd expect people to come in in the next 30-60min..
[17:46] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:49] * greglap (~Adium@ has joined #ceph
[18:00] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[18:04] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[18:11] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:22] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[18:31] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[18:35] <sagewk> jantje: do you mean you see large directory sizes even with norbytes on a 32bit kernel? or (possibly some other thing causing) EOVERFLOW in your workload?
[18:36] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:39] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[18:48] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:56] <sagewk> i dropped a few bad commits from the unstable and testing branches; watch out for git weirdness (you may need to git reset --hard origin/branchname
[18:56] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[19:07] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[19:08] <stingray> sagewk: you saw what I posted about fedora
[19:08] <sagewk> yeah
[19:08] <sagewk> sounds great.
[19:09] <sagewk> i also want to set up stuff here to autobuild rpms for releases
[19:09] <sagewk> both for convenience, and also so we can verify the packages build properly
[19:10] <wido> hi
[19:10] <sagewk> wido: hey
[19:11] <wido> I'm seeing some blocking OSD's, they seem to become split up. For example, on "noisy" I have 4 OSD's, when doing some I/O's inside a RBD VM one or two OSD's get marked "out"
[19:11] <wido> but they are up, but in their logs they are saying that the other OSD's are down (no heartbeat)
[19:11] <wido> trying to stop them, gets them into zombie state
[19:11] <sagewk> wido: can you verify that 727 is actually a duplicate? (e.g. that trying to add a non-existent image first will break things)?
[19:12] <sagewk> wido: hmm, seen the zombie thing a few times recently. not sure where that's coming from. any btrfs problems on console?
[19:12] <wido> sagewk: Oh, adding a non-existent doesn't break anymore with the current ceph-client
[19:12] <wido> Like I updated, the "alpha" image was corrupted somehow
[19:12] <wido> that caused some troubles I think
[19:13] <wido> sagewk: I'm seeing the btrfs messages from #715, but I just found out, these messages appear when I run "service ceph stop osd"
[19:13] <sagewk> oh i see
[19:14] <wido> and something new, no idea what it is: "space_info has 961927680 free, is not full" & "space_info total=1082130432, used=88825856, pinned=0, reserved=22922752, may_use=0, readonly=8454144"
[19:14] <sagewk> not sure. probably some of josef's debug stuff.
[19:14] <sagewk> if you see the btrfs warnings, be sure to reboot before continuing...
[19:16] <wido> Since the OSD's are in Z, I have no choice
[19:16] <gregaf> stingray: drat
[19:17] <gregaf> I think that's probably it then, there are a lot of checks like those for various things
[19:17] <gregaf> sorry :(
[19:18] <stingray> gregaf: no worries
[19:18] <stingray> I'll redo the cluster
[19:19] <stingray> but I'll try to push fresh spec for 0.24.1 to fedora first
[19:25] <sagewk> stingray: if you wait a day or so you can do 0.24.2
[19:26] <stingray> ok
[19:26] <stingray> it's no big deal, anyway
[19:26] * fzylogic (~fzylogic@ has joined #ceph
[19:27] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:27] <fzylogic> playground's locked up with osd timeouts and bad crc errors again
[19:28] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit ()
[19:49] <sagewk> fzylogic: which node?
[19:49] <sagewk> oh i see
[19:49] <fzylogic> all of them :)
[20:00] * cmccabe (~cmccabe@ has joined #ceph
[20:08] <sagewk> fzylogic: it's recovering fine, tho the errors keep coming up. reproducible at least :)
[20:09] <fzylogic> my transfer's still hung
[20:09] <sagewk> oh maybe the mds is hung. :/
[20:09] <fzylogic> my 2 scp processes on ladder0 are in uninterruptable sleep. one of those is from last night :-/
[20:13] <bchrisman> has there been much in the way of failure testing done? On a lark, I setup a 2-node cluster with four osd's per node and pulled one of the drives… all I/O seems to have stopped.
[20:13] <bchrisman> (1 drive per osd)
[20:14] <bchrisman> (of course, I'm also running client & filesystem daemons all on the same nodes)
[20:16] <bchrisman> (and am using kernel client)
[20:16] <wido> sagewk / gregaf: I just saw greg's post on the ml, you are getting some nodes up with the latest btrfs to hunt down what me and other people are seeing lately?
[20:16] <fzylogic> sagewk: transfer's going again and last night's process died off
[20:27] <gregaf> bchrisman: what's ceph -s report?
[20:28] <gregaf> architecturally stuff should keep working but in some failure scenarios the PGs can talk, uh, longer than expected to repeer
[20:29] <sagewk> wido: yeah
[20:29] * Meths_ (rift@ has joined #ceph
[20:31] <bchrisman> gregaf: ceph -s hung… looks like the second node has gone down… pings but even sshd not responding… think I need to redo this test with three nodes… cuz if something's causing a node to go down, then the problem will be exacerbated by only having two nodes, I'd expect.
[20:32] <gregaf> bchrisman: hmm, if the ceph tool hung then something's wrong with the monitor
[20:32] <gregaf> did you make sure to pull an OSD disk and not the monitor disk?
[20:34] <bchrisman> gregaf: hmm… OSDs are on all disks… monitor directory is on root filesystem, which is mirrored with md.
[20:34] <bchrisman> root fs is available.
[20:35] <bchrisman> data/mon0 is available (but not on the crashed node)
[20:35] <bchrisman> I'm guessing there's a problem with my only using two nodes for this test?
[20:35] <gregaf> what do you mean "but not on the crashed node"
[20:36] * DJLee (82d8d198@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[20:36] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[20:37] <gregaf> bchrisman: oh, I didn't notice you said the second node had gone down'
[20:37] <gregaf> so yeah, depending on configuration that's it
[20:37] <bchrisman> 2-node cluster, osds/mons/kernel-client on both with cosd-per-drive.. started some I/O… pulled 1 of 4 drives on node2 … node2 was up for a while after that, but the filesystem hung… some 10 minutes later… node2 is now unresponsive.
[20:38] <sagewk> wido: btw that btrfs message in the log was just removed in 2.6.38-rc1, so presumably harmless! :)
[20:38] <bchrisman> yeah.. I'm guessing that whatever the problem with the drive fail is… it's causing something that tries to get the cluster to check quorum/reform.. and with only two nodes… the loss of one becomes basically an unsupported config.
[20:39] <gregaf> bchrisman: ah, if the filesystem hung that likely hung the monitor
[20:39] <sagewk> heh, from commit: "let's not WARN and freak everybody out for no reason."
[20:39] <bchrisman> by filesystem hung, I mean the ceph fs hung.. not the root fs
[20:39] <gregaf> and you need a strict majority of live monitors in order to update maps, which means if you lose one of your monitors stuff is going to hang very quickly
[20:39] <gregaf> oh, hmm
[20:39] <bchrisman> yeah.. I'm going to retest with three nodes after a bit of work to free another node up.
[20:40] <gregaf> I suspect in this case you're running into some issues with the node from pulling the drive, rather than anything with Ceph itself
[20:40] <bchrisman> otherwise will be conflating drive-loss testing with 2-node-cluster-node-loss testing..
[20:41] <bchrisman> ok.. will report back after more testing.
[20:44] * sakib (~sakib@ has joined #ceph
[20:47] <Tv|work> ahahaha java is too bloaty to build jars in this vm, lovely
[20:53] <wido> sagewk: The zombie OSD's, do you think they are related to btrfs?
[20:56] * sakib (~sakib@ Quit (Quit: leaving)
[20:57] <sagewk> pretty sure the cases i saw were. if you see a btrfs BUG that can definitley result in a zombie.
[21:00] <wido> sagewk: Ok, then it won't be the warnings I'm seeing. I got the zombie OSD's when stressing them with Qemu-RBD, but I already told that :)
[21:19] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:55] * Meths_ is now known as Meths
[23:44] <bchrisman> gregaf: 3-node cluster drive-pull test. 'ls' of ceph filesystem hangs (uninterruptible) two cosd procs on node with pulled drive grow out of control in mem usage. (I initially see scrub messages and degradation messages which look correct… ) ceph -s (http://pastebin.com/eYQNHBEt)
[23:45] <bchrisman> gregaf: what failure testing has been performed thus far and how long ago? I'm still on a 0.24~rc version..
[23:46] <gregaf> bchrisman: varying amounts depending on the exact scenario
[23:46] <gregaf> I don't know that we've actually yanked drives before, but we run a lot of tests where we just kill −9 a cmds
[23:46] <gregaf> err, cosd I mean
[23:47] <gregaf> that kind of stuff is done on an ongoing basis, exact amounts depending on what else we have in the queue for our developers
[23:48] <bchrisman> gregaf: ok… all four cosd's are still running in this situation… two of them dramatically expand to consume all memory on the system.
[23:49] <gregaf> so are those cosds getting OOM-killed?
[23:49] <bchrisman> I guess next question is what kind of logging I shoudl turn on… I'm guessing osd logging, since that's what's spiraling out of control… yes.. they do get OOM-killed… eventually
[23:50] <gregaf> ah
[23:50] <gregaf> yes, an issue with memory usage during recovery is definitely possible
[23:50] <gregaf> there's a lot of state that currently just needs to sit in-memory during recovery
[23:50] <bchrisman> I can try killing the cosd's with no drive-pull.
[23:50] <gregaf> how much RAM do you have on these boxes?
[23:51] <gregaf> and is swap enabled?
[23:51] <bchrisman> 1GB, no swap currently.
[23:52] <gregaf> you're running 4 OSDs on 1GB of ram without swap?
[23:52] <bchrisman> the cosd's are using very little memory until the drive is pulled.
[23:52] <gregaf> yeah
[23:52] <gregaf> no, that's definitely the problem, I didn't realize your boxes were so small, sorry
[23:52] <bchrisman> rebuild takes a lot of ram then?
[23:52] <gregaf> I think they can use a few gigs each during recovery, though I don't have recent numbers off-hand
[23:52] <bchrisman> Ahh.. okay
[23:52] <gregaf> sjust: do you know the RAM requirements on recovery right now?
[23:52] <sjust> no
[23:53] <bchrisman> Swap will work but very slow?
[23:53] <sjust> hang on, might be able to ballpark it
[23:53] <gregaf> swap will work, yeah
[23:53] <gregaf> not sure how slow it'll actually be, the memory use pattern is likely to be pretty swap friendly IIRC
[23:53] <fzylogic> if you have multiple disks, you can essentially raid-0 your swap to speed it up a bit
[23:54] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:54] <sjust> looks like around 203 MB for one cosd instance with a significant portion of the pgs recovering
[23:54] <gregaf> that'll vary depending on number of PGs and objects though, right?
[23:54] <sjust> yes
[23:55] <bchrisman> Okay.. I'll put in a couple gigs of swap… the filesystem should be responsive during rebuild?
[23:55] <sjust> bchrisman: how many pgs are there?
[23:55] <gregaf> bchrisman: in general it should be responsive during rebuild, yes
[23:56] <bchrisman> will look at pgs when I get the cluster back up… ceph pg dump is the cmd for that?
[23:56] <gregaf> there are a few small windows where accesses won't work, and on our test systems I think they're usually very short (maxing out at a few seconds, and only if you touch the "wrong" data at the wrong time) but I don't know how it'll respond to memory-constrained environments
[23:57] <bchrisman> ahh ok.. by not working.. will there be an IO error, or will it simply block?
[23:57] <gregaf> it'll block
[23:57] <sjust> bchrisman: ceph -s will give you the total number as well
[23:57] <bchrisman> ahh ok.. thanks both of you..
[23:57] <gregaf> I don't recall how familiar you are with the storage model
[23:58] <bchrisman> heh.. somewhat.. my background is a bit more of the traditional SAN-style clustered filesytems… and a bit on the LUSTRE stuff.
[23:58] <gregaf> but when one OSD goes down then any PGs that are on it become degraded and start repeering
[23:58] <bchrisman> ahh yeah.. I saw those messages coming out.. scrubbing is part of that?
[23:58] <gregaf> the state changes are generally very short but some of them require a little more work than others, and it's during these that the PG data is inaccessible
[23:59] <gregaf> so if you're trying to look at an inode that's stored in an inaccessible PG, it will hang
[23:59] <gregaf> scrubbing is part of that
[23:59] <bchrisman> ok.. yeah.. transitional blocking seems like it's the best way to handle that...
[23:59] <gregaf> during *most* of the scrubbing process the data remains accessible
[23:59] <gregaf> but we eventually need to freeze IO to do a final tally

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.