#ceph IRC Log


IRC Log for 2010-09-28

Timestamps are in GMT/BST.

[18:29] -reticulum.oftc.net- *** Looking up your hostname...
[18:29] -reticulum.oftc.net- *** Checking Ident
[18:29] -reticulum.oftc.net- *** No Ident response
[18:29] -reticulum.oftc.net- *** Found your hostname
[18:29] * CephLogBot (~PircBot@fubar.widodh.nl) has joined #ceph
[18:30] <wido> gregorg: is back again :)
[18:30] <wido> crashed I think
[18:59] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:05] <wido> gregorg: bot is fixed
[19:08] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[19:31] * Yoric (~David@ Quit (Quit: Yoric)
[19:35] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:50] * cmccabe (~cmccabe@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:55] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:55] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[19:59] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:03] * cmccabe (~cmccabe@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving)
[20:04] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[20:24] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:13] <wido> sage: you there?
[21:13] <wido> "pg v334177: 3280 pgs: 1 active, 3276 active+clean, 3 crashed+peering; 1439 GB data, 1711 GB used, 4614 GB / 6326 GB avail; 2/3226406 degraded (0.000%)"
[21:13] <wido> it's almost there :)
[21:20] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:29] <wido> sage: i think you just missed my message
[21:29] <wido> "pg v334177: 3280 pgs: 1 active, 3276 active+clean, 3 crashed+peering; 1439 GB data, 1711 GB used, 4614 GB / 6326 GB avail; 2/3226406 degraded (0.000%)"
[21:29] <wido> Still the same issue I assume?
[21:29] <sagewk> yep, i got pretty close last night. still need to find those 2 objects, and see why those 3 pgs are misbehaving.
[21:30] <sagewk> will work on it later today
[21:30] <wido> ok, cool
[21:30] <wido> Right now I can't mount the filesystem either, that could be the same issue?
[21:30] <wido> and i can't ls the "rbd" pool
[21:34] <sagewk> yeah, 2 pgs in rbd pool are inactive, and 1 pg in the metadata pool (which is probably what's preventing the mds from doing whatever)
[21:35] <wido> Ok, just for my information
[21:35] <wido> a pg gets inactive during recovery?
[21:35] <wido> or just in this special situation?
[21:38] <cmccabe> hey, I'm going to do a little work on sepia today
[21:41] <sagewk> wido: yeah during recovery it goes into a peering state while it checks replicas and resyncs pg metadata. then goes active while it recovers, then active+clean when recovery is done.
[21:41] <sagewk> so for one pg recovery isn't completing, and for 3 the peering isn't completing.
[21:42] <wido> ok, so during a OSD failure, you should always expect some stalling
[21:42] <wido> while the cluster is recovering and checking everything
[21:43] <gregaf> well, performance will be degraded while it's transferring objects between the OSDs
[21:43] <gregaf> but peering is supposed to be pretty fast
[21:45] <wido> yes, but from what i understand, a pg goes into peering state where it will be unavailable for a some time (few seconds?)
[21:45] <wido> so that could stall your filesystem
[21:49] <sagewk> hopefully less than a few seconds :) all it has to do (normally) is transfer the pg logs since the last sync (for things like a restart). a couple network round trips. if it's something more serious (some osd has to rescan it's objects) it temporarily comes online with a subset of the osds, and that may take a few seconds to adjust the osdmap.
[21:57] <wido> ok, that makes sense. The situation i'm in right now is "special" and shouldn't occur when this bug is fixed. I mean the "ls" stalling and the metadata pool not functioning fully
[21:57] <sagewk> yeah :)
[22:08] <wido> ok, tnx!
[22:09] <wido> i'm going afk, hope you find the issue
[22:09] <wido> i'll check it tomorrow
[22:38] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[22:48] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:07] <cmccabe> hi all
[23:07] <cmccabe> I'm installing ceph on some cluster machines in the sepia cluster
[23:07] <cmccabe> the -dev packages to build ceph aren't installed on sepia. We usually just build everything on flab
[23:08] <cmccabe> I think Sage was hoping we could use "make install" to install ceph into the /usr directory on the cluster machines
[23:08] <cmccabe> but because we're using automake, make install needs all those dev packages to run
[23:09] <cmccabe> do you think that it's reasonable to mount the cluster's root dev on flab and do make install from there ? :)
[23:09] <yehudasa> ahmm
[23:09] <yehudasa> i'm not sure it's enough
[23:10] <cmccabe> I guess if any part of make install used absolute paths, it would break
[23:10] <yehudasa> yeah, it probably does use absolute paths
[23:11] <yehudasa> which dev packages are needed in order to run?
[23:11] <yehudasa> can't we just install them on the sepia machines?
[23:11] <cmccabe> I just installed libgoogle-perftools-dev
[23:11] <cmccabe> there was some openssl thing I had to install but it complained about not being able to find the .deb when I used apt-get
[23:11] <cmccabe> so then I started an apt-get upgrade
[23:11] <gregaf> oh no, not one of those
[23:11] <cmccabe> (on sepia)
[23:12] <cmccabe> yes... the dreaded upgrade
[23:12] <yehudasa> ahmm.. should have probably just 'apt-get update'?
[23:12] <gregaf> did we run out of disk space, or is it still alive?
[23:12] <cmccabe> at 71% on rootfs
[23:12] <cmccabe> more on other partitions
[23:13] <gregaf> oh, that's good — I got murdered several months ago by running out of disk space in the middle of one (my first, in fact, lol)
[23:14] <cmccabe> I had a problem with that just yesterday!
[23:14] <cmccabe> I was able to save the day with SIGSTOP and some manual intervention
[23:14] <cmccabe> also luckily I found out when it was installing man pages... no big deal if they don't get installed.
[23:49] <cmccabe> hmm, root dev is full on flab
[23:52] <cmccabe> there's no obvious space hogs here. Just that 4G is kind of a small rootfs.
[23:52] <cmccabe> Do we really need java on flab?
[23:54] <yehudasa> I think so, it's needed for hadoop afair
[23:55] <cmccabe> /usr/share is 993 M all by itself
[23:55] <yehudasa> apt-get clean?
[23:55] <cmccabe> /usr/lib another 1.3G
[23:56] <cmccabe> /var is another partition which is doing fine

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.