#ceph IRC Log

Index

IRC Log for 2011-08-09

Timestamps are in GMT/BST.

[0:00] <johnl_> gregaf: upgraded from a build of 394537092d
[0:01] <johnl_> upgraded quite a few times over the last few months, been fine.
[0:02] <johnl_> barely any data on this. 4 osds. pg v35316: 800 pgs: 800 active+clean; 2678 MB data, 13028 MB used, 2979 GB / 3152 GB avail
[0:04] <johnl_> hrm, it looks like it settled down. all fine now.
[0:04] <johnl_> fyi, a full log line: 2011-08-08 21:50:21.496558 log 2011-08-08 21:50:12.069411 osd3 10.200.35.118:6800/994 88 : [ERR] 0.f1 scrub stat mismatch, got 85/85 objects, 0/0 clones, 3842647/18446744073709199959 bytes, 3800/18446744073709551320 kb.
[0:05] <joshd> looks like a bad signed->unsigned conversion somewhere
[0:05] <joshd> maybe we're not checking an error code, and it's negative?
[0:06] <johnl_> restarting cluster didn't provoke it into any more errors.
[0:11] <johnl_> manually requesting a scrub of an osd provoked the errors again
[0:11] <joshd> have you used snapshots?
[0:12] <joshd> I think the scrub accounting for them is less well tested
[0:12] <johnl_> let me check
[0:14] <gregaf> joshd: less well tested, yeah, but the bug sam thought was in there turned out to have been dealt with a long time ago
[0:17] <johnl_> no rados snaps anyway
[0:18] <johnl_> as usual, happy to give ssh access if you want to investigate
[0:18] <johnl_> test cluster, no real data
[0:22] <johnl_> happy to gather whatever you need though
[0:24] <joshd> johnl_: probably the interesting thing is osd3's log
[0:25] <joshd> if it's small enough to attach to a bug, that'd be great
[0:25] <johnl_> let me see
[0:25] <joshd> I don't see anything obvious that would cause this in the scrub code
[0:29] <johnl_> want any particular debug level
[0:29] <johnl_> ?
[0:32] <gregaf> johnl_: debug osd = 20 is always nice
[0:32] <gregaf> if you're going to regenerate them anyway
[0:34] <johnl_> just gonna scrub it again. seems to do it every time.
[0:34] <johnl_> will up the osd debug level then, ta
[0:35] * hutchins (~hutchins@c-75-71-83-44.hsd1.co.comcast.net) Quit (Read error: Connection reset by peer)
[0:37] <johnl_> done. got it. will open a ticket.
[0:41] * verwilst (~verwilst@d51A5B689.access.telenet.be) Quit (Quit: Ex-Chat)
[0:49] <joshd> thanks
[0:51] <johnl_> http://tracker.newdream.net/issues/1376
[0:51] <johnl_> anything else?
[0:56] <joshd> that should be enough
[1:01] <johnl_> great, ta.
[1:01] <johnl_> I'm off to bed now, nnight
[1:02] <joshd> g'night
[1:09] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[1:16] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[1:21] * greglap (~Adium@166.205.136.208) has joined #ceph
[1:26] * MK_FG (~MK_FG@188.226.51.71) has joined #ceph
[1:30] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:47] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[1:52] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[1:53] * MK_FG (~MK_FG@188.226.51.71) Quit (Quit: o//)
[2:10] * greglap (~Adium@166.205.136.208) Quit (Quit: Leaving.)
[2:18] * huangjun (~root@122.225.105.244) has joined #ceph
[2:19] * MK_FG (~MK_FG@188.226.51.71) has joined #ceph
[2:21] * hutchins (~hutchins@c-75-71-83-44.hsd1.co.comcast.net) has joined #ceph
[2:22] * MK_FG (~MK_FG@188.226.51.71) Quit ()
[2:26] * MK_FG (~MK_FG@188.226.51.71) has joined #ceph
[2:33] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:47] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[3:05] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:07] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[3:19] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:25] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[3:43] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) Quit (Remote host closed the connection)
[4:02] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) Quit (Quit: Leaving.)
[4:03] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[5:17] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[5:19] * hutchins_ (~hutchins@c-75-71-83-44.hsd1.co.comcast.net) has joined #ceph
[5:19] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[5:25] * hutchins (~hutchins@c-75-71-83-44.hsd1.co.comcast.net) Quit (Read error: Operation timed out)
[5:34] * hutchins_ (~hutchins@c-75-71-83-44.hsd1.co.comcast.net) Quit (Quit: Leaving)
[5:39] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[5:51] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[6:35] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[9:03] * royh (~royh@mail.vgnett.no) has joined #ceph
[9:59] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) has joined #ceph
[11:31] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:13] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[12:30] <huangjun> ls
[12:53] * hijacker (~hijacker@213.91.163.5) Quit (Remote host closed the connection)
[13:17] * hijacker (~hijacker@213.91.163.5) has joined #ceph
[13:39] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[13:48] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[14:32] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:27] * huangjun_ (~root@113.106.102.8) has joined #ceph
[15:29] * huangjun (~root@122.225.105.244) Quit (Ping timeout: 480 seconds)
[16:08] * lxo (~aoliva@9YYAAAO6R.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[16:14] * huangjun_ (~root@113.106.102.8) Quit (Remote host closed the connection)
[16:17] * lxo (~aoliva@83TAACS7Z.tor-irc.dnsbl.oftc.net) has joined #ceph
[16:50] * greglap (~Adium@166.205.139.150) has joined #ceph
[17:31] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Remote host closed the connection)
[17:36] * greglap (~Adium@166.205.139.150) Quit (Quit: Leaving.)
[17:37] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:41] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[17:55] <gregaf> sagewk: hmm, gitbuilder ran out of space and it's busting teuthology, TV says that's probably you
[17:55] <sagewk> k
[17:56] <Tv> sagewk: you talked about adjusting the when-to-purge and resizing things.. it might be a coincidence but it seemed you'd have the best clue at this point
[17:57] <sagewk> there's plenty of space.. not sure where the error is coming from :
[17:58] <gregaf> hmm, it says file truncated and current master builds fine for me so I just assumed?
[17:58] <sagewk> i think it's build related.. i pushed a new branch prior to the merge and it was green
[17:58] <Tv> ".libs/librados_la-librados.o: file not recognized: File truncated"
[17:58] <sagewk> yeah
[17:58] <Tv> i've seen "File truncated" when it was actually a zero-length file
[17:58] <Tv> left over from open(...); crash
[17:59] <Tv> i don't see a crash though
[17:59] <gregaf> it's also fine on the 32-bit one
[18:01] <Tv> i told gitbuilder to rebuild the last 2 commits
[18:03] <Tv> well it happened again
[18:03] <Tv> exact same spot
[18:03] <Tv> perhaps badness cached by ccache
[18:03] <Tv> clearing
[18:05] <Tv> rebuilding
[18:11] <Tv> and now it's ok
[18:11] <Tv> bleh
[18:11] <Tv> if a file would spontaneously get truncted, that's what would happen
[18:11] <Tv> ohh i see the vm has had a few segfaults
[18:12] <Tv> in git, which is *really* rare
[18:12] <Tv> so it might be in a broken state somehow
[18:12] <Tv> sagewk: if you're not hands on with gitbuilder right now, i'd recommend rebooting it, just to be safe
[18:14] <sagewk> tv: probably corrupted the ccache cache state when i powered it off yesterday?
[18:14] <Tv> sagewk: could be
[18:14] <sagewk> k well rebooting. thanks!
[18:35] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[18:38] <gregaf> sagewk: I've got a stuck machine and last time you said to use powercycle out of ceph-qa-deploy ??? where should that work from?
[18:38] <gregaf> it seems to have some perl dependencies that aren't available by default
[18:40] <sagewk> flak?
[18:40] <sagewk> gregaf: any of the squeeze machines i think
[18:40] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:41] <gregaf> well when I run it on kai:
[18:41] <gregaf> gregf@kai:~/src/ceph-qa-deploy$ ./powercycle sepia56
[18:41] <gregaf> Can't locate Expect.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.10.1 /usr/local/share/perl/5.10.1 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl .) at ./_pdu_helper.pl line 4.
[18:41] <gregaf> BEGIN failed--compilation aborted at ./_pdu_helper.pl line 4.
[18:42] <gregaf> same on flak
[18:44] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[18:44] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:52] <sagewk> bchrisman: you guys don't have anyone with btrfs experience do you?
[18:54] <sagewk> gregaf: apt-get install libexpect-perl
[19:02] <bchrisman> sagewk: sorry.. no
[19:03] <bchrisman> I've also got a failure: Aug 9 01:36:31 kernel BUG at net/ceph/messenger.c:2195!
[19:03] <bchrisman> I'll post the crash log to redmine
[19:03] <sagewk> bchrisman: ok thanks
[19:03] <bchrisman> will see if I have messenger logs too
[19:09] * cmccabe (~cmccabe@69.170.166.146) has joined #ceph
[20:04] * Dantman (~dantman@96.49.150.218) has joined #ceph
[20:08] * aliguori (~anthony@32.97.110.59) has joined #ceph
[20:27] <sagewk> bchrisman: can you attach 'objdump -rdS net/ceph/libceph.[k]o' to #1382?
[20:30] <bchrisman> sagewk: ok.. just put it up there.
[20:30] <sagewk> bchrisman: thanks
[20:36] <sagewk> bchrisman: hrm, no debug info in your kernel :(
[20:37] <bchrisman> yeah.. can rebuild with debugging.. also can turn dynamic debugging on, which is what I was previously using.
[20:37] <sagewk> bchrisman: not actually sure which debug option it is, but when it's on the objdump has the source embedded along with the assembly
[20:39] <bchrisman> that's probably something we'll need set for quite some time??? will set it up in our kernel build.
[20:43] <sagewk> it's probably CONFIG_DEBUG_INFO=y or CONFIG_DEBUG_KERNEL=y
[20:45] <bchrisman> thanks..
[20:48] <bchrisman> hmm.. CONFIG_DEBUG_KERNEL=y and CONFIG_DEBUG_INFO=y perhaps: # CONFIG_DEBUG_KOBJECT is not set
[20:57] * Juul (~Juul@81.162.158.87) has joined #ceph
[21:03] <slang> hi all
[21:04] <slang> I've been running ceph with 4 active mds servers, and 4 standby-replay servers, but in that setup I see periodic hangs of the cfuse mount
[21:04] <slang> every hour or so
[21:04] <slang> with just a single active mds, and 2 standby-replays, I haven't seen any such hangs
[21:05] <slang> I don't know if this is related to the interface binding issues I was seeing previously, or something else
[21:07] <slang> http://fpaste.org/Sx1L/
[21:08] <slang> that's my config with the 4 mds setup
[21:09] <slang> any ideas?
[21:13] <Tv> slang: does it hang or just get slow & recover by itself after a while?
[21:14] <Tv> slang: getting logs from when that happens into a ticket would be a good next step
[21:16] <slang> Tv: looks like a hang to me
[21:16] <slang> but maybe I'm just impatient :-)
[21:16] <Tv> heh, then it's probably a hang
[21:16] <Tv> logs (with sufficient debug levels) will help
[21:16] <slang> Not much in the logs but ms_handle_reset/ms_handle_connect
[21:17] <slang> I'll see if I can get some reasonable logging on the mds servers
[21:18] <Tv> sorry it's lunch hour here, stick around for an hour and you'll probably get more help
[21:19] <Tv> i'm off to grab lunch while there's still some left ;)
[21:34] * aliguori_ (~anthony@32.97.110.65) has joined #ceph
[21:41] * aliguori (~anthony@32.97.110.59) Quit (Ping timeout: 480 seconds)
[21:43] * aliguori_ (~anthony@32.97.110.65) Quit (Quit: Ex-Chat)
[21:43] * aliguori (~anthony@32.97.110.65) has joined #ceph
[21:44] <gregaf> slang: multi-MDS setups are still not nearly as stable as one with a single MDS
[21:44] <gregaf> Sage fixed a number of request misdirection bugs recently, though, that would manifest themselves as client hangs
[21:44] <slang> gregaf: ok cool
[21:45] <slang> gregaf: how recently?
[21:45] <gregaf> mmm, Friday maybe?
[21:45] <gregaf> most of them were longer ago than that, but there was one on, let me see
[21:46] <gregaf> yep, Friday had one: 3d258798fad253ad15c69c45ea4460c4b0248e6c
[21:53] * Juul (~Juul@81.162.158.87) Quit (Ping timeout: 480 seconds)
[22:01] <slang> gregaf: ok thanks
[22:01] * Juul (~Juul@node-u2p.camp.ccc.de) has joined #ceph
[22:02] <Tv> i don't understand teuthology commit 83b6678e79904793bf31e82bbecad7bf16c1b2b5
[22:03] <Tv> and that fact worries me
[22:04] <Tv> oh and the comment added lies; it returns a list not a dict
[22:04] <Tv> sagewk: this is my unhappy face
[22:04] <sagewk> teuthology.git?
[22:05] <gregaf> yeah
[22:05] <gregaf> fix get_clients
[22:05] <sagewk> yeah comment is a lie
[22:06] <sagewk> the problem was that i had both cfuse and kclient tasks (client.0 and client.1), but the old get_clients would return both
[22:14] <Tv> sagewk: i still don't see how that'll work; they get the same input, you're not telling it "and now i want only cfuse clients"
[22:14] <Tv> ohh config is not the real config
[22:14] <Tv> THAT's your problem
[22:14] <Tv> bleh
[22:15] <sagewk> yeah i suspect the rest of get_clients rewrite wasn't totally necessary
[22:15] <Tv> sagewk: err would you happen to have the config yaml that triggered this?
[22:15] <sagewk> tasks:
[22:15] <sagewk> - ceph:
[22:15] <sagewk> - kclient: [client.0]
[22:15] <sagewk> - cfuse: [client.1]
[22:15] <sagewk> interactive:
[22:15] <sagewk> and
[22:16] <sagewk> roles:
[22:16] <sagewk> - [mon.a, osd.0]
[22:16] <sagewk> - [mon.b, osd.1]
[22:16] <sagewk> - [mon.c, osd.2]
[22:16] <sagewk> - [mds.a, osd.3]
[22:16] <sagewk> - [mds.a-s, osd.4]
[22:16] <sagewk> - [osd.5]
[22:16] <sagewk> - [osd.6]
[22:16] <sagewk> - [osd.7]
[22:16] <sagewk> - [client.0, client.1]
[22:16] <Tv> alright let's see..
[22:16] <sagewk> er, that's - interactive:
[22:16] <Tv> yeah i got it
[22:16] * aliguori (~anthony@32.97.110.65) Quit (Ping timeout: 480 seconds)
[22:17] <Tv> ah now i see, the same machine had overlapping roles, the old implementation went roles->remotes->roles and lost focus there
[22:18] <wido> sagewk: I'm around now :-)
[22:18] <wido> I have an idea about the OSD crazyness
[22:19] <sagewk> tv: yeah that sounds right... sorry, it was last week,
[22:19] <wido> I had to power cycle/reset all my machines due to btrfs issue, status D processes, etc, etc
[22:19] <wido> could it be that those OSD's had those files open at the time I gave them a reset?
[22:19] <wido> thus corrupting the files
[22:19] <sagewk> wido: could be that the powerc corrupted btrfs, although that's really not supposed to ever happen.
[22:20] <wido> No, but I even had whole filesystems dying, open_ctree failed
[22:20] * aliguori (~anthony@32.97.110.59) has joined #ceph
[22:20] <wido> on about 8 OSDs
[22:21] <sagewk> hmm, i wonder if that happened around the time that bad osdmap epoch was generated?
[22:21] <wido> I can't really say
[22:24] <sagewk> i think the real question is whether we can make it happen again. now we have an easy to check telltale sign of badness (0-length pglog* in snap_*/meta) to look for
[22:25] <wido> sagewk: Having the 0-byte pginfo also in the snap seemed pretty weird, that should be more then just the power loss
[22:25] <sagewk> you can work around the current osdmap crashes by copying the object file from another osd into the most recent snap_* dir and restarting (make sure you preserve xattrs).
[22:25] <wido> the corrupted osdmap could be a power loss
[22:26] <sagewk> they're both the same thing. my guess is a power failure (or btrfs bug) allowed the bad osdmap or bad pginfo to get into snap_*, and once it's there (i.e., committed) it'll stick around forever
[22:26] <wido> sagewk: I've given the current logging mechanism a little thought and the problem you have right now although you want to, running with full debugging on is impossible
[22:27] <wido> What about a "buffer", in a particular transaction you buffer a few lines which COULD be useful if something further along the road goes bad
[22:27] <sagewk> yeah. in this case the only potentially related logging is in the filestore.. let me see what levels would be helpful
[22:28] <wido> If that goes wrong, you then flush those lines towards the log
[22:28] <wido> preventing your logs from filling up with lines you don't need
[22:29] <wido> But I'll probably have to do a format again anyway, since I had 3 failed drives and 8 broken btrfs filesystems, leaving me to lose 26% of my OSDs
[22:29] <sagewk> debug filestore = 10 should be enough, if the problem is filestore's interaction with btrfs.
[22:30] <sagewk> wido: have you seen the slowdown christian was talking about on linux-btrfs?
[22:30] <sagewk> (ha, or have i asked you that already? having some deja vu here)
[22:30] <wido> sagewk: I didn't really notice it, I've been writing a lot of data, but that went with a steady 100Mbit
[22:31] <Tv> sagewk: i'm having serious trouble constructing a unit test that would show *any* difference in behavior between the two get_clients implementations :-(
[22:31] <wido> sagewk: No, you didn't. As soon as I have my new disks (somewhere this week) I'll test that
[22:33] <sagewk> tv: it might have a red herring.. i was fighting the bug in the previous commit at the same time
[22:33] <sagewk> tv: looking now the old one looks fine too
[22:33] <Tv> sagewk: that's what i'm thinking
[22:34] <Tv> sagewk: care to revert?
[22:34] <sagewk> sure
[22:35] <Tv> gregaf: commented out code committed to master branch makes Tv grumpy...
[22:36] <gregaf> ermm, did I leave something nasty in there?
[22:36] <Tv> not nasty, but still.. in manypools
[22:36] <sagewk> wido: he mentioned a (probably) good commit from btrfs-unstable (not yet upstream) that's probably a good place to start
[22:36] <gregaf> oh, yeah...
[22:36] <sagewk> i think tv was already grumpy before he saw commented code :)
[22:36] <Tv> hah
[22:36] <gregaf> my bad, I'll clean it up in my next push
[22:36] <wido> sagewk: I'll go for 3.1 with the new btrfs code and see what it does
[22:37] <gregaf> although I'm going to add more commented-out code on purpose this time :p
[22:37] <sagewk> wido: i'd do whatever is in btrfs-usntable (3.0 + stuff he merged for 3.1-rc1)
[22:38] <wido> sagewk: Uh, yeah, that's what I mean
[22:38] <sagewk> wido: :) cool just makin sure
[22:38] <wido> but about the logging, would that be feasible? Buffering some lines which might come handy when a operation fails
[22:39] <wido> right now you really have to crank up the logging really high to get some debug data, which is pretty impossible to run with all the time
[22:39] <wido> so hunting down a bug comes down to being able to reproduce it or being lucky
[22:39] <sagewk> wido: could be. it'll still burn a fair bit of cpu just logging to a buffer, but at least it won't be written out all the time. the trick would be know when/why to flush it..
[22:40] <gregaf> I think we looked at a Google paper/library that did this, but we didn't want to burn the dev time implementing it
[22:40] <wido> sagewk: Yes, it will use CPU and some memory
[22:40] <slang> what's the right way to go about adding a crashed mds back into the system?
[22:41] <sagewk> slang: just (re)start a cmds
[22:41] <slang> just restarting seems to make the crashed ..
[22:41] <slang> sagewk: maybe I have a stale config, but when I did that it went from standby-replay (before the crash) to active (after the restart)
[22:41] <wido> gregaf: They were talking about buffering some messages and writing them if you get into a particular state?
[22:42] <wido> I don't say it's easy, but when deployments get larger, you won't be running with full debugging the whole time
[22:42] <slang> also, is it possible to go from active to standby for a running mds?
[22:42] <gregaf> wido: yeah, I don't remember if the trigger was "anything bad happens, we call the flush_out command" or more like "on crash dump to disk", but they had a thing they could run to get pre-crash logging off of production systems
[22:43] <slang> the code has a set_state command in MDSMonitor, but I can't seem to get that working
[22:43] <gregaf> slang: no, you can't go active->standby
[22:43] <sagewk> slang: that's what should happen (if i'm understanding correctly). an active mds never goes to standby, though..
[22:45] <slang> so you can't assign states to mds servers? they just go to active up to the value of max_mds?
[22:45] <slang> and the rest are standby?
[22:46] <slang> It was my standby mds that crashed. when restarting it went to active
[22:46] <slang> but the active mds had not crashed
[22:46] <slang> so then I had +1 active mds
[22:47] <slang> which is fine, except that it sounds like I just want a single active mds for stability
[22:47] <wido> sagewk: You want to keep this one open? I currently have 27/40 OSDs online and that's it. I don't think it is going to recover
[22:47] <wido> but I do think my post on the ml could be interesting (the heartbeat thing)
[22:48] <wido> situation is still ongoing, they are still saying the other one is down
[22:48] <sagewk> slang: you must have had max_mds set to something > 1...
[22:48] <sagewk> slang: if there are fewer than teh desired # of active mds's, they will join teh cluster. otherwise they'll be standby and wait for a failure (or max_mds change)
[22:48] <gregaf> you need to go through a process to reduce the number of actives; it's not documented yet but it maybe working, I forget
[22:49] <sagewk> wido: looking into it now
[22:49] <slang> sagewk: yeah I guess I did
[22:49] <sagewk> wido: not sure there's any other useful information to get out of your cluster about the other failure, though.. best we can do there is keep an eye out for those files. maybe a cron that searches for 0-lenght pginfo files in snap_*
[22:51] <sagewk> wido: for the heartbeat thing, i just restructured the way those are handled to be a lot less brittle. is it possible to test that in the current cluster? (is the bad behavior something you can still trigger?)
[22:52] <wido> sagewk: When I'm up and running again I'll put up that cron
[22:52] <wido> sagewk: The heartbeat thing, it is something that came up
[22:52] <wido> Not sure what it will do when I restart osd.5
[22:53] <sagewk> wido: k. (i'm guessing it's an annoying race in the old code..)
[22:53] <wido> sagewk: That is something that came up. Leave it for now?
[22:54] <sagewk> yeah
[22:54] <wido> same goes for the 0-byte pginfo files? Put up the cron and see if it comes back again
[22:54] <slang> I fear that I've screwed my setup horribly
[22:54] <slang> mds e271: 3/3/1 up {0=bravo=up:resolve(laggy or crashed),1=bravo=up:resolve,2=alpha=up:resolve}, 1 up:standby-replay
[22:54] <sagewk> yeah
[22:54] <sagewk> slang: can you pastebin the mds.0 stack trace?
[22:55] <slang> sagewk: its running
[22:56] <sagewk> slang: sure? mon thinks mds.bravo is laggy/down...
[22:56] <slang> yeah its running
[22:56] <slang> 2011-08-09 15:55:52.597997 7fc23ab0f700 -- 192.168.101.13:6800/2327 >> 192.168.101.13:6800/31530 pipe(0x2689000 sd=41 pgs=0 cs=0 l=0).connect claims to be 192.168.101.13:6800/2327 not 192.168.101.13:6800/31530 - wrong node!
[22:56] <slang> those repeat every 20 secs or so
[22:56] <slang> 30 secs
[22:57] <slang> 15 secs
[22:57] <slang> sorry - can't subtract today
[22:57] <sagewk> :) in which daemon long? and who are pids 2327 and 31530 on node 101.13?
[22:57] <sagewk> s/long/log/
[22:57] <slang> that's in mds.bravo.log
[22:59] <slang> 2328 is the cmds bravo process
[22:59] <slang> no 31530 process running
[23:00] <slang> sagewk: I had restarted all processes though
[23:00] <slang> maybe shouldn't have done that
[23:01] <sagewk> is there a 31531? the pids in the addr are off by one
[23:03] <slang> nothing starting 315..
[23:04] <sagewk> slang: 'ceph osd dump -o- | grep 31530' ?
[23:04] <slang> nothing
[23:05] <slang> http://fpaste.org/tKWB/
[23:07] <sagewk> slang: weird. well in any case, if the mon thinks the mds is laggy its bc it's not getting beacons. what kind of logs does mds.bravo have?
[23:08] <slang> http://fpaste.org/Ck3T/
[23:10] <sagewk> oh! can you pastebin 'ceph mds dump -o -' ?
[23:10] <wido> sagewk: fyi, I restarted osd.5 and osd.19, they get back into that condition right away
[23:10] <wido> the heartbeat thing
[23:10] <sagewk> wido: with the old or new code?
[23:11] <wido> old code :-) Getting late here, I'll give the new code a try
[23:11] <slang> http://fpaste.org/dHNN/
[23:11] <wido> tomorrow I mean
[23:11] <sagewk> wido: k. if you see it tomorrow with the new code capture logs!
[23:11] <sagewk> thanks
[23:12] <slang> sagewk: monitor thinks two bravo's are running?
[23:12] <sagewk> slang: restart charlie without the 'standby for' options and it'll take over for mds0
[23:13] <sagewk> slang: it knows the old bravo is down, but there's noone to take over (mds.charlie is following rank 1)
[23:19] <slang> everything is active again
[23:19] <slang> sagewk: thanks
[23:19] <slang> http://fpaste.org/VMqV/
[23:19] <sagewk> slang; np.
[23:20] <slang> as I understand it, from here I can't go back to a single active mds
[23:20] <sagewk> slang: if you want to go back to a single mds you need to set max_mds back to 1 and then do 'ceph mds stop 1' and 'ceph mds stop 2'
[23:20] <slang> ah ok
[23:21] * lxo (~aoliva@83TAACS7Z.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[23:26] * lxo (~aoliva@19NAACY84.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:40] * slang (~slang@chml01.drwholdings.com) has left #ceph
[23:40] * aliguori (~anthony@32.97.110.59) Quit (Quit: Ex-Chat)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.