#ceph IRC Log


IRC Log for 2011-04-11

Timestamps are in GMT/BST.

[0:06] * MarkN (~nathan@ has joined #ceph
[0:07] * MarkN (~nathan@ has left #ceph
[1:15] * Meths_ (rift@customer800.pool1.unallocated-106-128.orangehomedsl.co.uk) has joined #ceph
[1:22] * Meths (rift@customer9880.pool1.unallocated-106-192.orangehomedsl.co.uk) Quit (Ping timeout: 480 seconds)
[2:08] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[2:12] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[3:26] * maswan (maswan@kennedy.acc.umu.se) Quit (Ping timeout: 480 seconds)
[3:31] * maswan (~maswan@kennedy.acc.umu.se) has joined #ceph
[3:50] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) has joined #ceph
[5:40] * maswan (~maswan@kennedy.acc.umu.se) Quit (Ping timeout: 480 seconds)
[5:53] * maswan (maswan@kennedy.acc.umu.se) has joined #ceph
[6:51] * mwodrich (~Terminus@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[7:02] * mwodrich (~Terminus@ip-66-33-206-8.dreamhost.com) has joined #ceph
[7:33] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[7:45] * maswan (maswan@kennedy.acc.umu.se) Quit (Ping timeout: 480 seconds)
[8:16] * maswan (maswan@kennedy.acc.umu.se) has joined #ceph
[8:25] * maswan (maswan@kennedy.acc.umu.se) Quit (Ping timeout: 480 seconds)
[8:35] * maswan (maswan@kennedy.acc.umu.se) has joined #ceph
[8:37] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[9:06] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[9:06] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit ()
[9:09] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:24] * djlee1 (~dlee064@des152.esc.auckland.ac.nz) has left #ceph
[9:42] * samsung (~samsung@ has joined #ceph
[10:13] * allsystemsarego (~allsystem@ has joined #ceph
[10:47] * Meths_ is now known as Meths
[10:59] * Yoric (~David@did75-14-82-236-25-72.fbx.proxad.net) has joined #ceph
[13:20] * Yoric_ (~David@did75-14-82-236-25-72.fbx.proxad.net) has joined #ceph
[13:27] * Yoric (~David@did75-14-82-236-25-72.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[13:27] * Yoric_ is now known as Yoric
[14:39] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:14] * gregorg_taf (~Greg@ has joined #ceph
[15:20] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[16:04] * samsung (~samsung@ Quit (Quit: Leaving)
[16:36] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[16:47] * greglap (~Adium@ has joined #ceph
[16:54] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[16:58] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[17:09] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[17:11] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[17:28] * Yoric_ (~David@did75-14-82-236-25-72.fbx.proxad.net) has joined #ceph
[17:34] * Yoric (~David@did75-14-82-236-25-72.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[17:34] * Yoric_ is now known as Yoric
[17:39] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[17:52] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[17:54] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[17:55] * gregaf1 (~Adium@ip-66-33-206-8.dreamhost.com) has left #ceph
[17:55] * gregaf1 (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:21] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:36] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[18:43] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[18:44] * Yoric (~David@did75-14-82-236-25-72.fbx.proxad.net) Quit (Quit: Yoric)
[18:47] * pombreda (~Administr@24-176-184-26.static.reno.nv.charter.com) has joined #ceph
[18:49] <pombreda> Howdy :)
[18:49] <pombreda> gregaf: the playground is not behaving , but you may be already aware of that ?
[18:49] <gregaf1> pombreda: I think sage is looking at it
[18:49] <pombreda> :)
[18:50] <gregaf1> an OSD started misbehaving last week and apparently turned up some issues in the recovery handling
[18:50] <pombreda> awesome :)
[18:51] <pombreda> gregaf1: this is the purpose of the playground after all
[18:52] <gregaf1> yeah, still makes us sad though :(
[18:57] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:03] <sagewk> going to do the standup after lunch when colin is back from his dr's appt
[19:05] <gregaf1> sagewk: do we have a time estimate? "after lunch" is pretty broad ;)
[19:05] <sagewk> 2
[19:05] <sagewk> unless he's back in 12:30 and everyone is here.. i have a 1-2 call
[19:05] <gregaf1> cool
[19:07] * pombreda (~Administr@24-176-184-26.static.reno.nv.charter.com) has left #ceph
[19:09] <Tv> sjust: hey for some reason sepia5/6/9 didn't get the same ssh keys as others -- i can't log in
[19:09] <Tv> sjust: easy to fix but is there something wonky with the install scripts?
[19:09] <sjust> ah, those are in the old cluster still
[19:09] <sjust> sepia1-10 I think
[19:36] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[19:46] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Connection reset by peer)
[19:50] * maswan (maswan@kennedy.acc.umu.se) Quit (Ping timeout: 480 seconds)
[19:52] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[19:53] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Ping timeout: 480 seconds)
[20:00] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[20:03] * maswan (maswan@kennedy.acc.umu.se) has joined #ceph
[20:09] <wido> is there any bug known about PG::replay_queued_ops ?
[20:09] <wido> After #996 I'm seeing all my OSD's crashing on PG::replay_queued_ops
[20:09] <wido> http://pastebin.com/PJ23yn1a
[20:10] <gregaf1> wido: I'm not aware of anything with replay_queued_ops specifically (sjust?), but we are having some replay issues on our playground right now that sage was looking at
[20:10] <gregaf1> unfortunately he's in meetings for several hours now
[20:11] <wido> oh, ok :) I'll hold off for now then
[20:13] * allsystemsarego_ (~allsystem@ has joined #ceph
[20:18] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Operation timed out)
[20:18] * allsystemsarego (~allsystem@ Quit (Read error: Connection reset by peer)
[20:22] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[20:35] <sjust> wido: I'm taking a look
[20:50] <wido> sjust: Ok, cool :) I've got 10 down OSD's due to it right now
[20:54] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[21:01] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[21:49] <gregaf1> Tv: is http://tracker.newdream.net/issues/906 still a valid issue as far as you know?
[21:50] <gregaf1> and was the error you saw on cfuse or the kclient?
[21:50] <Tv> gregaf1: haven't heard anyone change anything related, it's kclient
[21:50] <gregaf1> okay, thanks
[21:51] <sjust> wido: can I get to the logs from logger?
[21:53] * allsystemsarego_ (~allsystem@ Quit (Quit: Leaving)
[21:59] <gregaf1> Tv: hmm, I can't reproduce that bug locally
[21:59] <gregaf1> how is it being run?
[22:00] <Tv> gregaf1: autotest
[22:00] <gregaf1> yeah, but the ceph-autotest repo test for that just contains a reference to ceph_pjdfstest
[22:00] <gregaf1> which doesn't seem to be defined in the repo?
[22:00] <Tv> gregaf1: you mean with the ceph_ prefix?
[22:00] <Tv> gregaf1: that's defined in autotest itself
[22:01] <gregaf1> sorry, I meant the python file contains " self.job.run_test(
[22:01] <gregaf1> 'pjd_fstest',
[22:01] <gregaf1> dir=mnt,
[22:01] <gregaf1> )
[22:01] <gregaf1> "
[22:01] <gregaf1> that pjd_fstest is defined in autotest?
[22:01] <Tv> gregaf1: https://github.com/tv42/autotest/blob/ceph/client/tests/pjd_fstest/pjd_fstest.py
[22:01] <gregaf1> so we can't dig out the settings it uses? :(
[22:02] <Tv> just means you need to dig at the right spot
[22:02] <gregaf1> ah, excellent
[22:02] <Tv> there's just about nothing special there
[22:02] <gregaf1> didn't know there was a separate autotest repo we needed to look at
[22:03] <Tv> i'm not gonna re-implement bonnie etc results parsin
[22:03] <Tv> g
[22:03] <gregaf1> hehe
[22:03] <gregaf1> looks like maybe they're running a newer version of pjd than I have
[22:03] <gregaf1> that might explain it
[22:03] <Tv> the tarball is right there, grab it
[22:04] <gregaf1> wait, no, it's 20080816
[22:04] <gregaf1> oh, but prove is separate from the pjd stuff, blargh
[22:05] <bchrisman> those are pjd failures not related to clock synchronization?
[22:05] <Tv> gregaf1: prove is just in perl
[22:05] <Tv> bchrisman: all but one
[22:05] <Tv> i'm syncing sepia clocks & rerunning the test
[22:05] <gregaf1> there's something about lchown not changing ownership on symlinks
[22:05] <bchrisman> Tv: odd.. we're running those tests but haven't seen failures outside of clock synch..
[22:05] <gregaf1> in the bug report
[22:06] <gregaf1> I can't reproduce it on my box yet, though
[22:06] <Tv> bchrisman: there is a chance i misread the darn tests, it's so hard to track the line numbers right
[22:06] <Tv> TAP sucks :(
[22:06] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[22:06] <bchrisman> hmm… I put something in.. lemme look
[22:06] <gregaf1> I wonder if it's just wrong
[22:06] <gregaf1> maybe it's a possible conclusion but the guessing is wrong here?
[22:06] <gregaf1> debugging chown/00.t test 135-137, line 274-
[22:06] <gregaf1> • test code
[22:06] <gregaf1> expect 0 symlink ${n1} ${n0}
[22:06] <gregaf1> expect 0 lchown ${n0} 65534 65533
[22:06] <gregaf1> ctime1=`${fstest} lstat ${n0} ctime`
[22:06] <gregaf1> sleep 1
[22:06] <gregaf1> expect 0 -u 65534 -g 65532 lchown ${n0} 65534 65532
[22:07] <gregaf1> expect 65534,65532 lstat ${n0} uid,gid
[22:07] <gregaf1> ctime2=`${fstest} lstat ${n0} ctime`
[22:07] <gregaf1> test_check $ctime1 -lt $ctime2
[22:07] <gregaf1> • TODO conclusion 1: lchown does not change user/group of the symlink?
[22:07] <gregaf1> • TODO conclusion 2: lchown does not update ctime of the symlink
[22:07] <gregaf1> I admit the question mark doesn't fill me with confidence
[22:07] <cmccabe> hi guys
[22:07] <gregaf1> hi
[22:07] <cmccabe> gregaf: I thought symlinks didn't have certain attributes
[22:07] <cmccabe> gregaf: I forget exactly which ones those were
[22:08] <Tv> gregaf1: tracing the test numbers was a btch
[22:08] <gregaf1> cmccabe: I'm just looking at the test output, don't ask me about it
[22:08] <gregaf1> Tv: yeah, believe me I know
[22:08] <Tv> cmccabe: lchown exists as a syscall. deal.
[22:08] <gregaf1> I had to trace them all back when we first ran the test and it only passed like 90%
[22:09] <Tv> *finally* all sepia nodes are either marked down, or have clock synced
[22:09] <Tv> that's a pain
[22:10] <cmccabe> oh, that's lchmod that doesn't exist on linux
[22:10] <gregaf1> bchrisman: did you happen to add getattr to libceph?
[22:10] <bchrisman> Tv: not sure it'd help you, but I patched our pjdfstest's misc.sh for trackign down test failures: http://pastebin.com/h1tgypgC
[22:11] <bchrisman> I also reformatted output cuz we're not using prove, so that wouldn't help… but I hacked in a call to 'caller' in the output for failure modes
[22:11] <Tv> bchrisman: thx, will plug that in if it keeps giving me trouble
[22:11] <bchrisman> gregaf1: yes I have...
[22:11] <gregaf1> did you send a patch?
[22:11] <bchrisman> gregaf1: I haven't because I haven't tested it yet
[22:11] <gregaf1> ah, k
[22:12] <bchrisman> gregaf1: is it critical path stuff right now?
[22:12] <gregaf1> no, not really
[22:12] <gregaf1> just running down the list of things for our .27 sprint
[22:12] <bchrisman> gregaf1: it's part of the samba-vfs effort I've been working on..
[22:12] <gregaf1> it's assigned to you anyway so if you don't want to submit it we'll just push it back a release or something ;)
[22:12] <bchrisman> gregaf1: those changes are pretty small… okay...
[22:12] <bchrisman> cool :)
[22:13] <gregaf1> yeah, but we're lazy and we like having external authors in our git tree
[22:13] <gregaf1> makes us feel more like a real boy
[22:13] <bchrisman> good enough… will submit when I've got it vaguely tested
[22:16] <Tv> 13:11:43 DEBUG| [stdout] /usr/local/autotest/tests/pjd_fstest/src/tests/chown/00.t .....
[22:16] <Tv> 13:11:43 DEBUG| [stdout] not ok 135
[22:16] <Tv> 13:11:43 DEBUG| [stdout] not ok 136
[22:16] <Tv> 13:11:43 DEBUG| [stdout] not ok 137
[22:16] <Tv> 13:11:43 DEBUG| [stdout] Failed 3/171 subtests
[22:16] <Tv> gregaf1: still failing, with clocks in decent sync
[22:17] <Tv> within ~2 seconds of each other
[22:18] <Tv> expect 0 mkdir ${n0} 0755
[22:18] <Tv> expect 0 chown ${n0} 65534 65533
[22:18] <Tv> ctime1=`${fstest} stat ${n0} ctime`
[22:18] <Tv> sleep 1
[22:18] <Tv> expect 0 -u 65534 -g 65532 chown ${n0} 65534 65532
[22:18] <Tv> those should be 135-137 expects that fail
[22:19] <gregaf1> Tv: hmm, something is different between my UML environment and your autotests...
[22:20] <Tv> gregaf1: 3 machines each wth mon+mds, 2 with an osd each, client on separate machine, kclient -- the standard autotest setup
[22:20] <Tv> gregaf1: worst case, it's timing related :(
[22:20] <gregaf1> yeah
[22:21] <gregaf1> I've just been running it with one of each daemon, I'll try it with your numbers
[22:21] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Connection reset by peer)
[22:21] <Tv> gregaf1: you should really start using autotest ;)
[22:22] <gregaf1> oh, are the logs accessible for that run?
[22:22] <Tv> see the bug
[22:22] <gregaf1> I just wanted an environment where I could do rapid iterations
[22:22] <bchrisman> are those multi-active mds?
[22:22] <gregaf1> the link doesn't seem to be valid anymore
[22:22] <Tv> gregaf1: crap, go one up & see there
[22:23] <gregaf1> "one up"?
[22:23] <Tv> autotest has an annoying tendency to write to .log, and when done re-read & parse that into .DEBUG etc
[22:23] <Tv> gregaf1: parent dir
[22:23] <Tv> gregaf1: oh just to clarify: look at the *new* stuff
[22:23] <Tv> at the bottom of the ticket
[22:23] <gregaf1> oh, heh
[22:23] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[22:23] <gregaf1> hadn't reloaded
[22:24] <gregaf1> Tv: sorry, I meant debug logs for the server daemons/kclient
[22:25] <gregaf1> err, or are those in the weird 409-tv/group0 thing?
[22:25] <gregaf1> this hierarchy is strange
[22:25] <Tv> gregaf1: they'll show up after the test is done
[22:25] <Tv> [ 9440.020009] ceph: mds0 hung
[22:25] <Tv> [ 9658.080151] libceph: mds0 connection failed
[22:25] <Tv> crap
[22:26] <Tv> gregaf1: im me your ssh public key and i'll get you logged on into these machines
[22:26] <Tv> the mds is borking
[22:27] <Tv> 2011-04-11 13:27:28.060628 mds e13: 3/3/3 up {0=up:active(laggy or crashed),1=up:active,2=up:active}
[22:27] <gregaf1> huh
[22:28] <sagewk> bchrisman: fyi jeremy allison (from samba.org) saw my talk at collab summit where I mentioned there was samba vfs glue in the works. they're interested in taking a look when it's working
[22:29] <bchrisman> sagewk: cool.. I'll hop on the samba list soon.. yeah.. hopefully I can get these other issues out of the way quickly and get back to that.
[22:31] <Tv> gregaf1: i do think this mds thing is unrelated to the pjd bug
[22:40] <cmccabe> is anyone else having trouble connecting to flab.ceph.dreamhost.com (and some other ceph.dreamhost.com machines)?
[22:40] <Tv> FYI STAFF: 10.3.14.* network is broken
[22:41] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[22:42] <bchrisman> gregaf1: so those stats messages are definitely not related directly to my copy… there are some things occurring previously (scripting stuff from tools we maintain) which must be triggering that stats issue
[22:43] <bchrisman> I'm figuring I'll need full mds logging to figure out how it's getting into that broken state?
[22:46] <gregaf1> bchrisman: almost certainly, yeah
[22:49] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Connection reset by peer)
[22:51] <Tv> 10.3.14.x is back
[22:52] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[22:52] <Tv> err, not
[22:52] <Tv> .1 is back
[22:52] <Tv> but the net behind it is not
[22:52] <gregaf1> do we know what's going on?
[22:53] <Tv> gregaf1: if it doesn't recover fairly quickly, we need to wake up ops, but last time this happened i was told that *now* there is monitoring on it..
[22:53] * Tv looks at sagewk
[22:54] <sagewk> i can get to flab?
[22:55] <Tv> hmm
[22:55] <Tv> does work
[22:55] <cmccabe> yakko <-> has been restored for me too
[22:55] <Tv> and the rest too
[22:55] <Tv> it just came up
[22:56] <Tv> "Last login: Mon Apr 11 13:55:33 2011 from freebsd.paramagnetic.com" ?
[22:56] <Tv> that does not sound like a good thing
[22:56] <sagewk> lots of people leak 10.x addresses into their reverse dns
[22:56] <gregaf1> Tv: I've seen that a few times, I suspect it's just a DNS lookup thing due to multiple IPs
[22:57] <Tv> that's a dreamhost screwup though
[22:58] <sagewk> yep. well, it's a custom dns record the user set up
[22:58] <Tv> sagewk: you're saying "users" control reverse lookup of arbitrary ip addresses on dreamhost dns recursors
[22:58] <gregaf1> anyway, the mds crash is due to a scatterstat bug being detected by check_rstats
[22:58] <Tv> sagewk: that is *not* a good idea
[22:58] * verwilst (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[22:58] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[22:59] <cmccabe> it does seem like you could do some nasty things with control of DNS
[23:00] <sagewk> != 'control of dns' i'll explain in the meeting
[23:00] <sagewk> now?
[23:00] <gregaf1> yeah
[23:00] <cmccabe> ok
[23:00] <Tv> gregaf1: it seems the pjd thing didn't repro this time: http://autotest.ceph.newdream.net/results/410-tv/group0/sepia56.ceph.dreamhost.com/debug/client.0.log
[23:01] <gregaf1> I think it's just being picky with the ctime
[23:01] <Tv> gregaf1: but the expects that are failing should be these:
[23:02] <Tv> expect 0 mkdir ${n0} 0755
[23:02] <Tv> expect 0 chown ${n0} 65534 65533
[23:02] <Tv> ctime1=`${fstest} stat ${n0} ctime`
[23:02] <Tv> sleep 1
[23:02] <Tv> expect 0 -u 65534 -g 65532 chown ${n0} 65534 65532
[23:02] <Tv> no ctime comparison there
[23:43] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[23:49] * MarkN (~nathan@ has joined #ceph
[23:51] * MarkN (~nathan@ has left #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.