#ceph IRC Log


IRC Log for 2011-08-10

Timestamps are in GMT/BST.

[0:10] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[0:24] * Juul (~Juul@node-u2p.camp.ccc.de) Quit (Ping timeout: 480 seconds)
[0:35] * Juul (~Juul@node-vjn.camp.ccc.de) has joined #ceph
[0:39] * Juul (~Juul@node-vjn.camp.ccc.de) Quit ()
[0:39] * Juul (~Juul@node-vjn.camp.ccc.de) has joined #ceph
[0:43] <Tv> sagewk: teuthology pause on error done: http://tracker.newdream.net/issues/1291
[0:44] <sagewk> tv: yay!
[0:45] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[0:46] <sagewk> tv: did you look at that dumb rbd thing yet?
[0:46] <Tv> right now
[1:01] <Tv> well it's quite simple
[1:01] <Tv> rbd mounts it's stuff in mnt.client.0, and cleans that up perfectly
[1:01] <Tv> but the workunits are run against mnt.0
[1:01] <Tv> and apparently create the dir on their own??
[1:02] <sagewk> ah
[1:02] <sagewk> cfuse and kclient do mnt.$id.. rbd should probably just do the same
[1:02] <sagewk> and workunit shouldn't create the dir?
[1:02] <Tv> bad use of install -d
[1:02] <Tv> yes
[1:02] <sagewk> yeah
[1:02] <Tv> i'll fix it
[1:03] <sagewk> cool tnx
[1:03] <Tv> kill the install -d first so i see it start to fail
[1:05] <Tv> ahh it uses sudo install --owner= because the mounts start off as writable to root only
[1:05] <Tv> gonna replace that with a mkdir && chown, i guess
[1:06] <Tv> ohh even better cd .. && sudo install, lets me avoid the ambiguity of umask
[1:17] * yehuda_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[1:25] <Tv> sagewk: rbd task fixed
[1:25] <sagewk> tv: yay thanks
[1:32] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[1:33] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:34] * greglap (~Adium@ has joined #ceph
[1:55] <greglap> anybody doing anything with sepia63? it's not accessible and I've locked it as such, but if I don't hear anything in the next 30 minutes I'm going to powerc it
[2:00] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) Quit (Remote host closed the connection)
[2:17] * cmccabe (~cmccabe@ has left #ceph
[2:27] * Dantman (~dantman@ Quit (Ping timeout: 480 seconds)
[2:28] * greglap (~Adium@ Quit (Quit: Leaving.)
[2:33] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:54] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[3:18] * huangjun (~root@ has joined #ceph
[3:18] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:27] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[3:29] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[3:31] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[3:35] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[3:42] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[3:42] * Juul (~Juul@node-vjn.camp.ccc.de) Quit (Quit: Leaving)
[3:50] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[4:22] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[4:33] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[4:43] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Remote host closed the connection)
[4:51] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[4:52] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[4:52] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit ()
[5:39] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[5:40] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[7:57] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[8:24] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[8:50] * huangjun (~root@ Quit (Ping timeout: 480 seconds)
[9:00] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[9:21] * huangjun (~root@ has joined #ceph
[10:45] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) has joined #ceph
[11:02] * lxo (~aoliva@19NAACY84.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[11:02] * lxo (~aoliva@09GAAF054.tor-irc.dnsbl.oftc.net) has joined #ceph
[11:54] * tjikkun (~tjikkun@195-240-187-63.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[12:21] * jantje_ (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[12:21] * huangjun_ (~root@ has joined #ceph
[12:23] * huangjun (~root@ Quit (Ping timeout: 480 seconds)
[12:40] * jantje (~jan@paranoid.nl) has joined #ceph
[13:00] * jantje_ (~jan@paranoid.nl) has joined #ceph
[13:02] * jantje (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[13:54] * huangjun_ (~root@ Quit (Quit: Lost terminal)
[14:18] * huangjun (~root@ has joined #ceph
[14:28] * aliguori (~anthony@ has joined #ceph
[14:44] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[14:46] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[14:47] * aliguori (~anthony@ has joined #ceph
[14:50] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[15:03] <pmjdebruijn> I noticed ceph has Debian package, however, I can't seem to locate the package sources
[15:03] <pmjdebruijn> except for 0.27...
[15:03] <pmjdebruijn> I can't find any current .orig.tar.gz/.debian.tar.gz/.diff.gz
[15:56] <wido> pmjdebruijn: You can simply grab the source from the website
[16:00] <pmjdebruijn> huh?
[16:00] <pmjdebruijn> there is no debian directory in the source
[16:01] <wido> pmjdebruijn: Oh, I mean the Ceph source, the packages are being build from that
[16:01] <pmjdebruijn> yeah but how?
[16:01] <wido> it contains the debian directory
[16:01] <pmjdebruijn> nope it doesn't
[16:03] <wido> pmjdebruijn: If you do a git clone you can do a checkout with the specific version
[16:03] <wido> and then you'll get the same source
[16:03] <pmjdebruijn> oh so it is in the tree?
[16:03] <wido> Yes, it is
[16:03] <wido> git checkout v0.31 for example
[16:04] <pmjdebruijn> ah
[16:04] <pmjdebruijn> there it is
[16:04] <pmjdebruijn> doh
[16:04] <pmjdebruijn> thanks
[16:07] <wido> np
[16:31] <pmjdebruijn> though it's still odd there is no .diff.gz or something on the ceph repo
[16:34] * huangjun (~root@ Quit (Quit: leaving)
[16:48] * wido (~wido@rockbox.widodh.nl) Quit (Quit: Changing server)
[17:08] * Juul (~Juul@node-sju.camp.ccc.de) has joined #ceph
[17:40] * yehuda_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[17:53] * greglap (~Adium@ has joined #ceph
[18:00] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:06] * Juul (~Juul@node-sju.camp.ccc.de) Quit (Ping timeout: 480 seconds)
[18:33] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:34] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[18:53] * greglap (~Adium@ Quit (Quit: Leaving.)
[19:05] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:07] * aliguori (~anthony@ Quit (Remote host closed the connection)
[19:08] * wido (~wido@rockbox.widodh.nl) has joined #ceph
[19:31] * aliguori (~anthony@ has joined #ceph
[19:32] <wido> Has anyone here experienced the btrfs message "open_ctree" failed?
[19:59] * MK_FG (~MK_FG@ Quit (Quit: o//)
[20:00] * MK_FG (~MK_FG@ has joined #ceph
[20:58] * slang (~slang@chml01.drwholdings.com) has joined #ceph
[20:59] <slang> I tried to take one osd out of my cluster, and it caused a bunch of other osds to crash
[20:59] <slang> when I try to restart them, it crashes in the same spot
[20:59] <slang> ../../src/osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_hb_inst(int)', in thread '0x7f991ab8d700'
[20:59] <slang> ../../src/osd/OSDMap.h: 506: FAILED assert(is_up(osd))
[20:59] <slang> ceph version (commit:)
[20:59] <slang> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x89) [0x8da625]
[20:59] <slang> 2: (OSDMap::get_hb_inst(int)+0x7d) [0x8486f3]
[20:59] <slang> 3: (OSD::_add_heartbeat_source(int, std::map<int, unsigned int, std::less<int>, std::allocator<std::pair<int const, unsigned int> > >&, std::map<int, utime_t, std::less<int>, std::allocator<std::pair<int const, utime_t> > >&, std::map<int, Connection*, std::less<int>, std::allocator<std::pair<int const, Connection*> > >&)+0x103) [0x81e589]
[20:59] <slang> 4: (OSD::update_heartbeat_peers()+0x363) [0x81ec23]
[21:00] <slang> ack
[21:00] <slang> http://fpaste.org/siBa/
[21:01] <slang> full trace: http://fpaste.org/gNzy/
[21:02] <slang> sagewk: btw, this is with the wip-heartbeats branch
[21:04] <wido> slang: I've seen that one also today
[21:04] <wido> but I was running a mix of old and new code
[21:04] <wido> updating ALL the OSDs fixed it for me
[21:04] <sagewk> wido: open_ctree() failure is what you get with most btrfs corruptions i've seen
[21:04] <sagewk> slang: oh, i know what the problem is.
[21:05] <wido> sagewk: Yes, I know. But I'm nearly there with my recovery, so I hoped to fix it somehow
[21:05] <wido> googling gave me some hints, like super-select
[21:06] <sagewk> wido: i haven't tried any heroic efforts.. i seem to remember some suggestions from chris on the list a month or two back though.
[21:06] <sagewk> wido: in theory he'll have a preliminary fsck available in a week or two, but we'll see...
[21:07] <wido> sagewk: Ah, ok. I'll dive in the ml
[21:07] <wido> tried the latest btrfs-tools btw
[21:07] <wido> The new heartbeat code seems much better btw!
[21:08] <wido> Even during heavy recovery on my Atoms, I haven't seen flapping
[21:08] <sagewk> wido: yay!
[21:09] <sagewk> slang: pushed 07837c9bfbec6953a1a40af75cc71d5f6adc3734 which should fix your problem
[21:12] * Juul (~Juul@ has joined #ceph
[21:13] <slang> sagewk: thanks!
[21:17] <wido> sagewk: I've been thinking about a patch in the init script where it retries a unmount of the filesystem if it fails. Sometimes a cosd is still flushing data while the init script tries to unmount
[21:17] <wido> If you give a reboot to your server it could lead to a not cleanly unmounted filesystem
[21:17] <wido> If you try the umount about 5 sec later it goes fine, no more "in use" message
[21:18] <sagewk> wido: i suspect the right fix is to make sure the cosd is fully shut down before doing the umount. the pid check is probably broekn
[21:18] <wido> sagewk: I expected that to be your answer :-) But you are right
[21:19] <sagewk> wido: the same thing is pbly also the cause of weirdness when doing restart (sometimes we get lock errors)
[21:19] <sagewk> iirc it's looking at /proc/$pid and waiting for that to disappear.. that probably isn't the right thing to do?
[21:21] <wido> sagewk: Yes, that seems the right way. Not sure what happends in those few seconds
[21:21] <wido> In what state the OSD goes, zombie or D
[21:26] <sagewk> dunno
[21:26] <sagewk> either way we should wait for the process to disappear entirely
[21:28] <Tv> this is why debian created start-stop-daemon (and why ubutu created upstart)
[21:38] * Juul (~Juul@ Quit (Ping timeout: 480 seconds)
[21:40] <Tv> ok, poke holes in this: i have a commit for teuthology that makes choosing kernels work even when the desired kernel is not the one with the highest version number (of all installed)
[21:40] <Tv> it does this by adding a grub config that sets a specific kernel as the default
[21:41] <Tv> so.. that's "sticky", but doing anything else would be very frustrating
[21:41] <Tv> i still need to test behavior if you remove that particular deb
[21:41] <Tv> but what this means is, to guarantee any sanity, now *all* jobs should give "kernel: branch: master" in config
[21:41] <Tv> and potentially suffer reboots
[21:41] <Tv> how does that sound
[21:42] <joshd> not much different from now, if you don't specify a kernel
[21:42] <Tv> oh yeah i tested the behavior if the chosen menu entry no longer exists -- it falls back to the usual "boot the greatest"
[21:42] <Tv> so that's all good
[21:43] <joshd> that seems fine to me
[21:44] <Tv> bleh different awk versions biting me
[21:49] * Juul (~Juul@slim.visitor.camp.ccc.de) has joined #ceph
[21:52] <Tv> joshd: it seems the waiting for reboot always times out for me..
[21:53] <gregaf> man, reboot -f doesn't send anything back across an SSH connection :(
[21:53] <Tv> it's a hard reboot
[21:54] <Tv> *nothing* is done
[21:54] <gregaf> yeah, it kinda makes sense
[21:54] <gregaf> it's just waiting for the local SSH to realize the pipe is broken takes *forever*
[21:54] <Tv> ahh.. probably better to terminate it yourself
[21:55] <Tv> then again, if you terminate too early, reboot doesn't get run
[21:55] <wido> gregaf: have you enabled your KeepAlive in the ssh_config?
[21:55] <wido> I have that to prevent crappy NAT routers closing my SSH connections all the time
[21:55] <Tv> wido: waiting a minute is undesirable too
[21:55] <wido> Tv: ah, k
[21:55] <gregaf> wido: not sure if I do, but I don't want to count on it for our QA framework ;)
[21:56] <Tv> ssh_config is ignored anyway; this is paramiko not openssh
[21:56] <gregaf> I suppose I can just forget the processes and leave a larger timeout while waiting for them to come back up
[21:57] <Tv> you could do some elaborate dance like wrap the remote in something that disconnects from controlling terminal (to guard against SIGHUP) and then writes to stdout to signal readiness
[21:57] <Tv> and then execs reboot -f
[21:57] <Tv> err maybe waits for something on stdin and *then* execs
[21:57] <Tv> still a chicken-and-egg with closing ssh connection too early
[21:58] <Tv> no wait it isn't
[21:58] <Tv> err
[21:59] <Tv> yeah.. the reboot -f will never know when ssh data has been flushed to network
[21:59] <Tv> crap
[21:59] <Tv> something like
[21:59] <gregaf> it's really not that big a deal to just background the process and include it as part of the timeout period
[21:59] <Tv> yeah
[21:59] <Tv> INFO:teuthology.task.kernel:Re-opening connections...
[22:00] <Tv> ERROR:teuthology.task.kernel:unknown socket error: timeout('timed out',)
[22:00] <Tv> harrumph
[22:00] <gregaf> is it actually timing out or is it getting some other problem?
[22:00] <Tv> both
[22:00] <gregaf> the reconnect task was conflating a bunch of things, I cleaned it up a little bit locally for my nuke work
[22:00] <Tv> sometimes the reconnect takes too long, this time it got a socket.error it wasn't prepared to handle
[22:03] <joshd> I put the extra debugging there because there's some socket error that doesn't have an error code, and I wasn't able to reproduce it again
[22:04] <joshd> if you ever get 'weird socket error without error code' let me know
[22:05] <Tv> this one was 'unknown socket error'
[22:05] <Tv> the fix was trivial
[22:06] <Tv> though i still say the default timeout is likely too small
[22:06] <joshd> that's just a bad error message - it's the same as the timeout path
[22:06] <Tv> at least sepia86 takes longer to reboot
[22:06] <joshd> I have no problem increasing the timeout to 5 minutes or something
[22:06] <Tv> yeah 300 is definitely enough for me
[22:07] <Tv> didn't test in between
[22:07] <Tv> i'll commit that too
[22:07] <Tv> one more test run
[22:19] <sagewk> wido: still around?
[22:20] <sagewk> wido: just wanna mention #1032 in case you see that again. just waiting on more info
[22:21] * aliguori (~anthony@ Quit (Remote host closed the connection)
[22:41] <Tv> aaand teuthology can downgrade kernels
[22:42] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) Quit (Quit: ChatZilla 0.9.87 [Firefox 4.0.1/20110609040224])
[22:43] * Juul (~Juul@slim.visitor.camp.ccc.de) Quit (Quit: Leaving)
[22:48] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.