#ceph IRC Log


IRC Log for 2011-09-02

Timestamps are in GMT/BST.

[0:00] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[0:01] <sagewk> all btrfs right?
[0:01] <damoxc> yeah
[0:02] <sagewk> is there overlap?
[0:02] <damoxc> how do you mean?
[0:02] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[0:09] <sagewk> are the same pginfo files affected on mulitple nodes?
[0:09] <sagewk> if they are disjoint, you can safely remove the affected pg directories. if the same pginfo is broken on mulitple nodes, we have to get trickier to recover.
[0:10] <damoxc> on one of the osds all the broken pginfo files were active+clean in the cluster and on other nodes
[0:10] <sagewk> this is all assuming you care about saving the data.. if you're just testing and can throw it all away, we can focus our efforts on finding the actual bug
[0:11] <damoxc> by trash it do you mean just that osd?
[0:11] * adjohn (~adjohn@ Quit (Remote host closed the connection)
[0:11] <sagewk> the whole cluster.
[0:11] * adjohn (~adjohn@ has joined #ceph
[0:12] <sagewk> otherwise, we need to verify each broken pg is broken _only_ on that node, and if so, remove it and let the recovery re-replicate it
[0:12] <damoxc> i've been doing that so far
[0:12] * adjohn (~adjohn@ Quit (Read error: Connection reset by peer)
[0:12] * adjohn (~adjohn@ has joined #ceph
[0:13] <damoxc> there are some things I would like to get out of it, plus it's interesting messing with the disk format
[0:15] <bchrisman> anybody seen this osd issue before? http://pastebin.com/umT2CvP7 and ceph.conf (http://pastebin.com/K5kX1DWE)
[0:15] <bchrisman> summary:
[0:15] <bchrisman> 2011-09-01 22:13:24.118211 7f5adc222720 FileStore is up to date.
[0:15] <bchrisman> starting osd0 at osd_data /data/osd0 /dev/sda6
[0:15] <bchrisman> 2011-09-01 22:13:24.171158 7f5adc222720 global_init_daemonize: BUG: there are 1 child threads already started that will now die!
[0:15] <bchrisman> failed: ' /usr/bin/cosd -i 0 -c /etc/ceph/ceph.conf '
[0:17] <gregaf1> bchrisman: what version is that?
[0:19] <bchrisman> 2255a9a107c8e8187b6e6f529e248eb07f016451
[0:19] <bchrisman> that was the latest commit.. I have one commit on top of that for creating the /var/run/ceph directory in the spec file??? was testing that.. but shouldn't be an issue I'd think.
[0:20] <gregaf1> yeah, humm
[0:20] <gregaf1> oh, how are you creating the dir?
[0:21] <bchrisman> mkdir -p nothing fancy..
[0:21] <bchrisman> there somethingone sec
[0:21] <bchrisman> drwxr-xr-x. 2 root root 4096 Sep 1 22:17 /var/run/ceph
[0:22] <gregaf1> what that error means is that you're trying to daemonize but have already started extra threads (so in this case, 2 are running in the process)
[0:22] <gregaf1> and it's called from cosd.cc and there's nothing in there that ought to be starting extras, or has changed...
[0:22] <bchrisman> yeah.. something mucked up in the init script?
[0:23] <bchrisman> I guess I could rerun a mkcephfs
[0:23] <bchrisman> make sure I created with the config file I think I am.
[0:24] <gregaf1> I don't see how that could have done this
[0:24] <gregaf1> although I don't see how anything could have, so who knows
[0:27] <bchrisman> okay.. yeah.. there's really no debugging this thing from the standard debug logs I'm gussing.
[0:28] <gregaf1> I dunno, it'd be a pretty short log
[0:28] <gregaf1> we could try
[0:28] <gregaf1> sagewk: you have any ideas about too many threads before daemonization?
[0:28] <bchrisman> checked.. and yeah.. no help
[0:29] <gregaf1> ordinarily I'd ask Colin about this, but he's out sick
[0:33] <bchrisman> ahh
[0:33] <bchrisman> okay.. well??? will have to look at it later.. this is odd??? haven't seen it before.
[0:35] <gregaf1> yeah
[0:35] <sagewk> bchrisman: can you post eht osd.log somewhere?
[0:35] <sagewk> sjust: the filestore conversion doesn't start any threads, does it?
[0:39] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[0:47] <sjust> sjust: it implicitly starts any threads the FileStore usually creates
[0:56] * adjohn is now known as Guest8225
[0:56] * Guest8225 (~adjohn@ Quit (Read error: Connection reset by peer)
[0:56] * adjohn (~adjohn@ has joined #ceph
[0:59] <gregaf1> sjust: did you add that call to convertfs?
[1:01] <gregaf1> I don't see where it would be starting up threads, but if it is then that would cause trouble with daemonize
[1:09] <sagewk> store->mount creates a thread...
[1:10] <sagewk> but it should umount.
[1:11] <gregaf1> does that actually destroy the thread? deterministically?
[1:11] <sagewk> it does a join
[1:11] <gregaf1> Thread::get_num_threads() just looks at /proc/pid/task these days
[1:12] * jmlowe (~Adium@mobile-166-137-142-223.mycingular.net) has joined #ceph
[1:13] <gregaf1> oh, looks like maybe the error-handling is bad?
[1:13] <gregaf1> I see a lot of " if (r < 0)
[1:13] <gregaf1> return -r;"
[1:13] <gregaf1> which doesn't unmount
[1:13] <gregaf1> but then in cosd:
[1:13] <gregaf1> " int err = OSD::convertfs(g_conf->osd_data, g_conf->osd_journal);
[1:13] <gregaf1> if (err < 0) {"
[1:14] <jmlowe> so I've started to quantify just how slow qemu-img convert for rbd is
[1:14] <sagewk> jmlowe: mainline or with yehuda's patches?
[1:15] <jmlowe> actually I'm not sure
[1:15] <sagewk> there's an rbd-async-convert branch on git://ceph.newdream.net/git/qemu-kvm.git that has his patches
[1:16] <sagewk> the mainline version is horribly slow :(
[1:17] <gregaf1> sjust; care to comment reply filestore conversion and error handling?
[1:17] <jmlowe> I'm figuring about 12-13 hours to convert a 4G image
[1:18] <jmlowe> about 97kBs
[1:19] <sagewk> jmlowe: yeah, it's probably doing 512-byte synchronous writes
[1:20] <jmlowe> git://ceph.newdream.net/git/qemu-kvm.git
[1:20] <sjust> gregaf1: sorry, let me look
[1:20] <jmlowe> perhaps it's time for a pull
[1:20] <sagewk> anyway, try that branch and see how it does
[1:22] <sagewk> yeah our changes are based on stable releases so we don't have to worry about qemu bleeding edge
[1:22] <sjust> gregaf1: actually, I didn't check the store->unmount return value, that's a possibility
[1:23] <jmlowe> gw48:~/qemugit/qemu-kvm# git pull
[1:23] <jmlowe> remote: Counting objects: 7688, done.
[1:23] <jmlowe> remote: Compressing objects: 100% (1321/1321), done.
[1:23] <jmlowe> remote: Total 6814 (delta 5497), reused 6798 (delta 5484)
[1:23] <jmlowe> Receiving objects: 100% (6814/6814), 2.07 MiB | 638 KiB/s, done.
[1:23] <jmlowe> Resolving deltas: 100% (5497/5497), completed with 575 local objects.
[1:23] <jmlowe> From git://ceph.newdream.net/git/qemu-kvm
[1:23] <jmlowe> 9ed5726..20be92d master -> origin/master
[1:23] <jmlowe> + 8113be5...7ea6521 rbd-async-convert -> origin/rbd-async-convert (forced update)
[1:23] <jmlowe> Already up-to-date.
[1:33] * adjohn is now known as Guest8230
[1:33] * adjohn (~adjohn@ has joined #ceph
[1:37] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:38] * Guest8230 (~adjohn@ Quit (Ping timeout: 480 seconds)
[1:39] <gregaf1> jojy: that thread count thing bchrisman was asking about ??? should fixed in current master, but was actually masking some other issue we think
[1:41] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[1:41] <ajm> anyone seen a situation with cephfs where accessing a specific file/directory just hangs? All components (mon/mds/osd) are up/online
[1:42] <gregaf1> ajm: that's a fairly common class of bugs which can be caused by just about anything
[1:42] <gregaf1> do you have any logs?
[1:42] <jmlowe> ok, that's now very usable as async
[1:42] <jmlowe> time ~/qemugit/qemu-kvm/qemu-img convert -p -f qcow2 -O raw centos5.6.img centos5.6.raw
[1:42] <jmlowe> (100.00/100%)
[1:42] <jmlowe> real 0m9.064s
[1:42] <jmlowe> user 0m1.630s
[1:42] <jmlowe> sys 0m10.510s
[1:43] <jmlowe> 451 MBs
[1:43] <ajm> gregaf1: actually this is new: "libceph: mds0 a.b.c.d:6800 io error" in dmesg
[1:44] <jmlowe> that's about 1/2 of line rate for the network card
[1:45] <gregaf1> ajm: is the MDS still alive?
[1:45] <jmlowe> nm, that's 4x line rate, buffers are a good thing
[1:45] <ajm> gregaf1: yeah, mds e67: 1/1/1 up {0=15=up:active}, 1 up:standby
[1:46] <ajm> i'm poking the mds'es to see if there's anything unusual there
[1:47] <gregaf1> hmm, I don't know when the mds would be returning EIO, you have any ideas sagewk?
[1:47] <ajm> "accept connect_seq 5 vs existing 6 state 3"
[1:48] <sagewk> gregaf1: not offhand.. git grep EIO mds/
[1:48] <gregaf1> yeah, there's nothing there :/
[1:48] <ajm> its -scrolling- that so fast it almost crashed my terminal in the log
[1:48] <ajm> in the mds log
[1:48] <gregaf1> sagewk: I thought maybe the kernel was generating it?
[1:48] <sagewk> oh, probably
[1:48] <ajm> -rw------- 1 root root 11G Sep 1 19:48 /var/log/ceph/mds.15.log
[1:49] <gregaf1> sagewk: yeah, looks like it's a kernel error code from ceph_tcp_recvmsg, not out of the mds
[1:49] <gregaf1> which just calls into kernel_recvmsg
[1:50] <gregaf1> ajm: you're running with debugging on, I assume?
[1:50] <gregaf1> that generates a lot of output, but hopefully can give us what we need
[1:50] <ajm> no, actually
[1:50] <ajm> that 11gb log file isn't even with debug
[1:51] <sagewk> what's on the console?
[1:51] <gregaf1> what logging options are you using?
[1:51] <ajm> i have debug ms = 1
[1:51] <ajm> thats it
[1:51] <ajm> sagewk: console of ?
[1:52] <sagewk> dmesg
[1:52] <sagewk> just that one error?
[1:53] <ajm> http://pastebin.com/X8PghkTr
[1:53] <sagewk> interesting.
[1:54] <sagewk> well, once you hit 'reconnect denied', i think you're screwed.. need to umount -f or reboot. or restart that mds again and hope for the best
[1:55] <sagewk> (if you miss the reconnect window there's no graceful handling for slowpokes)
[1:55] <ajm> whats odd is
[1:55] <ajm> its not totally broken
[1:55] <ajm> some things work
[1:55] <ajm> specific files hang
[1:55] <gregaf1> probably files the client still has cached and thinks it has caps for
[1:55] <ajm> hrm, ok
[1:55] <gregaf1> or actually, just stuff the client thinks it still has caps for, it can talk to the OSDs to get data
[1:57] <ajm> ah
[1:57] <ajm> that makes sense
[1:57] <ajm> so it can still read/write some things
[1:58] <gregaf1> if everything's working properly it will time out caps after a while and refuse to touch stuff
[1:58] <gregaf1> and the MDS won't let anybody else touch it until that time has passed, to preserve consistency
[1:58] <ajm> makes sense
[1:59] <gregaf1> I'm not sure why it would have gotten the reconnect denied, which is annoying
[1:59] <gregaf1> but I don't think that's something we can diagnose without more logging than you have on
[2:00] <gregaf1> have you tried using a different client?
[2:00] <ajm> I think the mds was just overloaded
[2:00] <ajm> whatever "accept connect_seq 5 vs existing 6 state 3" was
[2:01] <gregaf1> hrm
[2:02] <gregaf1> can you put where you're seeing that (with whatever detail you can) in the tracker?
[2:02] <ajm> the accept connect_seq stuff ?
[2:03] <gregaf1> yeah
[2:04] <ajm> ok, lemme try to pare down the log a bit :)
[2:04] <gregaf1> thanks!
[2:04] <ajm> no, thank you!
[2:14] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:31] * adjohn (~adjohn@ Quit (Remote host closed the connection)
[2:31] * adjohn (~adjohn@ has joined #ceph
[2:42] <ajm> interesting: 2011-09-01 19:50:10.545840 7febb11bd700 mds0.cache.dir(1000021959f) [dentry #1/some/file/path [2,head] auth (dversion lock) pv=0 v=3053246 inode=0x7feb489f0810 0x7feb69127938] n(v0 1=0+1)
[2:43] <ajm> poking through logs to remove the bulk of it
[3:01] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:16] * adjohn is now known as Guest8244
[3:16] * adjohn (~adjohn@ has joined #ceph
[3:17] * Guest8244 (~adjohn@ Quit (Ping timeout: 480 seconds)
[3:51] * adjohn is now known as Guest8250
[3:51] * adjohn (~adjohn@ has joined #ceph
[3:56] * Guest8250 (~adjohn@ Quit (Ping timeout: 480 seconds)
[4:28] * adjohn (~adjohn@ Quit (Quit: adjohn)
[4:55] * pzb (~pzb@gw-ott1.byward.net) Quit (Quit: Ex-Chat)
[4:59] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[5:41] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[5:42] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit ()
[6:12] * lxo (~aoliva@1RDAAAM6S.tor-irc.dnsbl.oftc.net) has joined #ceph
[6:34] * jmlowe (~Adium@mobile-166-137-142-223.mycingular.net) Quit (Quit: Leaving.)
[7:03] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[9:27] * eternaleye__ is now known as eternaleye
[9:37] * eternaleye is now known as eternaleye_
[9:45] * eternaleye_ is now known as eternaleye
[9:57] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[10:43] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[11:38] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:57] * alekibango (~alekibang@ip-94-113-34-154.net.upcbroadband.cz) Quit (Quit: BUS error!)
[13:38] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Ping timeout: 480 seconds)
[14:10] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[14:53] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[15:06] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[15:58] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[16:21] * jmlowe (~Adium@140-182-133-51.dhcp-bl.indiana.edu) has joined #ceph
[16:27] <slang> sagewk: fyi, I'm not able to reproduce the hang now, using a version that includes both patches you pointed me at, as well as recompiling with tcmalloc
[16:57] * jmlowe (~Adium@140-182-133-51.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[17:16] * lxo (~aoliva@1RDAAAM6S.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[17:16] * jmlowe (~Adium@140-182-150-36.dhcp-bl.indiana.edu) has joined #ceph
[17:54] * jmlowe (~Adium@140-182-150-36.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[18:18] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:32] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:37] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[18:53] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:26] * jmlowe (~Adium@mobile-166-137-141-170.mycingular.net) has joined #ceph
[19:35] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:39] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[19:40] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[19:40] * lxo (~aoliva@19NAADIU0.tor-irc.dnsbl.oftc.net) has joined #ceph
[19:41] <sagewk> slang: hmm.. even with tcmalloc compiled in you mean? (like it was originally?) that's frustrating... :/
[19:46] <sagewk> aie.. is sf markus trolling? i can't tell if he's serious
[19:47] * adjohn (~adjohn@ has joined #ceph
[19:49] <gregaf1> joshd: sagewk: augh, <insert generic anger about teuthology machines being locked>
[19:49] <gregaf1> also, sf markus? wrong channel?
[19:49] <gregaf1> oh, that guy
[19:50] <gregaf1> yeah, no idea
[19:51] <sagewk> gregaf1: unlocked 3
[19:51] <sagewk> just keep some locked
[19:51] <gregaf1> argh, teuthology just ate them
[19:51] <gregaf1> I would have some locked but I was done using them, just running a few tests before I merge my flock stuff
[19:52] <gregaf1> oh, maybe it didn't
[19:56] <sagewk> gregaf1: all good?
[19:56] <gregaf1> yeah, thanks :)
[19:57] <sagewk> hmm, suddenty teuth isn't returning an error code when a workunit fails
[19:58] <gregaf1> oh, so maybe there is a new bug there
[19:59] <gregaf1> I assumed it was just because locktest was crashing instead of returning that it was borked
[19:59] <gregaf1> sagewk: err, actually, what are your symptoms?
[20:00] <gregaf1> might be I missed something when setting up "all" if you're just using that
[20:00] <slang> yes, just like originally
[20:00] <slang> I guess I can remove your patches and see if it comes back
[20:00] <gregaf1> or parallel() isn't working right
[20:03] <sagewk> i have a job that runs the 'false.sh' workunit, and virtualenv/bin/teuthology myjob && echo success echos success, despite seeing exceptions and stack dumps and such
[20:05] <gregaf1> sagewk: I mean, is it running on a specific client or is it running on "all"
[20:05] <sagewk> all
[20:05] <gregaf1> I saw a problem yesterday where locktest.c was failing and it wasn't getting propagated back, myself, so there might be a bug with the wrappers somehow
[20:05] <gregaf1> or it could be a bug with that or with the parallel stuff that josh wrote
[20:06] <gregaf1> can you try it on a specific client to narrow things down?
[20:06] <sagewk> yeah
[20:09] <sagewk> still hides the failure
[20:09] <gregaf1> okay, so the script wrappers are wrong somehow
[20:09] <gregaf1> which is odd because I don't see any recent changes
[20:09] <gregaf1> Tv: joshd: any ideas?
[20:10] <sagewk> i'll bisect
[20:10] <joshd> I'll check the parallel thing again - single client is still using it for workunits
[20:11] <Tv> gregaf1: didn't josh say something about a bug in parallel() two days ago or so?
[20:11] <Tv> also, there was the gevent import thing
[20:12] <gregaf1> Tv: it could be that, I just saw a similar thing yesterday where a crashed process wasn't reporting an error back in teuthology, and that wasn't using parallel() at all
[20:12] <joshd> those both caused hangs
[20:12] <Tv> gregaf1: you'll have to give me a bit more than that to work with... :-/
[20:13] <gregaf1> well, I got errors back properly once I removed the wrappers (enable-coredump, coverage, daemon-helper)
[20:13] <sagewk> no worries, i'm bisecting, shoudl just take a few
[20:14] <gregaf1> my assumption at the time was that it just didn't work right for crashes, but I'd dealt with my bug and posted it somewhere and didn't want to deal with it any more *shrug*
[20:14] <gregaf1> this thing makes me wonder if something deeper happened
[20:14] <Tv> gregaf1: please don't assume i do subpar work :(
[20:16] <Tv> i wonder about 0c2bee1514c1b1e65ca5d52459062e5a45da2d7b
[20:16] <Tv> yes
[20:16] <Tv> >>> def f():
[20:16] <Tv> ... try:
[20:17] <Tv> ... raise RuntimeError('foo')
[20:17] <Tv> ... finally:
[20:17] <Tv> ... return
[20:17] <Tv> ...
[20:17] <Tv> >>> f()
[20:17] <Tv> >>>
[20:17] <Tv> that eats exceptions for breakfast
[20:17] <Tv> gregaf1: all your fault :-p
[20:17] * bchrisman (~Adium@ has joined #ceph
[20:43] * adjohn (~adjohn@ Quit (Quit: adjohn)
[21:14] * adjohn (~adjohn@ has joined #ceph
[21:26] * lxo (~aoliva@19NAADIU0.tor-irc.dnsbl.oftc.net) Quit (Quit: later)
[21:36] <Tv> new docs: http://ceph.newdream.net/docs/latest/ops/install/ http://ceph.newdream.net/docs/latest/ops/autobuilt/
[21:38] <ajm> a general idea, a "installing ceph without mkcephfs" would be a nice addition
[21:39] <Tv> ajm: yes, but i hope to tackle that via making it simpler first...
[21:39] <slang> man mkcephfs
[21:39] <ajm> or even a mkcephfs that has a "pretend" mode
[21:39] <slang> there's some steps in there about not using mkcephfs
[21:39] <ajm> hrm
[21:39] <ajm> I missed that man page :)
[21:39] <Tv> ajm: the non-ssh mode of mkcephfs is somewhat better at pretending
[21:40] <Tv> but honestly, i don't want too many people to learn the intricacies of monmap & osdmap just to install ceph; that part needs to become simpler first
[21:40] * jmlowe (~Adium@mobile-166-137-141-170.mycingular.net) Quit (Quit: Leaving.)
[21:43] * lxo (~aoliva@9KCAAAUAB.tor-irc.dnsbl.oftc.net) has joined #ceph
[21:47] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[21:47] <darkfader> slang: man pages do NOT replace a manual
[21:47] <darkfader> :)
[21:48] <darkfader> just think of the really bad IT books where they include "all important linux command man pages" :)
[21:48] <Tv> and 484-line shell scripts do NOT replace a good installation mechanism ;)
[21:48] <darkfader> hehe
[21:48] <darkfader> did you find my last kickstart scripts?
[21:48] <Tv> i'm looking at mkcephfs
[21:49] <Tv> this'll all get better, it just needs more TLC
[21:51] <sjust> bchrisman: got logs for that crash?
[21:54] <bchrisman> sjust: I've just pulled & built latest.
[21:54] <sjust> ok
[22:06] <sagewk> anybody uses the opensuse open build service before?
[22:14] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[22:46] * jmlowe (~Adium@mobile-166-137-141-170.mycingular.net) has joined #ceph
[22:58] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Quit: Leaving)
[23:02] <bchrisman> new logfile naming scheme? 'osd.3.3.log' or was that there before and I didn't notice it?
[23:07] <bchrisman> sjust: http://pastebin.com/1mehJWjG
[23:07] <bchrisman> hmm.. probably my osd logging not full..
[23:08] * jmlowe (~Adium@mobile-166-137-141-170.mycingular.net) has left #ceph
[23:08] <cmccabe> bchrisman: probably you want something like log_file = $name.log
[23:09] <bchrisman> log file = /cephlogs/$name.$id.log
[23:09] <bchrisman> that's changed?
[23:09] <sagewk> $name=$type.$id
[23:09] <sagewk> so that's $type.$id.$Id
[23:09] <bchrisman> ahh okay.. thx
[23:10] <bchrisman> sjust: http://pastebin.com/J1k9w6Vs
[23:10] <bchrisman> with osd/filestore=20
[23:11] <bchrisman> my conf file: http://pastebin.com/HfBL2Fim
[23:19] <bchrisman> yeah.. I don't see anything suspicious in the log files that look out of place.
[23:22] <Tv> so mkcephfs -a requires the ceph.conf host= to be short form hostname (hostname=`hostname | cut -d . -f 1` in ceph_common.sh), and thus also tries to ssh to short form hostnames.. that's annoying :(
[23:22] <Tv> my dns is configured purely by dhcp, and thus i don't have the right subdomain in my list of domains to search
[23:23] <Tv> and there's ~100 sepia nodes, i don't want to set up ~/.ssh/config aliases for each one!
[23:23] <sjust> bchrisman, be with you in a bit, sorry for the delay
[23:24] <Tv> sagewk: before i go on a rampage to solve this somehow, what's your vision of "right" here
[23:25] <sagewk> ideally either would work, i think
[23:29] <Tv> and it thinks local tmp path is always valid on remote :-(
[23:29] <Tv> no wait not tmp, local cwd?
[23:30] <Tv> "sshdir" huh?
[23:31] <Tv> sagewk: mkcephfs has too many "sage convenience" features to work reliably :-/
[23:33] <sagewk> that can be disabled by default
[23:33] <sagewk> the goal was to work in shared root nfs-type environment
[23:33] <sagewk> namely, my ~/ceph/src in my nfs-mounted home directory :)
[23:34] <Tv> it's more that it's frustrating to hit all the speed bumps
[23:34] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[23:34] <Tv> sagewk: this is why i'm constantly tempted to just write something simpler from scratch
[23:34] <ajm> fwiw, I got around that in my setup by doing: CEPH_COMMON=(--hostname $(hostname)) in conf.d/ceph (i'm a gentoo)
[23:34] <ajm> so it doesn't use the short-form hostnames
[23:35] <Tv> sagewk: fundamentally, it's easier to build something complex from simple components, than something simple from a complex component
[23:35] <sagewk> tv: i think most of the pain comes from the iterate-over-hosts part. all the pieces are broken apart now
[23:38] <sagewk> we're almost to the point wehre we can separate adding osds from this process entirely
[23:39] <sagewk> maybe we should focus on how to make the mon cluster setup simpler
[23:39] <Tv> oh yes
[23:39] <Tv> i'm just trying to get *something* in place
[23:39] <Tv> but my list of "why mkcephfs breaks on sepia" is already 4 entries long
[23:54] <Tv> whee segfault too
[23:58] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[23:59] <Tv> throwing end_of_buffer in CrushWrapper::decode, backtrace: http://pastebin.com/KjGtuCRF

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.