#ceph IRC Log

Index

IRC Log for 2013-02-04

Timestamps are in GMT/BST.

[0:01] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[0:02] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[0:44] * GBurton (~GBurton@46-65-88-85.zone16.bethere.co.uk) has left #ceph
[1:04] * LeaChim (~LeaChim@027ee384.bb.sky.com) Quit (Ping timeout: 480 seconds)
[1:16] * ScOut3R (~scout3r@540079A1.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[1:20] * loicd (~loic@2a01:e35:2eba:db10:accc:dd91:8c6d:d542) Quit (Quit: Leaving.)
[1:20] <BillK> 2013-02-04 07:13:40.352932 7fc06ecfa700 0 -- 192.168.44.90:6807/12680 >> 192.168.44.90:6826/10601 pipe(0x1546a00 sd=30 :47544 pgs=0 cs=0 l=0).connect claims to be 192.168.44.90:6826/13152 not 192.168.44.90:6826/10601 - wrong node!
[1:21] <BillK> tried to remove (set "out") some osd's and its hung
[1:21] <BillK> how can I fix?
[1:25] * scuttlemonkey (~scuttlemo@217.64.252.30.mactelecom.net) Quit (Quit: This computer has gone to sleep)
[1:28] * andret (~andre@pcandre.nine.ch) Quit (Ping timeout: 480 seconds)
[1:36] * The_Bishop (~bishop@2001:470:50b6:0:6157:b33e:c40d:437b) Quit (Ping timeout: 480 seconds)
[1:37] * andret (~andre@pcandre.nine.ch) has joined #ceph
[1:44] * The_Bishop (~bishop@2001:470:50b6:0:71af:768b:b89e:2209) has joined #ceph
[1:56] * andret (~andre@pcandre.nine.ch) Quit (Ping timeout: 480 seconds)
[1:56] * andret (~andre@pcandre.nine.ch) has joined #ceph
[2:47] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[2:47] * ChanServ sets mode +o elder
[2:55] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[3:15] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[3:18] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[3:34] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[3:47] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[3:48] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[3:52] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[3:59] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[4:01] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[4:01] * wschulze1 (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[4:03] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Read error: Connection reset by peer)
[4:04] * wschulze1 (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit ()
[4:12] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[4:31] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:46] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[4:59] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[5:00] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:01] * lxo (~aoliva@lxo.user.oftc.net) Quit ()
[5:02] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:16] * maxia (~sensa@114.91.100.100) Quit (Ping timeout: 480 seconds)
[5:17] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[5:21] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[5:22] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[5:28] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:35] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has left #ceph
[5:46] * gaveen (~gaveen@112.135.9.175) has joined #ceph
[5:50] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[5:51] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:54] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[6:04] * drokita1 (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[6:09] * drokita2 (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[6:10] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[6:14] * drokita1 (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[6:25] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[6:31] * drokita2 (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[6:39] * xiaoxi (~xiaoxiche@134.134.137.71) has joined #ceph
[7:04] * mattch (~mattch@pcw3047.see.ed.ac.uk) Quit (Read error: Connection reset by peer)
[7:25] * andret (~andre@pcandre.nine.ch) Quit (Ping timeout: 480 seconds)
[7:25] * andret (~andre@pcandre.nine.ch) has joined #ceph
[7:45] * Ryan_Lane (~Adium@10.113-78-194.adsl-static.isp.belgacom.be) has joined #ceph
[8:05] * xiaoxi (~xiaoxiche@134.134.137.71) Quit (Ping timeout: 480 seconds)
[8:10] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[8:28] * gucki (~smuxi@84-72-10-109.dclient.hispeed.ch) Quit (Remote host closed the connection)
[8:44] * trond (~trond@trh.betradar.com) Quit (Remote host closed the connection)
[8:55] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[9:00] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: The early bird may get the worm, but the second mouse gets the cheese)
[9:03] * raso (~raso@deb-multimedia.org) Quit (Quit: WeeChat 0.3.8)
[9:03] * darktim (~andre@pcandre.nine.ch) has joined #ceph
[9:03] * andret (~andre@pcandre.nine.ch) Quit (Read error: Connection reset by peer)
[9:07] * raso (~raso@deb-multimedia.org) has joined #ceph
[9:11] * loicd (~loic@lvs-gateway1.teclib.net) has joined #ceph
[9:11] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:20] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[9:21] * gohko (~gohko@natter.interq.or.jp) Quit ()
[9:24] * ScOut3R (~ScOut3R@212.96.47.215) has joined #ceph
[9:26] * LeaChim (~LeaChim@027ee384.bb.sky.com) has joined #ceph
[9:32] * loicd (~loic@lvs-gateway1.teclib.net) Quit (Ping timeout: 480 seconds)
[9:33] * Ryan_Lane (~Adium@10.113-78-194.adsl-static.isp.belgacom.be) Quit (Quit: Leaving.)
[9:37] * yoshi (~yoshi@p6124-ipngn1401marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[9:46] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) has joined #ceph
[9:49] * sleinen (~Adium@2001:620:0:46:fd9f:fac7:123f:35e) has joined #ceph
[9:53] * Morg (d4438402@ircip1.mibbit.com) has joined #ceph
[9:56] * ninkotech (~duplo@89.177.137.236) Quit (Quit: Konversation terminated!)
[9:57] * leseb (~leseb@mx00.stone-it.com) has joined #ceph
[10:04] * hybrid512 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[10:07] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:15] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[10:19] * gaveen (~gaveen@112.135.9.175) Quit (Ping timeout: 480 seconds)
[10:28] * gaveen (~gaveen@112.135.13.180) has joined #ceph
[10:34] * scuttlemonkey (~scuttlemo@217.64.252.30.mactelecom.net) has joined #ceph
[10:34] * ChanServ sets mode +o scuttlemonkey
[10:37] * Morg (d4438402@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[10:45] * sjustlaptop (~sam@fw.office-fra1.proio.com) has joined #ceph
[10:50] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[10:58] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[10:59] * yoshi (~yoshi@p6124-ipngn1401marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:11] * joshd1 (~jdurgin@2602:306:c5db:310:7dd2:ff03:e100:4283) Quit (Quit: Leaving.)
[11:12] <Kdecherf> oh yeah, a segfault on mds
[11:59] * xiaoxi (~xiaoxiche@134.134.139.72) has joined #ceph
[12:02] * The_Bishop (~bishop@2001:470:50b6:0:71af:768b:b89e:2209) Quit (Ping timeout: 480 seconds)
[12:11] * The_Bishop (~bishop@2001:470:50b6:0:6157:b33e:c40d:437b) has joined #ceph
[12:12] <loicd> Kdecherf: :-D
[12:20] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[12:30] * Ryan_Lane (~Adium@212-123-27-210.iFiber.telenet-ops.be) has joined #ceph
[12:36] * mattch (~mattch@pcw3047.see.ed.ac.uk) has joined #ceph
[12:37] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[12:39] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[12:41] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[12:51] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[12:58] * low (~low@188.165.111.2) has joined #ceph
[13:25] * maxia (~sensa@114.91.121.183) has joined #ceph
[13:39] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[13:39] * foxhunt (~richard@office2.argeweb.nl) has joined #ceph
[13:50] * jluis (~JL@89.181.149.199) has joined #ceph
[13:52] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[13:55] * Ryan_Lane (~Adium@212-123-27-210.iFiber.telenet-ops.be) Quit (Quit: Leaving.)
[13:56] * joao (~JL@89.181.159.56) Quit (Ping timeout: 480 seconds)
[14:00] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[14:06] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:07] <Kdecherf> Does having a lot of "handle_client_file_setlock" in logs a good thing?
[14:09] * sleinen (~Adium@2001:620:0:46:fd9f:fac7:123f:35e) Quit (Ping timeout: 480 seconds)
[14:10] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[14:14] * itamar (~itamar@82.166.185.149) has joined #ceph
[14:18] <loicd> Would someone be willing to quickly review https://github.com/ceph/ceph-build/pull/1/files in the context of http://tracker.ceph.com/issues/3788 ?
[14:29] <Kdecherf> Can I manually out the current active mds of a cluster? It currently hangs up and generates a lot of "missing file" errors
[14:30] * jefferai (~quassel@quassel.jefferai.org) has joined #ceph
[14:34] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[14:36] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[14:38] * gaveen (~gaveen@112.135.13.180) Quit (Quit: Leaving)
[14:40] * gammacoder (~chatzilla@23-25-46-123-static.hfc.comcastbusiness.net) has joined #ceph
[14:43] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[14:43] <xiaoxi> hi, when ceph perform read, which thread will be blocked ?Seem osd op thread?
[14:43] <sjustlaptop> xiaoxi: yes
[14:44] <xiaoxi> but for write,that is filestore op thread, right?
[14:44] <sjustlaptop> right
[14:45] <xiaoxi> sjustlaptop
[14:46] <xiaoxi> sjustlaptop: I have seen a strange performance data ,that my setup can write @ ~1700MB/s, but read only ~ 1300MB/s
[14:46] <xiaoxi> is it due to no enough osd op thread?
[14:46] <sjustlaptop> that is one possibility
[14:46] <sjustlaptop> is that mixed read/write?
[14:47] <xiaoxi> no, pure read. We doing the test ontop of rbd by using Aiostress, keep all the aio parameters the same between R/W test
[14:49] * Ryan_Lane (~Adium@LPuteaux-156-16-1-107.w80-14.abo.wanadoo.fr) has joined #ceph
[14:49] <sjustlaptop> ok
[14:49] <sjustlaptop> I'd be interested to see what would happen if you increased the number of osd op threads
[14:49] <jluis> sjustlaptop, how was your flight?
[14:50] <sjustlaptop> jluis: transatlantic
[14:50] <sjustlaptop> which is my new word for "tedious and long"
[14:50] <jluis> yeah, that's what it takes :p
[14:50] <sjustlaptop> mostly uneventful though
[14:51] <jluis> enjoying Germany already?
[14:51] <sjustlaptop> sure
[14:51] * aliguori (~anthony@cpe-70-112-157-151.austin.res.rr.com) has joined #ceph
[14:51] <sjustlaptop> I'll be enjoying it more when the cluster is happy :)
[14:51] <jluis> bumped into any issues yet?
[14:53] <xiaoxi> sjustlaptop:I will test it now
[14:55] <sjustlaptop> jluis: not really, the window is at 9pm tonight
[14:55] <jluis> cet?
[14:55] <sjustlaptop> most of the work will happen after 9pm
[14:57] <jluis> let me know if there's anything I can help with; I should be around until 3am CET anyway
[14:57] <sjustlaptop> k
[14:57] * itamar_ (~itamar@82.166.185.149) has joined #ceph
[14:58] * itamar_ (~itamar@82.166.185.149) Quit ()
[15:00] <xiaoxi> sjust:increasing osd op thread seems doesn't help
[15:05] <Kdecherf> does anyone observed a high cpu overload/crash/corruption on mds (Ceph 0.56.1)? I have a lot of "2013-02-04 14:39:18.998132 7f7f8a17a700 10 mds.0.server FAIL on ESTALE but attempting recovery" in debug mode and some connected clients are hanging
[15:05] <Kdecherf> A restart of the mds doesn't fix this issue
[15:12] * itamar (~itamar@82.166.185.149) Quit (Remote host closed the connection)
[15:12] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Quit: Leaving.)
[15:15] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[15:33] * rlr219 (43c87e04@ircip1.mibbit.com) has joined #ceph
[15:33] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[15:35] <rlr219> Hi All. I am slowly upgrading my ceph cluster from 0.56.1 to 0.56.2 and I noticed an error message in my ceph -w output this morning. Coincidentally it is from 2 of the OSDs I upgraded this morning. Just wondering if I should be worried
[15:36] <sjustlaptop> what message?
[15:36] <rlr219> osd.9 [ERR] 2.406 osd.9 inconsistent snapcolls on be172406/rb.0.168a.1b99b739.000000005a3c/53//2 found 53 expected 17,53 2013-02-04 09:27:12.671442 osd.9 [ERR] 2.406 osd.17 inconsistent snapcolls on be172406/rb.0.168a.1b99b739.000000005a3c/53//2 found 53 expected 17,53 2013-02-04 09:27:13.718475 osd.9 [ERR] 2.406 osd.9 inconsistent snapcolls on 5e2a7c06/rb.0.
[15:36] <rlr219> Hi sjust. let me put it in paste bin real quick.
[15:37] <rlr219> http://mibpaste.com/IGh20w
[15:37] <sjustlaptop> it's not a big deal, we need an automated way to fix it
[15:38] <rlr219> I saw the scrub and figured that it was correcting, but I was a little leary.
[15:39] * gucki (~smuxi@77-56-36-164.dclient.hispeed.ch) has joined #ceph
[15:39] <rlr219> I upgraded and rebuilt those 2 OSDs on Friday. they ran over weekend just fine. so I am starting the process of upgrading all to OSDs and MONs to 0.56.2
[15:48] * drokita (~drokita@199.255.228.128) has joined #ceph
[15:48] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[15:52] * trond (~trond@trh.betradar.com) has joined #ceph
[15:56] * loicd1 (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[15:57] * loicd1 (~loic@3.46-14-84.ripe.coltfrance.com) Quit ()
[15:57] * sleinen (~Adium@130.59.94.82) has joined #ceph
[15:59] * sleinen1 (~Adium@2001:620:0:25:305e:6c40:6644:c2ab) has joined #ceph
[16:03] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[16:03] * PerlStalker (~PerlStalk@72.166.192.70) has joined #ceph
[16:05] * sleinen (~Adium@130.59.94.82) Quit (Ping timeout: 480 seconds)
[16:06] <xiaoxi> nhm: are you here?
[16:08] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[16:09] <janos> xiaoxi returns! how's the cluster?
[16:10] * vata (~vata@2607:fad8:4:6:2c77:ee92:9bfa:3f07) has joined #ceph
[16:12] <xiaoxi> janos:Thanks~ It's good now...There are a lot of HW issue...fixed all the HW issue make the ceph works well
[16:13] <janos> glad to hear it
[16:13] <xiaoxi> for rados bench I could get 1.9GB/s for write
[16:13] <janos> nice!
[16:13] <xiaoxi> but the strange thing is that the read performance is lower than write
[16:14] <xiaoxi> I have done a 128RBD test some hours before,I created 128 VM and mounted 128 RBD to them by qemu,runing AIO-STRESS on top of the rbd.
[16:15] <xiaoxi> I have seen ~1.7GB/s for write while only 1.3GB/s for read..
[16:15] <xiaoxi> this is really strange
[16:15] <janos> yeah i'm not sure how that works out
[16:15] <janos> glad you're further along though
[16:16] <xiaoxi> janos:thanks a lot for your all 's help~
[16:17] <xiaoxi> I once about to give up and blame ceph :) but it turn out I have to blame my hw
[16:34] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) Quit (Remote host closed the connection)
[16:34] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) has joined #ceph
[16:35] * scalability-junk (~stp@188-193-201-35-dynip.superkabel.de) has joined #ceph
[16:43] * loicd (~loic@magenta.dachary.org) has joined #ceph
[16:49] * scalability-junk (~stp@188-193-201-35-dynip.superkabel.de) Quit (Quit: Leaving)
[16:54] * mattch (~mattch@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[16:56] <elder> sage, is there a tracker issue for that xfs hang issue?
[16:56] <elder> I'm testing the fix now but would like to update it (or create one)
[16:57] * jamespage (~jamespage@tobermory.gromper.net) has joined #ceph
[17:02] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[17:04] * sleinen1 (~Adium@2001:620:0:25:305e:6c40:6644:c2ab) Quit (Quit: Leaving.)
[17:04] * sleinen (~Adium@130.59.94.82) has joined #ceph
[17:04] <elder> sage nevermind, I just opened http://tracker.ceph.com/issues/3997
[17:06] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:07] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[17:08] * BillK (~BillK@124.150.34.96) Quit (Ping timeout: 480 seconds)
[17:09] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[17:12] * sleinen (~Adium@130.59.94.82) Quit (Ping timeout: 480 seconds)
[17:13] * ScOut3R (~ScOut3R@212.96.47.215) Quit (Ping timeout: 480 seconds)
[17:14] * maxia (~sensa@114.91.121.183) Quit (Remote host closed the connection)
[17:17] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[17:18] <nhm> xiaoxi: Hi! That's very interesting regarding reads! I have seen something similar recently. Do you see long periods of stalls in the read tests?
[17:21] <xiaoxi> nhm:hi, I can see slow request,but seems no long stalls.
[17:23] <xiaoxi> nhm:oh yes, I just see the inside VM iostat data,here is a sample for BW (MB/s) in continuing 9 secs:( 42,27,25,4,1,1,23,5,11)
[17:25] * BillK (~BillK@124.150.63.178) has joined #ceph
[17:25] * jlogan (~Thunderbi@2600:c00:3010:1:5c67:1323:2b43:da43) has joined #ceph
[17:27] <nhm> xiaoxi: On another platform I am testing for another customer, I see 3.3GB/s writes but also around 1.3GB/s reads, with long stalls on the reads. I haven't determined the cause yet.
[17:28] <nhm> xiaoxi: I'll be doing more testing on that system later this week.
[17:29] <xiaoxi> nhm:well, are they also using XFS?
[17:30] <nhm> xiaoxi: I think I saw it with both XFS and EXT4.
[17:30] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[17:30] <nhm> xiaoxi: Right now they are reconfiguring the hardware for other reasons. I'll know more later.
[17:30] <xiaoxi> but as I asked on the mail, there is no reason for BTRFS outperform its counterpart for seq read
[17:33] <nhm> xiaoxi: ah, that's a good question. Too bad Sam is in Germany right now, he might have ideas about if we do anything differently for BTRFS reads.
[17:34] * jlogan1 (~Thunderbi@2600:c00:3010:1:852f:a2dd:c540:fa16) has joined #ceph
[17:35] <nhm> xiaoxi: Unfortunately I don't know the answer right now. I've been much more focused on the write path. It is strange that reads are going slow.
[17:35] <nhm> xiaoxi: or at least that they are not scaling well.
[17:36] <nhm> xiaoxi: one thing I have noticed, that the number of OSD OP Threads seems to have a significant impact on read performance. Perhaps there is some thread contention problem only present in reads.
[17:36] * foxhunt (~richard@office2.argeweb.nl) Quit (Remote host closed the connection)
[17:37] <nhm> It seems like that would be more of a per-osd thing though, not a per-node issue.
[17:38] <xiaoxi> nhm: yes, i guessed osd op thread num may be too small since osd op thread will be blocked during read. but increasing that seems make no difference on my setup
[17:40] <xiaoxi> nhm: I also put most of my attention to write path in the past, I used to believe read=2* write...but it's proven to be wrong now
[17:41] * jlogan (~Thunderbi@2600:c00:3010:1:5c67:1323:2b43:da43) Quit (Ping timeout: 480 seconds)
[17:46] <nhm> xiaoxi: you may find that it is with fewer OSDs, but there may be some bottleneck blocking it.
[17:47] <nhm> xiaoxi: I'm hopeful that if it is a problem with Ceph, it should be fairly easy to fix since we can get good write performance and writes are much more complicated than reads.
[17:48] <nhm> xiaoxi: Interestingly my supermicro test node can do about 2GB/s for both reads and writes.
[17:50] <nhm> actually, more like 1.5GB/s for each with XFS.
[17:51] <nhm> BTRFS can do 2GB/s for both, and EXT4 can only do 2GB/s for writes.
[17:54] * nwl (~levine@atticus.yoyo.org) Quit (Remote host closed the connection)
[17:54] * nwl (~levine@atticus.yoyo.org) has joined #ceph
[17:54] * nwl (~levine@atticus.yoyo.org) Quit ()
[17:55] <xiaoxi> nhm: but we have to admit that write can protentially benefit from pagecache, to make it more sequential ,but for read it seems harder
[17:57] * nwl (~levine@atticus.yoyo.org) has joined #ceph
[17:57] <nhm> xiaoxi: that's true, but it doesn't seem like that would show up like this? IE with fewer OSDs/node reads are faster, but then cease to scale with lots of OSDs?
[17:58] <nhm> xiaoxi: that's the behavior I have noticed so far anyway.
[17:58] <nhm> xiaoxi: Maybe when I do my next set of tests I'll run blktrace to see IO pattern on the OSDs during the reads.
[17:59] <xiaoxi> nhm:I really have some iostat ,showing the avgrq-size is significantly smaller than that of write
[17:59] * nwl (~levine@atticus.yoyo.org) has left #ceph
[17:59] <nhm> xiaoxi: interesting
[18:00] <nhm> xiaoxi: blktrace might reveal lots of seeks
[18:00] * nwl (~levine@atticus.yoyo.org) has joined #ceph
[18:00] <xiaoxi> this is also strange, for a single write, it has to write pg_log,pg_info, and then the object,but for read,it read the object directly, why it has even smaller avgrq-sz
[18:00] <nhm> xiaoxi: Like you said, I wonder if we are doing something else during reads. Lots of little IOs that drags the avgrq-size down.
[18:00] <nwl> morning all
[18:01] <nhm> xiaoxi: I know, that's why this is confusing.
[18:01] <nhm> nwl: good morning!
[18:02] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[18:02] <xiaoxi> nhm:yes, if sam can give us some inputs ,that would be great.From the code I really cannot find something unusual ,maybe I need to open the debug filestore and read the log.
[18:02] <xiaoxi> nhm: by you said 1.3GB/s read you mean the librados performance or RBD or something else?
[18:03] <nhm> xiaoxi: that was doing rados bench "seq" test.
[18:03] <nhm> xiaoxi: 8 rados bench instances, each reading from it's own pool.
[18:03] <xiaoxi> with 4M req size?
[18:03] <nhm> xiaoxi: yes
[18:03] <nhm> xiaoxi: I probably have average request size results.
[18:04] <nhm> xiaoxi: what I noticed though was periods of high throughput followed by multi-second stalls.
[18:04] <nhm> I was wondering if rados bench was broken.
[18:05] <xiaoxi> Maybe I will see the same if I use rados bench. what i seen in RBD is periods of high throughput followed by very low throughput.
[18:06] <xiaoxi> nhm:have you found the avgrq-size results?
[18:06] <nhm> xiaoxi: one sec, let me check my results on the SC847a chassis.
[18:09] <xiaoxi> BTW:several days ago some guy advice me to mark a OSD out,it's bad to do so, marking an osd out will result ceph remove all the data reside in the osd and remapped to other osd,when you want to bring the osd back...you have to wait quite a long time. I would bet kill the osd process will be better than mark an osd out
[18:09] <nhm> xiaoxi: looks like I'm seeing pretty consistent 128K reads from the collectl results.
[18:09] <xiaoxi> nhm:that's the same with my result
[18:10] <nhm> xiaoxi: what does /sys/block/<device>/queue/hw_sector_size say? Probably 512?
[18:10] <xiaoxi> i am not that consistent but only 2~4 bytes away from 128K
[18:11] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[18:11] <xiaoxi> nhm:512
[18:14] <nhm> xiaoxi: hrm, no good answer right now. Probably need to look at the code again.
[18:14] * BillK (~BillK@124.150.63.178) Quit (Ping timeout: 480 seconds)
[18:16] <xiaoxi> yeah, could you please keep me update if you have any inputs from sam or sage for this issue?
[18:16] <nhm> sure
[18:16] <xiaoxi> thx
[18:16] <nhm> xiaoxi: Sage just mentioned it may be related to readahead.
[18:17] * low (~low@188.165.111.2) Quit (Quit: Leaving)
[18:17] <xiaoxi> i doubt that, read ahead always make request larger,but not smaller
[18:18] * The_Bishop (~bishop@2001:470:50b6:0:6157:b33e:c40d:437b) Quit (Ping timeout: 480 seconds)
[18:20] <nhm> xiaoxi: Might be worth playing with /sys/block/<device>/queue/read_ahead_kb since it does default to 128
[18:23] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:24] * alram (~alram@38.122.20.226) has joined #ceph
[18:28] * xiaoxi (~xiaoxiche@134.134.139.72) Quit (Remote host closed the connection)
[18:30] * leseb (~leseb@mx00.stone-it.com) Quit (Remote host closed the connection)
[18:31] * sstan (~chatzilla@dmzgw2.cbnco.com) has joined #ceph
[18:32] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[18:32] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[18:34] * BillK (~BillK@58-7-218-209.dyn.iinet.net.au) has joined #ceph
[18:34] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit ()
[18:38] * gucki (~smuxi@77-56-36-164.dclient.hispeed.ch) Quit (Remote host closed the connection)
[18:41] <elder> nhm, it's now 11:41.
[18:41] <elder> We haven't discussed when (or now, if) to meet for lunch today.
[18:42] <elder> If you want to meet at Rosedale I think I could do that by 12:30.
[18:44] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[18:45] <nhm> elder: Yikes, you are correct
[18:45] <elder> What do you say?
[18:45] <nhm> elder; One sec
[18:46] <nhm> elder: yeah, lets do it
[18:46] <elder> 12:30?
[18:47] <nhm> sounds good
[18:47] <elder> How about by Big Bowl. We can choose something else but at least meet there. I'm going to get ready now.
[18:47] <nhm> Ok, works for me
[18:51] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:51] * leseb_ (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:52] * danieagle (~Daniel@177.97.251.251) has joined #ceph
[18:54] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[18:56] * chutzpah (~chutz@199.21.234.7) has joined #ceph
[18:57] <sagewk> everyone: blew away the teuthology queue. there is likely a problem with the testing ceph-client.git branch...
[18:58] * lx0 is now known as lxo
[18:59] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[19:01] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Always try to be modest, and be proud about it!)
[19:02] * hybrid512 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[19:06] * Cube (~Cube@12.248.40.138) has joined #ceph
[19:08] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:08] * Cube (~Cube@12.248.40.138) Quit (Read error: Connection reset by peer)
[19:08] * Cube (~Cube@12.248.40.138) has joined #ceph
[19:11] * drokita1 (~drokita@199.255.228.128) has joined #ceph
[19:14] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) Quit (Read error: Connection reset by peer)
[19:14] * mdxi_ (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[19:14] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) has joined #ceph
[19:15] * drokita (~drokita@199.255.228.128) Quit (Ping timeout: 480 seconds)
[19:16] * ScOut3R (~scout3r@540079A1.dsl.pool.telekom.hu) has joined #ceph
[19:16] * jamespage (~jamespage@tobermory.gromper.net) Quit (Ping timeout: 480 seconds)
[19:16] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[19:16] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[19:17] * jamespage (~jamespage@tobermory.gromper.net) has joined #ceph
[19:22] * BillK (~BillK@58-7-218-209.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[19:33] * snaff (~z@81-86-160-226.dsl.pipex.com) Quit (Quit: Leaving)
[19:36] * danieagle (~Daniel@177.97.251.251) Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[19:45] * Cube1 (~Cube@12.248.40.138) has joined #ceph
[19:45] * Cube (~Cube@12.248.40.138) Quit (Read error: Connection reset by peer)
[19:49] * rlr219 (43c87e04@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[19:54] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[19:55] <noob2> the librbd for python is great! works in rhel6 no problem
[19:56] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[19:57] <dmick-away> noob2: awesome
[19:57] * dmick-away is now known as dmick
[19:57] <dmick> it's a quick way to do stuff with images, for sure
[19:57] <absynth_> evening, everyone
[19:57] * sleinen1 (~Adium@2001:620:0:25:a169:920a:3509:21ef) has joined #ceph
[19:58] <janos> evening, absynth_
[19:58] <janos> though it's more like 2pm here ;)
[19:59] <absynth_> it's waiting-for-bobtail pm here
[19:59] <janos> is the .56.x branch sort of a pre-bobtail?
[20:00] <dmick> pretty much
[20:04] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[20:05] * Cube (~Cube@12.248.40.138) has joined #ceph
[20:09] * Cube1 (~Cube@12.248.40.138) Quit (Ping timeout: 480 seconds)
[20:10] * rturk is now known as rturk-away
[20:13] * Cube (~Cube@12.248.40.138) Quit (Remote host closed the connection)
[20:13] * Cube (~Cube@12.248.40.138) has joined #ceph
[20:14] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[20:17] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:17] * rturk-away is now known as rturk
[20:19] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[20:19] * Cube1 (~Cube@12.248.40.138) has joined #ceph
[20:19] * Cube (~Cube@12.248.40.138) Quit (Read error: Connection reset by peer)
[20:22] <elder> gregaf, are you around?
[20:22] <gregaf> yeah
[20:22] <gregaf> what's up?
[20:22] <elder> I want to help with whatever's bothering ceph teuthology runs but I don't really have much info.
[20:22] <elder> Sage just said there were problems and supplied a stack trace.
[20:22] <elder> Ian said you might have more info.
[20:22] <gregaf> ah, yeah
[20:22] <gregaf> I haven't looked into the details at all yet
[20:23] <elder> Are you looking at it for now?
[20:23] <elder> I'm available to help, I just want to coordinate.
[20:23] <gregaf> not at the moment
[20:23] <gregaf> I just have tried several times over the last couple days to get a kernel mount on a cluster via teuthology
[20:23] <gregaf> (with an interactive: stuck at the end so I can go in and do some stuff on it)
[20:23] <elder> So can you easily reproduce the problem? Can you tell me how?
[20:24] <gregaf> and it hasn't succeeded; looks to be failing on the call to mount
[20:24] <gregaf> yeah
[20:24] <elder> Send me something and I'll see if I can get some more info.
[20:24] <gregaf> http://pastebin.com/0StyYh8i
[20:24] <elder> Maybe bisect. I have some stuff in the latest testing branch and want to either fix it or rule it out.
[20:24] <gregaf> I've just been using that teuthology conf
[20:24] <elder> OK, I'll go see what I can do with it, thanks.
[20:25] <gregaf> Sage said that it looked like runs were hanging on all the kclient tasks as well and so I imagine it's the same thing, but I don't know at all :)
[20:25] <gregaf> ty!
[20:25] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Remote host closed the connection)
[20:25] <elder> But first, some lunch.
[20:29] * Cube1 (~Cube@12.248.40.138) Quit (Ping timeout: 480 seconds)
[20:29] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[20:30] * Cube (~Cube@12.248.40.138) has joined #ceph
[20:44] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[20:55] * ScOut3R (~scout3r@540079A1.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[21:04] * Cube1 (~Cube@12.248.40.138) has joined #ceph
[21:04] * Cube (~Cube@12.248.40.138) Quit (Ping timeout: 480 seconds)
[21:10] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[21:31] * Cube (~Cube@12.248.40.138) has joined #ceph
[21:34] * drokita1 (~drokita@199.255.228.128) Quit (Ping timeout: 480 seconds)
[21:37] * Cube1 (~Cube@12.248.40.138) Quit (Ping timeout: 480 seconds)
[21:40] <elder> gregaf are you very familiar with the mount path in ceph?
[21:41] <elder> I'm trying to interpret the debug output I've got, but it's all new to me. If you aren't, it's OK, just looking for a shortcut.
[21:43] <gregaf> elder: what do you mean "the mount path"?
[21:43] <gregaf> the server code involved?
[21:43] <elder> Yes, I suppose so.
[21:43] <elder> Well, I don't know. I'm looking at the kernel client.
[21:43] <gregaf> reasonably with the server parts, not much with the kernel client
[21:43] <elder> I'm just trying to know what things it does when it's trying to mount a file system.
[21:43] <elder> OK.
[21:43] <gregaf> what kind of debug output do you have?
[21:43] <gregaf> looking at the messages?
[21:44] <elder> [ 1929.733665] ceph: mount opening root
[21:44] <elder> [ 1929.733666] ceph: open_root_inode opening ''
[21:44] <elder> [ 1929.733670] ceph: do_request on ffff880221606800
[21:44] <elder> [ 1929.733673] ceph: reserve caps ctx=ffff880221606ba8 need=2
[21:44] <elder> [ 1929.733680] ceph: reserve caps ctx=ffff880221606ba8 2 = 0 used + 2 resv + 0 avail
[21:44] <elder> [ 1929.733681] ceph: __register_request ffff880221606800 tid 1
[21:44] <elder> [ 1929.733683] ceph: __choose_mds (null) is_hash=0 (0) mode 0
[21:44] <elder> [ 1929.733685] ceph: choose_mds chose random mds-1
[21:44] <elder> [ 1929.733685] ceph: do_request no mds or not active, waiting for map
[21:44] <elder> [ 1929.733686] ceph: do_request waiting
[21:44] <elder> That sort of thing
[21:44] * rturk is now known as rturk-away
[21:44] <gregaf> okay, that looks to be where the MDS hasn't come up all the way so it's waiting on a new MDSMap from the monitor
[21:45] <gregaf> that actually shouldn't be happening in teuthology I don't think, as there's normally a wait-for-healthy that happens before it moves on to the client mounting tasks
[21:46] <elder> However that follows this, so it maybe should already have a map? [ 1926.652922] ceph: handle_map epoch 2 len 54
[21:46] <elder> [ 1926.686786] ceph: mdsmap_decode success epoch 2
[21:46] <elder> [ 1926.720323] ceph: check_new_map new 2 old 0
[21:46] <gregaf> you could check by looking at cpeh-s output
[21:46] <gregaf> yeah, if the map epoch is 2 then it ought to have an MDS already that it can talk to
[21:46] <elder> The test is over, I'm just browsing the output it produced.
[21:47] <gregaf> it looks like either the MDS wasn't running (unlikely but possible), or else the kernel client erroneously believed one wasn't running, or the kernel client was started when the MDS wasn't running and then failed to wake up when it did start
[21:48] <gregaf> so I'd get a live system and go browse around while the mount command is hanging
[21:49] <gregaf> brb
[21:59] <gregaf> here again
[22:00] <elder> OK. The test quits fairly quickly.
[22:00] <elder> And I don't really know what I'm looking for.
[22:01] <gregaf> the test quits because mount eventually fails, right?
[22:01] <gregaf> it seemed to take quite a while when I was doing it, but perhaps not
[22:01] <gregaf> do you have debugging from the servers?
[22:02] <gregaf> you could look through and see if the client is connecting to one or both of the monitor and MDS
[22:03] <elder> This is pretty new to me. Yes, I think the mount times out (well, something gets EIO before it quits)
[22:03] <elder> I don't know if I have debugging fom the servers, but could get it. I haven't ever looked at the server side stuff really.
[22:04] <elder> I used the yaml file you supplied, basically.
[22:07] <gregaf> yeah, so I'd run it and watch the teuthology output until it gets to the point where it's called mount (but hasn't timed out)
[22:07] <gregaf> run ceph -s on one of the servers at that point and see what it says
[22:07] <gregaf> that'll let you know if it's a setup issue or a kernel issue
[22:07] <gregaf> other tests are succeeding so I assume it's a kernel issue, but maybe there's some teuthology config issue or something else
[22:09] <elder> OK.
[22:11] * Cube (~Cube@12.248.40.138) Quit (Ping timeout: 480 seconds)
[22:11] * leseb_ (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[22:13] <elder> Shoot, I haven't updated my tools to understand the changes Sam L put in for /tmp/cephtest/<another_level>
[22:13] <nhm> is it bad I just typed "git vim <src file>"?
[22:13] <elder> Ahhhhhh!!!!
[22:13] <elder> YOU BROKE EVERYTHING!
[22:14] <elder> Did it give an error, nhm?
[22:14] <elder> (I got git: 'vim' is not a git command. See 'git --help'.
[22:14] <elder> )
[22:14] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:14] <nhm> elder: same. :)
[22:15] <elder> Then I expect all is well.
[22:15] * Cube (~Cube@12.248.40.138) has joined #ceph
[22:23] * Cube1 (~Cube@66-87-66-89.pools.spcsdns.net) has joined #ceph
[22:27] <elder> OK, gregaf I am getting this, but it could be due to the change in /tmp/cephtest:
[22:27] <elder> 2013-02-04 13:26:26.708576 7ff6f3184780 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
[22:27] <elder> 2013-02-04 13:26:26.708605 7ff6f3184780 -1 ceph_tool_common_init failed.
[22:27] <gregaf> ah, that could definitely be the problem
[22:28] <gregaf> slang1: looks like maybe the kclient task didn't get updated properly?
[22:28] <gregaf> elder: check what keyring path it's using and whether there's a file there, and if there should be in the teuthology code?
[22:28] <elder> Well, I mean it could be my own setup that's at fault, but yeah, if it's the code, well, fine...
[22:28] <elder> How do I check that?
[22:29] <gregaf> I think the teuthology log will have printed out what it's using, and you can ssh into the box and see if it's there, or look through the teuthology task code to see if that path is being added by anything
[22:29] <elder> OK.
[22:29] <nhm> elder: I more meant, "Does this officially man I've become insane?"
[22:29] <nhm> s/man/mean
[22:30] <elder> Only if you had said "vim git <path>"
[22:30] <slang1> elder: if you want the old behavior, you should be able to put:
[22:30] <slang1> test_path: /tmp/cephtest
[22:30] <slang1> in your .teuthology.yaml
[22:30] <elder> I just updated taht.
[22:30] <elder> that
[22:30] <slang1> I would probably do that for now
[22:30] * Cube (~Cube@12.248.40.138) Quit (Ping timeout: 480 seconds)
[22:30] <slang1> ok
[22:30] <elder> Oh
[22:30] <elder> I didn't do that, but I will.
[22:31] <slang1> elder: there's definitely some bits of the qa suite that still point to /tmp/cephtest
[22:31] <slang1> elder: are you running a specific yaml or just an entire suite?
[22:31] <elder> Specific yaml
[22:32] <elder> Greg pastebin'd it above
[22:32] <elder> I'll find it shortly
[22:32] * slang1 looks
[22:32] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[22:32] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:33] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:33] <elder> http://pastebin.com/0StyYh8i
[22:33] <dmick> wow, wtf
[22:33] <dmick> ImportError: No module named pexpect
[22:34] <slang1> dmick: re-bootstrap
[22:34] <dmick> I guess
[22:34] <dmick> sometimes I don't like venvs
[22:35] <elder> OK, rewind. I'm not sure I'm doing the right thing here. I normally am sitting on the client, and whatever I normally do might not be workin ghere.
[22:36] <elder> What should I do in order to be able to run "ceph" in the teuthology environment on the target node?
[22:36] <elder> slang1? gregaf?
[22:36] <slang1> elder: everything should work the same as before
[22:37] <elder> OK, but let's assume I have never done this before.
[22:37] <slang1> elder: (assuming you put test_path: /tmp/cephtest in your .teuthology.yaml)
[22:37] <dmick> I just discovered /tmp/cephtest/<name>/
[22:37] <dmick> that's new elder
[22:37] <slang1> elder: you have a ~/.teuthology.yaml?
[22:37] <slang1> dmick: that won't happen if you set test_path
[22:37] <gregaf> elder: there's a song-and-dance routine with setting the LD_LOAD_PATH and everything; check the logs to see how it's calling executables and use that
[22:38] <elder> slang1, yes
[22:38] <elder> Here is what I normally do:
[22:38] <elder> #!/bin/bash
[22:38] <elder> CROOT=/tmp/cephtest/elder
[22:38] <elder> CROOT=/tmp/cephtest
[22:38] <elder> cd "${CROOT}"
[22:38] <elder> export CEPH_ARGS="--conf ${CROOT}/ceph.conf"
[22:38] <elder> export CEPH_ARGS="${CEPH_ARGS} --keyring ${CROOT}/data/client.0.keyring"
[22:38] <elder> export CEPH_ARGS="${CEPH_ARGS} --name client.0"
[22:38] <dmick> slang1: guess I should read...something...more carefully
[22:38] <elder> export LD_LIBRARY_PATH="${CROOT}/binary/usr/local/lib:${LD_LIBRARY_PATH}"
[22:38] <elder> export PATH="${CROOT}/binary/usr/local/bin:${PATH}"
[22:38] <elder> export PATH="${CROOT}/binary/usr/local/sbin:${PATH}"
[22:38] <elder> (But that CROOT part is new)
[22:38] <slang1> dmick: :-)
[22:38] <elder> I just read it :)
[22:39] <elder> Ah, I think I have to change that --keyring part for something other than the client.
[22:39] <elder> Right?
[22:39] <slang1> btw, josh just pushed a fix for the case where you *don't* set test_path
[22:39] <slang1> well, not just, but...
[22:40] <elder> Sorry guys, I have to go pick up my son. I'll be back in half an hour or so and then I'll pick up where I left off.
[22:40] <slang1> elder: so if you set test_path: /tmp/cephtest, you won't have to change anything there
[22:41] * sleinen1 (~Adium@2001:620:0:25:a169:920a:3509:21ef) Quit (Ping timeout: 480 seconds)
[22:41] <dmick> I had a cheat-file with LD_LIBRARY_PATH, PATH, and CEPH_CONF settings to allow me to operate seminormally in an interactive shell
[22:42] <dmick> (I'd just paste the contents)
[22:42] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[22:42] * scuttlemonkey (~scuttlemo@217.64.252.30.mactelecom.net) Quit (Quit: This computer has gone to sleep)
[22:42] * slang1 will start putting at the top of emails like that: If you don't read to the end, its going to bite you in the ass laster
[22:42] <dmick> now I guess it would be nice to have an env file left behind to use for that, since the dirs are (at least by default) dynamically named
[22:43] <slang1> dmick: could be generated, yeah
[22:43] <dmick> it's my fault; I read that one on the train before I was fully awake and before I could browse the diffs to get it stuck in my head
[22:44] * jlogan1 (~Thunderbi@2600:c00:3010:1:852f:a2dd:c540:fa16) Quit (Ping timeout: 480 seconds)
[22:46] * jlogan (~Thunderbi@72.5.59.176) has joined #ceph
[22:50] * sleinen (~Adium@2001:620:0:25:813a:7dbe:d2d0:263b) has joined #ceph
[22:55] * sleinen1 (~Adium@2001:620:0:25:c57e:54:50ee:c031) has joined #ceph
[22:56] * Cube1 (~Cube@66-87-66-89.pools.spcsdns.net) Quit (Read error: Connection reset by peer)
[22:56] * Cube (~Cube@66-87-66-89.pools.spcsdns.net) has joined #ceph
[22:57] * Cube1 (~Cube@66-87-66-89.pools.spcsdns.net) has joined #ceph
[22:57] * Cube (~Cube@66-87-66-89.pools.spcsdns.net) Quit (Read error: Connection reset by peer)
[22:58] * sleinen (~Adium@2001:620:0:25:813a:7dbe:d2d0:263b) Quit (Ping timeout: 480 seconds)
[22:58] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[23:01] * Cube (~Cube@66-87-66-89.pools.spcsdns.net) has joined #ceph
[23:01] * Cube1 (~Cube@66-87-66-89.pools.spcsdns.net) Quit (Read error: Connection reset by peer)
[23:02] * sjustlaptop (~sam@fw.office-fra1.proio.com) Quit (Ping timeout: 480 seconds)
[23:04] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[23:05] * sleinen1 (~Adium@2001:620:0:25:c57e:54:50ee:c031) Quit (Ping timeout: 480 seconds)
[23:08] * Cube1 (~Cube@12.248.40.138) has joined #ceph
[23:11] * Cube (~Cube@66-87-66-89.pools.spcsdns.net) Quit (Ping timeout: 480 seconds)
[23:12] * sleinen (~Adium@2001:620:0:25:c57e:54:50ee:c031) has joined #ceph
[23:13] <slang1> http://pastebin.com/WLLH4u1S
[23:13] <slang1> that seems to be the error coming from the monitor
[23:14] <slang1> definitely the wrong key, but there's a key file in /tmp/cephtest/data/client.0.secret
[23:16] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[23:20] * BillK (~BillK@124-148-101-34.dyn.iinet.net.au) has joined #ceph
[23:25] * ScOut3R (~scout3r@540079A1.dsl.pool.telekom.hu) has joined #ceph
[23:25] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[23:25] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[23:28] * sleinen (~Adium@2001:620:0:25:c57e:54:50ee:c031) Quit (Ping timeout: 480 seconds)
[23:30] <elder> slang1, I do have "test_path: /tmp/cephtest" in my ~/.teuthology.yaml
[23:30] <elder> dmick, that thing I pasted is basically my own cheat file.
[23:30] <slang1> elder: yeah I'm able to reproduce the error with the same setup
[23:30] <slang1> elder: I don't see how its a /tmp/cephtest issue yet though
[23:31] <slang1> elder: the mon is getting a key from the client, but its the wrong key
[23:31] <elder> dmick, slang1, I agree it would be nice if the name of the directory would be in an environment variable or something. That was my first reaction. I like the user/host/date stamp, but I want to have access to it.
[23:31] <elder> I was getting that error with /tmp/cephtest so yes, I agree it's something else.
[23:32] <gregaf> is the key stuff actually being parsed correctly? eg is mount saying "key" instead of "keyfile" or something, maybe?
[23:33] <elder> Like I said before, I'm used to running on the client, hence my CEPH_ARGS includes "--name client.0" and so on. Maybe that's not right for running on an osd node.
[23:34] <elder> (I don't really know how all those arguments are used, I just have been using them...)
[23:34] <gregaf> oh, yeah, you need stuff that it has locally if you're trying to do that
[23:35] <dmick> <musing> it might be nice if ~ubuntu/.profile included /tmp/cephtest/env, which was written with the right path when a job runs
[23:36] * sleinen (~Adium@user-28-9.vpn.switch.ch) has joined #ceph
[23:36] <elder> Not a bad idea dmick
[23:37] <elder> Except what if there are multiple jobs?
[23:37] <elder> (I thought that was one advantage of creating the subdirectories)
[23:38] <dmick> I dunno.
[23:38] <dmick> if there are multiple jobs with different ceph configs, my head hurts
[23:38] <elder> But teuthology's doesn't
[23:40] <slang1> elder: teuthology does: sudo /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/sbin/mount.ceph 10.241.131.5:6789:/ /tmp/cephtest/mnt.0 -v -o name=client.0,secretfile=/tmp/cephtest/data/client.0.secret
[23:41] <slang1> elder: the multiple subdirs was more about saving the state of previous jobs
[23:41] <slang1> elder: would be tricky to do multiple _simultaneous_ jobs
[23:41] <elder> Seems like it should be possible though. But it maybe doesn't much matter.
[23:41] <slang1> elder: we used to have to blow away the whole /tmp/cephtest dir to run another job on that node
[23:42] <slang1> elder: yeah, I agree, its possible, I think sage is thinking of doing vms instead
[23:42] <slang1> elder: which solves the problem somewhat differently
[23:42] <elder> That's a different test scenario.
[23:42] <elder> But yeah, it loads the node in similar ways I suppose.
[23:43] * vata (~vata@2607:fad8:4:6:2c77:ee92:9bfa:3f07) Quit (Quit: Leaving.)
[23:51] * yasu` (~yasu`@dhcp-59-166.cse.ucsc.edu) has joined #ceph
[23:53] * sleinen (~Adium@user-28-9.vpn.switch.ch) Quit (Quit: Leaving.)
[23:53] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.