#ceph IRC Log

Index

IRC Log for 2013-12-28

Timestamps are in GMT/BST.

[0:09] <sagewk> kbader: re http://lkcl.net/reports/ssd_analysis.html : is that really a representative sample of ssds on the market?
[0:09] <sagewk> also, the report doesn't seem to be dated?
[0:10] * bandrus (~Adium@107.222.156.198) Quit (Quit: Leaving.)
[0:12] * dlan (~dennis@116.228.88.131) Quit (Remote host closed the connection)
[0:16] * mattbenjamin (~matt@aa2.linuxbox.com) Quit (Quit: Leaving.)
[0:18] * nwat (~textual@ip-64-134-25-189.public.wayport.net) has joined #ceph
[0:21] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[0:22] * fouxm (~fouxm@AOrleans-258-1-53-232.w90-24.abo.wanadoo.fr) Quit (Remote host closed the connection)
[0:22] <andreask> what exactly does that journal_force_aio option?
[0:23] <andreask> gives me a huge performance improvement on my dm-crypt osds with journals on the same filesystem
[0:25] * nwat (~textual@ip-64-134-25-189.public.wayport.net) Quit (Read error: Operation timed out)
[0:25] * rudolfsteiner (~federicon@200.68.116.185) Quit (Quit: rudolfsteiner)
[0:25] * nwat (~textual@ip-64-134-25-189.public.wayport.net) has joined #ceph
[0:29] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[0:30] <sagewk> andreask: at some point we hit a bug in xfs or ext4 (i think xfs?) when doing aio to the journal. so we stopped using aio for file-based journals by default.. didn't have time to track it down
[0:31] <andreask> sagewk: but its not "dangerous" in the sense of potential data-loss?
[0:31] <sagewk> trying to find the original bug, hold on
[0:32] <sagewk> issue 4910
[0:32] <kraken> sagewk might be talking about http://tracker.ceph.com/issues/4910 [journal Unable to read past sequence 337 but header indicates the journal has committed up through 348, journal is corrupt]
[0:32] <andreask> sagewk: thanks
[0:33] <sagewk> so, yes, possibly it can cause data loss. we would need to retest against a newer kernel
[0:33] <sagewk> fwiw you should just put the journal on a dedicated parition on the same drive and you'll get better performance
[0:34] <andreask> no, I don't
[0:34] <sagewk> oh? it's faster with xfs, a file journal, and that option enabled?
[0:34] <andreask> yes ... but there is also dm-crypt enabled
[0:35] * nwat (~textual@ip-64-134-25-189.public.wayport.net) Quit (Quit: My MacBook has gone to sleep. ZZZzzz???)
[0:35] <andreask> and its an pure ssd osd node
[0:36] <sagewk> ah
[0:36] <andreask> looks like these intel ssds don't like dio
[0:37] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) Quit (Remote host closed the connection)
[0:38] <andreask> but not using dio if the ssds have a non-volatile cache can be assumed to be save I assume?
[0:40] <sagewk> dio vs not should not affect safety in any way (modulo kernel bugs :), only performance.
[0:40] <andreask> ah .. good to know ;-)
[0:41] <andreask> so journal is forcefully flushed anyway?
[0:45] <sagewk> yeah. if it's buffered io we use fdatasync of fsync; if direct_io we use an O_DSYNC flag. i think.. something like that.
[0:55] * zerick (~eocrospom@190.187.21.53) Quit (Remote host closed the connection)
[0:58] <andreask> great
[0:58] <andreask> hmm ... dm-crypt with ceph-deploy does not encrypt the journal if dedicated journal devices are used?
[1:10] * BillK (~BillK-OFT@124.149.111.175) Quit (Ping timeout: 480 seconds)
[1:15] * bandrus (~Adium@107.222.156.198) has joined #ceph
[1:21] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[1:23] * Pedras (~Adium@216.207.42.132) has joined #ceph
[1:26] * rudolfsteiner (~federicon@181.167.96.123) has joined #ceph
[1:29] * rudolfsteiner (~federicon@181.167.96.123) Quit ()
[1:30] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[1:35] * rudolfsteiner (~federicon@181.167.96.123) has joined #ceph
[1:37] * rudolfsteiner (~federicon@181.167.96.123) Quit ()
[1:41] * rudolfsteiner (~federicon@181.167.96.123) has joined #ceph
[1:42] * jesus (~jesus@emp048-51.eduroam.uu.se) Quit (Read error: Operation timed out)
[1:44] * rudolfsteiner (~federicon@181.167.96.123) Quit ()
[1:48] <bjornar> sagewk, is listomapvals only supposed to return first 512 entries?
[1:48] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) has joined #ceph
[1:48] <sagewk> iirc it is chunked.. so youcall again with an offset and it'll give you teh next 512
[1:49] <bjornar> I mean from rados tool
[1:49] <sagewk> oh, hmm
[1:49] <bjornar> atleast it does, so if its not supposed to, I guess its a bug (if not fixed last week)
[1:50] <sagewk> yeah looks like a pretty trivial bug
[1:50] <sagewk> see tools/rados/rados.cc around like 1750
[1:51] <sagewk> probably s/ret/values.size()/ ?
[1:54] * wschulze1 (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[1:55] * wschulze1 (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit ()
[1:56] <bjornar> in the while .. probably
[1:56] <bjornar> dont know what omap_get_vals returns..
[1:56] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) Quit (Ping timeout: 480 seconds)
[1:57] <sagewk> bjornar: yeah i forget :)
[1:59] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[2:01] <joshd> bjornar: looks like just 0 or negative error code
[2:02] <bjornar> I dont know.. returns the r from op.omap_get_vals(start_after, filter_prefix, max_return, out_vals, &r); if > 0
[2:02] * jesus (~jesus@emp048-51.eduroam.uu.se) has joined #ceph
[2:02] * sagelap1 (~sage@2600:1012:b02b:bde6:94da:65ac:7e18:cadd) has joined #ceph
[2:04] <joshd> yeah, and on the osd side it looks like that result isn't set to anything but the default (0) or an error code (https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L3719)
[2:04] <bjornar> so one could use values.size (out_vals.size) as sage suggests?
[2:05] <joshd> yeah, that sounds good to me
[2:06] * sagelap (~sage@38.122.20.226) Quit (Ping timeout: 480 seconds)
[2:09] * BillK (~BillK-OFT@124.149.111.175) has joined #ceph
[2:12] * rudolfsteiner (~federicon@181.167.96.123) has joined #ceph
[2:20] * wer (~wer@206-248-239-142.unassigned.ntelos.net) has joined #ceph
[2:20] * rudolfsteiner (~federicon@181.167.96.123) Quit (Ping timeout: 480 seconds)
[2:22] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[2:22] * bjornar (~bjornar@ti0099a340-dhcp0395.bb.online.no) Quit (Ping timeout: 480 seconds)
[2:30] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[2:35] * Sysadmin88 (~IceChat77@2.218.8.40) Quit (Quit: Pull the pin and count to what?)
[2:36] * Pedras1 (~Adium@216.207.42.134) has joined #ceph
[2:36] * Pedras1 (~Adium@216.207.42.134) Quit ()
[2:42] * Pedras (~Adium@216.207.42.132) Quit (Ping timeout: 480 seconds)
[3:06] * Hakisho (~Hakisho@0001be3c.user.oftc.net) has left #ceph
[3:22] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[3:23] * bandrus (~Adium@107.222.156.198) Quit (Quit: Leaving.)
[3:30] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[3:36] <Psi-Jack> Heh, now that's just sad, but kinda funny. :)
[3:37] <Psi-Jack> I have been running ceph dumpling, but only included the cuttlefish rpm repo and not the dumpling repo, so I wasn't getting updates on ceph.
[3:39] * JC (~JC@71-94-44-243.static.trlk.ca.charter.com) Quit (Quit: Leaving.)
[3:44] * victoria24 (~waperasr4@77.211.20.144) has joined #ceph
[3:47] * victoria24 (~waperasr4@77.211.20.144) Quit ()
[3:50] * mattbenjamin (~matt@76-206-42-105.lightspeed.livnmi.sbcglobal.net) has joined #ceph
[3:50] * diegows (~diegows@190.190.17.57) Quit (Ping timeout: 480 seconds)
[4:02] * BillK (~BillK-OFT@124.149.111.175) Quit (Ping timeout: 480 seconds)
[4:03] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[4:11] * imjustmatthew (~imjustmat@pool-72-84-198-246.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[4:17] * angdraug (~angdraug@c-67-169-181-128.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[4:22] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[4:31] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[4:40] * zidarsk8 (~zidar@89-212-28-144.dynamic.t-2.net) Quit (Quit: Leaving.)
[4:58] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[5:06] * fireD_ (~fireD@93-139-153-186.adsl.net.t-com.hr) has joined #ceph
[5:07] * fireD (~fireD@93-142-227-137.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:23] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[5:26] * Vacum (~vovo@88.130.197.64) has joined #ceph
[5:31] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[5:32] * Vacum_ (~vovo@88.130.205.51) Quit (Read error: Connection reset by peer)
[6:08] * dmsimard (~Adium@69-165-206-93.cable.teksavvy.com) has joined #ceph
[6:08] * dmsimard (~Adium@69-165-206-93.cable.teksavvy.com) Quit ()
[6:15] * Cube (~Cube@66-87-131-175.pools.spcsdns.net) has joined #ceph
[6:23] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[6:31] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[7:10] * codice (~toodles@71-80-186-21.dhcp.lnbh.ca.charter.com) Quit (Ping timeout: 480 seconds)
[7:14] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) has joined #ceph
[7:23] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[7:30] * codice_ (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) has joined #ceph
[7:30] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) Quit (Read error: Operation timed out)
[7:31] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[7:40] * codice_ is now known as codice
[7:47] * BillK (~BillK-OFT@106-69-9-248.dyn.iinet.net.au) has joined #ceph
[7:48] * codice_ (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) has joined #ceph
[7:50] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) Quit (Ping timeout: 480 seconds)
[8:00] * cronix (~cronix@5.199.139.166) has joined #ceph
[8:06] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit (Quit: Leaving.)
[8:09] * cronix (~cronix@5.199.139.166) Quit (Ping timeout: 480 seconds)
[8:14] * codice_ (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) Quit (Read error: Operation timed out)
[8:14] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) has joined #ceph
[8:23] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[8:26] * BillK (~BillK-OFT@106-69-9-248.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[8:26] * sagelap (~sage@cpe-23-242-158-79.socal.res.rr.com) has joined #ceph
[8:31] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[8:33] * sagelap1 (~sage@2600:1012:b02b:bde6:94da:65ac:7e18:cadd) Quit (Ping timeout: 480 seconds)
[8:34] * BillK (~BillK-OFT@106-69-9-248.dyn.iinet.net.au) has joined #ceph
[8:49] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) Quit (Read error: Operation timed out)
[8:55] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) has joined #ceph
[8:58] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) has joined #ceph
[9:02] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) Quit (Remote host closed the connection)
[9:02] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) has joined #ceph
[9:10] * fouxm (~fouxm@AOrleans-258-1-53-232.w90-24.abo.wanadoo.fr) has joined #ceph
[9:10] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) Quit (Ping timeout: 480 seconds)
[9:14] * jeffhung_ (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (Read error: Connection reset by peer)
[9:14] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[9:16] * rendar (~s@host12-176-dynamic.1-87-r.retail.telecomitalia.it) has joined #ceph
[9:27] * codice_ (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) has joined #ceph
[9:29] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) Quit (Ping timeout: 480 seconds)
[9:42] * jesus (~jesus@emp048-51.eduroam.uu.se) Quit (Read error: Operation timed out)
[9:50] * codice_ (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) Quit (Ping timeout: 480 seconds)
[9:52] * Cube (~Cube@66-87-131-175.pools.spcsdns.net) Quit (Quit: Leaving.)
[9:55] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) has joined #ceph
[9:57] * jesus (~jesus@emp048-51.eduroam.uu.se) has joined #ceph
[10:10] * thorusH (~Thorus@p4FD0225A.dip0.t-ipconnect.de) has joined #ceph
[10:13] <thorusH> One of my OSDs always dieing shortly after startup, with basically 2013-12-28 06:35:08.489431 7fc9eccd5700 -1 osd/ReplicatedPG.cc: In function 'ReplicatedPG::RepGather* ReplicatedPG::trim_object(const hobject_t&)' thread 7fc9eccd5700 time 2013-12-28 06:35:08.487862 osd/ReplicatedPG.cc: 1379: FAILED assert(0). Can please someone help...
[10:51] * allsystemsarego (~allsystem@86.121.85.58) has joined #ceph
[11:19] * codice_ (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) has joined #ceph
[11:20] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) Quit (Ping timeout: 480 seconds)
[11:24] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[11:29] * dlan (~dennis@116.228.88.131) has joined #ceph
[11:32] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[11:43] * fouxm (~fouxm@AOrleans-258-1-53-232.w90-24.abo.wanadoo.fr) Quit (Remote host closed the connection)
[11:49] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) has joined #ceph
[11:54] * thorusH (~Thorus@p4FD0225A.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[11:54] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[12:02] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[12:20] * pvsa (~pvsa@89.204.138.186) has joined #ceph
[12:33] * pvsa (~pvsa@89.204.138.186) Quit (Remote host closed the connection)
[12:38] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[12:43] * bjornar (~bjornar@ti0099a340-dhcp0395.bb.online.no) has joined #ceph
[12:55] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[13:03] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[13:05] * diegows (~diegows@190.190.17.57) has joined #ceph
[13:06] * gaveen (~gaveen@175.157.10.103) Quit (Remote host closed the connection)
[13:08] * mschiff (~mschiff@port-35811.pppoe.wtnet.de) has joined #ceph
[13:09] * lightspeed (~lightspee@82-68-190-217.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[13:15] * fouxm (~fouxm@AOrleans-258-1-53-232.w90-24.abo.wanadoo.fr) has joined #ceph
[13:23] * fouxm (~fouxm@AOrleans-258-1-53-232.w90-24.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[13:46] * scuttlemonkey (~scuttlemo@96-42-139-47.dhcp.trcy.mi.charter.com) Quit (Ping timeout: 480 seconds)
[13:55] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[14:00] * diegows (~diegows@190.190.17.57) Quit (Ping timeout: 480 seconds)
[14:03] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[14:13] * gucki (~smuxi@p549F843F.dip0.t-ipconnect.de) has joined #ceph
[14:46] * cronix (~cronix@5.199.139.166) has joined #ceph
[14:49] * zidarsk8 (~zidar@84-255-203-33.static.t-2.net) has joined #ceph
[14:50] <gucki> hi guys
[14:50] * zidarsk8 (~zidar@84-255-203-33.static.t-2.net) has left #ceph
[14:50] <gucki> how can i repair an inconsistent pg "reported by deep scrub" in ceph emperor? the command "ceph osd repair" now seems to take an osd-id instead of a pg?
[14:55] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[14:56] * cronix (~cronix@5.199.139.166) Quit (Ping timeout: 480 seconds)
[15:03] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[15:16] * fouxm (~fouxm@AOrleans-258-1-53-232.w90-24.abo.wanadoo.fr) has joined #ceph
[15:24] * BillK (~BillK-OFT@106-69-9-248.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[15:24] * fouxm (~fouxm@AOrleans-258-1-53-232.w90-24.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[15:38] * mmmucky (~mucky@mucky.socket7.org) Quit (Ping timeout: 480 seconds)
[15:42] * mmmucky (~mucky@mucky.socket7.org) has joined #ceph
[15:56] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[16:04] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[16:17] * fouxm (~fouxm@AOrleans-258-1-53-232.w90-24.abo.wanadoo.fr) has joined #ceph
[16:25] * fouxm (~fouxm@AOrleans-258-1-53-232.w90-24.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[16:50] * nwat (~textual@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[16:56] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[17:04] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[17:28] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[17:29] * sleinen1 (~Adium@2001:620:0:25:41f1:a344:c333:5ee2) has joined #ceph
[17:30] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[17:30] * mnash (~chatzilla@vpn.expressionanalysis.com) Quit (Ping timeout: 480 seconds)
[17:30] * simulx (~simulx@vpn.expressionanalysis.com) Quit (Ping timeout: 480 seconds)
[17:33] * simulx (~simulx@vpn.expressionanalysis.com) has joined #ceph
[17:36] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[17:41] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) has joined #ceph
[17:42] * jesus (~jesus@emp048-51.eduroam.uu.se) Quit (Read error: Operation timed out)
[17:55] * jesus (~jesus@emp048-51.eduroam.uu.se) has joined #ceph
[18:03] * bjornar (~bjornar@ti0099a340-dhcp0395.bb.online.no) Quit (Ping timeout: 480 seconds)
[18:07] * sleinen1 (~Adium@2001:620:0:25:41f1:a344:c333:5ee2) Quit (Quit: Leaving.)
[18:07] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[18:15] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[18:15] * BillK (~BillK-OFT@106-69-9-248.dyn.iinet.net.au) has joined #ceph
[18:35] * zjohnson (~zjohnson@guava.jsy.net) Quit (Remote host closed the connection)
[18:43] * diegows (~diegows@190.190.17.57) has joined #ceph
[18:50] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) Quit (Remote host closed the connection)
[18:50] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) has joined #ceph
[18:54] * lightspeed (~lightspee@2001:8b0:16e:1:216:eaff:fe59:4a3c) has joined #ceph
[18:58] * zjohnson (~zjohnson@guava.jsy.net) has joined #ceph
[18:58] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) Quit (Ping timeout: 480 seconds)
[18:58] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[19:07] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[19:08] * mozg (~andrei@46.229.149.194) has joined #ceph
[19:09] * Fomina (~Fomina@192.162.100.197) has joined #ceph
[19:15] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[19:19] <Psi-Jack> Is it "OK" or not recommended to have ceph emperor ceph servers with a few (not all), ceph dumpling clients, namely clients like Proxmox VE which still uses the ceph dumpling clients to provision space, allocate snapshots, etc?
[19:20] <sage> Psi-Jack: shouldn't make any difference.
[19:21] <Psi-Jack> Shouldn't? So it would be fine? Procotol compatible and all?
[19:22] <Psi-Jack> Heh, if that's the case, that'll be part of my new years maintenance, to upgrade my servers to emperor. :)
[19:22] <sage> yeah
[19:23] <Psi-Jack> Cool. Good to know. I've been holding off on it. heh
[19:23] <sage> fwiw emperor doesn't add much in teh way of new functinoality (it's more of a way-piont twoard firefly), and probably wo'nt get long-term support as long as dumpling
[19:23] <Psi-Jack> Ahhhh.
[19:23] <sage> but if you move on to firefly after, should be fine.
[19:24] <Psi-Jack> hmmm, will there be an upgrade path from dumpling to firefly?
[19:25] * bjornar (~bjornar@ti0099a340-dhcp0395.bb.online.no) has joined #ceph
[19:27] <sage> absolutely
[19:27] <Psi-Jack> Okay then. :)
[19:27] <mozg> sage, hi. I was hoping you could help me with figuring out what is causing my vms to stall for several minutes at a time. I am running dumpling 0.67.4 with qemu 1.5.0 on ubuntu 12.04 servers.
[19:27] <mozg> i can't figure out what the problem is
[19:27] <mozg> i do not see any errors in the logs of ceph and qemu
[19:28] <mozg> but my vms stop responding for several minutes at a time
[19:28] <mozg> and after that resume as normal
[19:28] <Psi-Jack> Thanks, sage. Pretty much nailed what I was looking for. I usually keep with the LTS versions because I use ceph in my home production cluster farm. :)
[19:28] <mozg> this happens randomly about 5-10 times a day
[19:28] <sage> hmm. if you enable teh admin socket for lbirbd ('admin socket = /var/run/ceph/$name.$pid.asok) you can querythe socket while the vm is stalled to see if it si blocked on io
[19:29] <sage> ceph daemon <path> objecter_requests
[19:29] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[19:30] * cronix (~cronix@5.199.139.166) has joined #ceph
[19:30] <mozg> sage, do i enable it in the ceph.conf?
[19:30] * mattbenjamin (~matt@76-206-42-105.lightspeed.livnmi.sbcglobal.net) Quit (Quit: Leaving.)
[19:30] <sage> yeah, in [client] section
[19:30] <sage> after the vm restarts you should see it in /var/run/ceph
[19:30] <Psi-Jack> heh, one of these days I should try to make use of radosgw for S3, but.. the main purpose I use S3 for currently is to offload stuff to geoip storage off my home bandwidth. heh
[19:31] <mozg> what sort of thing should I look for?
[19:36] * cronix (~cronix@5.199.139.166) Quit (Read error: Operation timed out)
[19:43] <mozg> sage, should I change the $name and $pid values , or will ceph automatically recoginise and substitute the values?
[19:43] <mozg> coz i've added that to the [client] and restarted couple of vms, but I do not see the /var/run/ceph folder
[19:43] <sage> ceph will fill thos ein
[19:44] <mozg> i guess i should create the ceph folder
[19:44] * Cube (~Cube@66-87-131-175.pools.spcsdns.net) has joined #ceph
[19:44] <sage> bah, i was afraid of that.. i think qemu doesn't by default tell librbd to parse the config file. you may need to shove it in the qemu rbd:pool/image:... line somewhere
[19:44] <sage> oh, yeah, the folder needs to exist
[19:44] <sage> try that first
[19:44] <mozg> will do now
[19:45] <rendar> so, ceph fs runs in user-mode, right?
[19:46] <mozg> sage, thanks that has created the file
[19:50] <mozg> sage, I've got the following output when running the command: http://ur1.ca/g9x6q
[19:50] <mozg> is this what is suppose to happen?
[19:50] <mozg> what should I be looking for during the stalls?
[20:03] <liiwi> >/win 12
[20:03] <liiwi> blerp
[20:03] * mattbenjamin (~matt@aa2.linuxbox.com) has joined #ceph
[20:04] <sage> mozg: that is normal when there is no io in progress. when it hangs, look for stuff in the ops section that is old (there is a field with teh request age)
[20:08] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[20:12] * ScOut3R (~ScOut3R@c83-253-234-122.bredband.comhem.se) has joined #ceph
[20:16] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[20:24] <mozg> sage, okay, so if i see the old information, what would it tell me? what can i deduce from it? Sorry for stupid questions, but i've not used this before
[20:30] <mozg> sage, i've not mentioned, this, perhaps it's relevant, but when the stall happens, all my vms across all of my hosts hang at the same time and resume at the same time as well
[20:31] <mozg> it takes between 2-10 minutes of hang period
[20:32] <mozg> and this is happening about 5-10 times a day
[20:32] <mozg> i can't find any correlation at all
[20:41] * Steki (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[20:46] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Ping timeout: 480 seconds)
[21:08] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[21:15] <gucki> sage: yt?
[21:16] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[21:26] <gucki> sage: "ceph pg scrub 6.29f" doesn't seem to do anything. neither does "ceph pg repair 6.29f"
[21:29] <aarontc> gucki: how did you determine those commands aren't doing anything?
[21:30] <gucki> aarontc: there's nothing at all in the logs specific to these commands (no scrubbing XXX or repairing XXX messages). and the pg is still incosistent as it was before...
[21:30] <aarontc> it's kind of funny when ceph is confused about stats.. "client io 6166 MB/s rd, 1960 MB/s wr, 7093 op/s"
[21:30] <aarontc> gucki: I've "solved" that problem in the past by restarting the affected OSDs
[21:31] <aarontc> they seem to realize there is a problem and get it fixed after a fresh start
[21:31] <gucki> aarontc: I restart osd 8 twice, no change. but i can try to restart osd 10 to see if it helps..
[21:31] <aarontc> gucki: yes, all OSDs holding that placement group need restarted for my "solution" to work :)
[21:32] <gucki> aarontc: i'll try now. did you file a bug report for it?
[21:33] <aarontc> gucki: no, I haven't been able to collect enough data to make a useful bug report. I can't find any way to reproduce the problem, either
[21:34] <aarontc> oh, are you Corin, gucki?
[21:34] <gucki> aarontc: yeah, why do you wonder? ;-)
[21:35] <aarontc> Just looking at the mailing list archives ;)
[21:35] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[21:35] * ChanServ sets mode +o scuttlemonkey
[21:35] <gucki> aarontc: :-)
[21:37] <gucki> aarontc: omg, restarting osd 10 made my whole cluster hang.. :(
[21:37] <aarontc> gucki: can you pastebin "ceph health detail"?
[21:39] <gucki> aarontc: ok, cluster works again...was some strange lag.
[21:39] <gucki> aarontc: pg is still incosistent
[21:40] <aarontc> gucki: now you can "ceph pg repair 6.29f"
[21:40] <aarontc> after which, "ceph health detail" should show that pg is repairing
[21:40] <gucki> aarontc: mh,....i still have lots of slow requests, 417 requests hung stuck
[21:41] <aarontc> that is likely to continue until the repair finishes
[21:41] <gucki> aarontc: there's definitely something really strange / bad going on..
[21:43] <gucki> aarontc: repair is still ignored...and slow requests are still comming in, goihng...comming agauin..
[21:43] <gucki> aarontc: guess i need to restart osd 8 again
[21:47] <gucki> aarontc: mh, one osd was hanging cosuming 100% cpu...i restarted that one now..
[21:47] <aarontc> gucki: if it comes down to it, you can try restarting all the ceph daemons cluster-wide
[21:48] <gucki> aarontc: mh, i don't really want to just play around...it's a production system which needs to be running.. :)
[21:50] * codice_ (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) Quit (Read error: Operation timed out)
[21:50] * tomaw_ (tom@basil.tomaw.net) Quit (Quit: Quitting)
[21:53] <gucki> aarontc: mh, the cluster is comming back but very very very slow.... :(
[21:55] <aarontc> gucki: you can change priorities on the recovery process in the ceph.conf
[21:56] <gucki> aarontc: no, there has to be something wrong with ceph. because the restart osd is again taking 100 cpu and the recovery is hanging because of this.. :(
[21:58] <gucki> aarontc: even after doing kill pid it takes 100% cpu and doesn't stop. only a second kill kills it...
[21:59] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[22:04] <aarontc> gucki: well if you're super worried about uptime maybe best to talk to an expert and not take my suggestions :)
[22:06] * codice (~toodles@75-140-64-194.dhcp.lnbh.ca.charter.com) has joined #ceph
[22:08] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[22:11] <gucki> aarontc: wooow :) after restarting that osd around 5 times it finally managed to revocer and now the cluster is healthy again. quite strange, but at least i'll have a good night (i hope *g*)
[22:11] <aarontc> good luck :)
[22:15] <gucki> aarontc: thanks :)
[22:16] <gucki> aarontc: for your help too :)
[22:16] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[22:28] * allsystemsarego (~allsystem@86.121.85.58) Quit (Quit: Leaving)
[22:47] * gucki (~smuxi@p549F843F.dip0.t-ipconnect.de) Quit (Remote host closed the connection)
[22:54] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:59] * mozg (~andrei@46.229.149.194) Quit (Ping timeout: 480 seconds)
[23:09] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[23:17] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[23:21] * mattbenjamin (~matt@aa2.linuxbox.com) Quit (Quit: Leaving.)
[23:24] * mschiff_ (~mschiff@port-35811.pppoe.wtnet.de) has joined #ceph
[23:24] * mschiff (~mschiff@port-35811.pppoe.wtnet.de) Quit (Read error: Connection reset by peer)
[23:37] * rendar (~s@host12-176-dynamic.1-87-r.retail.telecomitalia.it) Quit ()
[23:54] <bjornar> I dont find any reference to filestore [min|max] sync interval in docs... it is mentioned, but thats it..

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.