#ceph IRC Log

Index

IRC Log for 2013-05-18

Timestamps are in GMT/BST.

[0:01] * davidzlap (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[0:02] * Meths (rift@2.25.191.72) Quit (Read error: Connection reset by peer)
[0:02] * Meths (rift@2.25.191.72) has joined #ceph
[0:03] * loicd (~loic@magenta.dachary.org) Quit (Ping timeout: 480 seconds)
[0:06] * rustam (~rustam@90.216.255.245) has joined #ceph
[0:06] * Meths (rift@2.25.191.72) Quit (Read error: Connection reset by peer)
[0:07] * Meths (rift@2.25.191.72) has joined #ceph
[0:07] * Meths (rift@2.25.191.72) Quit (Read error: Connection reset by peer)
[0:08] * rustam (~rustam@90.216.255.245) Quit (Remote host closed the connection)
[0:09] * vata (~vata@2607:fad8:4:6:f9b0:68a5:e595:8675) Quit (Quit: Leaving.)
[0:12] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[0:13] * Tamil (~tamil@38.122.20.226) Quit (Quit: Leaving.)
[0:16] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Read error: Operation timed out)
[0:16] * portante is now known as portante|afk
[0:17] * Meths (rift@2.25.189.26) has joined #ceph
[0:18] <cjh_> what's the min journal size you can create?
[0:18] <cjh_> i tried 100 and it fails in mkcephfs
[0:18] * rturk-away is now known as rturk
[0:18] <elder> Let me see if I can find out.
[0:19] <cjh_> elder: 512 seems to work
[0:19] <elder> Is that good enough?
[0:19] <cjh_> elder: just experimenting with what happens if i do a tiny journal
[0:19] <cjh_> so i can smooth out the humps in the write throughput
[0:20] <elder> OK. If I find out the minimum I'll report back.
[0:21] <cjh_> elder: thx :)
[0:24] * sjustlaptop (~sam@38.122.20.226) Quit (Ping timeout: 480 seconds)
[0:27] <nhm> cjh_: instead of making a super tiny journal, you may want to play around with the journal settings here: http://ceph.com/docs/next/rados/configuration/journal-ref/
[0:28] * sh_t (~sht@lu.privatevpn.com) Quit (Read error: No route to host)
[0:28] <nhm> you can control the max number of bytes and ops it will allow in the journal at any one time. Likely you couldn't set it lower than 100 due to the max bytes setting.
[0:29] * sh_t (~sht@lu.privatevpn.com) has joined #ceph
[0:31] <nhm> also, see: http://ceph.com/docs/next/rados/configuration/filestore-config-ref/#synchronization-intervals
[0:33] <cjh_> nhm: thanks i'll check that out!
[0:33] <cjh_> i'm closing in on the optimal config i think
[0:35] <cjh_> so the max bytes are the max allowed in the journal before it is flushed?
[0:36] <cjh_> can i play with these journal settings and just restart or do i have to rebuild the cluster?
[0:36] <nhm> Yeah, so you can control all of: bytes, ops, min sync inteval, max sync interval
[0:37] <nhm> I don't think you should have to rebuild at all.
[0:37] * glowell (~glowell@38.122.20.226) Quit (Quit: Leaving.)
[0:37] <cjh_> nhm: ok cool i'll modify, reboot and test a few times
[0:37] <cjh_> nhm: have you tried s3backer yet? it looks interesting
[0:37] <cjh_> it looks like a fuse mount of the rados gw s3 store
[0:38] <nhm> cjh_: that sounds a bit like the old s3 fuse thing.
[0:39] <nhm> cjh_: at a very high level that's sort of how RBD works in ceph with RADOS too.
[0:39] <cjh_> really?
[0:39] <cjh_> but rbd is in the kernel which is nice
[0:40] <dmick> rbd is both in and out of the kernel
[0:40] <cjh_> yeah i saw the fuse mount for that now
[0:40] <cjh_> which is awesome
[0:40] <dmick> and iscsi
[0:40] <cjh_> right
[0:40] <cjh_> tgt
[0:40] <dmick> and qemu direct support
[0:40] <cjh_> :)
[0:40] <dmick> which is by far the most-used
[0:40] <cjh_> do you know if there's plans to add user support to ceph-deploy or does it just use the user you're currently logged in as?
[0:41] <cjh_> so i can say use user x
[0:41] <cjh_> if not i might try to hack it in
[0:41] <dmick> no plans I've heard of
[0:41] <cjh_> ok
[0:41] <cjh_> it won't work for me because i don't have sudo
[0:41] <cjh_> nor can i get it
[0:41] <dmick> you can do that with .ssh/config, too
[0:41] <cjh_> that's true
[0:41] <cjh_> i'll try that first
[0:43] * markbby (~Adium@168.94.245.1) Quit (Quit: Leaving.)
[0:46] * Tamil (~tamil@38.122.20.226) has joined #ceph
[0:50] * tnt (~tnt@91.177.224.32) Quit (Read error: Operation timed out)
[0:51] <nwl> cjh_: it's on the todo list (in the tracker)
[0:51] <nwl> cjh_: patches welcome to the python script :)
[0:52] * coyo|2 (~unf@71.21.193.106) has joined #ceph
[0:53] <cjh_> nwl: cool :)
[0:53] <cjh_> nwl: i like hacking on python so maybe i'll take a look at it this weekend
[0:54] <nwl> cjh_: http://tracker.ceph.com/issues/3347
[0:55] <cjh_> cool thanks :)
[0:55] <cjh_> if you create a pool in ceph after brining up a cluster but before it settles will it stay in a stuck state?
[0:55] <cjh_> bringing*
[0:56] <dmick> wow. so it is.
[0:56] * glowell (~glowell@2607:f298:a:607:d9d4:fffb:b830:4db8) has joined #ceph
[0:56] <dmick> and, cjh_: it would surprise me, especially if there are no objects in the pool
[0:56] <cjh_> yeah me too
[0:57] <cjh_> it's a 61.2 cluster
[0:57] <cjh_> the mon log says: pgmap v226: 67392 pgs: 40554 creating, 26614 active+clean, 224 peering;
[0:57] * rustam (~rustam@90.216.255.245) has joined #ceph
[0:57] <cjh_> 40K are stuck unclean
[0:58] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[0:59] * PerlStalker (~PerlStalk@72.166.192.70) Quit (Quit: ...)
[0:59] * coyo (~unf@00017955.user.oftc.net) Quit (Ping timeout: 480 seconds)
[0:59] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[1:00] <cjh_> there we go. it just started syncing up
[1:00] <cjh_> it's knocking them do wn now
[1:02] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[1:03] * sjustlaptop (~sam@2607:f298:a:697:c8e1:c368:a42c:1d) has joined #ceph
[1:05] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit ()
[1:08] <cjh_> dmick: this is staying stuck unclean for a lot longer than i usually see
[1:08] <cjh_> the only thing i did was reduce the journal size to 512
[1:15] * rustam (~rustam@90.216.255.245) Quit (Remote host closed the connection)
[1:17] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[1:17] * sjustlaptop (~sam@2607:f298:a:697:c8e1:c368:a42c:1d) Quit (Ping timeout: 480 seconds)
[1:20] * sagelap (~sage@2600:1010:b000:ae47:9d3b:a069:66e9:a44d) has joined #ceph
[1:21] <cjh_> i think there's a bug in 61.2
[1:21] <cjh_> when i restart my cluster i see a ton of messages fly through the monitor log and then it crashes
[1:26] <dmick> cjh_: dunno, from that; it's possible
[1:27] <cjh_> lemme post the log
[1:30] <cjh_> http://fpaste.org/12878/68833409/
[1:30] * rustam (~rustam@90.216.255.245) has joined #ceph
[1:30] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[1:31] <cjh_> i see a bunch of failed lossy con dropping message logs
[1:31] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[1:31] <dmick> sjust: ^ that might interest you?
[1:31] * rustam (~rustam@90.216.255.245) Quit (Remote host closed the connection)
[1:31] <cjh_> the conn shouldn't be lossy. it's 10Gb :)
[1:32] * sagelap (~sage@2600:1010:b000:ae47:9d3b:a069:66e9:a44d) Quit (Quit: Leaving.)
[1:33] <cjh_> i think my in rack testing with rados bench isn't representative of what clients would actually see. i'm competing on bandwidth with replication because i only have 1 link
[1:43] * LeaChim (~LeaChim@176.250.188.136) Quit (Ping timeout: 480 seconds)
[1:43] * rturk is now known as rturk-away
[1:44] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:44] * diegows (~diegows@200-081-038-002.wireless.movistar.net.ar) has joined #ceph
[1:48] * alram (~alram@38.122.20.226) Quit (Quit: leaving)
[1:55] * coyo|2 (~unf@71.21.193.106) Quit (Quit: F*ck you, I'm a daemon.)
[1:56] * sagelap (~sage@156.39.10.21) has joined #ceph
[1:57] * diegows (~diegows@200-081-038-002.wireless.movistar.net.ar) Quit (Ping timeout: 480 seconds)
[1:58] * sjustlaptop (~sam@38.122.20.226) has joined #ceph
[1:59] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[2:12] * themgt (~themgt@24-177-232-33.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[2:14] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:22] * sagelap (~sage@156.39.10.21) Quit (Ping timeout: 480 seconds)
[2:25] * The_Bishop_ (~bishop@e179011252.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[2:29] * sjustlaptop (~sam@38.122.20.226) Quit (Ping timeout: 480 seconds)
[2:30] * The_Bishop (~bishop@e179011252.adsl.alicedsl.de) has joined #ceph
[2:41] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[2:49] * newbie (~kvirc@74-61-8-52.war.clearwire-wmx.net) has joined #ceph
[2:49] * newbie is now known as saras
[2:50] <saras> anyone smart people still here
[2:50] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) has joined #ceph
[2:51] <saras> aka some that is smart about how ceph uses libatmic-ops
[2:51] <saras> or at least where
[2:53] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[2:55] <saras> anyway when compile libatmoic-ops on PI i ran the cheack and got this poop
[2:55] <saras> https://github.com/sarasfox/ceph/blob/master/output%20of%20libatomic-ops%20on%20PI
[2:57] <saras> as all the test pass I think this is some thing that I work around but i would love some give idea of where to start in the ceph code
[3:02] * sagelap (~sage@184.169.31.254) has joined #ceph
[3:03] * rustam (~rustam@90.216.255.245) has joined #ceph
[3:05] * rustam (~rustam@90.216.255.245) Quit (Remote host closed the connection)
[3:05] <dmick> that's a problem with libatomic, not ceph
[3:06] <saras> where does ceph use libatomic
[3:07] <dmick> you've been down this road already
[3:08] <dmick> include/atomic.h
[3:09] * glowell (~glowell@2607:f298:a:607:d9d4:fffb:b830:4db8) Quit (Quit: Leaving.)
[3:09] <saras> sweet now only need to look at 17 files
[3:09] <saras> dmick: thanks
[3:10] <dmick> sure
[3:11] <saras> dmick: now i have sane place to start make since of this mess
[3:18] * sagelap (~sage@184.169.31.254) Quit (Read error: Connection reset by peer)
[3:24] * Tamil (~tamil@38.122.20.226) Quit (Quit: Leaving.)
[3:26] <saras> dmick: what package is dot looking for
[3:36] <dmick> dunno; you mean what package is it in?
[3:37] * sagelap (~sage@2600:1012:b01f:5cf0:9d3b:a069:66e9:a44d) has joined #ceph
[3:37] <saras> it is one of the dep on readme but their is know package called just dot in ubuntu or debian
[3:38] <dmick> graphviz: /usr/bin/dot
[3:39] <dmick> maybe it used to be its own package at one point; dunno
[3:39] <dmick> could just be a mistake
[3:39] <saras> dmick: i think it is just mistake
[3:40] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[3:41] <saras> as graphviz pull all of this stuff in itself
[3:42] <dmick> don't know what the state of the packages were at the time this was written
[3:42] <dmick> wrong for quantal at least
[3:42] <saras> it is not in any verison of debian eithre
[3:42] <saras> it is not in any verison of debian either
[3:42] <dmick> ok
[3:43] <dmick> have to run, ttyl
[3:43] * dmick (~dmick@2607:f298:a:607:9067:2df2:f863:6490) Quit (Quit: Leaving.)
[3:46] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Quit: Leaving.)
[3:50] * diegows (~diegows@190.190.2.126) has joined #ceph
[4:00] * diegows (~diegows@190.190.2.126) Quit (Read error: Operation timed out)
[4:15] * sagelap (~sage@2600:1012:b01f:5cf0:9d3b:a069:66e9:a44d) Quit (Ping timeout: 480 seconds)
[4:28] * sagelap (~sage@2600:1012:b01f:5cf0:9d3b:a069:66e9:a44d) has joined #ceph
[4:33] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) Quit (Quit: Pogoapp - http://www.pogoapp.com)
[4:36] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 20.0.1/20130409194949])
[4:50] * rovar (~oftc-webi@pool-96-246-17-104.nycmny.fios.verizon.net) has joined #ceph
[4:50] <rovar> hrrm
[5:05] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has joined #ceph
[5:05] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has left #ceph
[5:06] * sagelap (~sage@2600:1012:b01f:5cf0:9d3b:a069:66e9:a44d) Quit (Ping timeout: 480 seconds)
[5:08] * [fred] (fred@konfuzi.us) Quit (Ping timeout: 480 seconds)
[5:26] * loicd (~loic@magenta.dachary.org) has joined #ceph
[5:42] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[5:42] * noahmehl_ (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[5:44] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:44] * noahmehl_ is now known as noahmehl
[5:48] * rovar (~oftc-webi@pool-96-246-17-104.nycmny.fios.verizon.net) Quit (Quit: Page closed)
[6:07] * john_barbee (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) has joined #ceph
[6:10] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[6:14] * rustam (~rustam@90.216.255.245) has joined #ceph
[6:15] * rustam (~rustam@90.216.255.245) Quit (Remote host closed the connection)
[6:28] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[6:36] * lightspeed (~lightspee@81.187.0.153) Quit (Ping timeout: 480 seconds)
[6:45] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Quit: noahmehl)
[7:18] * ScOut3R (~ScOut3R@dsl51B614D7.pool.t-online.hu) has joined #ceph
[7:23] * ScOut3R (~ScOut3R@dsl51B614D7.pool.t-online.hu) Quit (Remote host closed the connection)
[7:28] * athrift (~nz_monkey@222.47.255.123.static.snap.net.nz) has joined #ceph
[8:24] * eschnou (~eschnou@249.73-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:26] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[8:31] <Kioob> Once again : cluster full
[8:31] <Kioob> and of course, "ceph mon tell '*' injectargs '--mon-osd-full-ratio 0.98'" doesn't work
[8:31] <Kioob> neither "ceph osd tell '*' injectargs '--mon-osd-full-ratio 0.98'"
[8:35] * eschnou (~eschnou@249.73-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:19] * alrs (~lars@cpe-142-129-65-37.socal.res.rr.com) Quit (Read error: Operation timed out)
[9:21] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[9:26] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[9:36] * esammy (~esamuels@host-2-102-68-228.as13285.net) has joined #ceph
[9:50] * lightspeed (~lightspee@fw-carp-wan.ext.lspeed.org) has joined #ceph
[10:05] * saras (~kvirc@74-61-8-52.war.clearwire-wmx.net) Quit (Ping timeout: 480 seconds)
[10:12] * esammy (~esamuels@host-2-102-68-228.as13285.net) has left #ceph
[10:24] * The_Bishop_ (~bishop@f052097055.adsl.alicedsl.de) has joined #ceph
[10:27] * The_Bishop (~bishop@e179011252.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[10:28] * vipr (~vipr@78-23-113-37.access.telenet.be) Quit (Ping timeout: 480 seconds)
[10:30] * dcasier (~dcasier@223.103.120.78.rev.sfr.net) has joined #ceph
[10:31] * alrs (~lars@71-80-161-15.static.lsan.ca.charter.com) has joined #ceph
[10:45] * LeaChim (~LeaChim@176.250.188.136) has joined #ceph
[10:57] * loicd (~loic@host-78-149-198-157.as13285.net) has joined #ceph
[11:09] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[11:14] * tnt (~tnt@91.177.224.32) has joined #ceph
[11:35] * alrs (~lars@71-80-161-15.static.lsan.ca.charter.com) Quit (Ping timeout: 480 seconds)
[11:38] * rustam (~rustam@90.216.255.245) has joined #ceph
[12:05] * rustam (~rustam@90.216.255.245) Quit (Remote host closed the connection)
[12:14] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[12:27] * rustam (~rustam@90.216.255.245) has joined #ceph
[12:42] * diegows (~diegows@190.190.2.126) has joined #ceph
[12:52] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[13:12] * portante|ltp (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[13:21] * eschnou (~eschnou@249.73-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[13:34] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[13:36] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Read error: Operation timed out)
[13:38] * eschnou (~eschnou@249.73-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[13:45] * bergerx_ (~bekir@78.188.204.182) has joined #ceph
[13:50] * humbolt (~elias@91-113-46-139.adsl.highway.telekom.at) Quit (Ping timeout: 480 seconds)
[14:01] * humbolt (~elias@213-33-5-146.adsl.highway.telekom.at) has joined #ceph
[14:01] * john_barbee (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:07] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:25] * rustam (~rustam@90.216.255.245) Quit (Remote host closed the connection)
[14:29] * amb (~amb@82-69-2-201.dsl.in-addr.zen.co.uk) Quit (Quit: Leaving)
[14:54] * alrs (~lars@cpe-142-129-65-37.socal.res.rr.com) has joined #ceph
[14:58] * markbby (~Adium@168.94.245.1) has joined #ceph
[15:02] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[15:09] * diegows (~diegows@190.190.2.126) Quit (Ping timeout: 480 seconds)
[15:12] * spekzor (~rens@90-145-135-59.bbserv.nl) Quit (Ping timeout: 480 seconds)
[15:20] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Remote host closed the connection)
[15:30] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:34] * eschnou (~eschnou@249.73-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[15:47] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[15:53] * madkiss (~madkiss@2001:6f8:12c3:f00f:f4d2:eae4:fd37:f894) Quit (Quit: Leaving.)
[15:54] * bergerx_ (~bekir@78.188.204.182) Quit (Remote host closed the connection)
[15:57] * madkiss (~madkiss@2001:6f8:12c3:f00f:79b6:c614:f44d:c3a6) has joined #ceph
[16:02] * madkiss (~madkiss@2001:6f8:12c3:f00f:79b6:c614:f44d:c3a6) Quit ()
[16:04] * madkiss (~madkiss@2001:6f8:12c3:f00f:4df4:543e:9789:981e) has joined #ceph
[16:06] * sileht (~sileht@gizmo.sileht.net) Quit (Quit: WeeChat 0.4.0)
[16:11] * kyle_ (~kyle@216.183.64.10) has joined #ceph
[16:12] * ghartz (~ghartz@33.ip-5-135-148.eu) has joined #ceph
[16:13] * Steki (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[16:13] * kmekil (~kyle@216.183.64.10) Quit (Read error: Connection reset by peer)
[16:13] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Remote host closed the connection)
[16:17] * eschnou (~eschnou@249.73-201-80.adsl-dyn.isp.belgacom.be) Quit (Read error: Operation timed out)
[16:47] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[16:48] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[17:12] * coyo (~unf@71.21.193.106) has joined #ceph
[17:13] * ghartz (~ghartz@33.ip-5-135-148.eu) Quit (Remote host closed the connection)
[17:21] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[17:27] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) Quit (Remote host closed the connection)
[17:29] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) has joined #ceph
[17:35] * The_Bishop_ (~bishop@f052097055.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[17:38] * The_Bishop (~bishop@f052102039.adsl.alicedsl.de) has joined #ceph
[17:42] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[17:56] * amb (~amb@82-69-2-201.dsl.in-addr.zen.co.uk) has joined #ceph
[17:59] <amb> I'm trying to simulate an automated ceph deployment, first by adding 2 x MON (done), then by adding 2 x OSDs using the manual add OSD. The OSDs go up & in, but of the 192 original 'creating' pgs, I am left with "HEALTH_WARN 14 pgs degraded; 23 pgs stuck unclean". I'm using default setup everywhere and 'ceph osd dump' shows all pools have size 2. Any ideas?
[18:04] * sileht (~sileht@gizmo.sileht.net) has joined #ceph
[18:06] <mrjack> amb: 2 mon won't work
[18:07] <amb> sorry, 3 x mon
[18:07] <mrjack> amb: you need an odd number of mons to get quorum..
[18:07] <mrjack> ah ok
[18:07] <amb> my mistake
[18:07] <mrjack> ceph health details?
[18:07] <amb> http://pastebin.com/56iVwPLn
[18:08] <amb> What I haven't got is every OSD in every ceph.conf, on the basis it should be retrieving these from the cluster map. Is that wrong?
[18:09] <amb> IE my ceph.conf on the OSDs lists the MONs + that OSD only (to get the daemon to start)
[18:09] <amb> I'm trying to simulate a large cluster where copying around ceph.conf over ssh is impractical.
[18:16] <tnt> having the mon and local osd in ceph.conf is sufficient, it's what I use.
[18:17] <amb> tnt, that's what I thought.
[18:17] <amb> crushmap is here: http://pastebin.com/T0CitAW6
[18:17] <tnt> it's a test cluster right ?
[18:17] <amb> tnt, yep
[18:18] <amb> It's had nothing written to it
[18:18] <tnt> try "ceph osd crush tunables optimal"
[18:18] <amb> tnt, that fixed it. What did that do?
[18:19] <tnt> basically the crush algorithm had some issue with very small (i.e. 2 OSD) cluster size and sometime failed to generate a valid placement for all the PGs.
[18:20] <tnt> so in later version, it was "tuned" by changing some value of the algo to allow for better distribution and better handling of corner cases.
[18:20] <tnt> but by default the old values are used because when you use those "tunables" you need clients that support then ... (i.e. recent enough).
[18:20] <tnt> http://ceph.com/docs/master/rados/operations/crush-map/#tunables
[18:20] <amb> Is there a disadvantage to running that? What I'm trying to do is to get a base automated deployment.
[18:20] <amb> Oh I can guarantee all my clients are cuttlefish (or I suppose whatever comes next)
[18:21] <tnt> well, also the kernel client ? (if you use the RBD kernel client).
[18:21] <amb> No kernel clients here.
[18:21] <tnt> then there is no problem.
[18:22] <amb> tnt, thanks. Is it just 'random bad luck' the default example config (which as 2 OSDs in) can do this?
[18:27] <tnt> pretty much.
[18:27] <amb> thanks :-)
[18:31] <joao> <mrjack> amb: you need an odd number of mons to get quorum.. <- this is not true
[18:31] <joao> a quorum is achieved by a majority
[18:32] <tnt> s/need/want/
[18:32] <joao> two monitors can achieve a majority just as well as 3
[18:33] <amb> 2 x MON seems a bit pointless, because if one is down you can't use the other one.
[18:33] <joao> having two monitors just happen to be as good as one though
[18:33] <amb> 1 x MON would be better here.
[18:33] <tnt> joao: it's worse than 1 ... if either goes down, the cluster is down.
[18:34] <joao> yes, but my point is that there's this misconception than an even-numbered monitor cluster doesn't work
[18:34] <joao> I've seen it every now and then and just wanted to make that clear :)
[18:35] <joao> tnt, technically is just as good as 1, given that if your one monitor goes down on a single-monitor cluster you're in the woods
[18:35] <joao> :p
[18:35] <amb> I suspect tnt meant that the probability of either of 2 machines going down is greater than the probability of one machine going down.
[18:35] <amb> i.e. adding a second MON makes your cluster more likely to fail.
[18:36] <tnt> yes. Since your cluster depends on two machines to be up, the combined MTBF of two machines will be lower than the one of a single machine.
[18:38] <joao> yeah, that's right; I'm clearly a bit too thick-headed this afternoon :)
[18:49] * Steki (~steki@fo-d-130.180.254.37.targo.rs) Quit (Quit: Ja odoh a vi sta 'ocete...)
[18:50] <mrjack> joao: yes, you are right,.. but having two monitors results in no majority if one goes down, right?
[18:50] <joao> yes
[18:50] <joao> see what tnt pointed out
[18:50] <mrjack> so it's a bad idea two have only 2...
[18:50] <mrjack> yeah
[18:56] <amb> Can I move an OSD from one host to another? Imagine a cluster with 10 hosts each with 8 drives, and one host blows up - can I distribute the 8 drives amongst the other hosts?
[19:04] <tnt> amb: you don't "move OSD". Once ceph detects the failure, it will redistribute data by itself.
[19:06] <amb> tnt, yeah I know. I'm considering what happens if I try to make it quicker by moving the harddisks. And I suppose I am trying to work out what happens if people move hard disks around in my scheme to automagically mount them.
[19:07] <tnt> yeah, you can move the harddrive, but you'll have to modify the ceph.conf file and the crushmap (altough I think in cuttle fish the startup script does that for you).
[19:08] <amb> tnt, thought you should be able to. What actually happens in SIGABORT
[19:08] <amb> s/in/is/
[19:08] <tnt> But ceph will most likely move data around anyway because some data that was previously on 2 different host will be on the same host now and ceph will want to redistribute.
[19:09] <amb> Yeah I moved one of my 2 OSDs (with 2 replicas) to prevent that, so it had nowhere to move it to.
[19:11] <amb> Well, I suppose whatever I did, SIGABORT is not the right result, so I should report that.
[19:17] * coyo (~unf@00017955.user.oftc.net) Quit (Ping timeout: 480 seconds)
[19:28] * markbby (~Adium@168.94.245.1) Quit (Quit: Leaving.)
[19:32] <amb> Apparently ceph doesn't like scp stripping all the xattr's off. Who'd have thought. Doh.
[19:33] <tnt> no kidding :)
[19:33] <amb> rsync -aHAX is your friend :-)
[19:41] * alrs (~lars@cpe-142-129-65-37.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[19:45] * jksM (~jks@3e6b5724.rev.stofanet.dk) Quit (Read error: Connection reset by peer)
[20:00] <mrjack> is there a way to find out scrubbing statistics? how long the scrub takes, how much % work is done or so?
[20:08] * BillK (~BillK@124-169-186-145.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[20:08] * [fred] (fred@konfuzi.us) has joined #ceph
[20:11] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[20:11] * sileht (~sileht@gizmo.sileht.net) Quit (Quit: WeeChat 0.4.0)
[20:20] * coyo (~unf@ip-64-134-48-199.public.wayport.net) has joined #ceph
[20:23] * sileht (~sileht@gizmo.sileht.net) has joined #ceph
[20:27] * alrs (~lars@ip-64-134-231-144.public.wayport.net) has joined #ceph
[20:28] <mrjack> somehow, rbd rm <image> forces slow requests on 0.61.2 on my setup... then one osd gets kicked out, joins again, it resyncs and everything is fine again... what can i do?
[20:36] * dcasier (~dcasier@223.103.120.78.rev.sfr.net) Quit (Read error: Connection reset by peer)
[20:36] <tnt> well rbd rm is pretty IO intensive ... but still shouldn't kick an osd out. Are you sure there is nothing wrong with that OSD drives ?
[20:40] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) has joined #ceph
[20:42] <mrjack> no, the osd is on raid10
[20:43] <mrjack> i can check...
[20:43] <mrjack> no, the disks are clean
[20:43] <mrjack> i can reproduce it easily
[20:46] * loicd (~loic@host-78-149-198-157.as13285.net) Quit (Ping timeout: 480 seconds)
[20:47] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[21:02] * loicd (~loic@host-78-149-198-157.as13285.net) has joined #ceph
[21:03] * loicd (~loic@host-78-149-198-157.as13285.net) Quit ()
[21:04] * BillK (~BillK@124-169-186-145.dyn.iinet.net.au) has joined #ceph
[21:11] * loicd (~loic@host-78-149-198-157.as13285.net) has joined #ceph
[21:15] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[21:18] * markbby (~Adium@168.94.245.1) has joined #ceph
[21:19] * markbby (~Adium@168.94.245.1) Quit ()
[21:30] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[21:31] * eschnou (~eschnou@249.73-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:34] <tnt> mrjack: is it always the same osd ?
[21:35] * alrs (~lars@ip-64-134-231-144.public.wayport.net) Quit (Ping timeout: 480 seconds)
[21:42] * jks (~jks@3e6b5724.rev.stofanet.dk) has joined #ceph
[21:46] * loicd (~loic@host-78-149-198-157.as13285.net) Quit (Ping timeout: 480 seconds)
[21:59] * markbby (~Adium@168.94.245.3) has joined #ceph
[22:00] * loicd (~loic@host-78-149-198-157.as13285.net) has joined #ceph
[22:08] <mrjack> tnt: i'll check that
[22:09] * KindTwo (KindOne@h98.41.28.71.dynamic.ip.windstream.net) has joined #ceph
[22:09] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[22:10] * KindTwo is now known as KindOne
[22:10] <mrjack> tnt: it is always the same osd that says that it was wrongly marked down, but the slow requests seem to happen to all osds
[22:10] <mrjack> well
[22:10] <mrjack> no
[22:11] <mrjack> it is not always the same osd
[22:12] <mrjack> oh wait
[22:12] <mrjack> it seems that it was not always the same osd on bobtail, but now since upgrade to cuttlefish it is indeed always the same osd wrongly marked down..
[22:13] * BillK (~BillK@124-169-186-145.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[22:28] * coyo (~unf@00017955.user.oftc.net) Quit (Ping timeout: 480 seconds)
[22:50] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[22:50] * ChanServ sets mode +v andreask
[22:56] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[22:58] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[23:01] * loicd1 (~loic@host-78-149-198-157.as13285.net) has joined #ceph
[23:02] * loicd (~loic@host-78-149-198-157.as13285.net) Quit (Read error: Connection reset by peer)
[23:10] * markbby (~Adium@168.94.245.3) Quit (Remote host closed the connection)
[23:12] * loicd1 trying to remember the URL of the teuthology builds
[23:25] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[23:41] * eschnou (~eschnou@249.73-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:53] * coyo (~unf@mdf2636d0.tmodns.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.