#ceph IRC Log

Index

IRC Log for 2013-08-04

Timestamps are in GMT/BST.

[0:01] <loopy> also, i cant seem to keep my clocks in sync between them - any suggestions there
[0:01] <loopy> and tryig to access the cluster with a client via their public interface is giving 'fault'
[0:12] * sleinen (~Adium@2001:620:0:25:5927:903d:7065:4f90) Quit (Quit: Leaving.)
[0:12] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[0:14] * saabylaptop (~saabylapt@1009ds5-oebr.1.fullrate.dk) Quit (Quit: Leaving.)
[0:20] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[0:41] * LeaChim (~LeaChim@2.122.178.96) Quit (Ping timeout: 480 seconds)
[0:51] * Vincent_Valentine (~Vincent_V@49.206.158.155) Quit (Ping timeout: 480 seconds)
[1:34] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[1:39] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[1:52] * AfC (~andrew@2001:44b8:31cb:d400:cc05:1c07:9192:74e2) Quit (Quit: Leaving.)
[2:14] * rongze (~quassel@754fe8ea.test.dnsbl.oftc.net) Quit (Read error: Connection reset by peer)
[2:14] * rongze (~quassel@li565-182.members.linode.com) has joined #ceph
[2:40] <cfreak201> mhm I just restarted one of my 2 ceph nodes (clean shutdown), everything setup to be replicated between both nodes.. after restarting out of the 24 disk sometimes 13 are up, sometimes 17 are up, or just 14... its toggling around.. any hints ? (there has been no write/read access to the storage since hours)
[2:55] * terje (~joey@97-118-167-16.hlrn.qwest.net) Quit (Read error: Operation timed out)
[2:56] * terje (~joey@184-96-143-206.hlrn.qwest.net) has joined #ceph
[3:14] * joao (~JL@216.1.187.162) has joined #ceph
[3:14] * ChanServ sets mode +o joao
[3:33] * leseb (~leseb@88-190-214-97.rev.dedibox.fr) Quit (Killed (NickServ (Too many failed password attempts.)))
[3:33] * leseb (~leseb@88-190-214-97.rev.dedibox.fr) has joined #ceph
[3:56] * rturk-away is now known as rturk
[4:03] * joao (~JL@216.1.187.162) Quit (Ping timeout: 480 seconds)
[4:18] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[4:30] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Quit: Ja odoh a vi sta 'ocete...)
[4:32] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Read error: Operation timed out)
[4:36] * rturk is now known as rturk-away
[5:05] * fireD_ (~fireD@93-139-173-101.adsl.net.t-com.hr) has joined #ceph
[5:07] * fireD (~fireD@93-142-230-25.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:16] * gentleben (~sseveranc@c-98-207-40-73.hsd1.ca.comcast.net) Quit (Quit: gentleben)
[5:23] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[5:28] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[5:32] * gentleben (~sseveranc@c-98-207-40-73.hsd1.ca.comcast.net) has joined #ceph
[5:50] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[5:58] * gentleben (~sseveranc@c-98-207-40-73.hsd1.ca.comcast.net) Quit (Quit: gentleben)
[6:13] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[6:13] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit ()
[6:26] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[6:26] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit ()
[6:33] * gentleben (~sseveranc@c-98-207-40-73.hsd1.ca.comcast.net) has joined #ceph
[6:36] * gentleben (~sseveranc@c-98-207-40-73.hsd1.ca.comcast.net) Quit ()
[6:41] * gentleben (~sseveranc@c-98-207-40-73.hsd1.ca.comcast.net) has joined #ceph
[6:49] * gentleben (~sseveranc@c-98-207-40-73.hsd1.ca.comcast.net) Quit (Quit: gentleben)
[7:08] * AfC (~andrew@2001:44b8:31cb:d400:cc05:1c07:9192:74e2) has joined #ceph
[7:11] * gwapo (~oftc-webi@103.11.50.249) has joined #ceph
[7:12] * gwapo (~oftc-webi@103.11.50.249) Quit ()
[7:20] * yanzheng (~zhyan@134.134.139.74) has joined #ceph
[7:24] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[7:29] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[7:30] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[7:36] * Vincent_Valentine (~Vincent_V@49.206.158.155) has joined #ceph
[8:04] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[8:08] * yanzheng (~zhyan@134.134.139.74) Quit (Remote host closed the connection)
[8:18] * Vincent_Valentine (~Vincent_V@49.206.158.155) Quit (Ping timeout: 480 seconds)
[8:22] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[8:46] * gentleben (~sseveranc@c-98-207-40-73.hsd1.ca.comcast.net) has joined #ceph
[9:07] * yanzheng (~zhyan@134.134.139.74) has joined #ceph
[9:31] * odyssey4me (~odyssey4m@41-133-58-101.dsl.mweb.co.za) has joined #ceph
[9:53] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[9:55] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit ()
[10:03] * odyssey4me (~odyssey4m@41-133-58-101.dsl.mweb.co.za) Quit (Ping timeout: 480 seconds)
[11:11] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[11:12] * sage (~sage@76.89.177.113) Quit (Ping timeout: 480 seconds)
[11:23] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[11:34] * yanzheng (~zhyan@134.134.139.74) Quit (Remote host closed the connection)
[11:47] * yanzheng (~zhyan@jfdmzpr05-ext.jf.intel.com) has joined #ceph
[12:21] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[12:28] * allsystemsarego (~allsystem@188.25.130.190) has joined #ceph
[12:28] * sleinen (~Adium@2001:620:0:25:c883:5ee0:6129:cc7a) has joined #ceph
[12:34] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[12:38] * BillK (~BillK-OFT@124-148-246-233.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[12:41] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Read error: Operation timed out)
[12:55] * LeaChim (~LeaChim@2.122.178.96) has joined #ceph
[13:04] <cfreak201> how can i tell ceph: That osd is lost, a meteor crashed on it, just forget it and use the new ones ?
[13:08] * dwm (~dwm@northrend.tastycake.net) has joined #ceph
[13:21] <phantomcircuit> cfreak201, ceph osd lost
[13:21] <phantomcircuit> :)
[13:21] <cfreak201> phantomcircuit: i did that.. added new osd's just to see those "wrongly marked me down" messages on the new osds..
[13:22] <phantomcircuit> cfreak201, ceph osd dump, ceph osd tree -> pastebin
[13:26] <cfreak201> phantomcircuit: http://pastie.org/8205148 http://pastie.org/8205150
[13:26] * yanzheng (~zhyan@jfdmzpr05-ext.jf.intel.com) Quit (Remote host closed the connection)
[13:28] <phantomcircuit> cfreak201, are all of the ones listed as down actually up and alive?
[13:28] <phantomcircuit> it looks like you reused the old osd's numbers
[13:28] <cfreak201> osd0 - osd11 are "dead" as in disks died...
[13:28] <phantomcircuit> it's probably best to use new ones
[13:28] * dpippenger (~riven@cpe-75-85-17-224.socal.res.rr.com) has joined #ceph
[13:28] <phantomcircuit> oh you did
[13:29] <phantomcircuit> cfreak201, what happens when you do ceph osd lost 0
[13:29] <cfreak201> osd.0 is not down or doesn't exist
[13:30] <phantomcircuit> oh i see you used ceph osd rm
[13:30] <phantomcircuit> hmm
[13:30] <phantomcircuit> they're listed as DNE
[13:31] <cfreak201> since it's just evalution of ceph and not production data on it I figured i rather try to resolve it myself then wait for a response :-)
[13:31] <dwm> Hmm, applying a ceph osd reweight appears to have left some PGs stuck in active+remapped
[13:31] <cfreak201> what does DNE stand for ?
[13:31] <phantomcircuit> cfreak201, are the osd's listed as 'down' on storeage1 actually down or just incorrectly marked as down
[13:31] <dwm> DNE == Does Not Exist?
[13:32] <cfreak201> phantomcircuit: processes are not running, the physical disks have been wiped (dd /dev/zero) and readded as new ones
[13:32] <phantomcircuit> lol then how is 35 up/in
[13:32] <phantomcircuit> o.o
[13:32] <cfreak201> 35 is one of the new ones that are up
[13:33] <phantomcircuit> cfreak201, ok so ceph-osd 27
[13:33] <phantomcircuit> is there a ceph-osd process for osd 27 on storage1
[13:36] <cfreak201> oh well i had them shutdown... now the processes are running..
[13:36] <cfreak201> but 24 - 35 are flapping down/up
[13:40] <cfreak201> phantomcircuit: the current status with actually running processes.. http://pastie.org/8205181 http://pastie.org/8205182
[13:40] <phantomcircuit> cfreak201, ceph -w wait a little while and then pastebin it
[13:40] <cfreak201> ok
[13:43] * yanzheng (~zhyan@134.134.139.74) has joined #ceph
[13:44] <cfreak201> phantomcircuit: http://pastie.org/8205186
[13:46] <phantomcircuit> cfreak201, try bringing them up one at a time more slowly
[13:46] <cfreak201> phantomcircuit: ok
[13:48] <cfreak201> i've added 5min of time between each of the osd's... hopefully that hsould be enough
[13:50] * dpippenger (~riven@cpe-75-85-17-224.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[14:02] <cfreak201> phantomcircuit: that doesn't seem to quiet do it :/ http://pastebin.com/4vTeftFr
[14:19] <dwm> Ah, CRUSH weights and reweights are different things.
[14:19] <dwm> Reweights are a float 0..1 that throttles utility; the weight is an arbitrary float that represents storage capacity.
[14:20] <dwm> Using reweights to specify storage capacity doesn't seem to work cleanly..
[14:20] <dwm> (Resulting in the active+remapped condition.)
[14:31] * sage (~sage@76.89.177.113) has joined #ceph
[14:34] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[14:38] * yanzheng (~zhyan@134.134.139.74) Quit (Remote host closed the connection)
[14:38] <phantomcircuit> dwm, iirc crush weights are where to set capacity
[14:39] <phantomcircuit> ceph more or less assumes that all the osd's have the same throughput/capacity ratio
[14:39] <phantomcircuit> which is pretty much never true
[14:40] * leseb (~leseb@88-190-214-97.rev.dedibox.fr) Quit (Killed (NickServ (Too many failed password attempts.)))
[14:41] * leseb (~leseb@88-190-214-97.rev.dedibox.fr) has joined #ceph
[14:41] * Vincent_Valentine (~Vincent_V@49.206.158.155) has joined #ceph
[14:42] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[14:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:44] <cfreak201> phantomcircuit: the rampup is about to finish but still i see it bouncing back to 12 osds and then back to ~23 and down to 14... same behaviour like before
[14:44] <cfreak201> any further ideas ?
[14:45] <phantomcircuit> cfreak201, is the network saturated?
[14:46] <cfreak201> phantomcircuit: i dont think so... dedicated 1 GBit switch with just those ceph nodes connected
[14:47] <cfreak201> ~600kbit/s "load"
[14:47] <phantomcircuit> im out of ideas
[14:48] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) has joined #ceph
[14:53] <cfreak201> paravoid: thanks for your input anyway .. Seems like I just don't get it to be able to recover from any kind of failure which makes it pretty useless :/
[15:00] <cfreak201> oh sorry paravoid, wrong autocomplete ment phantomcircuit
[15:05] <dwm> phantomcircuit: Hmm, indeed --- my error seems to have been trying to use 'reweight' rather than 'weight'.
[15:23] <cfreak201> phantomcircuit: i figured it out.. ceph statush "HEALTH_OK".... after rebooting that server it didn't allow incomming connections/packets for ceph.. -.-
[15:24] <phantomcircuit> lol
[15:24] <cfreak201> yes..
[15:24] <phantomcircuit> :)
[15:25] <phantomcircuit> dont worry i pulled one of those yesterday
[15:25] <cfreak201> i would have expected some "connection refused" output somewhere..
[15:25] <phantomcircuit> i couldn't figure out why postgresql-server was failing to build on gentoo hardened
[15:25] <phantomcircuit> i knew i had to emerge postgresql-base again, but forgot to do it
[15:25] <phantomcircuit> proceeded to dig through tons of perfectly functional code
[15:25] <cfreak201> haha
[15:25] <phantomcircuit> for hours
[15:25] <phantomcircuit> and then remembered that i hadn't
[15:26] <phantomcircuit> checklists
[15:26] <phantomcircuit> infinitely valuable
[15:26] <cfreak201> i just adjusted the puppet class..
[15:26] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[15:57] * BillK (~BillK-OFT@124-148-246-233.dyn.iinet.net.au) has joined #ceph
[16:05] * sprachgenerator (~sprachgen@50.44.40.223) has joined #ceph
[16:20] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[16:20] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:22] * sprachgenerator (~sprachgen@50.44.40.223) Quit (Quit: sprachgenerator)
[16:27] * Vincent_Valentine (~Vincent_V@49.206.158.155) Quit (Ping timeout: 480 seconds)
[16:45] * Vjarjadian (~IceChat77@90.214.208.5) Quit (Quit: Not that there is anything wrong with that)
[16:47] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) Quit (Remote host closed the connection)
[16:53] * yanzheng (~zhyan@134.134.139.74) has joined #ceph
[17:30] * xmltok (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[17:32] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[17:35] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[17:36] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[17:44] * joao (~JL@216.1.187.162) has joined #ceph
[17:44] * ChanServ sets mode +o joao
[17:47] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:48] * allsystemsarego (~allsystem@188.25.130.190) Quit (Quit: Leaving)
[18:02] * leseb (~leseb@88-190-214-97.rev.dedibox.fr) Quit (Killed (NickServ (Too many failed password attempts.)))
[18:02] * leseb (~leseb@88-190-214-97.rev.dedibox.fr) has joined #ceph
[18:02] * allsystemsarego (~allsystem@188.25.130.190) has joined #ceph
[18:13] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[18:14] * Vincent_Valentine (~Vincent_V@49.206.158.155) has joined #ceph
[18:42] * yanzheng (~zhyan@134.134.139.74) Quit (Ping timeout: 480 seconds)
[18:47] * dpippenger (~riven@cpe-75-85-17-224.socal.res.rr.com) has joined #ceph
[18:50] * DarkAce-Z (~BillyMays@50.107.55.36) has joined #ceph
[18:55] * DarkAceZ (~BillyMays@50.107.55.36) Quit (Ping timeout: 480 seconds)
[18:59] * DarkAce-Z is now known as DarkAceZ
[19:10] * joao (~JL@216.1.187.162) Quit (Remote host closed the connection)
[19:13] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Read error: Operation timed out)
[19:21] * nwat (~nwat@eduroam-251-132.ucsc.edu) has joined #ceph
[19:36] * dpippenger (~riven@cpe-75-85-17-224.socal.res.rr.com) Quit (Remote host closed the connection)
[19:36] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[19:42] * xmltok (~xmltok@relay.els4.ticketmaster.com) Quit (Ping timeout: 480 seconds)
[19:43] <chamings> good morning (or whatever). Does anyone else have trouble getting 'osd mount options xfs' recognized in the cuttlefish release? It seems to ignore the values I place there altogether and mount with only 'rw'.
[19:48] <Vincent_Valentine> Any idea on when developer summit is scheduled .. just need approximate time
[19:54] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Quit: Leaving...)
[19:55] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[19:55] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit ()
[20:27] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:33] * saabylaptop (~saabylapt@1009ds5-oebr.1.fullrate.dk) has joined #ceph
[20:33] * saabylaptop (~saabylapt@1009ds5-oebr.1.fullrate.dk) Quit ()
[20:34] * saabylaptop (~saabylapt@1009ds5-oebr.1.fullrate.dk) has joined #ceph
[20:44] * Vincent_Valentine (~Vincent_V@49.206.158.155) Quit (Ping timeout: 480 seconds)
[20:58] * grepory (~Adium@209.119.62.120) has joined #ceph
[21:06] * saabylaptop (~saabylapt@1009ds5-oebr.1.fullrate.dk) Quit (Quit: Leaving.)
[21:06] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:07] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:12] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:13] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:17] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:17] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:20] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:20] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:21] <sleinen> A colleague had to reformat four OSDs last week. I cannot quite get the cluster back into a clean state. There are two unfound objects in one PG. I cannot declare them as lost, because one of the OSDs of the PG is always reported as "querying".
[21:21] <sleinen> https://gist.github.com/sleinen/6151553
[21:21] <sleinen> Any ideas on how to get this unstuck?
[21:39] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[21:47] * mschiff (~mschiff@port-3598.pppoe.wtnet.de) has joined #ceph
[21:48] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:49] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:52] * grepory (~Adium@209.119.62.120) Quit (Quit: Leaving.)
[21:59] * grepory (~Adium@209.119.62.120) has joined #ceph
[22:18] * allsystemsarego (~allsystem@188.25.130.190) Quit (Quit: Leaving)
[22:21] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[22:22] * ninkotech (~duplo@static-84-242-87-186.net.upcbroadband.cz) Quit (Remote host closed the connection)
[22:22] * ninkotech (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[22:26] * dalgaaf (~dalgaaf@nrbg-4dbe2844.pool.mediaWays.net) has joined #ceph
[22:46] * grepory (~Adium@209.119.62.120) Quit (Quit: Leaving.)
[22:54] * mozg (~andrei@host109-151-35-94.range109-151.btcentralplus.com) has joined #ceph
[22:54] <mozg> hello guys
[22:54] <mozg> i need to rebuild one of my servers which acts as one of the ceph-mon servers
[22:54] <mozg> does anyone know what is the right procedure?
[22:55] <mozg> should I remove that mon server from ceph and leave my cluster with 2 mon servers while I rebuild the mon?
[22:55] <mozg> or should i just rebuild the server and install ceph and add the mon afterwords?
[23:08] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[23:10] <mikedawson> mozg: I have had success with http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors then http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors
[23:11] <mozg> does it matter if my cluster will temporarily have 2 mons instead of 3?
[23:13] <mikedawson> mozg: as long as you have quorum, you'll be fine. Two of three is a majority, so you should keep quorum.
[23:13] <mozg> cheers
[23:14] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[23:30] * mschiff (~mschiff@port-3598.pppoe.wtnet.de) Quit (Remote host closed the connection)
[23:39] * sleinen (~Adium@2001:620:0:25:c883:5ee0:6129:cc7a) Quit (Ping timeout: 480 seconds)
[23:42] <mozg> strange
[23:42] <mozg> i've just added a 4th mon to my cluster using ceph-deploy
[23:42] <mozg> and my cluster froze
[23:43] <mozg> how do I recover?
[23:43] <mozg> i had a quorum with 3 mons
[23:44] <mozg> the mon log file has the following messages:
[23:44] <mozg> 2013-08-04 22:44:04.511021 7fa0616a9700 1 mon.arh-cloud11-ib@1(electing).elector(523) init, last seen epoch 523
[23:44] <mozg> every 5 seconds
[23:45] <mozg> could someone please help me with the problem
[23:47] <sage> mozg: sitll no quorum even when the 4th mon daemon is stopped?
[23:47] <mozg> sage: no (((
[23:47] <mozg> ceph -s gives me:
[23:47] <mozg> 2013-08-04 22:47:02.825539 7f1eee74c700 0 -- 192.168.168.200:0/22402 >> 192.168.168.2:6789/0 pipe(0x7f1ed80088b0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
[23:48] <mozg> I do have this process running on the new mon: /usr/bin/python /usr/sbin/ceph-create-keys --cluster=ceph -i arh-cloud11-ib
[23:48] <mozg> it's been running for about 20 mins
[23:49] <sage> that's harmless and can be ignored
[23:49] <sage> can you do a 'ceph --admin-daemon /var/run/ceph/ceph-mon.*.asok config set debug_mon 20'
[23:49] <sage> and debug_ms 1
[23:49] <sage> on one of hte other other mons so we can see why teh election isn't completing?
[23:49] <mozg> will do
[23:50] <mozg> not on the new mon server?
[23:51] <mozg> sage: done the debug commands
[23:51] <mozg> got success in return
[23:51] <sage> post the resulting /var/log/ceph/ceph-mon.*.log somewhere
[23:51] <mozg> will do
[23:52] <mozg> could you please remind me the url for posting to your sftp server?
[23:54] <mozg> done the upload
[23:55] <sage> thansk, looking
[23:55] <sage> mozg: are you sure 3 mons are actually up?
[23:56] <sage> with monmap of size 3, 2 is a quorum. if you add a 4th, you need 3 up ceph-mon daemons
[23:56] <sage> i only see election messages from mon.0 and mon.2
[23:56] <mozg> i will double check, but i've not stopped them before adding the 4th mon
[23:56] <sage> mon.1 could have been down the whole time
[23:57] <mozg> all 3 ceph-mon processes are running
[23:57] <mozg> and actually iv'e checked that before adding the 4th mon
[23:58] <mozg> and ceph -s showed all 3 mons were up
[23:58] <mozg> should i restart ceph-mon on all 4 servers?
[23:58] <mozg> by the way, i am using 0.61.7
[23:58] <mozg> on ubuntu 12.04
[23:58] <sage> hmm
[23:58] <sage> can you do the same debug commands on the mon.1 and post that log?
[23:59] <sage> curious why it's not participating in teh election
[23:59] <sage> my guess is that restarting it will fix, btw.
[23:59] <sage> if you are in a hurry :)
[23:59] <mozg> i will try to restart it
[23:59] <mozg> by the way, I do not have mon.1. the first mon is called arh-ibstorage1-ib
[23:59] <mozg> the second one is arh-ibstorage2-ib

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.