#ceph IRC Log

Index

IRC Log for 2013-04-23

Timestamps are in GMT/BST.

[0:04] * jskinner (~jskinner@69.170.148.179) Quit (Remote host closed the connection)
[0:08] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[0:11] * houkouonchi-work (~linux@12.248.40.138) Quit (Remote host closed the connection)
[0:14] * tnt (~tnt@91.176.19.114) Quit (Ping timeout: 480 seconds)
[0:16] * houkouonchi-work (~linux@12.248.40.138) has joined #ceph
[0:26] * DarkAceZ (~BillyMays@50.107.54.92) Quit (Ping timeout: 480 seconds)
[0:29] * DarkAceZ (~BillyMays@50.107.54.92) has joined #ceph
[0:35] * gmason (~gmason@hpcc-fw.net.msu.edu) Quit (Quit: Computer has gone to sleep.)
[0:52] * rturk is now known as rturk-away
[1:01] * PerlStalker (~PerlStalk@72.166.192.70) Quit (Quit: ...)
[1:06] * calebamiles (~caleb@c-50-138-218-203.hsd1.vt.comcast.net) has joined #ceph
[1:06] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[1:09] * mrjack (mrjack@office.smart-weblications.net) has joined #ceph
[1:23] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[1:29] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[1:33] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) has joined #ceph
[1:35] * KevinPerks2 (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[1:38] <Kioob> once again, for the record : after a lot of times, I found each version of my data, on all nodes. All seem fine at ceph level : I have exactly the same data on each replica, but my data are not here
[1:38] <Kioob> So, if there is a problem, it's not in OSD
[1:38] <Kioob> well.... good night !
[1:47] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:54] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[1:54] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[2:08] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[2:08] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[2:15] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[2:15] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[2:17] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit ()
[2:17] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[2:17] * John (~john@astound-64-85-225-33.ca.astound.net) Quit (Quit: Leaving)
[2:18] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[2:18] * rustam (~rustam@94.15.91.30) has joined #ceph
[2:18] * LeaChim (~LeaChim@176.250.220.3) Quit (Read error: Operation timed out)
[2:19] * alram (~alram@38.122.20.226) Quit (Read error: Operation timed out)
[2:20] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[2:20] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Read error: Connection reset by peer)
[2:21] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[2:24] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit ()
[2:24] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[2:36] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:41] * gmason (~gmason@173.241.208.122) has joined #ceph
[2:41] * capri (~capri@pd95c3284.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[2:41] * capri (~capri@pd95c3283.dip0.t-ipconnect.de) has joined #ceph
[2:48] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Quit: Ja odoh a vi sta 'ocete...)
[2:51] * capri_on (~capri@pd95c3284.dip0.t-ipconnect.de) has joined #ceph
[2:51] * capri (~capri@pd95c3283.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[2:55] * rustam (~rustam@94.15.91.30) has joined #ceph
[2:55] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[2:56] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[3:01] * rustam (~rustam@94.15.91.30) has joined #ceph
[3:01] * coyo (~unf@00017955.user.oftc.net) Quit (Quit: F*ck you, I'm a daemon.)
[3:02] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[3:03] <dmick> http://www.thepetitionsite.com/986/064/112/vietnam-stop-the-illegal-trade-in-pangolins/?cid=fb_lg_pangolinvietnam2
[3:18] <mikedawson> dmick: have you seen monitor admin sockets that hang for a long time?
[3:18] <dmick> er...not usually
[3:19] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: Leaving.)
[3:19] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:20] <mikedawson> dmick: I'm fighting several mon issues with 0.60. Not sure if this is another issue or just a symptom
[3:21] * noob2 (~cjh@173.252.71.2) Quit (Quit: Leaving.)
[3:23] * mjevans (~mje@209.141.34.79) Quit (Ping timeout: 480 seconds)
[3:23] * dosaboy (~dosaboy@72.11.113.122) has joined #ceph
[3:26] <dmick> seems strange. what command, and does it eventually complete?
[3:28] <mikedawson> dmick: any command ... ceph --admin-daemon /var/run/ceph/ceph-mon.b.asok help . I give up after several minutes
[3:28] <mikedawson> this daemon also isn't logging or spinning CPU
[3:29] <dmick> yeah, that should be immediate unless the mon is wedged
[3:29] <mikedawson> it starts up very slowly, then nothing
[3:29] * mcclurmc_laptop (~mcclurmc@cpc1-oxfd21-2-0-cust70.4-3.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[3:29] <dmick> strace?
[3:30] <mikedawson> I'll give try to strace it after kid's bedtime. Is there a link anywhere to remind me how to do it?
[3:31] <dmick> man strace, but: strace -f -p <pid>
[3:31] <dmick> just to see if it's doing anything at all
[3:32] <mikedawson> thanks, dmick
[3:32] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: Leaving.)
[3:33] * Cube (~Cube@12.248.40.138) Quit (Quit: Leaving.)
[3:33] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:38] <mikedawson> dmick: when I do service ceph -a start, sometimes I get stuff like: global_init: unable to open config file from search list /tmp/ceph.conf.37e4d4203d44734d9f8dde2314ca5389
[3:38] * rustam (~rustam@94.15.91.30) has joined #ceph
[3:39] <mikedawson> do you know what's going on there?
[3:39] <dmick> that seems ungood; root-ssh to all boxes from that one working OK?
[3:39] <dmick> (passwordless root)
[3:40] * gmason (~gmason@173.241.208.122) Quit (Quit: Computer has gone to sleep.)
[3:40] <mikedawson> yep. it started two OSDs on the host in question, but failed on the third
[3:40] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[3:40] <mikedawson> If I rerun, it works
[3:40] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: Leaving.)
[3:40] <mikedawson> but may fail on another OSD
[3:42] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[3:43] <dmick> doesn't make sense. Are you perhaps suffering from flappy networks?
[3:45] <mikedawson> perhaps
[3:45] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:46] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit ()
[3:46] <mikedawson> dmick, is the /tmp dir on the box with the OSD or the box from which I ran "service ceph -a start"?
[3:47] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:47] <dmick> I think the OSD, but I'm not 100% sure; I think the master distributes ceph.conf to the slaves by scp'ing it to /tmp
[3:48] <dmick> lemme browse the script
[3:48] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:48] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit ()
[3:49] <mikedawson> dmick: it seems to be the osd... root 2503 1 3 01:42 ? 00:00:11 /usr/bin/ceph-osd -i 39 --pid-file /var/run/ceph/osd.39.pid -c /tmp/ceph.conf.61ea75c41365f2cd892094880d76fc8e
[3:49] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Read error: Connection reset by peer)
[3:50] <dmick> yep
[3:50] <mikedawson> for looping a "service ceph start" works all the time (and refers to the local /etc/ceph/ceph.conf)
[3:50] <dmick> sure; without the -a, it doesn't do any remote ops
[3:51] <dmick> I did see some discussion earlier today about similar hostnames causing issues...lemme see if I can find that
[3:51] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:52] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:52] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Read error: Connection reset by peer)
[3:52] <dmick> http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/14566
[3:53] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:53] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit ()
[3:53] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:53] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Read error: Connection reset by peer)
[3:54] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:54] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Read error: Connection reset by peer)
[4:01] <dmick> mikedawson: I don't know if that could be your problem but it seems plausible
[4:02] * treaki__ (afe4ff0994@p4FDF78D1.dip0.t-ipconnect.de) has joined #ceph
[4:02] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[4:07] * treaki_ (53241a3ae7@p4FDF70FE.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[4:09] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[4:09] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[4:19] * diegows (~diegows@190.190.2.126) Quit (Ping timeout: 480 seconds)
[4:26] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[4:28] <mikedawson> dmick: that's not my problem, but I can see where that would be an issue
[5:01] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[5:02] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[5:03] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:08] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[5:17] * nhm (~nh@65-128-150-185.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[5:21] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:20] <mikedawson> matt_, mega_au: ping
[6:20] <mega_au> Here I am.
[6:21] <mikedawson> mega_au: did you deploy 0.60 from scratch or upgrade? I may have found the problem if you deployed 0.60 or 0.59 with mkcephfs
[6:22] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[6:23] <mega_au> Upgrade from 0.39 all the way through 0.60. Got complaint when was upgrading to 0.59 that store is pre 0.52
[6:24] <mega_au> On one monitor. Dropped that monitor and created it afresh. Rest upgraded more or less OK.
[6:24] <mega_au> In my yesterday message should read "too far ahead" - sorry. Did you check your logs?
[6:25] <mikedawson> mega_au: if you are still having monitor / quorum issues, can you check your monitor keyrings for the presence of a caps line?
[6:25] <mikedawson> mega_au: yeah, I've seen just about everything that can go wrong with monitors so far
[6:26] <mega_au> caps are OK.
[6:27] <mikedawson> ok, thanks
[6:27] <mega_au> Same here. Currently my issue is that I cannot add empty OSD - mons crashing out when I try to start OSD which was down.
[6:28] <mega_au> I was going through code to be in the know of how it works and I believe new mon code is better than old one. Bugs are inevitable but they will be ironed out soon. And for next stable it should be good.
[6:30] <mega_au> What sort of problem do you have with mkcephfs?
[6:31] <mikedawson> http://tracker.ceph.com/issues/4756 which causes http://tracker.ceph.com/issues/4752
[6:32] <mikedawson> and it may contribute to http://tracker.ceph.com/issues/4784
[6:36] <mega_au> Did not use ceph-create-keys - so I did not hit 4752. But I have same stuff in the logs as in 4784. I was wondering why.
[6:39] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[6:40] * rustam (~rustam@94.15.91.30) has joined #ceph
[6:42] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[6:53] * rustam (~rustam@94.15.91.30) has joined #ceph
[6:55] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[7:01] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 20.0.1/20130409194949])
[7:17] * eschnou (~eschnou@131.165-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[7:21] * norbi (~nonline@buerogw01.ispgateway.de) has joined #ceph
[7:24] * norbi (~nonline@buerogw01.ispgateway.de) Quit (Read error: Connection reset by peer)
[7:34] * eschnou (~eschnou@131.165-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[8:02] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:03] * tnt (~tnt@91.176.19.114) has joined #ceph
[8:18] * Vjarjadian (~IceChat77@90.214.208.5) Quit (Quit: Not that there is anything wrong with that)
[8:22] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) Quit (Ping timeout: 480 seconds)
[8:33] * rustam (~rustam@94.15.91.30) has joined #ceph
[8:35] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[8:50] * Romeo_ (~romeo@198.144.195.85) has joined #ceph
[8:53] * Romeo (~romeo@198.144.195.85) Quit (Ping timeout: 480 seconds)
[8:59] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[9:00] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:03] * madkiss (~madkiss@2001:6f8:12c3:f00f:75ae:96dd:448f:746b) has joined #ceph
[9:06] * niklas (~niklas@2001:7c0:409:8001::32:115) has joined #ceph
[9:16] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[9:16] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:16] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:22] * leseb (~Adium@83.167.43.235) has joined #ceph
[9:23] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) has joined #ceph
[9:24] * loicd1 (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:24] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[9:28] * tnt (~tnt@91.176.19.114) Quit (Ping timeout: 480 seconds)
[9:28] * loicd1 (~loic@3.46-14-84.ripe.coltfrance.com) Quit ()
[9:28] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:34] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Read error: Connection reset by peer)
[9:34] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:36] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:42] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[9:50] * l0nk (~alex@83.167.43.235) has joined #ceph
[9:51] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) Quit (Ping timeout: 480 seconds)
[9:52] * esammy (~esamuels@host-2-102-68-79.as13285.net) has joined #ceph
[9:55] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[10:10] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) Quit (Quit: Leaving.)
[10:12] * LeaChim (~LeaChim@176.250.220.3) has joined #ceph
[10:12] * ScOut3R (~ScOut3R@212.96.47.215) has joined #ceph
[10:15] * rahmu (~rahmu@83.167.43.235) has joined #ceph
[10:18] * mega_au_ (~chatzilla@94.137.213.1) has joined #ceph
[10:20] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:20] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:21] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:21] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:22] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:22] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:22] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:23] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:23] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:23] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:24] * mega_au (~chatzilla@94.137.213.1) Quit (Ping timeout: 480 seconds)
[10:24] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:24] * mega_au_ is now known as mega_au
[10:24] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:24] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:25] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:25] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:25] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:26] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:26] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:26] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:26] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:27] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:27] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:27] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:28] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:28] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:29] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:29] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:29] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[10:30] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[10:36] * hatman5468 (d4af59a2@ircip3.mibbit.com) has joined #ceph
[10:37] * hatman5468 (d4af59a2@ircip3.mibbit.com) Quit (autokilled: spambot. Dont mail support@oftc.net with questions. (2013-04-23 08:37:13))
[10:47] * rustam (~rustam@94.15.91.30) has joined #ceph
[10:49] * vo1d (~v0@194-118-211-45.adsl.highway.telekom.at) has joined #ceph
[10:49] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[10:56] * v0id (~v0@62-46-175-181.adsl.highway.telekom.at) Quit (Ping timeout: 480 seconds)
[11:01] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Ping timeout: 480 seconds)
[11:10] * rustam (~rustam@94.15.91.30) has joined #ceph
[11:11] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[11:21] * joelio (~Joel@88.198.107.214) Quit (Ping timeout: 480 seconds)
[11:23] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[11:23] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[11:24] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[11:24] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[11:25] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[11:25] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[11:25] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[11:26] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[11:26] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[11:27] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[11:30] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[11:30] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[11:31] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[11:31] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[11:32] * dxd828 (~dxd828@195.191.107.205) has joined #ceph
[11:32] <leseb> hi guys
[11:54] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) has joined #ceph
[12:15] * calebamiles (~caleb@c-50-138-218-203.hsd1.vt.comcast.net) Quit (Quit: Leaving.)
[12:15] * calebamiles (~caleb@c-50-138-218-203.hsd1.vt.comcast.net) has joined #ceph
[12:18] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[12:18] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:22] * diegows (~diegows@190.190.2.126) has joined #ceph
[12:27] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[12:29] * infernix (nix@5ED33947.cm-7-4a.dynamic.ziggo.nl) has joined #ceph
[12:48] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[12:49] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[12:49] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) has joined #ceph
[12:50] * mcclurmc_laptop (~mcclurmc@client-7-201.eduroam.oxuni.org.uk) Quit (Max SendQ exceeded)
[13:03] <tnt> Mmm, when doing a dd to check perf, I would have expected to get sequential requests of the blocksize at the block layer ... but it seems not.
[13:08] * andreask (~andreas@212.101.205.2) has joined #ceph
[13:09] * andreask (~andreas@212.101.205.2) has left #ceph
[13:16] * portante|ltp (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[13:20] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[13:32] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[13:32] <rtek> 172.16.172.9:6823/2732 >> 172.16.172.9:6800/25048 pipe(0x76b6500 sd=41 :35519 s=2 pgs=969 cs=1 l=0).fault with nothing to send, going to standby
[13:32] <rtek> heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f264ecc9700' had timed out after 60
[13:32] <rtek> what does this, generally speaking, indicate?
[13:33] * samppah (hemuli@namibia.aviation.fi) has left #ceph
[13:34] <rtek> some OSDs seem unable to communicate with each other, even on the same host
[13:34] <rtek> which isn't particularly loaded or something
[13:46] * esammy (~esamuels@host-2-102-68-79.as13285.net) has left #ceph
[13:52] <benner> How to see new rbd block device size on client side after rbd resize operation?
[13:52] <mrjack_> is "filestore fiemap" a per osd config option?
[13:52] <leseb> benner: fdisk -l /dev/rbd?
[13:53] <benner> leseb: yes.
[13:54] <benner> i just found that unmount did the trick... is there way to do this online?
[13:55] <leseb> only if you use btrfs otherwise neither xfs nor ext4 can do it… :/
[13:58] <darkfaded> no blkdev --rereadpt or similar?
[13:59] <benner> leseb: are you mean btrfs on osd?
[14:01] <leseb> darkfaded: unfortunately not.. I tried almost everything
[14:01] <leseb> benner: nop, btrfs for the rbd fs
[14:02] <darkfaded> leseb: ok thanks for clarifiying
[14:02] <leseb> darkfaded: np :)
[14:03] <jerker> what file system at OSD currently give best performance?
[14:04] <wogri_risc> jerker: as far as I've read on the mailinglist the answer is: it depends.
[14:04] <wogri_risc> if you have either a lot of small reads / writes or write in big blocks
[14:04] <wogri_risc> also I've read that ext4 degrades in performance over time, which is somehow important to know
[14:04] <leseb> jerker: probably brtfs but it's not stable, thus xfs seems to be the better choice for now
[14:05] <wogri_risc> I agree to leseb.
[14:05] * humbolt (~elias@194-024-138-227.nat.orange.at) has joined #ceph
[14:05] * Yen (~Yen@ip-81-11-244-122.dsl.scarlet.be) Quit (Ping timeout: 480 seconds)
[14:07] * rahmu (~rahmu@83.167.43.235) Quit (Remote host closed the connection)
[14:08] <jerker> wogri_risc: why would ext4 degrade in performance?
[14:08] * Yen (~Yen@ip-83-134-116-177.dsl.scarlet.be) has joined #ceph
[14:08] <wogri_risc> I don't know but I've read it on the mailinglist.
[14:08] <wogri_risc> maybe I can find the post
[14:10] <jerker> wogri_risc: i find this regarding degrade btrfs performance https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1016435
[14:12] <wogri_risc> hm. don't nail me on that one, I thought to have read it on the mailinglist, but I can't find the post.
[14:13] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[14:13] * humbolt (~elias@194-024-138-227.nat.orange.at) Quit (Quit: humbolt)
[14:13] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[14:16] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:31] * BillK (~BillK@58-7-139-175.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[14:38] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[14:38] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[14:40] * BillK (~BillK@124-169-105-67.dyn.iinet.net.au) has joined #ceph
[14:44] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[14:50] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:02] * joelio (~Joel@88.198.107.214) has joined #ceph
[15:02] * lofejndif (~lsqavnbok@199.48.147.36) has joined #ceph
[15:04] <benner> leseb: so you are saying that this is correct behavior: http://p.defau.lt/?Y4AAK2GqC8vSqmspg2yldQ ?
[15:05] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) Quit (Remote host closed the connection)
[15:07] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:07] <leseb> benner: yes it is
[15:08] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[15:08] <leseb> benner: do this same test with brtfs, then re-read the block device with blockdev while the fs is mounted, you will notice the change
[15:09] * gmason (~gmason@hpcc-fw.net.msu.edu) has joined #ceph
[15:10] <Kdecherf> gregaf: are you here?
[15:11] <elder> Kdecherf, try again in a few hours.
[15:11] <elder> It is about 6am in his part of the world.
[15:15] <benner> leseb: why it behaves like that? for example i can see resized LV after lvresize (LVM)
[15:15] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[15:16] <leseb> lvm uses the device mapper which is quite different, I assume that somehow with a raw device, fs other btrfs block kernel calls to retrieve the block size
[15:16] <leseb> benner: ^
[15:18] <benner> ok, and what if i reexport this rbd0 over iscsi. the behavarion will be the same (can't do online resize)?
[15:18] <jmlowe> benner: you need some mechanism to reload the disk metadata, lvm has it rbd doesn't have hooks to take action on resize
[15:18] * capri_on (~capri@pd95c3284.dip0.t-ipconnect.de) Quit (Quit: Verlassend)
[15:19] <jmlowe> benner: I think there is a iscsi adaptor written that will cut out the kernel client
[15:19] <Kdecherf> elder: oh, ok thx
[15:19] <jmlowe> http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
[15:20] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[15:20] <jmlowe> I think it needs some work, but that's probably your path forward
[15:21] * Vjarjadian (~IceChat77@90.214.208.5) Quit (Quit: REALITY.SYS Corrupted: Re-boot universe? (Y/N/Q))
[15:25] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Quit: Ex-Chat)
[15:27] <benner> yes, i saw rbd module for stgt, but my question is more about online rbd resize. if rbd0 is disk for vm, can i resize rbd online or i can shutdown vm and then resize rbd?
[15:27] <thelan> Hello
[15:29] <jmlowe> benner: for a vm, my guess is that if you use kvm with virtio-scsi you can rescan the scsi bus and pickup the changed size
[15:29] <thelan> I've seen ticket #3454. Is it a way to retrive a file via SWIFT with a temporary key or something like that ?
[15:33] * virsibl (~virsibl@94.231.117.244) has joined #ceph
[15:34] * virsibl (~virsibl@94.231.117.244) has left #ceph
[15:34] <tnt> Ah, finally found why RBD performance in Xen is really bad ...
[15:34] <tnt> seems even large sequential requests by the VM are split into tiny (44k) requests ...
[15:37] <darkfaded> tnt: are they always 44k?
[15:37] <darkfaded> and could you by chance report that to xen-devel?
[15:38] <darkfaded> i had a "beer discussion" with i think ian jackson 2 years ago and he was saying that this was definitely not "intentional"
[15:38] <tnt> They're 44k or smaller. (basically whatever size IO is done by userspace is split into requests so that they are smaller than PAGE_SIZE * 11)
[15:38] <darkfaded> i know why it happens but i'd rather have them think on their own
[15:38] * mnash (~chatzilla@vpn.expressionanalysis.com) Quit (Remote host closed the connection)
[15:39] <tnt> I just asked the question on the ML. I know they set max_segments to 11 and max_segment_size to PAGE_SIZE ... which is why the requests are split.
[15:39] <tnt> But not sure why they set those limits.
[15:41] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[15:41] <jmlowe> benner: this will definitely pick up a disk size change in a vm if it's presented as scsi 'echo 1 > /sys/block/sda/device/rescan'
[15:42] <tnt> The only think I see I could do is to merge the requests myself before submitting them ...
[15:42] <tnt> but that means when I get a request, I must wait a bit before effectively sending it out.
[15:43] <darkfaded> tnt: the fb-flashcache thing showed the same issues
[15:43] <darkfaded> i would've thought it's the blkback ringbuffer
[15:44] <darkfaded> well, ring-thing. i don't know anything about stuff like that if it goes into details
[15:44] <tnt> darkfaded: yes, those limits are related to the ring_buffer. The '11' is because that's how much segment they can put in one page.
[15:44] <darkfaded> ah :)
[15:45] <tnt> comment says: /* Ensure a merged request will fit in a single I/O ring slot. */
[15:45] <darkfaded> i'd love to see one of their devs really starts to think about the implications. it wasn't a big deal in 2005 ;)
[15:46] <benner> jmlowe: thanks
[15:46] <darkfaded> tnt: can you give me an archive link to your mail?
[15:47] <tnt> darkfaded: http://lists.xen.org/archives/html/xen-devel/2013-04/msg02296.html
[15:48] <jmlowe> benner: what specifically is your setup, I'm running 78 kvm based vm's against 18 osd's split between two data centers
[15:50] <benner> jmlowe: my setup is pre-alfa (3 osd on first server). i'm just learning now but have 6x servers with 12x 2TB SATA each for my ceph lab.
[15:50] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[15:51] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[15:53] <jmlowe> benner: primarily for vm hosting?
[15:56] <benner> jmlowe: primary yes but not limited to this. as i said for now i'm just doing research to have hand on. in papers allways all looks better than reality :)
[15:58] * PerlStalker (~PerlStalk@72.166.192.70) has joined #ceph
[15:58] <benner> as example, in time of talking here i meet other problem: http://p.defau.lt/?Ul9Y1jOB0Mmb3GDzxTz6MQ
[15:59] <benner> bonnnie on rbd
[16:00] <jmlowe> that's odd, is that inside a vm and which filesystem was on the rbd device?
[16:00] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[16:00] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[16:00] <tnt> benner: are yu sure there is enough free space on the rbd drive ?
[16:01] <mikedawson> joao: have you grabbed the log files yet? I want to rm them after you get them
[16:01] <benner> jmlowe: it's on physical node (separete from ceph nodes) and fs is ext3
[16:01] <jmlowe> benner: dmesg give any hints as to what went wrong?
[16:01] * john_barbee_ (~jbarbee@17192e61.test.dnsbl.oftc.net) has joined #ceph
[16:02] <benner> nope, but i'll investigate what tnt said
[16:07] <jmlowe> benner: I've created a production service hosting vm's for xsede.org using rbd devices as the storage and libvirt/qemu for virtualization, I'd be happy to help you out in any way that I can
[16:07] * john_barbee_ (~jbarbee@17192e61.test.dnsbl.oftc.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 21.0/20130408165307])
[16:07] <janos> jmlowe: cool
[16:08] <benner> jmlowe: yes, cool, thanks
[16:08] * gmason_ (~gmason@hpcc-fw.net.msu.edu) has joined #ceph
[16:09] * portante|ltp (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Ping timeout: 480 seconds)
[16:09] <barryo> If SSD's for journals were out out the question what would be the next best option? a decent controller cache, multiple journals shared across a few nearline sas disks or journals on OSD's?
[16:10] * gmason (~gmason@hpcc-fw.net.msu.edu) Quit (Ping timeout: 480 seconds)
[16:10] * gmason_ is now known as gmason
[16:10] <fghaas> barryo next best would be one journal per sas disk, but that's usually not an option if you have many journals
[16:10] <benner> tnt: you was right about space.
[16:11] <barryo> yeah, one journal per disk might be a bit to costly for us
[16:12] * aliguori (~anthony@20616e33.test.dnsbl.oftc.net) has joined #ceph
[16:13] <tnt> benner: :) I just ran a bunch of bonnie++ bench yesterday and was hit by the same thing.
[16:14] <joao> mikedawson, yep
[16:16] <joao> Karcaw, around?
[16:17] <mikedawson> joao: I created this 0.60 deployment with mkcephfs, and identified http://tracker.ceph.com/issues/4756 then Gary told me what caps should be in my mon. keyring, after manually adding those caps, and restarting the mons, I've had all three in quorum for the past 10 hours
[16:18] <joao> ah, caps issues then
[16:19] <joao> well, mon issues, but wrt caps
[16:19] <joao> not as bad as obscure bugs introduced/flushed during the rework
[16:19] <mikedawson> joao: checked with mega_au and he confirmed he's also seeing http://tracker.ceph.com/issues/4784 and he believes he didn't have the caps issue
[16:20] <joao> oh
[16:20] <joao> I was really hoping that one was a side effect of the caps issue
[16:21] <joao> not sure why it should though; call it wishful thinking
[16:21] <mikedawson> joao: me too, I keep waiting for it to lose quorum now that I have better caps, so far it's been solid
[16:21] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[16:22] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[16:31] * Yen (~Yen@ip-83-134-116-177.dsl.scarlet.be) has left #ceph
[16:37] * loicd (~loic@185.10.252.15) has joined #ceph
[16:42] * vata (~vata@2607:fad8:4:6:a050:a385:75f0:262a) has joined #ceph
[16:51] * wschulze (~wschulze@38.98.115.249) has joined #ceph
[17:02] * jskinner (~jskinner@69.170.148.179) has joined #ceph
[17:02] * timmclaughlin (~timmclaug@69.170.148.179) has joined #ceph
[17:08] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) Quit (Remote host closed the connection)
[17:12] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[17:12] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[17:16] * mtk (~mtk@44c35983.test.dnsbl.oftc.net) has joined #ceph
[17:16] * mtk (~mtk@44c35983.test.dnsbl.oftc.net) Quit (Remote host closed the connection)
[17:17] * mtk (~mtk@44c35983.test.dnsbl.oftc.net) has joined #ceph
[17:17] * mtk (~mtk@44c35983.test.dnsbl.oftc.net) Quit (Remote host closed the connection)
[17:21] * Cube (~Cube@4c5fd981.test.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[17:21] * mtk (~mtk@44c35983.test.dnsbl.oftc.net) has joined #ceph
[17:23] <matt_> Is it possible to bootstrap a monitor by copying a store from another monitor whilst they're offline? (0.60)
[17:24] <matt_> I need to move a monitor due to faulty hardware and I'm not able to add a new one normally due to bugs
[17:24] <joao> you could just move the store to the new location, inject a new monmap on all the monitors if you have indeed changed IPs, and start the monitor
[17:26] <matt_> so I have mon 1,2,3 and 3 is new. I can shutdown 1,2 and copy a store to 3, then start 2,3 to achieve quorum?
[17:27] <matt_> The mon map is up to date but lameness ensued - mon/Monitor.cc: In function 'void Monitor::sync_timeout(entity_inst_t&)' thread 7f09835bc700 time 2013-04-23 23:2$
[17:27] <matt_> mon/Monitor.cc: 1099: FAILED assert(0 == "Unable to find a new monitor to connect to. Not cool.")
[17:30] <joao> hmm
[17:31] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Quit: noahmehl)
[17:34] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[17:34] <matt_> I may have Mike Dawson's paxos issue also...
[17:37] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:37] <joao> err
[17:38] <joao> well, I'm not sure if you can in fact copy 'a' store's to 3, but it's worth a try I guess
[17:39] <joao> I can't think of anything right now that would be a problem
[17:39] <joao> btw
[17:39] <joao> wrt to that sync_timeout stuff, you could try running that monitor with 'mon sync debug provider = <another monitor name>'
[17:40] <matt_> I'll give it a go now, thanks!
[17:40] <joao> that's not something you're supposed to use though
[17:41] * BManojlovic (~steki@91.195.39.5) Quit (Remote host closed the connection)
[17:41] <joao> matt_, are you positive you have injected a proper monmap?
[17:42] <joao> or that the monitor's monmap has more than one monitor in it?
[17:42] <matt_> Sorry, I didn't clarify that. It was injecting, I went down to two monitors by removing one first and then I've added a 3rd on a new host
[17:43] <matt_> wasn't injecting*
[17:43] <jmlowe> how long is it taking to convert the mon's up to 0.60 anyway, we talking seconds, minutes, hours?
[17:52] * wschulze (~wschulze@38.98.115.249) Quit (Quit: Leaving.)
[17:55] * eschnou (~eschnou@131.165-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[17:56] <mikedawson> matt_: about the paxos issue on 0.60, do you have caps listed in your mon keyrings (/var/lib/ceph/mon/ceph-a/keyring)? I have a theory I'm trying to validate
[17:57] <matt_> mikedawson, no I don't. It's just the key
[17:58] <mikedawson> matt_: did you deploy with mkcephfs on version 0.59 or 0.60?
[17:59] <matt_> mikedawson, nope. The original cluster was built on 0.56 I think
[18:00] <Karcaw> joao: i'm running the fix again, as you mentioned in bug 4521, and i get a ' osdmap ver 6300 does not exist' error
[18:01] <joao> hmm
[18:03] <joao> Karcaw, can you pastebin the result from 'ceph_test_store_tool store.db list osdmap' and from 'ls -1 mon/osdmap' and 'ls -1 mon/osdmap_full' ?
[18:03] <mikedawson> matt_: OK. I manually added 'caps mon = "allow *"' to all of my monitor keyrings, then restarted the mons. I've had luck since then
[18:03] <mikedawson> matt_: if you try that, please report back if it helps or not
[18:03] <matt_> excellent, I'll give it a go and let you know
[18:04] <matt_> dammit, I'm rhyming again. Must be sleep deprived...
[18:04] <joao> Karcaw, also, the result from 'ceph_test_store_tool store.db get osdmap first_committed' and 'cat osdmap/first_committed'
[18:04] <mikedawson> matt_: for reference http://tracker.ceph.com/issues/4752 and http://tracker.ceph.com/issues/4756
[18:06] <mikedawson> matt_: if it really works, I suspect you can just add the third monitor following normal procedures (no need to try copying the store like you mention above)
[18:08] * BillK (~BillK@124-169-105-67.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:08] * ScOut3R (~ScOut3R@212.96.47.215) Quit (Ping timeout: 480 seconds)
[18:10] <Karcaw> joao: http://pastebin.com/Tj5q9pwC
[18:10] <joao> Karcaw, thanks :)
[18:11] * alram (~alram@267a14e2.test.dnsbl.oftc.net) has joined #ceph
[18:12] <joao> Karcaw, can I bother you for the output of the fix as well?
[18:14] * eschnou (~eschnou@50c9a583.test.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[18:16] <Karcaw> sure
[18:17] <Karcaw> http://pastebin.com/6RzYdfh7
[18:18] <Karcaw> adding --debug-none does not seem to change the output much
[18:19] <joao> Karcaw, add '--debug-none 20'
[18:19] <Karcaw> it dosent change the output
[18:20] <Karcaw> i'm running 'src/ceph_mon_kvstore_fix --i-am-sure --for-real --debug-none 20 /data/mon/ /data/mon/'
[18:20] <joao> that's weird
[18:20] <joao> it worked when I first added that to the test
[18:20] <joao> err, to the program
[18:21] * BillK (~BillK@58-7-48-223.dyn.iinet.net.au) has joined #ceph
[18:22] * timmclaughlin (~timmclaug@69.170.148.179) Quit (Remote host closed the connection)
[18:22] * jskinner (~jskinner@69.170.148.179) Quit (Remote host closed the connection)
[18:22] <joao> you're right, it's not working
[18:22] <Karcaw> 30 maybe?
[18:22] <joao> 40 didn't do the trick
[18:22] * l0nk (~alex@83.167.43.235) Quit (Quit: Leaving.)
[18:22] <joao> I'll look into that later
[18:22] <matt_> mikedawson, no luck adding the new monitor unfortunately
[18:22] <matt_> 2013-04-24 00:20:39.853167 7f397529c700 -1 mon/Monitor.cc: In function 'void Monitor::sync_timeout(entity_inst_t&)' thread 7f397529c700 time 2013-04-24 00:20:39.852535
[18:22] <matt_> mon/Monitor.cc: 1056: FAILED assert(sync_role == SYNC_ROLE_REQUESTER)
[18:23] <mikedawson> matt_: I've seen that one, too
[18:23] <mikedawson> matt_: did you get quorum between the first two mons?
[18:24] <matt_> mikedawson, Yep. I get that far and the new mon appears to be syncing based on it's store.db increasing size
[18:25] <mikedawson> joao: should I enter a bug for mon/Monitor.cc: 1056: FAILED assert(sync_role == SYNC_ROLE_REQUESTER)? Both matt_ and I have seen it. He saw if with the mon. caps in place
[18:25] <joao> yes please
[18:25] <joao> mark it Urgent
[18:26] <mikedawson> joao: will do
[18:29] * yehudasa (~yehudasa@2607:f298:a:607:953a:9b8d:c1db:2b84) Quit (Ping timeout: 480 seconds)
[18:35] <mikedawson> joao, matt_: http://tracker.ceph.com/issues/4793
[18:36] <mikedawson> joao: It got put under devops accidentally, and I don't have permissions to change it
[18:36] <joao> mikedawson, fixed, thanks
[18:37] * yehudasa (~yehudasa@2607:f298:a:607:e918:deb4:5e7:63ec) has joined #ceph
[18:38] <joao> mikedawson, any chance you can provide more of the log, at least up until some 20 lines prior to the first mention of:
[18:39] <joao> mon.a@0.*sync( leader state none ))
[18:39] <joao> ?
[18:40] * tnt (~tnt@91.176.19.114) has joined #ceph
[18:40] <mikedawson> joao: will look for it. matt_: do you have logs with high levels of logging?
[18:41] <matt_> mikedawson, the log I have was just on default. I can try again with higher for you
[18:41] <mikedawson> matt_: Great. I will tell you I haven't been able to repeat this one myself
[18:42] <matt_> What debug settings are best for this?
[18:42] <mikedawson> matt_: joao is the man to ask
[18:43] <joao> debug mon = 20, debug ms = 1
[18:44] <matt_> No problems, I'll attach it to the bug report once it crashes out again
[18:51] * leseb (~Adium@83.167.43.235) Quit (Quit: Leaving.)
[18:51] * leseb (~Adium@83.167.43.235) has joined #ceph
[18:52] * leseb (~Adium@83.167.43.235) Quit ()
[18:53] <gregaf> Kdecherf: what's up?
[18:55] * diegows (~diegows@190.190.2.126) Quit (Ping timeout: 480 seconds)
[18:58] <joao> Karcaw, still around?
[18:58] <Karcaw> yep
[18:58] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[19:01] <joao> can you retry with the new patch I just pushed to wip-4521?
[19:01] <joao> Karcaw, ^
[19:06] <matt_> joao, When sync'ing, is it normal for the monitor store size to hover around 260MB when the other stores are 9GB and 3GB on the two other monitors?
[19:07] <joao> holy crap, you have a 9GB store?
[19:07] <Karcaw> yes
[19:08] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[19:08] <matt_> joao, 9.3GB...
[19:09] <joao> matt_, if a sync fails and the monitor is restarted, it will clear the whole store to avoid a corrupted state
[19:09] <joao> hence the 260MB probably
[19:09] <joao> I mean, the disparity in sizes
[19:10] <joao> I think the sync should be made more intelligent to work better for stores that big
[19:11] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Read error: Operation timed out)
[19:12] <matt_> joao, it appears to be clearing the store every few minutes. 2013-04-24 01:11:05.633366 7fb27f925700 10 mon.KVM10@2(synchronizing sync( requester state chunks )) e7 sync_requester_abort mon.0 172.16.0.3:6789/0 mon.1 172.16.0.17:6789/0 clearing potentially inconsistent store
[19:13] * BillK (~BillK@58-7-48-223.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[19:14] <joao> have to reboot the router; brb
[19:15] * loicd (~loic@b90afc0f.test.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[19:16] * gmason_ (~gmason@hpcc-fw.net.msu.edu) has joined #ceph
[19:17] * jluis (~JL@89.181.147.69) has joined #ceph
[19:18] * gmason (~gmason@23090c02.test.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[19:18] * gmason_ is now known as gmason
[19:22] * joao is now known as Guest3186
[19:22] * jluis is now known as joao
[19:23] * Guest3186 (~JL@89.181.154.215) Quit (Ping timeout: 480 seconds)
[19:25] * diegows (~diegows@190.190.2.126) has joined #ceph
[19:25] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[19:26] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[19:27] <joao> matt_, can you show us your 'ceph -s' ?
[19:27] * nhm (~nhm@65-128-150-185.mpls.qwest.net) has joined #ceph
[19:27] <matt_> health HEALTH_WARN mds Storage2 is laggy; 1 mons down, quorum 0,1 KVM08,Storage3
[19:27] <matt_> monmap e7: 3 mons at {KVM08=172.16.0.17:6789/0,KVM10=172.16.0.19:6789/0,Storage3=172.16.0.3:6789/0}, election epoch 544, quorum 0,1 KVM08,Storage3
[19:27] <matt_> osdmap e45461: 100 osds: 97 up, 97 in
[19:27] <matt_> pgmap v8224463: 5500 pgs: 5500 active+clean; 15583 GB data, 51320 GB used, 77101 GB / 127 TB avail; 39830KB/s wr, 61op/s
[19:27] <matt_> mdsmap e157521: 1/1/1 up {0=Storage2=up:replay(laggy or crashed)}
[19:29] * BillK (~BillK@124-169-44-130.dyn.iinet.net.au) has joined #ceph
[19:31] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[19:34] * dosaboy (~dosaboy@72.11.113.122) Quit (Quit: leaving)
[19:38] <joao> Karcaw, any joy?
[19:39] * timmclaughlin (~timmclaug@69.170.148.179) has joined #ceph
[19:39] <Karcaw> not sure, got distracted by a cooling failure in the datacenter
[19:40] <Karcaw> my compile is failing with:
[19:40] <Karcaw> ./mon/PaxosService.h:968: error: ‘bool PaxosService::exists_version(const std::string&, version_t)’ cannot be overloaded
[19:40] <Karcaw> ./mon/PaxosService.h:957: error: with ‘bool PaxosService::exists_version(const std::string&, version_t)’
[19:40] <Karcaw> the function is in the file twice
[19:44] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[19:44] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[19:47] <joao> eh
[19:47] <joao> oops
[19:47] <joao> forgot to stage a file for commit
[19:47] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[19:48] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[19:50] <joao> uh, nop; did forget to stage something for commit, but nothing of the sorts
[19:50] <Karcaw> new output: http://pastebin.com/Qg74bVwt
[19:51] <Karcaw> i fixed the file.. not sure how that happened..
[19:52] <joao> here's the thing: that version is indeed present on the store output you showed me earlier
[19:53] <joao> Karcaw, can you send me your store?
[19:53] <joao> I really need to crank up all kinds of unimaginable debug to figure out what the hell is happening
[19:54] <Karcaw> i'll attach the current one to the bug...
[19:54] <joao> ty
[19:54] <Karcaw> i'll be gone to lunch for a bit
[19:54] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[19:55] <joao> np, I'm heading out for a couple of hours myself in an hour or so
[20:00] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[20:00] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[20:09] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[20:12] * jskinner (~jskinner@69.170.148.179) has joined #ceph
[20:19] * lofejndif (~lsqavnbok@83TAAASAE.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[20:21] * BillK (~BillK@124-169-44-130.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[20:29] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:31] * Cube (~Cube@12.248.40.138) has joined #ceph
[20:38] * timmclau_ (~timmclaug@69.170.148.179) has joined #ceph
[20:38] * timmclaughlin (~timmclaug@69.170.148.179) Quit (Read error: Connection reset by peer)
[20:41] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[20:43] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) has joined #ceph
[20:44] * wer (~wer@206-248-239-142.unassigned.ntelos.net) Quit (Remote host closed the connection)
[20:44] * wer (~wer@206-248-239-142.unassigned.ntelos.net) has joined #ceph
[20:47] * LeaChim (~LeaChim@176.250.220.3) Quit (Ping timeout: 480 seconds)
[20:56] * LeaChim (~LeaChim@176.250.159.86) has joined #ceph
[21:12] * eschnou (~eschnou@131.165-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:22] * calebamiles (~caleb@c-50-138-218-203.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[21:22] * coyo (~unf@pool-71-164-242-68.dllstx.fios.verizon.net) has joined #ceph
[21:23] * b1tbkt (~Peekaboo@68-184-193-142.dhcp.stls.mo.charter.com) Quit (Remote host closed the connection)
[21:31] * vata (~vata@2607fad800040006a050a38575f0262a.test.dnsbl.oftc.net) Quit (Quit: Leaving.)
[21:38] <Kdecherf> gregaf: i'm back
[21:38] <Kdecherf> gregaf: our cluster is now on 0.60
[21:39] <gregaf> okay, my memory's not that good so I don't know what discussion we're having right now :)
[21:39] <Kdecherf> gregaf: but we still have this strange latency on some files
[21:40] <gregaf> ah
[21:40] <Kdecherf> :)
[21:41] <gregaf> as I said in our email discussion, when spot-checking all the files with any latency looked to be shared cache and php files which were opened read-write by separate nodes
[21:44] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[21:44] <Kdecherf> it can be possible that separate nodes access to the same files but it is not the only case we observe latency
[21:44] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[21:44] <Kdecherf> Even with one node, we observe latency in some clients
[21:45] <gregaf> how do you have multiple clients with only one node?
[21:47] <Kdecherf> clients mount a subfolder of the cluster, they are alone (on its subfolder) by default
[21:48] * Havre (~Havre@2a01:e35:8a2c:b230:e0ba:7dd8:8fa9:7050) Quit (Remote host closed the connection)
[21:49] <gregaf> "by default"
[21:49] <gregaf> the reason I ask is that when I last looked at this, it was pretty clear that everything slow was being accessed from multiple nodes
[21:50] <gregaf> that was the only reason I saw for anything which had any latency associated with it
[21:50] <gregaf> so I think you need to check what your clients have access to, and are accessing
[21:57] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[21:59] <gregaf> mikedawson: I'm a bit out of the loop you and joao have going, but I'm looking at http://tracker.ceph.com/issues/4784 now
[21:59] <gregaf> do you have more logs available somewhere?
[22:05] * danieagle (~Daniel@177.133.173.20) has joined #ceph
[22:08] <Kdecherf> gregaf: hm ok, I will check some laggy instances
[22:09] <Kdecherf> the latency concerns only the files accessed from multiple nodes or it can concerns the entire folder?
[22:09] <Kdecherf> (theoretically)
[22:17] <mikedawson> gregaf: I'll cull some more complete logs. joao wanted logs before the first occurrence of mon.a@0.*sync( leader state none )). Sound good?
[22:23] <gregaf> Kdecherf: I believe it was only files that I saw, but folders can cause it too
[22:23] <gregaf> mikedawson: I don't think sync is involved on this one? (at least that you included in the bug report)
[22:24] <gregaf> the part where it first transitions to two of them both reporting as leader is the interesting one
[22:25] <mikedawson> gregaf: log on the first mon is 1.2gb. Do you want the whole thing?
[22:25] <gregaf> yeah, that's fine (it should compress nicely)
[22:26] <mikedawson> give me a few. Upload it to you or give you a link?
[22:26] <gregaf> either way
[22:26] <mikedawson> where do I put it / what protocol?
[22:27] <gregaf> if it compresses down to less than…73.4MB (wherever that number came from) you can put it on the tracker
[22:27] <gregaf> as an attachment
[22:27] <gregaf> otherwise cephdrop@ceph.com; I assume you've used it before?
[22:27] <mikedawson> nope
[22:28] <mikedawson> too big for the tracker
[22:29] <gregaf> okay
[22:30] <Kdecherf> gregaf: as for me, the second case is more likely to occur in our configuration than the first
[22:30] <Kdecherf> (thx)
[22:33] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:46] * Havre (~Havre@2a01:e35:8a2c:b230:e8a8:e15:1197:808c) has joined #ceph
[22:51] * eschnou (~eschnou@131.165-201-80.adsl-dyn.isp.belgacom.be) Quit (Quit: Leaving)
[22:54] <PerlStalker> I'm looking at the latency stats from 'ceph --admin-daemon ceph-osd.0.asok perf dump'. What are the units on the sum?
[22:55] <gregaf> seconds; iirc that's the cumulative total and you'll need to divide by count
[22:56] <PerlStalker> gregaf: Sure but what is it a cumulative of?
[22:56] <PerlStalker> Is it seconds, bytes, camels?
[22:56] <dmick> camels, for sure
[22:56] <gregaf> latency is going to be in seconds, and I believe it's time elapsed between when the OSD sees the message and when it sends the reply
[22:57] <PerlStalker> Shiny
[23:07] * timmclaughlin (~timmclaug@69.170.148.179) has joined #ceph
[23:07] * timmclau_ (~timmclaug@69.170.148.179) Quit (Read error: Connection reset by peer)
[23:07] <PerlStalker> I'm having too much fun graphing ceph performance stats in graphite.
[23:07] <dmick> graphs are the very definition of shiny
[23:08] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[23:09] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[23:15] * timmclaughlin (~timmclaug@69.170.148.179) Quit (Remote host closed the connection)
[23:28] * tnt (~tnt@91.176.19.114) Quit (Ping timeout: 480 seconds)
[23:31] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[23:38] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:39] * terje- (~terje@184-96-148-241.hlrn.qwest.net) has joined #ceph
[23:39] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[23:39] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[23:40] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:43] * tnt (~tnt@91.176.19.114) has joined #ceph
[23:52] <Kdecherf> gregaf: well, I made another test with only one node on the same folder and the latency remains
[23:53] <gregaf> if you can grab debug logs from the client and MDS while observing the latency I might be able to take a look
[23:57] * jskinner (~jskinner@69.170.148.179) Quit (Remote host closed the connection)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.