#ceph IRC Log

Index

IRC Log for 2013-06-28

Timestamps are in GMT/BST.

[15:31] -coulomb.oftc.net- *** Looking up your hostname...
[15:31] -coulomb.oftc.net- *** Checking Ident
[15:31] -coulomb.oftc.net- *** No Ident response
[15:31] -coulomb.oftc.net- *** Found your hostname
[15:31] * CephLogBot (~PircBot@rockbox.widodh.nl) has joined #ceph
[15:31] * Topic is 'Latest stable (v0.61.4 "Cuttlefish) -- http://ceph.com/get || http://wiki.ceph.com Live! || "Geek on Duty" program -- http://goo.gl/f02Dt'
[15:31] * Set by joao!~JL@89.181.151.112 on Tue Jun 25 12:51:26 CEST 2013
[15:31] <wido> done :)
[15:31] <ccourtaut> wido: thanks!
[15:31] <loicd> wow
[15:31] <loicd> that was quick :-)
[15:33] <fridudad> sage sagewk do you have a minute to talk about http://tracker.ceph.com/issues/5401?
[15:34] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[15:34] * redeemed (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[15:35] * gregaf1 (~Adium@cpe-76-174-249-52.socal.res.rr.com) Quit (Quit: Leaving.)
[15:40] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[15:41] * hybrid5121 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[15:41] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Read error: Operation timed out)
[15:42] * sagelap (~sage@2600:1001:b121:63c2:f59c:45f0:21e4:a1b2) Quit (Ping timeout: 480 seconds)
[15:48] * sagelap (~sage@162.sub-70-208-79.myvzw.com) has joined #ceph
[15:54] * drokita (~drokita@199.255.228.128) has joined #ceph
[16:06] * redeemed_ (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[16:10] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[16:11] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[16:11] * ChanServ sets mode +v andreask
[16:13] * sagelap (~sage@162.sub-70-208-79.myvzw.com) Quit (Read error: Operation timed out)
[16:13] * redeemed (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[16:13] * iggy2 is now known as iggy
[16:14] * fridudad (~oftc-webi@fw-office.allied-internet.ag) Quit (Remote host closed the connection)
[16:20] * redeemed_ (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) Quit (Read error: Operation timed out)
[16:21] * redeemed (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[16:24] * redeemed_ (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[16:29] * BMDan (~BMDan@74.121.199.170) has joined #ceph
[16:29] * tkensiski1 (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has joined #ceph
[16:30] <BMDan> Sooooo... who wants to help me overcome LevelDB sucking up all my disk space? I tried http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-June/002000.html but was unable to fix the issue.
[16:30] <BMDan> I pay in delicious virtual {root,} beer!
[16:31] <saaby> what ceph version are you on?
[16:31] * redeemed (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[16:32] <BMDan> Actually, that was one of my questions. The monitor that is unable to come back up is 0.61.4-1. The other two, which I'm afraid to restart, are on that same version, but I'm unsure if they're *running* that version, or if that's merely what is installed.
[16:32] <BMDan> Is there a way to see the version of a running ceph binary?
[16:32] <BMDan> mon_stat does not show it.
[16:34] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[16:36] <saaby> ok
[16:36] <saaby> this should tell you if the installed version is the running version:
[16:36] <saaby> dpkg-query -Wf \${Version}\\n ceph; for i in $(pgrep -x ceph-mon); do readlink /proc/$i/exe; done | grep deleted
[16:37] <BMDan> Yeah, did something similar just now; it is definitely not the running version.
[16:37] <saaby> if it returns "deleted" it's not the same
[16:37] <BMDan> Finding out its version as best I can, one sec...
[16:37] <saaby> ok
[16:37] * markbby1 (~Adium@168.94.245.2) has joined #ceph
[16:38] <saaby> ceph --admin-daemon /var/run/ceph/ceph<monitorname>.asok version show
[16:39] <BMDan> Much simpler than my method, but produced the same results: 0.61.2
[16:39] <saaby> ok
[16:39] <saaby> I would:
[16:39] <BMDan> (My method was to look at the process start date and work backwards from apt logs.)
[16:39] <saaby> upgrade to 0.61.4
[16:40] <saaby> put: "mon compact on start = true" in ceph.conf
[16:40] <saaby> hope for the best
[16:40] <saaby> and restart the mon
[16:40] <BMDan> Okay, so I've effectively done that on one of my mons.
[16:40] <saaby> .4 has better leveldb storage management
[16:40] <saaby> ok?
[16:41] <BMDan> I have three mons. On mon.1, I set that per http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-June/002000.html and restarted. It didn't shrink. I then issued an explicit "compact" via "ceph mon tell". That resulted in it getting down to 5% disk space and killing itself.
[16:42] <BMDan> I couldn't find a way out of that box, so I decided to simply let it reinitialize from another mon; I deleted the contents of store.db. It never completed initialization and dropped to 5% and killed itself again.
[16:42] <BMDan> So now I have two working 0.61.2 mons and a dead 0.61.4 mon that I'm unclear how to recover.
[16:42] <saaby> right ok
[16:42] * markbby (~Adium@168.94.245.3) Quit (Remote host closed the connection)
[16:42] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[16:42] <BMDan> I can remove mons from the monmap and run on just mon3 while I do something to mon2, but I'm hesitate to do that without knowing that I won't make the situation worse.
[16:42] <saaby> yeah, you should probably focus on getting that dead one up first then
[16:43] <BMDan> I'd rather fix mon1, yes, exactly. :)
[16:43] <saaby> and you tried starting from a blank/recreated store.db on 0.61.4 on that?
[16:43] <BMDan> Yeah, lemme grab you the message, one sec...
[16:44] <BMDan> !pastebin
[16:44] <BMDan> Hmm, not a standard pastebin, eh?
[16:44] * markbby (~Adium@168.94.245.3) has joined #ceph
[16:45] <BMDan> http://pastebin.com/T1vpDhmf
[16:46] <saaby> "reached critical levels of available space on data store -- shutdown!" <- thats probably why?
[16:47] <BMDan> Yes, but that's because it appears to be grabbing every transaction since ever.
[16:47] <BMDan> # du -sh /opt/data/mon.1/store.db/
[16:47] <BMDan> 221G /opt/data/mon.1/store.db/
[16:48] <saaby> ok, so you start from a fresh store.db, start on 0.61.4, and the store grows to +220GB?
[16:48] <BMDan> Yep.
[16:48] <saaby> hm
[16:49] <saaby> have you tried: "mon compact on trim = true" ?
[16:49] <BMDan> I have compact on start turned off on this one, but that doesn't seem relevant given that there is no data when it starts.
[16:49] * markbby1 (~Adium@168.94.245.2) Quit (Remote host closed the connection)
[16:49] <BMDan> No, I have not./
[16:49] <BMDan> Shall I?
[16:49] <saaby> not really sure it has an effect on that scenario though
[16:49] <BMDan> I do have mon debug dump transactions = false
[16:50] <BMDan> But I think that was a misset default variable in 0.61.2, not 0.61.4.
[16:50] <saaby> yes
[16:50] <BMDan> (If I read the patch notes correctly.)
[16:50] <BMDan> I also injected that at runtime into the other mons so that they will hopefully not grow quite as fast until I fix this.
[16:50] <saaby> right
[16:51] <BMDan> Anyway, I can certainly throw "mon compact on trim = true" at it, delete the store.db, and let it reinit.
[16:51] <BMDan> Is there a reasonable max size it should hit that I would quickly know whether it's growing without bound?
[16:51] <saaby> it *probably* wont hurt anything..
[16:52] <saaby> I think most mon stores are in the low GB's 1-10.
[16:52] <BMDan> Okelie dokelie.
[16:52] * mschiff (~mschiff@tmo-104-255.customers.d1-online.com) has joined #ceph
[16:53] <BMDan> Nothing like running "rm -f" on production data, eh? ;)
[16:53] <saaby> hehe.. no.
[16:53] <BMDan> Okay, starting 'er up.
[16:53] * mschiff (~mschiff@tmo-104-255.customers.d1-online.com) Quit (Remote host closed the connection)
[16:53] * lyncos (~chatzilla@208.71.184.41) Quit (Remote host closed the connection)
[16:53] <saaby> how are you starting over the mon store?
[16:54] <BMDan> rm -f /opt/data/mon.1/store.db/*
[16:54] <saaby> ceph-mon -i <mon_id> --debug-mon 20 --mon-data /var/lib/ceph/mon/<mon_id/ -d --monmap <monmap_file> --mkfs --keyring <keyring_file>
[16:54] <saaby> is how I have rebuilt the mon stores
[16:55] <BMDan> Do you rm -rf /opt/data/mon.1 first, then?
[16:55] <saaby> yes
[16:55] <saaby> eh
[16:55] <BMDan> Well, saving the keyring first.
[16:55] <saaby> if /opt/data/mon.1 is your mon store, yes.
[16:55] <BMDan> Yeah, it is.
[16:55] <saaby> yes save the keyring or get it from one of the other mons
[16:55] <BMDan> Right, makes sense.
[16:55] <saaby> and get the monmap with
[16:56] <saaby> ceph mon getmap -o monmap
[16:56] <BMDan> Do you think there will be a substantive difference? That is, should I cancel this start attempt and do it your way?
[16:56] <saaby> I don't know actually.. never tried what you do.
[16:57] <BMDan> It has burned 10 GB already, so I'm quite flexible. If it hits 15, I'll cancel and do yours. :)
[16:57] <saaby> ok
[16:57] <saaby> the ceph-mon command with --mkfs will return after a few secs. after that you should be able to start the mon normally
[16:58] * BillK (~BillK@220-253-132-55.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[17:00] <BMDan> Okay: ceph-mon: created monfs at /opt/data/mon.1 for mon.1
[17:00] <BMDan> Starting...
[17:01] <BMDan> It has the monmap (same as the one I provided it originally). It's not in requester state start.
[17:01] <BMDan> now in*
[17:02] <BMDan> Now state chunks.
[17:03] <BMDan> Chunking down data; looks very similar to what I saw before.
[17:03] <saaby> hm
[17:04] <BMDan> store.db on mon2 is 160 GB, so if it's pulling all of that, that might be reasonable.
[17:04] <BMDan> But I'm guessing it's not.
[17:05] <BMDan> 160 GB on mon3, as well. (That might be obvious to you, but I had to check. :>)
[17:07] <saaby> right.. Im not sure if you maybe have to compact the store on the current leader, to avoid what you are seeing. Someone more knowledable than me probably have to answer that.
[17:07] <saaby> but I see that, that could be a bit scary in your situation..
[17:08] <BMDan> Indeed.
[17:08] <BMDan> Is compact(ing/ion? not sure the right tense here) blocking?
[17:08] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[17:11] <saaby> that is my experience, yes.
[17:11] * link0_ is now known as link0
[17:13] * Maskul (~Maskul@host-78-148-87-112.as13285.net) Quit (Quit: Leaving)
[17:14] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:19] <BMDan> Okay, so I have a terrible, terrible idea.
[17:19] <BMDan> I should be shot for thinking it.
[17:19] <janos> haha
[17:20] <BMDan> If this uses more than 160 GB (that is, the size of store.db on mon2/mon3), then what would happen if I copied store.db, file-for-file, from mon2 onto this mon (mon1) and started with that?
[17:20] <BMDan> Wouldn't that give me a consistent state and it would grab the latest and go from there?
[17:20] * janos has no idea, but runs in fear
[17:20] <BMDan> Seriously, I'm not proud of this, tell me why I'm wrong.
[17:21] <saaby> I have no idea :)
[17:21] <BMDan> That's the spirit, fellas!
[17:21] <BMDan> Okay, so, serious version of the question: is store.db on the various mons *identical* in a "at-rest" state?
[17:22] <janos> on the surface sounds fine
[17:22] <janos> but i have no idea if data in them is keyed in any way
[17:22] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[17:23] <BMDan> The files themselves are definitely not the same; mon2 is writing to 3418570 while mon3 writes to 3422164.
[17:23] <BMDan> But I cannot help but suspect that the data *is*.
[17:29] <janos> nod
[17:31] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[17:38] * tnt (~tnt@pd95bae9b.dip0.t-ipconnect.de) has joined #ceph
[17:50] * DarkAceZ (~BillyMays@50.107.52.142) has joined #ceph
[17:51] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[17:54] * haomaiwang (~haomaiwan@117.79.232.227) Quit (Remote host closed the connection)
[17:55] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[17:56] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[17:57] * Cube (~Cube@66-87-81-43.pools.spcsdns.net) has joined #ceph
[18:01] * bergerx_ (~bekir@78.188.204.182) Quit (Quit: Leaving.)
[18:04] * X3NQ (~X3NQ@195.191.107.205) Quit (Quit: Leaving)
[18:07] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) Quit (Quit: Leaving.)
[18:10] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) has joined #ceph
[18:11] * Cube (~Cube@66-87-81-43.pools.spcsdns.net) Quit (Quit: Leaving.)
[18:14] * gary (~gary@host86-172-49-182.range86-172.btcentralplus.com) has joined #ceph
[18:15] * gary (~gary@host86-172-49-182.range86-172.btcentralplus.com) has left #ceph
[18:25] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:29] * LeaChim (~LeaChim@2.217.235.48) Quit (Ping timeout: 480 seconds)
[18:30] * l0uis (~l0uis@madmax.fitnr.com) has left #ceph
[18:33] * tkensiski1 (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:38] * Machske (~Bram@d5152D87C.static.telenet.be) has joined #ceph
[18:39] * LeaChim (~LeaChim@2.125.92.224) has joined #ceph
[18:39] * tnt (~tnt@pd95bae9b.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[18:50] * missing (~lotreck@65.202.229.36) has joined #ceph
[18:51] <missing> Are OSDs deployed per node or per disk ? The docs advise 1 OSD per node but in the ceph slides I've seen OSDs diagrams in per disk configurations. Which is 'correct' ?
[18:53] <trhoden> missing: generally, people are deploying OSDs per disk
[18:53] <trhoden> that is certainly the most common and intended way.
[18:54] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[18:55] <missing> trhoden: Thanks. What is the osd per disk config syntax ? The docs only show examples for a single osd per node http://ceph.com/docs/master/rados/configuration/ceph-conf/#osds
[18:56] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Quit: Konversation terminated!)
[18:56] * Tamil (~tamil@38.122.20.226) has joined #ceph
[18:56] * sagelap (~sage@2600:1001:b11e:79e3:f59c:45f0:21e4:a1b2) has joined #ceph
[18:56] <trhoden> missing: you can list as many [osd.<xx>] entries as you want in there -- each specifying a host. That host could be the same, or different.
[18:57] <trhoden> however, those steps aren't really needed if you choose to use ceph-deploy
[18:59] * fghaas (~florian@91-119-131-155.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[19:06] * alram (~alram@38.122.20.226) has joined #ceph
[19:09] * sagelap (~sage@2600:1001:b11e:79e3:f59c:45f0:21e4:a1b2) Quit (Ping timeout: 480 seconds)
[19:17] * mattbenjamin (~matt@aa2.linuxbox.com) Quit (Quit: Leaving.)
[19:23] * rturk-away is now known as rturk
[19:23] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[19:35] * doubleg (~doubleg@69.167.130.11) has joined #ceph
[19:36] * tkensiski (~tkensiski@209.66.64.134) has joined #ceph
[19:37] * sagelap (~sage@2600:1001:b11e:79e3:c685:8ff:fe59:d486) has joined #ceph
[19:38] * tkensiski (~tkensiski@209.66.64.134) has left #ceph
[19:39] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[19:39] * ChanServ sets mode +v andreask
[19:41] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) Quit (Quit: Leaving.)
[19:48] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[19:52] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) has joined #ceph
[19:54] * mattbenjamin (~matt@76-206-42-105.lightspeed.livnmi.sbcglobal.net) has joined #ceph
[19:56] <redeemed_> howdy. ubuntu 12.04 LTS has linux kernel 3.5.0-34-generic as the latest and no 3.4.x. however, ceph recommend 3.4.x or 3.6.x. Does 3.5.x have issues? i want to use 3.6.11 but it seems some other packages do not like 3.6.
[19:58] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[19:59] * allsystemsarego (~allsystem@188.25.134.34) Quit (Quit: Leaving)
[20:01] <Gugge-47527> redeemed_: you could use the lts-raring kernel ... 3.8.x
[20:02] <Gugge-47527> ive had some strange issues with xfs on md raid5 devices on that kernel though :)
[20:03] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) Quit (Quit: Leaving.)
[20:06] <missing> trhoden: Thanks.
[20:06] <ofu> Gugge-47527: why do you use md raid5? Why not use a separate osd for each disk?
[20:07] * missing (~lotreck@65.202.229.36) Quit (Quit: missing)
[20:07] <redeemed_> thanks, Gugge-47527 , installed via apt-get, right?
[20:11] <redeemed_> don't answer my last question, Gugge-47527 :)
[20:21] * sagelap (~sage@2600:1001:b11e:79e3:c685:8ff:fe59:d486) Quit (Ping timeout: 480 seconds)
[20:26] * LeaChim (~LeaChim@2.125.92.224) Quit (Ping timeout: 480 seconds)
[20:26] * fghaas (~florian@91-119-131-155.dynamic.xdsl-line.inode.at) has joined #ceph
[20:32] * sagelap (~sage@2600:1001:b11e:79e3:adee:d7fc:bdb:3c) has joined #ceph
[20:34] * julian (~julianwa@125.70.135.148) Quit (Quit: afk)
[20:35] * LeaChim (~LeaChim@90.201.192.169) has joined #ceph
[20:35] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[20:36] * oddomatik (~Adium@12.248.40.138) has joined #ceph
[20:36] * hijacker (~hijacker@213.91.163.5) Quit (Read error: Connection timed out)
[20:36] * hijacker (~hijacker@213.91.163.5) has joined #ceph
[20:37] <redeemed_> looks like ubuntu 12.04 users will need to stick with kernel 3.5 provided if one desires to use openvswitch
[20:43] * Maskul (~Maskul@host-78-148-87-112.as13285.net) has joined #ceph
[20:44] * mattbenjamin (~matt@76-206-42-105.lightspeed.livnmi.sbcglobal.net) Quit (Quit: Leaving.)
[20:49] * LeaChim (~LeaChim@90.201.192.169) Quit (Ping timeout: 480 seconds)
[20:53] * madkiss (~madkiss@089144192006.atnat0001.highway.a1.net) has joined #ceph
[20:53] * sagelap1 (~sage@38.121.228.2) has joined #ceph
[20:56] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) has joined #ceph
[20:58] * LeaChim (~LeaChim@90.221.247.164) has joined #ceph
[20:59] * sagelap (~sage@2600:1001:b11e:79e3:adee:d7fc:bdb:3c) Quit (Ping timeout: 480 seconds)
[21:01] * madkiss (~madkiss@089144192006.atnat0001.highway.a1.net) Quit (Quit: Leaving.)
[21:07] * madkiss (~madkiss@089144192006.atnat0001.highway.a1.net) has joined #ceph
[21:08] * Tamil (~tamil@38.122.20.226) Quit (Quit: Leaving.)
[21:17] * oddomatik (~Adium@12.248.40.138) Quit (Quit: Leaving.)
[21:19] * oddomatik (~Adium@12.248.40.138) has joined #ceph
[21:20] <doubleg> Gugge-47527: always wondered how a md raid backed, osd configuration would work. Are you using this currently?
[21:25] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) Quit (Quit: Leaving.)
[21:26] * Maskul (~Maskul@host-78-148-87-112.as13285.net) Quit (Quit: Leaving)
[21:29] * portante (~user@nat-pool-rdu-t.redhat.com) has joined #ceph
[21:29] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[21:29] * ChanServ sets mode +v andreask
[21:29] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has left #ceph
[21:30] * tnt (~tnt@pd95bae9b.dip0.t-ipconnect.de) has joined #ceph
[21:34] <BMDan> Is there a way to mark a mon as down, but leave it in the pool so it can catch up?
[21:34] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[21:35] <BMDan> Or to set "mon nodown" like you can set for osds?
[21:37] <gregaf> BMDan: if I understand your question correctly, it's already doing all that stuff implicitly if it's out of date; is there some behavior you're trying to avoid?
[21:38] * Tamil (~tamil@38.122.20.226) has joined #ceph
[21:38] * oddomatik1 (~Adium@12.248.40.138) has joined #ceph
[21:39] <BMDan> gregaf: Yeah; I brought up mon1 with mon2's store.db, and I think it's thrashing a wee bit, so I was trying to avoid having it in the pool, since it seems to have made the other mons go a bit screwy.
[21:40] <gregaf> you brought it up with another mon's store? you mean you copied the store into its directory, or...?
[21:40] <BMDan> Yeah, rsync'd the store.db directory. Left everything else as it was.
[21:40] <gregaf> oh….
[21:40] <gregaf> did you do that while mon.1 was running?
[21:40] <BMDan> No; I'm insane, but not THAT insane.
[21:41] <gregaf> hrm
[21:41] <BMDan> Kept trying to bring it up the "right" way and it would use 200+ GB and die when it hit 5% disk free.
[21:41] <BMDan> This despite the fact that the 0.61.2 store.db's on mon2 and mon3 were only 160 GB.
[21:41] <BMDan> mon1 is 0.61.4.
[21:42] <gregaf> I'm not quite sure what the impact of rsyncing like that would be; it depends a lot on the prior disk state of mon.1 and on what leveldb is doing internally
[21:42] <BMDan> Yeah, nobody else was sure, either. ;)
[21:43] <gregaf> the sync should have just worked; I suspect the reason it didn't is because of the large store size
[21:43] <gregaf> which should be fixed by upgrading the other monitors
[21:43] <BMDan> Right, so I'm going to try that now.
[21:43] <BMDan> If there's a way to set nodown, I'd like to do that, so I don't lose quorum if mon3 takes a while to restart.
[21:43] <BMDan> Is there?
[21:43] * portante (~user@nat-pool-rdu-t.redhat.com) Quit (Quit: give way to znc)
[21:43] * portante_ (~portante@nat-pool-bos-t.redhat.com) has joined #ceph
[21:44] <gregaf> ah, I see — no, the fundamental algorithm isn't going to allow anything like that
[21:44] * oddomatik (~Adium@12.248.40.138) Quit (Ping timeout: 480 seconds)
[21:45] <BMDan> Okay. If mon3 dies in a ditch, I can use mon2 to remove mon1 and mon3 from the map and get a quorum back, yes?
[21:45] * portante_ (~portante@nat-pool-bos-t.redhat.com) Quit ()
[21:45] <gregaf> it'll require a bit of work, but yes
[21:46] <gregaf> (if you're that nervous you can also copy mon.3's whole directory somewhere and then revert back if the upgrade goes badly)
[21:46] <BMDan> http://eu.ceph.com/docs/v0.47.1/ops/manage/grow/mon/#removing-a-monitor-from-an-unhealthy-or-down-cluster ?
[21:46] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[21:46] * portante (~portante@nat-pool-bos-t.redhat.com) has joined #ceph
[21:46] <BMDan> Issue is that I don't have the binaries for 0.61.2, so it's probably faster to go with ^ ^ if it works.
[21:46] <gregaf> yep, that one
[21:46] <BMDan> Okie dokie. Let's see what happens.
[21:48] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit ()
[21:52] * LPG_ (~oftc-webi@c-76-104-197-224.hsd1.wa.comcast.net) has joined #ceph
[21:55] * LPG_ (~oftc-webi@c-76-104-197-224.hsd1.wa.comcast.net) Quit ()
[22:02] * LPG|2 (~kvirc@c-76-104-197-224.hsd1.wa.comcast.net) has joined #ceph
[22:03] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:04] * nwat (~Adium@eduroam-251-132.ucsc.edu) has joined #ceph
[22:07] * nwat (~Adium@eduroam-251-132.ucsc.edu) Quit ()
[22:07] * andrei (~andrei@host86-155-31-94.range86-155.btcentralplus.com) has joined #ceph
[22:18] * oddomatik1 (~Adium@12.248.40.138) Quit (Quit: Leaving.)
[22:20] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[22:21] * oddomatik (~Adium@12.248.40.138) has joined #ceph
[22:21] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) Quit (Quit: Leaving.)
[22:25] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) has joined #ceph
[22:30] * fghaas (~florian@91-119-131-155.dynamic.xdsl-line.inode.at) has left #ceph
[22:32] * madkiss (~madkiss@089144192006.atnat0001.highway.a1.net) Quit (Quit: Leaving.)
[22:33] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[22:37] * trhoden (~trhoden@pool-108-28-184-124.washdc.fios.verizon.net) Quit (Quit: trhoden)
[22:39] * LPG|2 (~kvirc@c-76-104-197-224.hsd1.wa.comcast.net) Quit (Ping timeout: 480 seconds)
[23:06] * LPG|2 (~kvirc@c-76-104-197-224.hsd1.wa.comcast.net) has joined #ceph
[23:19] * miniyo (~miniyo@0001b53b.user.oftc.net) Quit (Ping timeout: 480 seconds)
[23:20] * rtek (~sjaak@rxj.nl) Quit (Remote host closed the connection)
[23:20] * LPG|2 (~kvirc@c-76-104-197-224.hsd1.wa.comcast.net) Quit (Ping timeout: 480 seconds)
[23:24] * LPG|2 (~kvirc@c-76-104-197-224.hsd1.wa.comcast.net) has joined #ceph
[23:24] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[23:25] * rtek (~sjaak@rxj.nl) has joined #ceph
[23:25] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[23:25] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) has joined #ceph
[23:28] * miniyo (~miniyo@0001b53b.user.oftc.net) has joined #ceph
[23:28] * LPG|2 (~kvirc@c-76-104-197-224.hsd1.wa.comcast.net) has left #ceph
[23:30] * sagelap (~sage@2600:1001:b11e:79e3:adee:d7fc:bdb:3c) has joined #ceph
[23:31] * rtek (~sjaak@rxj.nl) Quit (Remote host closed the connection)
[23:31] * rtek (~sjaak@rxj.nl) has joined #ceph
[23:33] * leseb (~Adium@bea13-1-82-228-104-16.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[23:35] * sjustlaptop (~sam@rrcs-50-75-163-254.nys.biz.rr.com) has joined #ceph
[23:37] * sagelap1 (~sage@38.121.228.2) Quit (Ping timeout: 480 seconds)
[23:49] * BMDan (~BMDan@74.121.199.170) Quit (Quit: Leaving.)
[23:49] * tnt (~tnt@pd95bae9b.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[23:50] * sjustlaptop (~sam@rrcs-50-75-163-254.nys.biz.rr.com) Quit (Ping timeout: 480 seconds)
[23:52] * tnt (~tnt@pd95bae9b.dip0.t-ipconnect.de) has joined #ceph
[23:54] * sagelap (~sage@2600:1001:b11e:79e3:adee:d7fc:bdb:3c) Quit (Ping timeout: 480 seconds)
[23:55] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Read error: Connection reset by peer)
[23:55] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[23:57] * drokita1 (~drokita@199.255.228.128) has joined #ceph
[23:57] * rturk is now known as rturk-away

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.