#ceph IRC Log


IRC Log for 2013-05-30

Timestamps are in GMT/BST.

[0:03] * drokita1 (~drokita@ has joined #ceph
[0:04] * rturk-away is now known as rturk
[0:05] <mrjack> tnt: 2x1GE?
[0:05] <mrjack> 15mb/sek seems slow yes
[0:08] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[0:10] <tnt> mrjack: yes 2x1G and when benchmarking RBD I can get to 100Mo/s write and 200Mo/s read with a single thread / client.
[0:11] <tnt> and it's also generating "Slow request" for things that are currently running on the cluster which is definitely not ideal wince it can 'freeze' vm for 30s or so.
[0:11] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[0:11] * BillK (~BillK@124-148-82-128.dyn.iinet.net.au) has joined #ceph
[0:12] * drokita1 (~drokita@ Quit (Ping timeout: 480 seconds)
[0:13] * jeff-YF (~jeffyf@ Quit (Ping timeout: 480 seconds)
[0:15] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[0:20] * fridudad (~oftc-webi@p5B09D334.dip0.t-ipconnect.de) Quit (Remote host closed the connection)
[0:21] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) Quit (Ping timeout: 480 seconds)
[0:21] * redeemed (~redeemed@static-71-170-33-24.dllstx.fios.verizon.net) Quit (Quit: bia)
[0:23] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:31] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:57] * redeemed (~redeemed@cpe-192-136-224-78.tx.res.rr.com) has joined #ceph
[1:26] * tkensiski (~tkensiski@70.sub-70-211-66.myvzw.com) has joined #ceph
[1:27] * tkensiski (~tkensiski@70.sub-70-211-66.myvzw.com) has left #ceph
[1:30] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[1:35] <mrjack> tnt: have you added many osds at once?
[1:35] * BManojlovic (~steki@fo-d- Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:36] <tnt> mrjack: 2
[1:37] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) has joined #ceph
[1:39] <tnt> mrjack: anyway, now the cluster is finally back to HEALTH_OK so I can go to sleep :p I'll take a closer look at the IO logs tomorrow see what when on exactly.
[1:44] <sagewk> need a quick review for https://github.com/ceph/ceph/commit/c59580a71d76110272905c7f5b54e47fafb46cea
[1:44] * alop (~al592b@71-80-139-200.dhcp.rvsd.ca.charter.com) Quit (Quit: alop)
[1:44] <sagewk> am i right that the stl containers don't initialize int types?
[1:44] <sagewk> fwiw it fixes my symptoms
[1:49] * tnt (~tnt@ Quit (Ping timeout: 480 seconds)
[1:49] <davidzlap> sagewk: Reviewed-by: David Zafman <david.zafman@inktank.com>
[1:49] <sagewk> thanks. i'm right about the stl int thing right?
[1:49] <davidzlap> not sure, but the code doesn't break anything.
[1:50] <sagewk> yeah
[1:51] <davidzlap> I thought that "new" initializes memory to 0, so I would have though that an STL container that creates a new item would have int types of 0.
[1:51] <sagewk> it proably doesn't under the assumption that the ctor will do that work
[1:51] <dmick> map[] says it inserts a new element if the key doesn't exist
[1:51] <dmick> the claim is it's constructed with the default constructor
[1:51] <davidzlap> I may be wrong about the "new" thing
[1:52] <sagewk> ...and there is not ctor for int.
[1:52] <dmick> so yeah, I suspect it would use uint8_t's constructor, and yes.
[1:52] * rturk is now known as rturk-away
[1:53] <dmick> that....might almost be worth some sort of autoreview
[1:53] <gregaf> yeah, stl and memory initialization is a pain, but it definitely doesn't init ints of any kind
[1:54] <gregaf> that's what was causing the down OSDs? *cry*
[1:54] <sagewk> paravoid: yay, fixed your too-many-osds-marked-down bug
[1:55] <sagewk> it was marking one osd down, and any others that had seen any reports (where the stl map was instnatiated and looked at) were also marked down
[1:55] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[1:55] * buck (~buck@bender.soe.ucsc.edu) Quit (Quit: Leaving.)
[2:00] <dmick> sweet!
[2:02] <paravoid> sagewk: hey!
[2:02] * sagewk (~sage@2607:f298:a:607:b8b1:2d0c:7124:f5a7) has left #ceph
[2:02] <paravoid> lol
[2:02] * sagewk (~sage@2607:f298:a:607:b8b1:2d0c:7124:f5a7) has joined #ceph
[2:03] <paravoid> sagewk: hey
[2:03] <sagewk> hey
[2:04] <paravoid> sneaky bug
[2:05] <nhm> new linkedin title: sneaky bug hunter
[2:07] <paravoid> if only #5084 was fixed now
[2:08] <paravoid> I'm greedy aren't I
[2:08] <sagewk> so greedy!
[2:11] <paravoid> sagewk: no bobtail backport for 4967?
[2:11] <paravoid> ah, you just did that
[2:32] * Cube (~Cube@ Quit (Read error: Operation timed out)
[2:33] <nhm> sagewk: still seeing some pretty extreme CPU utilization and mons dropping/joining during PG creation with wip-5176-cuttlefish. We'll see how it does once PG creation is over.
[2:33] <sagewk> k
[2:36] <joao> nhm, do you think what you're seeing may be related to this? https://code.google.com/p/leveldb/issues/detail?id=174
[2:37] <sagewk> nhm: does 'perf top' tell you anything?
[2:37] <nhm> interesting, I'm not sure.
[2:38] <nhm> sagewk: I need to get a kernel that has perf compiled with libunwind
[2:38] <nhm> sagewk: as of kernel 3.8 it can use unwind/dwarf which may finally get us a good stacktrace.
[2:39] <dmick> sagewk: ceph -w fix pushed
[2:42] <sagewk> cool
[2:49] <nhm> sagewk: http://pastie.org/7981963
[2:50] <sagewk> hmm, all leveldb reads...
[2:50] <sagewk> try with 'mon compact on trim = false' and see what happens?
[2:50] <dmick> weird. my first four reads got "no such pastie", and then it finally showed up
[2:51] <dmick> eventual consistency on a CDN?...
[2:51] <nhm> dmick: pastie seems to be having issues I think.
[2:52] <dmick> it's loadbalanced, that probably explains it
[2:52] <nhm> sagewk: I'll give it a try. I went through and did some investigation into the leveldb code to see where PosixRandomAccessFile could be getting instantiated. Looks like repair and compaction are big ones.
[2:52] <sagewk> s
[2:56] * diegows (~diegows@ has joined #ceph
[3:11] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[3:24] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[3:34] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[3:37] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) Quit (Quit: Leaving.)
[3:40] * yasu` (~yasu`@dhcp-59-219.cse.ucsc.edu) Quit (Remote host closed the connection)
[3:51] * xmltok (~xmltok@pool101.bizrate.com) Quit (Ping timeout: 480 seconds)
[3:51] <nhm> sagewk: disabling compact on trim may be helping. The mons are still using a lot of CPU, but PG creation seems to be going along faster and it's not always pegged. I'll have to see what the throughput loks like once PG creation finishes.
[4:02] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[4:05] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[4:07] * redeemed (~redeemed@cpe-192-136-224-78.tx.res.rr.com) Quit (Quit: bia)
[4:08] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[4:08] * loicd (~loic@2a01:e35:2eba:db10:405:7eb1:2d74:a21f) has joined #ceph
[4:14] <nhm> sagewk: so again the mons are using lots of CPU time due to PosiRandomAccess file, but it seems slightly faster, and the ratios are different. So we are still spending basically all of our time doing reads, but presumably less time doing reads due to compactions on trim.
[4:17] * dosaboy (~dosaboy@host86-163-9-169.range86-163.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[4:17] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Quit: Leaving.)
[4:17] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[4:20] * dosaboy (~dosaboy@host86-164-82-60.range86-164.btcentralplus.com) has joined #ceph
[4:22] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[4:41] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[4:41] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[4:49] * loicd (~loic@2a01:e35:2eba:db10:405:7eb1:2d74:a21f) Quit (Quit: Leaving.)
[4:49] * loicd (~loic@magenta.dachary.org) has joined #ceph
[4:53] * alexxy[home] (~alexxy@2001:470:1f14:106::2) has joined #ceph
[4:53] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Read error: Connection reset by peer)
[5:08] * Vanony (~vovo@ has joined #ceph
[5:15] * Vanony_ (~vovo@ Quit (Ping timeout: 480 seconds)
[5:17] * hflai_ is now known as hflai
[5:17] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[5:17] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[5:26] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[5:46] * The_Bishop (~bishop@2001:470:50b6:0:6dd1:495c:667:a5e6) Quit (Ping timeout: 480 seconds)
[5:50] * yehuda_hm (~yehuda@2602:306:330b:1410:6885:1334:8c26:70e9) Quit (Read error: Connection timed out)
[5:51] * yehuda_hm (~yehuda@2602:306:330b:1410:6885:1334:8c26:70e9) has joined #ceph
[5:55] * The_Bishop (~bishop@2001:470:50b6:0:11f9:94a4:9de5:a2bd) has joined #ceph
[6:04] * The_Bishop (~bishop@2001:470:50b6:0:11f9:94a4:9de5:a2bd) Quit (Ping timeout: 480 seconds)
[6:13] * The_Bishop (~bishop@2001:470:50b6:0:6dd1:495c:667:a5e6) has joined #ceph
[6:23] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[6:24] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[6:27] * xmltok (~xmltok@pool101.bizrate.com) Quit (Quit: Leaving...)
[6:50] * san (~san@ has joined #ceph
[6:59] * KindTwo (~KindOne@h143.171.17.98.dynamic.ip.windstream.net) has joined #ceph
[7:02] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[7:02] * KindTwo is now known as KindOne
[7:05] * TiCPU|Home (jerome@p4.i.ticpu.net) has joined #ceph
[7:05] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[7:06] <TiCPU|Home> I don't know what is going on but I restarted 3 OSD and they stay down, logging doesn't help much
[7:07] <TiCPU|Home> I tried resetting one OSD and using --mkfs, this worked, it rejoined, resynced, all went healthy then I restarted it and it doesn't join again
[7:07] * tnt (~tnt@ has joined #ceph
[7:08] <TiCPU|Home> the OSD also segfault on stop :/
[7:08] <TiCPU|Home> -1> 2013-05-30 05:08:36.416010 7f425994e700 -1 osd.2 1290 *** Got signal Terminated ***
[7:08] <TiCPU|Home> 0> 2013-05-30 05:08:36.418124 7f425994e700 -1 *** Caught signal (Segmentation fault) **
[7:08] <TiCPU|Home> in thread 7f425994e700
[7:10] * dpippenger1 (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[7:11] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[7:13] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Read error: Operation timed out)
[7:21] * tkensiski (~tkensiski@99-196-196-10.cust.wildblue.net) has joined #ceph
[7:21] * tkensiski (~tkensiski@99-196-196-10.cust.wildblue.net) has left #ceph
[7:25] * julian (~julianwa@ Quit (Quit: afk)
[7:28] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[7:32] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[7:33] * loicd (~loic@magenta.dachary.org) has joined #ceph
[7:34] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) Quit (Remote host closed the connection)
[7:36] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) has joined #ceph
[7:37] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) Quit (Remote host closed the connection)
[7:40] * Vjarjadian (~IceChat77@ Quit (Quit: If at first you don't succeed, skydiving is not for you)
[7:41] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) has joined #ceph
[7:53] <dmick> TiCPU|Home: if there's a segfault, there's a stack backtrace up there
[7:53] * KindTwo (KindOne@h74.213.89.75.dynamic.ip.windstream.net) has joined #ceph
[7:54] <TiCPU|Home> yep
[7:55] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[7:55] * KindTwo is now known as KindOne
[7:55] <dmick> pastebin, maybe? or is it an assert failure?
[7:55] <TiCPU|Home> yes, sorry to keep you waiting, my log has grown a lot
[7:55] <dmick> no worries
[7:56] <TiCPU|Home> http://pastebin.ca/2384904
[7:56] <TiCPU|Home> I was able to make them rejoin by removing those lines: filestore min sync interval = 60; filestore max sync interval = 300; filestore journal parallel = true;
[7:57] <TiCPU|Home> removing only filestore journal or only filestore min failed to make it up again
[7:57] <TiCPU|Home> back in HEALTH_OK state
[7:59] <TiCPU|Home> just tried again, uncommented filestore * sync interval lines and OSD couldn't join the cluser
[7:59] <dmick> hm. not a lot of info in that backtrace
[7:59] <TiCPU|Home> ceph-dbg is installed
[8:01] <TiCPU|Home> what is weird is that I added all those 3 lines this afternoon, then later in the evening I restarted an OSD which had corruption, killed it, --mkfs'ed it, then any OSD I restarted stopped working except when freshly wiped.
[8:01] <TiCPU|Home> (added those 3 lines and restarted all mon/osd one-by-one)
[8:04] * agh (~oftc-webi@gw-to-666.outscale.net) Quit (Quit: Page closed)
[8:07] <dmick> looking at the code, must have been in osdmap->get_epoch() or load_pgs() in OSD::init
[8:07] <dmick> which I suppose makes some sense
[8:07] <dmick> although I don't know why those settings would really matter to either of those
[8:08] <TiCPU|Home> dmick, the thing is, it crashes on sigterm even without those settings, I just noticed
[8:08] <dmick> might be interesting to try starting with debug osd >= 10
[8:08] <dmick> load_pgs() gives some progress at 10
[8:09] <dmick> if you're not aware, you can add that on the cmdline --debug-osd=10
[8:10] <TiCPU|Home> I had debug OSD == 20 and didn't get much details, I can pastebin it
[8:10] <dmick> sure
[8:11] <TiCPU|Home> 500M of logs, damn it
[8:12] <dmick> heh
[8:12] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:14] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[8:17] <TiCPU|Home> dmick, http://paste.ubuntu.com/5715754/
[8:17] <TiCPU|Home> extracted the lines between 2 OSD boots
[8:17] <dmick> ok
[8:18] <dmick> are all the OSDs v0.61.2?
[8:18] * dpippenger1 (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[8:19] <TiCPU|Home> yes, upgraded 3 days ago
[8:19] <TiCPU|Home> all 6 servers have been rsync'ed to be the same
[8:19] <TiCPU|Home> kernel 3.9.4 btw
[8:25] <dmick> taxing my git skills over here and interruptions; sorry
[8:25] * tnt (~tnt@ Quit (Ping timeout: 480 seconds)
[8:26] <TiCPU|Home> no problem, I still didn't peek at ceph's code, I'm still getting the hang of it
[8:26] <dmick> there have been some changes in clear_temp, which is where it died
[8:26] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Read error: Connection reset by peer)
[8:27] <TiCPU|Home> I had to modify libvirt and qemu a bit but not ceph yet
[8:27] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[8:27] * ChanServ sets mode +o scuttlemonkey
[8:31] <dmick> I'm certain sjust would be interested in seeing this tomorrow if you can save logs and perhpas the filesystem state of the breaking OSD
[8:32] <dmick> but I don't have anything quick I can do tonight
[8:33] <TiCPU|Home> well, the state seems to be persistent
[8:33] <TiCPU|Home> I can just change back the parameter and restart the OSD to get it not to join the cluster
[8:34] <dmick> oh I thought you said it was dying even with those config settings changed?
[8:34] <TiCPU|Home> well, thanks for checking it out :)
[8:34] <TiCPU|Home> dmick, it dies on exit
[8:34] <dmick> er..
[8:34] <TiCPU|Home> I issue SIGTERM then it SIGSEGV
[8:34] <dmick> oh
[8:34] <dmick> hm
[8:34] <dmick> I know there were recent changes to try to handle signals more cleanly
[8:35] <dmick> sounds like a bug crept in
[8:35] <dmick> but, yeah, dunno if they're connected
[8:35] <dmick> if you can come around during business hours tomorrow hit up sjust with this one
[8:35] <TiCPU|Home> certainly, I'll be there tomorrow on my other nickname
[8:36] <TiCPU|Home> well, 2:30 in the morning, I should get some sleep
[8:41] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:06] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[9:12] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:13] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[9:17] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:19] * davidzlap (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[9:19] * dcasier (~dcasier@ has joined #ceph
[9:27] * madkiss (~madkiss@089144192093.atnat0001.highway.a1.net) has joined #ceph
[9:29] * joshd (~jdurgin@2602:306:c5db:310:459d:83f:2de:a3f7) Quit (Quit: Leaving.)
[9:30] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:31] * eschnou (~eschnou@ has joined #ceph
[9:34] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Read error: Operation timed out)
[9:36] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:36] * ScOut3R (~ScOut3R@ has joined #ceph
[9:51] * maximilian (~maximilia@ has joined #ceph
[9:53] <maximilian> hi folks, I hope that somebody can help me out from my problem (HEALTH_WARN 19 pgs degraded; 19 pgs stuck unclean; recovery 228/3814 degraded (5.978%))
[9:58] * madkiss (~madkiss@089144192093.atnat0001.highway.a1.net) Quit (Ping timeout: 480 seconds)
[9:58] <maximilian> I'm coming from drbd environment want to move to ceph...installed&configured ceph on 2 server each 1 osd everything worked fine up ot first failover scenario.. I have just rebooted os.1 server restarted ceph again
[10:00] * BManojlovic (~steki@ has joined #ceph
[10:00] <maximilian> still waiting for self-healing
[10:01] <maximilian> ceph pg dump_stuck unclean returns
[10:01] <maximilian> ok
[10:01] <maximilian> pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
[10:01] <maximilian> 0.68 0 0 0 0 0 0 0 active+degraded 2013-05-29 21:56:05.167623 0'04'22 [2] [2] 0'0 2013-05-28 22:46:29.958220 0'0 2013-05-28 22:46:29.958220
[10:01] <maximilian> 1.67 0 0 0 0 0 0 0 active+degraded 2013-05-29 21:56:05.168347 0'04'22 [2] [2] 0'0 2013-05-28 22:50:55.003394 0'0 2013-05-28 22:50:55.003394
[10:02] <maximilian> any help would be appreciated
[10:03] <tnt> it's probably because you only have 2 servers
[10:04] <tnt> try "ceph osd crush tunables argonaut"
[10:07] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[10:07] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[10:08] * MrNPP (~MrNPP@ Quit (Read error: Operation timed out)
[10:11] * MrNPP (~MrNPP@ has joined #ceph
[10:12] * LeaChim (~LeaChim@ has joined #ceph
[10:15] * virsibl (~virsibl@ has joined #ceph
[10:18] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[10:22] * virsibl (~virsibl@ has left #ceph
[10:23] * virsibl (~virsibl@ has joined #ceph
[10:26] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[10:26] <maximilian> I use cuttlefish ... "ceph osd crush tunables cuttlefish" does this work for me???
[10:28] <tnt> maximilian: I don't think the cuttlefish profile exist yet. You can use the 'optimal' profile but beware that you then need kernel >= 3.9 if you use the kernel client for rbd or cephfs
[10:28] <tnt> using the 'argonaut' profile (even if you use cuttlefish) will most likely fix your issue.
[10:29] * virsibl (~virsibl@ Quit (Quit: KVIrc 4.1.3 Equilibrium http://www.kvirc.net/)
[10:30] * virsibl (~virsibl@ has joined #ceph
[10:32] <maximilian> I used the "ceph osd crush tunables argonaut" ...health status doesnt changed yet
[10:33] <tnt> pastebin ceph osd tree
[10:34] <maximilian> # id weight type name up/down reweight
[10:34] <maximilian> -1 2 root default
[10:34] <maximilian> -3 1 rack unknownrack
[10:34] <maximilian> -2 0 host kysrv1_10g
[10:34] <maximilian> -4 1 host kysrv2_10g
[10:34] <maximilian> 2 1 osd.2 up 1
[10:34] <maximilian> -5 1 host kysrv1
[10:34] <maximilian> 1 1 osd.1 up 1
[10:36] <tnt> I said pastebin
[10:38] <maximilian> sorry never use that before... http://pastebin.com/CAP2M8DV
[10:38] <tnt> that crushmap is just weird
[10:39] <tnt> first the hostname don't seem to match kysrv1 vs kysrv1_10g each time you use a 'host' definition in ceph.conf it _must_ match whatever 'uname -n' of the machine returns.
[10:40] <maximilian> I see
[10:40] <maximilian> I have 2 network adapters on each server 1G and 10G
[10:41] <tnt> second the hierarchy is just weird. Your best best is probably to export it, decompile it, fix it up by hand to reflect a proper hierachy, recompile it and reinject it. Alternatively, you can just try to fix the 'hosts' settings everywhere and restart the cluster, it might fix it automatically on start.
[10:41] <maximilian> both 10G are connected via crossover and in ceph.conf I use only 10G...nowhere I have mentioned other hostname
[10:41] <tnt> well the 'host' just selects the machine, not the network at all. If you want to use specific addresses / networks, there are options for that but basically it will by default use the IP as seen from the mons.
[10:42] <tnt> if the hostname returned by 'uname -n' is kysrv1 you _must_ use that.
[10:47] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:56] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[10:56] * ChanServ sets mode +v andreask
[10:57] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has left #ceph
[11:01] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[11:01] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[11:06] * ScOut3R_ (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[11:09] * fridudad (~oftc-webi@fw-office.allied-internet.ag) has joined #ceph
[11:10] <fridudad> rbd snap rollback does not show progress since cuttlefish? but there is only a no-progress option
[11:10] <fridudad> how can i show the progress?
[11:10] <maximilian> thx tnt just decompiled and fixed the crushmap..health is Ok now
[11:10] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[11:26] <tnt> maximilian: good.
[11:40] * ScOut3R_ (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Remote host closed the connection)
[11:43] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[11:44] * tziOm (~bjornar@ has joined #ceph
[11:55] * Meths_ (rift@ has joined #ceph
[12:00] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[12:12] * andrei (~andrei@host217-46-236-49.in-addr.btopenworld.com) has joined #ceph
[12:12] <andrei> hello guys
[12:12] <andrei> I am planning to change the filesystem on the osds in my small PoC ceph setup
[12:13] <andrei> i currently have 2 servers with 17 osds beteween them
[12:13] <andrei> with xfs file system
[12:13] <andrei> i would like to move to the btrfs
[12:13] <andrei> and also would like to add an ssd disk for journal cache
[12:13] <andrei> could someone please suggest a way to do this?
[12:15] * jbd_ (~jbd_@34322hpv162162.ikoula.com) Quit (Remote host closed the connection)
[12:18] <tnt> to switch to ssd journal, you can just create the partitions on the ssd (one per osd), then for each OSD: stop it, flush the journal, edit the configuration to point to the ssd partition for that osd, format journal, start the osd again.
[12:18] <tnt> look at http://wiki.skytech.dk/index.php/Ceph_-_howto,_rbd,_lvm,_cluster#Add.2Fmove_journal_in_running_cluster
[12:19] <tnt> to switch fs, I'm not sure what's the better option.
[12:28] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) has joined #ceph
[12:29] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[12:34] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[12:45] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[12:48] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[12:50] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:50] * ChanServ sets mode +v andreask
[12:52] * BillK (~BillK@124-148-82-128.dyn.iinet.net.au) Quit (Ping timeout: 481 seconds)
[12:53] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[12:57] <andrei> tnt: thanks
[12:57] <andrei> could I have a mix of filesystems in the cluster?
[12:58] <andrei> so, let's say I can switch osds one by one without redoing the cluster
[12:58] <andrei> and ceph would automatically rebalance the data?
[12:58] <andrei> so, I would remove one osd from ceph and add it with a new fs
[12:58] <andrei> and do this one by one?
[12:58] <andrei> or is there a better option?
[12:58] <andrei> perhaps doing it one server at a time?
[13:00] * BillK (~BillK@124-148-124-185.dyn.iinet.net.au) has joined #ceph
[13:01] * nhm (~nhm@ Quit (Ping timeout: 480 seconds)
[13:02] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has left #ceph
[13:06] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[13:07] * nhm (~nhm@ has joined #ceph
[13:11] * madkiss1 (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[13:16] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[13:36] * fridudad (~oftc-webi@fw-office.allied-internet.ag) Quit (Quit: Page closed)
[13:40] <andrei> does anyone know what is the proper way of replacing fs on osds?
[13:47] <sha> you meen change xfs on any other
[13:59] <andrei> sha: may i have a cluster with a mix of btrfs and xfs?
[13:59] <andrei> or do i need to have a consistency?
[14:00] <tnt> joao: ping
[14:00] * jeff-YF (~jeffyf@ip-64-134-70-136.public.wayport.net) has joined #ceph
[14:01] * nhm (~nhm@ Quit (Ping timeout: 480 seconds)
[14:04] <sha> did you read this? http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
[14:07] * LeaChim (~LeaChim@ Quit (Read error: Operation timed out)
[14:10] <andrei> sha: yes i have
[14:10] <andrei> however, it's not really clear how would I transition from one fs to another
[14:11] * schwarzenegro (~schwarzen@ Quit (Quit: I´ll be back)
[14:11] <jjgalvez> mark the osd as out, let the cluster become healthy, format the drive/partition to a new fs, then add it back to the cluster as a new osd
[14:11] <jjgalvez> ceph will backfill data to the osd when it comes back up
[14:11] <andrei> sha: i've read the benchmark results and there is a rather big difference between the xfs and btrfs
[14:12] <andrei> especially with small reads
[14:12] <sha> yes it is good idea
[14:12] <jjgalvez> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
[14:12] <andrei> jjgalvez: do I need to update the ceph.conf ? I do have an entry there to use xfs
[14:12] * joelio now has puppet-ceph in jenkins, lint checked and rpsec'd for goodness - Time to extend that baby :)
[14:13] <andrei> or do I update ceph.conf once i've transitioned to btrfs?
[14:13] <tnt> andrei: btrfs is also a whole less stable afaik.
[14:14] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[14:14] <andrei> tnt: do you mean the fs itself is less stable?
[14:14] <jjgalvez> yeah, we still don't recommend using btrfs in production, xfs is the recommended fs
[14:14] <andrei> or do you mean ceph + btrfs is less stable?
[14:14] <andrei> coz i've seen a lot of distros adding btrfs
[14:15] <andrei> and they usually don't do that unless there has been a lot of tests done
[14:15] <tnt> andrei: afaik ceph uses some advanced features of btrfs to have those better perf and those aren't the most stable
[14:15] <sha> fs brtfs is less stable
[14:15] <tnt> and of course you need a much more recent kernel for btrfs like 3.8/3.9 or so.
[14:15] <joao> tnt, what's up?
[14:16] <andrei> i do have the 3.8 kernel on my servers, so that shouldn't be an issue
[14:16] <andrei> has anyone here had any negative experience with ceph + btrfs?
[14:16] <jjgalvez> andrei: either way, if you feel like experimenting, I'd say go ahead and modify the ceph.conf when you remove the osd and put the new details back in place when you add it
[14:17] <andrei> I do intend to use this for production after i've finished with testing, so I guess i should leave xfs for the time being
[14:17] <tnt> joao: I had the same weird behavior as last night where the leader went into sync and essentially took down the cluster because the peons never took over. And nothing was really going on the cluster at that time http://pastebin.com/raw.php?i=BWQzKKbv
[14:17] <andrei> just need to enable the ssd journaling for now
[14:17] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has left #ceph
[14:18] <tnt> joao: mmm, actually the mon.b thinks he took over ...
[14:19] <joao> tnt, what about the other monitor?
[14:19] <tnt> there is interesting stuff, let me pastebin it.
[14:19] <sha> same picture i gess
[14:20] <sha> or some thing about quorum on other mon
[14:21] * LeaChim (~LeaChim@ has joined #ceph
[14:21] <tnt> joao: mon.b http://pastebin.com/raw.php?i=QMGEwFPb
[14:22] <tnt> so it become leader ... for all of 10 sec ... then peon, but mon.a wasn't leader.
[14:22] <sha> tnt: - no quorum
[14:22] <tnt> yeah, the question is why did it get in that state with all 3 mon processes running.
[14:23] <joao> tnt, can you pastebin the contents of the other monitor's (mon.c?) logs around that time?
[14:23] <tnt> yup, doing that atm
[14:23] <sha> we have same picture after remiving 1 mon....
[14:24] <joao> sha, ?
[14:24] <tnt> joao: mon.c http://pastebin.com/raw.php?i=7dbsTKZ6
[14:25] <tnt> at ~ 11:52 is when I restarted all mons to get it working again.
[14:25] <tnt> (people screaming so I had to just restart ASAP :p)
[14:26] * sha (~kvirc@ Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[14:27] <joao> (unrelated: funny how the data_health stats show the store growing some 10-20MB per minute)
[14:28] <joao> tnt, these log levels don't really provide much insight on what's causing this behavior :\
[14:29] <joao> I'll try to figure it out once I look over a couple of standing logs regarding other bugs
[14:29] <mrjack> joao: i let the mon which is not in quorum try to synchronize for 12 hours, but still the mon has not joined quorum...
[14:29] <joao> mrjack, still synchronizing?
[14:29] <mrjack> jep
[14:29] <joao> do you have logs for it?
[14:29] <mrjack> yes, but there is nothing interesting in the logs:
[14:30] <mrjack> mon.2@0(synchronizing sync( requester state chunks )).data_health(0) update_stats avail 89% total 476774544 used 24278936 avail 428276824
[14:30] <mrjack> ... a dozen of lines...
[14:30] <mrjack> store.db is now 3.1G
[14:31] <mrjack> i see from leveldb LOG that it is still compacting etc.. but i feel like that monitor is unable to keep up with the other two (the other two are on ssd, the one not in quorum has normal sata disks only - i am still on cuttlefish 0.61.2)
[14:32] <joao> mrjack, you should upgrade to 0.61.3 whenever you can
[14:32] <mrjack> hm? when was it released, i missed that?
[14:32] <joao> yesterday, I think
[14:32] <joao> or 2 days ago
[14:33] <mrjack> that was 0.63?
[14:33] <joao> oh
[14:33] <joao> yeah
[14:33] <joao> sorry
[14:33] <joao> there was a 3 in it
[14:33] <mrjack> yeah should i upgrade to 0.63? then it is not cuttlefish anymore?
[14:33] <joao> source of confusion
[14:33] <joao> no
[14:33] <joao> better wait for .3
[14:34] <mrjack> joao: i saw the patches etc for mon and rbd rm etc.. if that is all going to be backported i'll wait
[14:34] <joao> okay, so, any chance you can get me a log for that monitor with mon debug = 10?
[14:34] <mrjack> joao: or someone could point me to a howto for squeeze to build my onw packages with cherry-picked patches?
[14:34] <mrjack> joao: i can
[14:36] <mrjack> joao: i think with mon debug = 10 won't help much.. the monitor is just endless syncig
[14:37] <mrjack> joao: on the two surviving monitors, i have 1.2g store.db which does not compact to a smaller size... maybe something got messed up when the disk was full on mon.2 that corrupted the leveldb?
[14:37] <joao> debug mon = 10 should provide much needed insight to the code path and some debug messages on the sync stuff
[14:38] <tnt> joao: Can that be related to the going_to_bootstrap change ?
[14:38] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[14:38] <joao> mrjack, it's likely that you're being bit by #4895
[14:39] <joao> tnt, your issues?
[14:39] <tnt> joao: yes. Since I never saw it before deploying that hot-fix.
[14:40] <joao> tnt, I doubt it
[14:40] <mrjack> joao: well on my other cluster, i do compact on startup, and do daily restarts of the mons... there all will be fine after a few seconds.. but this cluster just seems to endless compact/sync/whatever..
[14:40] <mrjack> joao: alwas these two lines: 2013-05-30 12:38:41.209518 7f1995713700 10 mon.2@0(synchronizing sync( requester state chunks )) e1 handle_sync mon_sync( chunk bl 1043868 bytes last_key ( paxos,12366794 ) ) v1
[14:40] <mrjack> 2013-05-30 12:38:41.209530 7f1995713700 10 mon.2@0(synchronizing sync( requester state chunks )) e1 handle_sync_chunk mon_sync( chunk bl 1043868 bytes last_key ( paxos,12366794 ) ) v1
[14:40] <mrjack> endless.. ;)
[14:40] <joao> mrjack, yeah, I bet the whole problem is a bunch of paxos versions in need to be synchronized
[14:41] <joao> mrjack, dirty solution here is probably to restart the leader
[14:41] <joao> kill mon.2, restart the leader, restart mon.2 once quorum is formed
[14:42] <joao> that's the quick and dirty workaround for #4895
[14:44] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[14:56] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[14:56] * Vjarjadian (~IceChat77@ has joined #ceph
[14:56] * diegows (~diegows@ has joined #ceph
[14:56] * jeff-YF (~jeffyf@ip-64-134-70-136.public.wayport.net) Quit (Quit: jeff-YF)
[15:05] * virsibl (~virsibl@ Quit (Quit: KVIrc 4.1.3 Equilibrium http://www.kvirc.net/)
[15:08] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[15:08] <andrei> guys, I am just running some rados benchmarks on the cluster and I am seeing the following latency: Average Latency: 0.036261
[15:09] <andrei> is that 3.6ms?
[15:09] <andrei> or 0.36ms?
[15:18] <andrei> Guys, could you please recommend me xfs formatting options which I should use if I plan to store kvm image files accessed via librados?
[15:19] <mrjack> joao: ok, but then i have no io until quorum is formed again?
[15:19] <joao> mrjack, that is right
[15:19] <mrjack> hmhm
[15:20] <joao> you could always try adding a new mon, but it's moot as your main issue here is that one of your already existing monitors is unable to sync
[15:20] <tnt> does IO completely stop when there is no quorum ? I thought currently connected client still worked for a couple minutes or so.
[15:20] <joao> probably due to all those paxos versions that have gone untrimmed
[15:20] <joao> tnt, yeah, I think that's right
[15:21] <mrjack> joao: i think it is a bad idea to let people still download and install 0.61.2 despite of the known problems with leveldb and monitors - i would not speak of it as a stable version :(
[15:21] <andrei> mrjack: do you know when 0.63 is hitting ubuntu ppa?
[15:22] <andrei> i am currently on 0.61.2
[15:22] <joao> mrjack, we did have leveldb issues prior to cuttlefish's release, but the gross of it has only been popping up now; and we've been fixing things as we've been able to close down on the issues
[15:22] <tnt> yeah, unless you have monitors on like SSD and the compaction doesn't trigger an election, you risk hitting 4895 easily.
[15:24] <mrjack> joao: i didn't want to offend anyone, you are doing great work, but i think when people download and install 0.61.2 and they have trouble, then it is bad for reputation of ceph, now?
[15:24] * tnt 's testing wip-5176-cuttlefish atm
[15:24] <mrjack> andrei: i don't know, i am no developer ;)
[15:25] <joao> mrjack, no offense taken really; just trying to share a bit of why things are this way atm
[15:25] <joao> mrjack, 0.61.2 has some much needed fixes; 0.61.3 will have a couple others too
[15:25] <tnt> hopefully 0.61.3 will soon be out. I mean 4895 is fixed essentially.
[15:26] <mrjack> tnt: that's the point,... i see 0.63 with all that fixes etc, but still no 0.61.3
[15:26] <tnt> I think maybe they're waiting for wip-5176 to get some testing to include it.
[15:27] <mrjack> hm maybe
[15:27] <mrjack> joao: ok i restarted all mons
[15:27] <mrjack> joao: the surviving ones have 1.1G store.db - does not compact any more
[15:28] <tnt> that seems a bit big.
[15:28] <mrjack> yes
[15:28] <mrjack> tnt: it happend when one monitor died because of out of disk
[15:28] <tnt> ah but it'll need to trim in addition.
[15:28] * jahkeup (~jahkeup@ has joined #ceph
[15:28] <mrjack> tnt: i set compact on startup, but it does not compact to smaller size than 1.1g
[15:28] * PerlStalker (~PerlStalk@ has joined #ceph
[15:29] <mrjack> and the third monitor still does sync, that never finishes and is unable to join quorum...
[15:29] <joao> mrjack, can you send the third monitor's log my way?
[15:29] <tnt> mrjack: there is two separate issues, one is the need for compaction because leveldb is badly behaved in some use case. The other is ceph didn't trim (i.e. delete from leveldb) some old versions.
[15:29] <mrjack> tnt: yeah i know these two issues
[15:30] <mrjack> tnt: but i think something else bad happened to the montiors when one mon died with disk-full
[15:30] <joao> damn, gotta go afk for a bit :\
[15:30] <joao> be back as soon as possible
[15:31] <tnt> mrjack: ah yes, possible. When it first happened to me once of the mon was just bad and I deleted/recreated it. (hopefully the other 2 were fine and so I had quorum)
[15:32] <tnt> mrjack: and they just don't start at all ?
[15:32] <tnt> none of them ?
[15:32] <mrjack> tnt: they all start
[15:32] <mrjack> tnt: get quorum... but the third monitor won't join quorum
[15:33] <mrjack> tnt: it is trying to sync endless...
[15:33] <tnt> mrjack: can't you just kill it and recreate ?
[15:33] <mrjack> tnt: i have tried several times
[15:33] <tnt> ah ok.
[15:33] <mrjack> i now see in the log
[15:33] <mrjack> 2013-05-30 13:32:54.167554 7fc6609de700 10 mon.2@0(synchronizing sync( requester state stop )) e1 handle_sync_finish_reply mon_sync( finish_reply ) v1
[15:33] <mrjack> so the sync seems finished?
[15:34] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[15:34] <mrjack> "quorum": [
[15:34] <mrjack> 1,
[15:34] <mrjack> 2],
[15:34] <mrjack> "outside_quorum": [],
[15:34] <mrjack> looks strange to me
[15:34] <tnt> well, honestly if you get to quorum I think it should trim down much smaller than 1G. And then it might be easier to sync.
[15:34] <mrjack> i cannot trim smaller
[15:35] <mrjack> i set mon compact on start = true
[15:35] <mrjack> and did /etc/init.d/ceph -a restart mon
[15:35] <mrjack> but store.db does not get smaller than 1.1G
[15:35] <mrjack> next problem is that it seems sync has finished, but mon get's no quorum
[15:35] <tnt> compact is not trimming.
[15:36] <mrjack> and from ceph mon_status it looks to me as if somehow the monitor id's got messed up
[15:36] <mrjack> it says quroum 1,2 but i think it should be 0,1
[15:36] <mrjack> because 2 is not running...
[15:36] <tnt> mrjack: can you shutdown the one that doesn't work. Restart the two others and set debug to mon 10 paxos 10, let it run for ~10 min and pastebin both logs ?
[15:36] <mrjack> and it does not show the third monitor outside quorum
[15:37] <tnt> heh ... that's why I named my mons 'a' 'b' 'c'
[15:38] <mrjack> when i setup this cluster i read somewhere that it is best to use id's instead of names to not get confused ;)
[15:38] <tnt> what's the debug command you typed to get the output above.
[15:38] <mrjack> on mon.0 if i tail the ceph-mon.0.log i see normal ceph log so somethin is wrong here
[15:38] <mrjack> ceph mon_status
[15:39] <tnt> can you pastebin the rest of the output ?
[15:40] <mrjack> tnt: http://pastebin.com/vbtHz1hw
[15:40] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[15:40] <tnt> mrjack: 1,2 only refers to index inside the mons array below I think.
[15:40] <mrjack> tnt: is not within quorum
[15:40] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[15:40] <saaby> joao: could you use the logs+store dumps from yesterday's mon crashes?
[15:41] <mrjack> tnt: i enabled logging
[15:41] <mrjack> and debugging
[15:42] <mrjack> tnt: currently waiting for that 1.1g to sync again :(
[15:42] * redeemed (~redeemed@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[15:45] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Quit: office)
[15:46] <tnt> mrjack: what code are you running btw ?
[15:46] <redeemed> howdy. i have deployed ceph (cuttlefish, latest, on ubuntu 12.04), with cephx, and am trying to use CephFS for shared storage between OpenStack (grizzly 2013.1) compute nodes. i am able to create a file in the shared storage, after mounting on all the compute nodes, and add contents to the file; but as soon as i attempt to access the file from another node, the contents are wiped and "Operation not permitted" results from any c
[15:47] <redeemed> anyone ever encounter this? i have scoured the mailing list for ceph users and devl, as well as google, for about a week. drawing a blank!
[15:48] <tnt> redeemed: compute and storage node are physically distinct ?
[15:49] * eegiks (~quassel@2a01:e35:8a2c:b230:b593:6630:8f9d:771) Quit (Ping timeout: 480 seconds)
[15:49] <redeemed> tnt, yes sir.
[15:50] <redeemed> tnt, oh! this is on virtualbox. kernel is 3.5.0 #31
[15:50] <tnt> and the compute nodes have network access to the storage nodes (osds) ? They also have a ceph.conf with a keyring that has the required access permissions ?
[15:50] <tnt> redeemed: but they're running on physically different servers ?
[15:51] <redeemed> yes sir! verified they are using the proper keyring. if the access permissions were not proper would i be able to create a file and add content?
[15:51] <tnt> well, maybe cephfs has some sort of local cache and doesn't notice it can't write :p
[15:52] <tnt> (disclaimer: I'm not all that familiar with cephfs)
[15:52] <redeemed> txt, let me clarify further, they are on different virtual servers, same physical server.
[15:52] <redeemed> tnt*
[15:52] <redeemed> lol
[15:52] <tnt> if you umount and remount, are the files still there ?
[15:53] <tnt> Ah ... I explicitely asked about physical server for a reason :)
[15:53] <redeemed> txt, good good. what is the reason? hopefully that is something i need to understand
[15:53] <redeemed> tnt* geeze
[15:54] <redeemed> kernel is "3.5.0-31-generic #52" got my numbers confused
[15:54] <mrjack> tnt: 0.61.2
[15:55] <mrjack> tnt: on another cluster after a montior restart i got:
[15:55] <mrjack> -4> 2013-05-30 15:54:19.268798 7f33ea2fe780 10 mon.0@-1(probing).log v16295968 check_subs
[15:55] <mrjack> -3> 2013-05-30 15:54:19.357839 7f33ea2fe780 10 mon.0@-1(probing).auth v9026 update_from_paxos
[15:55] <mrjack> -2> 2013-05-30 15:54:19.358100 7f33ea2fe780 10 mon.0@-1(probing).auth v9026 update_from_paxos version 9026 keys ver 0 latest 0
[15:55] <mrjack> -1> 2013-05-30 15:54:19.358112 7f33ea2fe780 10 mon.0@-1(probing).auth v9026 update_from_paxos key server version 0
[15:55] <mrjack> 0> 2013-05-30 15:54:19.358697 7f33ea2fe780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos()' thread 7f33ea2fe780 time 2013-05-30
[15:55] <mrjack> mon/AuthMonitor.cc: 147: FAILED assert(ret == 0)
[15:55] <tnt> Well ... I'm not sure, but when I was using xen I had some weirdness due to some virtualization optimization that didn't behave that well when one VM kernel triggered some action in another vm kernel.
[15:55] <mrjack> i am frustrated
[15:55] <mrjack> no free minute since 0.61.2
[15:55] <mrjack> :(
[15:55] <mrjack> the system failing and f*****g up all time...
[15:56] <redeemed> tnt, ok. thank you. i will continue with the advice you have given already :) time to graduate to physical hardware and do a few new test (you mentioned). gracias
[15:56] <redeemed> mrjack, i'm sorry to hear that, dude.
[15:56] * eegiks (~quassel@2a01:e35:8a2c:b230:5413:28c3:bb66:db34) has joined #ceph
[15:57] <tnt> mrjack: yeah, my last 2 weeks have been really busy as well ... but fortunately the cluster still stayed up most of the time, just required a lot of attention.
[15:58] <mrjack> my wife hates ceph
[15:59] <tnt> mrjack: that auth error is on another cluster ?
[15:59] <mrjack> yes
[15:59] <mrjack> tnt: i maintain three installations currently...
[15:59] <mrjack> tnt: the two i have on 0.61.2 don't let me sleep, eat, nor do any other normal activity in real life without interruption
[15:59] <mrjack> it is like hell :(
[16:00] <mrjack> i read that there are fixes for the issues
[16:00] <mrjack> but not backported...
[16:00] <tnt> I'd reommend deploying the wip-4895-ceph on at least the leader mon.
[16:00] <tnt> wip-4895-cuttlefish sorry
[16:00] <mrjack> tnt: i cannot build it, because of missing leveldb-dev etc etc
[16:01] <mrjack> tnt: i am on debian squeeze...
[16:01] <andrei> does anyone know if I use the disk partition for the osd journal (like osd journal = /dev/sda5) would ceph automatically create a journal and use it on server reboot without me having to mount /dev/sda5 manually?
[16:02] <mrjack> tnt: what makes things even worse... if there are monitor issues and no io because of dying mons etc, it makes my ocfs2 nodes fence themselves to death
[16:02] <tnt> mrjack: mmm, there is an autobuilder building .deb for all those branches.
[16:03] <mrjack> tnt: hmmm
[16:04] <mrjack> tnt: but that branches there.. they don't include all the fixes i'd like to have it included... ;)
[16:05] <tnt> well maybe not, but that's better than nothing :p and what about the cuttlefish branch ? It has everything that's finished and already ready for the future 0.61.3
[16:08] <mrjack> tnt: there seems to still be bugs in the monitor
[16:08] <tnt> mrjack: mmm, btw about the Auth thing seems to be a damaged store from what I can gather in the source
[16:08] <mrjack> tnt: yeah
[16:09] <mrjack> what really gives me headaches is that i cannot resurrect monitors anymore
[16:09] <mrjack> i deleted them
[16:09] <mrjack> recreated them
[16:09] <mrjack> they run a few seconds
[16:09] <mrjack> crash
[16:09] <tnt> mrjack: well I'm running it and there are still "issues" but as I said, it can keep the cluster alive well enough to be way more reliable than 0.61.2
[16:09] <mrjack> crash the leveldb
[16:09] <mrjack> ...
[16:10] <mrjack> 2013-05-30 16:07:46.139352 7f6fd14ea700 -1 mon/Monitor.h: In function 'void Monitor::SyncEntityImpl::sync_init()' thread 7f6fd14ea700 time 2013-05-30 16:07:46.13883
[16:10] <mrjack> mon/Monitor.h: 620: FAILED assert(synchronizer->has_next_chunk())
[16:10] <tnt> which cluster is that ?
[16:11] <mrjack> my prod cluster..
[16:11] <mrjack> there are 5 mons
[16:11] <mrjack> 3 are alive
[16:11] <mrjack> mon.0 is dead with br0ken store
[16:11] <mrjack> mon.2 is dead
[16:11] <mrjack> when i try to launch mon.2 it is trying endless to sync from mon.0
[16:11] <mrjack> which is also dead
[16:12] <mrjack> it does not chose another mon
[16:12] <tnt> you can remove mon.0 from the monmap alltogether
[16:12] <tnt> or maybe removing it just from /etc/ceph/ceph.conf will suffice
[16:13] <mrjack> i now try recreate both down monitor
[16:13] <mrjack> health HEALTH_WARN 1 mons down, quorum 1,2,3,4 1,2,3,4
[16:13] <mrjack> hmhm
[16:13] <mrjack> 2013-05-30 16:13:38.301456 mon.0 [INF] mon.0@0 won leader election with quorum 0,1,2,3,4
[16:13] <mrjack> ...
[16:13] <mrjack> ok i hope it will run now for the next 3 hours
[16:14] <mrjack> my wife wants to go eat sushi...
[16:14] <mrjack> bbl
[16:19] * vata (~vata@2607:fad8:4:6:c93e:7f92:f3fb:7cd9) has joined #ceph
[16:25] <mrjack> ceph -s
[16:25] <mrjack> health HEALTH_OK
[16:25] <mrjack> rbd ls
[16:25] <mrjack> 2013-05-30 16:25:50.426564 7fdf6eb08700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption
[16:26] <mrjack> hmpfg!
[16:26] <tnt> it's not been 3h
[16:26] <mrjack> yes
[16:26] * diegows (~diegows@ Quit (Quit: Leaving)
[16:26] <mrjack> because ticket-system forced me to go back
[16:26] <mrjack> :(
[16:27] * diegows (~diegows@ has joined #ceph
[16:27] <tnt> :(
[16:27] <mrjack> rbd ls not working anymoure
[16:27] <mrjack> i don't know why
[16:27] <mrjack> all mons up
[16:27] <mrjack> quorum
[16:27] <mrjack> :(
[16:30] <tnt> huh ... never seen that.
[16:30] <mrjack> i have never seen that, too
[16:30] <mrjack> and my wife hates me now
[16:31] <mrjack> are there stats on people committing suicide because of ceph failures?! ;)
[16:31] <tnt> not yet
[16:31] <tnt> you might soon be the first datapoint for divorce rate though ...
[16:32] <tnt> I guess somehow the auth info is not properly synced
[16:32] <mrjack> tnt: if that would be true, then why is there quorum?
[16:33] <tnt> I didn't say it was logical ... i mean obviously something is wrong here.
[16:33] * KindTwo (KindOne@h12.182.130.174.dynamic.ip.windstream.net) has joined #ceph
[16:33] <saaby> mrjack: are all your pg's active?
[16:33] <mrjack> yes
[16:34] <saaby> k
[16:34] <mrjack> and i can see there is io
[16:34] <mrjack> but unable to start new guests
[16:35] <saaby> can you: # rados lspools on that host ?
[16:35] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:36] <mrjack> yes
[16:36] * KindTwo is now known as KindOne
[16:36] <tnt> mrjack: does the mon log show anything ?
[16:36] <mrjack> node04:~# rados lspools
[16:36] <mrjack> data
[16:36] <mrjack> metadata
[16:36] <mrjack> rbd
[16:36] <saaby> ok
[16:37] <mrjack> nothing exciting in the logs
[16:37] <saaby> hm, I don't really know rbd
[16:38] <saaby> but ot looks as though your rados cluster is working (including auth)
[16:38] <saaby> it*
[16:38] <mrjack> yeah
[16:38] <mrjack> it is working i can see io
[16:38] <mrjack> and all VMs which are using rbd still work
[16:38] <tnt> mrjack: increase logs for mon and auth. (you can do that on-the-fly with injectargs)
[16:38] <mrjack> but unable to startup VMs
[16:39] <tnt> which is kind of typical of mon failing to assign a session key I think.
[16:41] <mrjack> hm
[16:41] <mrjack> i restarted and retried with logging enabled
[16:42] <mrjack> 2013-05-30 16:42:07.974605 7f0306400700 10 mon.0@0(leader).auth v9039 update_from_paxos
[16:42] <mrjack> 2013-05-30 16:42:07.974628 7f0306400700 10 mon.0@0(leader).auth v9039 auth
[16:42] <mrjack> 2013-05-30 16:42:08.359705 7f0305bff700 10 mon.0@0(leader).auth v9039 update_from_paxos
[16:42] <mrjack> ...
[16:42] <mrjack> nothing to fancy in the logs
[16:42] <mrjack> at least no errors
[16:44] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:44] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[16:44] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[16:44] <tnt> try with msg log set to 1 to see messages being exchanged
[16:45] <tnt> ceph mon tell 0 injectargs '--debug-ms 1'
[16:45] <tnt> and rbd ls just stoppe woring 30 min ago ? it worked before ?
[16:45] <mrjack> now i am unable to tell from the logoutput what is relevant
[16:45] <mrjack> tnt: yeah
[16:45] <mrjack> tnt: it worked before
[16:46] <mrjack> tnt: i have to restart all mons twice a day
[16:46] <mrjack> tnt: because of store.db size
[16:46] <mrjack> tnt: so when i restarted all mons about 12:00 all was working
[16:46] <tnt> mrjack: technically if you restart the leader, it will be sufficient.
[16:46] <mrjack> tnt: that is not true
[16:46] <mrjack> tnt: i have to restart all twice so that all store.db gets compacted
[16:47] <mrjack> tnt: this time, /etc/init.d/ceph -a restart mon killed mon.0 and mon.3
[16:47] <mrjack> these two mons where unable to start and asserted
[16:47] <tnt> that should be true. the leader will compact, and a few minutes (like 10 or less) later will trigger a trim which will trigger a compact on all mons.
[16:47] <mrjack> so i removed the mon directory, fetched monmap, monauth, recreated store and started again
[16:47] <mrjack> got quorum again
[16:47] <mrjack> but rbd not working anymore from the console
[16:48] <tnt> can you compare the keyring file between one of the mon you recreated and one of the other ?
[16:49] <mrjack> identical
[16:49] <tnt> also you can try to shutdown 0 and 3 again and see if rbd starts working again. (3 out of 5 should still be quorum).
[16:49] <joao> saaby, still around?
[16:50] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[16:50] <joao> mrjack, what are they asserting on?
[16:50] * KindTwo (~KindOne@h220.213.89.75.dynamic.ip.windstream.net) has joined #ceph
[16:50] <saaby> joao: yep
[16:50] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[16:50] <joao> saaby, the rebuilt monitor of which you provided me logs and store, did you inject a monmap into it?
[16:50] <saaby> yes
[16:50] * xmltok (~xmltok@pool101.bizrate.com) Quit ()
[16:50] <saaby> on creation
[16:51] <joao> from where did you obtained it?
[16:51] <mrjack> joao: look above i pasted
[16:51] <saaby> ceph getmonmap on one of the other mons
[16:51] <mrjack> joao what i did:
[16:52] <mrjack> 1. /etc/init.d/ceph stop mon
[16:52] <joao> mrjack, can you restart those monitors with mon debug = 20 and send the log my way please?
[16:52] <mrjack> 2. cd /data/ceph/ ; mv mon mon.b0rked6 ; mkdir mon; ceph-mon -i 0 --mkfs --monmap /.. --keyring /.. ; /etc/init.d/ceph start mon
[16:52] <joao> saaby, are you able to reobtain that map and drop it somewhere for me to download?
[16:53] <mrjack> joao: i cant
[16:53] <mrjack> joao: my wife
[16:53] <mrjack> help :(
[16:53] <mrjack> egna
[16:53] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:53] * KindTwo is now known as KindOne
[16:53] <saaby> joao: that rebuilt mon which couldn't start - I ended up deleting my test pool, and recreated it again with the exact same command and monmap/monkey. - after that it worked fint
[16:53] <joao> mrjack, okay; whenever you can, just poke me and I'll take a look
[16:53] <mrjack> should i kill it again and recreate it?
[16:54] <saaby> fint == fine*
[16:54] <joao> mrjack, if you could do that after grabbing some logs, that would be appreciated; it's your call though :)
[16:54] <joao> you should make sure however that you have a quorum before doing that
[16:55] <mrjack> joao: i have quorum
[16:55] <mrjack> ceph health
[16:55] <mrjack> HEALTH_OK
[16:55] <mrjack> rbd ls
[16:55] <mrjack> 2013-05-30 16:55:22.745694 7fce4c1b2700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption
[16:55] <mrjack> 2013-05-30 16:55:22.745706 7fce4c1b2700 0 -- >> pipe(0x7fce480012f0 sd=4 :42496 s=1 pgs=0 cs=0 l=1).failed verifying authorize reply
[16:55] <mrjack> so i now shut down mon.0
[16:56] <mrjack> health HEALTH_WARN 1 mons down, quorum 1,2,3,4 1,2,3,4
[16:56] <mrjack> rbd ls
[16:56] <mrjack> 2013-05-30 16:56:06.470477 7fbd9439d700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption
[16:56] <mrjack> 2013-05-30 16:56:06.470567 7fbd9439d700 0 -- >> pipe(0x1dbd090 sd=4 :42533 s=1 pgs=0 cs=0 l=1).failed verifying authorize reply
[16:56] <mrjack> still using mon.0?!?
[16:56] <joao> I don't know which one is mon.0
[16:56] <mrjack> the one i just stopped
[16:56] <mrjack>
[16:57] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[16:58] <joao> I'm not sure what can be causing that; my guess would be a misconfigured keyring?
[16:58] <joao> mrjack, try running that command as 'rbd ls -m IP:PORT', varying ip and port for each of your available monitors
[17:01] <mrjack> gna
[17:01] <mrjack> hell
[17:01] <mrjack> 2013-05-30 17:01:46.809867 mon.1 [INF] pgmap v16820370: 768 pgs: 298 active+clean, 14 stale+active+clean, 423 active+degraded, 33 incomplete; 1734 GB data, 3466 GB used, 2707 GB / 6359 GB avail; 11045B/s rd, 145KB/s wr, 33op/s; 233240/912995 degraded (25.547%)
[17:01] <mrjack> :/
[17:02] <mrjack> now i have two ceph-osd processes in state D
[17:03] <mrjack> welll
[17:03] <mrjack> that FUCKED UP ALL
[17:03] <saaby> joao: do you want the current monmap - or the file used to recreate the mon?
[17:03] <joao> saaby, that would be nice
[17:04] <saaby> joao: both?
[17:04] <joao> yeah
[17:04] <saaby> ok
[17:04] <mrjack> top - 17:04:38 up 2 days, 16:11, 1 user, load average: 136.91, 102.28, 51.40
[17:04] <mrjack> ...
[17:05] <mrjack> [229873.044591] INFO: task ceph-osd:8949 blocked for more than 300 seconds.
[17:05] <mrjack> ocfs2 fenced 4 nodes
[17:05] <mrjack> well
[17:05] <mrjack> FUCK
[17:05] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:05] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[17:06] <mrjack> complete offline again :(
[17:06] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[17:07] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[17:07] <saaby> joao: http://www.saaby.com/files/monmap <- that is the file used to recreate the mon. - I just grabbed the current map, and that is identical to this one according to md5sum
[17:07] <joao> k thx
[17:08] <mrjack> joao: would ceph support fix those issues if i had a support contract?
[17:09] <joao> scuttlemonkey, ^ ?
[17:12] <mrjack> hm
[17:12] <mrjack> ok after rebooting all nodes
[17:12] <mrjack> rbd ls works again
[17:13] <saaby> :)
[17:13] <mrjack> yeah
[17:13] <mrjack> downtime again
[17:13] <mrjack> wife angry
[17:13] <mrjack> no holiday :/
[17:18] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[17:18] * ChanServ sets mode +v andreask
[17:18] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has left #ceph
[17:19] <scuttlemonkey> joao: not quite sure what the issue is on a quick glance at backscroll
[17:20] <scuttlemonkey> mrjack: support would definitely help you through it to get you up and running...that's what we do :)
[17:20] <mrjack> scuttlemonkey: i can get it up and running by rebooting all nodes
[17:20] <mrjack> scuttlemonkey: what i would need is someone looking on the cluster when things go horrbily wrong
[17:21] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[17:22] <scuttlemonkey> mrjack: yeah, crawling through logs and diving into clusters is (as I understand it) what support spends a lot of time doing :)
[17:22] <mrjack> scuttlemonkey: i think i am hitting a bug which only happens if you use ceph, rbd, ocfs2 ontop of rbd on the ceph-osd node...
[17:23] <scuttlemonkey> hmm
[17:23] <scuttlemonkey> it's possible
[17:23] <scuttlemonkey> I know almost nothing about ocfs2
[17:24] <scuttlemonkey> I remember a bunch of people complaining to me about it a few years ago...but never investigated further
[17:26] <tnt> mrjack: wait, you have a rbd kernel client on the same node as an osd ?
[17:26] <mrjack> yes
[17:27] <mrjack> inside a VM
[17:27] <tnt> what virtualization system ? And is it the same VM for OSD and the client ?
[17:28] <mrjack> kvm is using rbd, osd on physical host
[17:28] <mrjack> i read that this shouldn't be a problem?
[17:30] <tnt> huh ... wait you're using kvm with the kernel client ? so it's running inside the VM and you're not using the KVM build-in userspace rbd driver /
[17:30] <tnt> ?
[17:30] * The_Bishop (~bishop@2001:470:50b6:0:6dd1:495c:667:a5e6) Quit (Ping timeout: 480 seconds)
[17:30] <mrjack> no i use kvm with qemu librbd
[17:31] <tnt> ah ok. I see.
[17:32] <TiCPU> if my pool is size=3, and I have 6 OSD, and HEALTH is OK, if I kill 2 random OSD by force, is it normal I keep stale PGs ?
[17:33] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[17:33] <TiCPU> crushmap says they are all in the same rack too
[17:34] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[17:36] <tnt> depends on the crushmap
[17:36] <TiCPU> it looks like this # ceph osd tree : http://pastebin.com/eqdQRGiX
[17:38] * The_Bishop (~bishop@2001:470:50b6:0:c9df:6318:fdca:966) has joined #ceph
[17:41] * loicd (~loic@magenta.dachary.org) has joined #ceph
[17:48] * danieagle (~Daniel@ has joined #ceph
[17:48] <jgallard> loicd, http://dachary.org/?p=2009
[17:48] <jgallard> very nice thanks :)
[17:48] <jgallard> and thanks for the update
[17:50] * tnt (~tnt@212-166-48-236.win.be) Quit (Read error: Operation timed out)
[17:54] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[17:55] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[17:58] * aliguori (~anthony@ has joined #ceph
[18:02] * dty (~derek@proxy00.umiacs.umd.edu) has joined #ceph
[18:03] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[18:03] <loicd> jgallard: you're welcome
[18:05] * tnt (~tnt@ has joined #ceph
[18:07] * BillK (~BillK@124-148-124-185.dyn.iinet.net.au) Quit (Read error: Operation timed out)
[18:07] * eschnou (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:10] <andrei> wido: ping
[18:14] <andrei> does anyone use ceph + kvm ? apart from wido who seems to be away ?
[18:15] <joao> saaby, which version of ceph were you using when you hit those asserts?
[18:16] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[18:16] * eschnou (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[18:16] <scuttlemonkey> andrei: quite a few people are using ceph+kvm
[18:17] <joao> saaby, nevermind
[18:18] <andrei> I am having some issues with ceph + kvm
[18:18] <joelio> andrei: I do, heavily!
[18:18] <andrei> the errors that I am getting:
[18:18] <joelio> mainly with OpenNebula middleware
[18:18] <joao> saaby, ping me when you're around :)
[18:19] <joelio> andrei: pastebin them if large
[18:19] <andrei> http://ur1.ca/e3sqr
[18:19] <dty> is there a way to have ceph-deploy run more verbosely showing the commands that is running remotely through pushy? I am trying to track down a "No such file or directory" exception being thrown when running 'mon create'. Maybe I am missing something obvious in the stack trace but it is being obfuscated by the compiling of the sudo call
[18:20] <andrei> here is the error message
[18:20] <andrei> so, I can see that qemu can't connect to the monitor
[18:20] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[18:21] <andrei> however, qemu-img info command gives me the details about the image file
[18:22] <andrei> I am also able to run ceph -s from the kvm host and i can see ceph status
[18:22] <redeemed> dty, the only thing i know is add "-v" argument to your ceph-deploy command.
[18:24] * loicd (~loic@magenta.dachary.org) Quit (Ping timeout: 480 seconds)
[18:26] * Maskul (~Maskul@host-89-241-174-13.as13285.net) has joined #ceph
[18:27] * Tamil (~tamil@ has joined #ceph
[18:28] <andrei> anyone have an idea what's wrong?
[18:28] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Quit: Leaving)
[18:28] <andrei> i've compiled the latest version of libvirt and qemu from sources with rbd support
[18:29] <sagewk> tnt: ping!
[18:31] * loicd (~loic@magenta.dachary.org) has joined #ceph
[18:32] <tnt> sagewk: pong
[18:32] <sagewk> tnt: did you get a chance to try the mon wip branch by chance?
[18:32] <tnt> sagewk: yes, it's been running for a few hours now.
[18:33] <tnt> sagewk: the space usage seems bounded, it doesn't grow out of control like if you disable compact on trim.
[18:33] <tnt> it can get a bit bigger at times than previously but it gets backs down progressibely in a few minutes.
[18:34] <elder> sjust, sagewk, I updated my ceph tree and now I'm getting feature mismatch with OSDHASHPSPOOL missing.
[18:35] <sagewk> what kernelare you running?
[18:35] <elder> Looks like Sam just committed "pg_pool_t: enable FLAG_HASHPSPOOL by default" which is probably related.
[18:35] <elder> Master?
[18:35] <elder> Wait.
[18:35] <sagewk> it should be supported by current kernel tho
[18:35] <sjusthm> elder: yeah, should be supported in recent kernels/
[18:35] <sjusthm> ?
[18:35] <elder> Well, it's based on current testing.
[18:36] <elder> The error seems to indicate that feature is *missing* from the osd side.
[18:36] <elder> And present on the kernel osd client side.
[18:36] <tnt> sagewk: IO rate is lower, but not all that much. about 20-30% or so.
[18:36] <sagewk> that sounds about right
[18:37] <elder> Here comes a long line:
[18:37] <elder> 40134700 0 -- >> pipe(0xe65310 sd=3 :37577 pgs=0 cs=0 l=1).connect protocol feature mismatch, my 7fffff < peer 407fffff missing 40000000
[18:37] <sjusthm> hmm
[18:37] <elder> I can go back to an earlier version, just wanted to make sure it wasn't something that needed fixing.
[18:38] <sjusthm> how new are your osds?
[18:38] <elder> Oh, they are probably old. But I just started my ceph stuff.
[18:38] <elder> Let me see if it blows them away.
[18:39] <sjusthm> I mean the version
[18:39] <sjusthm> current master?
[18:39] <elder> YUes
[18:40] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[18:40] <tnt> sagewk: the good news though is that I haven't had any spurious election due to time out since I deployed it.
[18:40] <elder> Back in a abit.
[18:41] * davidzlap (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[18:42] <tnt> I'll be adding two osds to the cluster tonight like I did yesterday and see how it goes. Yesterday it yielded a lot of IO on the mon which caused various things to timeout and a bunch of issues ...
[18:43] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has joined #ceph
[18:48] * KippiX (~kippix@coquelicot-a.easter-eggs.com) Quit (Quit: leaving)
[18:53] * rturk-away is now known as rturk
[18:56] <elder> sjusthm, I'm going to just back off a few commits and make sure that makes the problem go away.
[18:56] <sjusthm> k
[18:56] * jahkeup (~jahkeup@ Quit (Ping timeout: 480 seconds)
[18:58] <scuttlemonkey> andrei: sry, got pulled away
[18:58] <scuttlemonkey> when do you get the error that you pasted?
[18:59] <andrei> scuttlemonkey: thanks for coming back
[18:59] <andrei> i get this error when I try to start the vm with rbd disk
[19:02] * eegiks (~quassel@2a01:e35:8a2c:b230:5413:28c3:bb66:db34) Quit (Ping timeout: 480 seconds)
[19:02] <andrei> this is the full qemu command: http://ur1.ca/e3taw
[19:03] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[19:04] <scuttlemonkey> getting a bit out of my depth on qemu/kvm stuff
[19:04] <scuttlemonkey> the extent of my knowledge is pretty much what's in our doc
[19:04] <scuttlemonkey> http://ceph.com/docs/master/rbd/libvirt/
[19:04] <scuttlemonkey> http://ceph.com/docs/master/rbd/qemu-rbd/
[19:05] <scuttlemonkey> andrei: and there was a blog entry a while back...a bit dated perhaps, but still some good tidbits
[19:05] <andrei> are you using ubuntu or centos?
[19:05] <scuttlemonkey> http://blog.bob.sh/2012/02/basic-ceph-storage-kvm-virtualisation.html
[19:06] <scuttlemonkey> I use ubuntu almost exclusively
[19:06] <scuttlemonkey> but I know there are folks out there using it on cent
[19:06] * dcasier (~dcasier@ Quit (Ping timeout: 480 seconds)
[19:07] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:12] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[19:15] * andrei (~andrei@host217-46-236-49.in-addr.btopenworld.com) Quit (Ping timeout: 480 seconds)
[19:37] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has joined #ceph
[19:39] * terje__ (~joey@184-96-148-241.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[19:39] * terje- (~terje@184-96-148-241.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[19:40] * rturk is now known as rturk-away
[19:40] * rturk-away is now known as rturk
[19:44] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[19:46] <loicd> sjust I don't understand why the stats are overridden only if last_backfill.is_max() in https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L611 . I would very much appreciate a hint :-)
[19:49] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[19:51] * rturk is now known as rturk-away
[19:54] <sjusthm> loicd: during backfill, stats are "faked" to indicate only the backfilled portion
[19:54] <sjusthm> the primary then uses that estimate backfill progress
[19:54] <loicd> oh.... that explains it :-)
[19:54] <sjusthm> and the number of remaining degraded objects
[19:54] <sjusthm> a second backfill_stats would probably have been clearer
[19:54] * Meths_ is now known as Meths
[19:55] * scuttlemonkey_ is now known as scuttlemonkey
[19:55] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[19:55] * ChanServ sets mode +v andreask
[19:57] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has left #ceph
[19:59] * dcasier (~dcasier@ has joined #ceph
[20:01] * loicd gets the "unicorn" github page
[20:02] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[20:09] <loicd> sjust I now understand enough of the PG::merge_log implementation to write tests for it. I tend to think that such tests would only be slightly dependent on the API ( paste PGLog.h URL here when github is back ). There is a delicate logic behind the PG::proc_replica_log + PG::rewind_divergent_log + PG::merge_log functions. If I was to write tests for these three, it would capture the expected behavior, and I suspect it will require a significant amount of wo
[20:10] * dty (~derek@proxy00.umiacs.umd.edu) Quit (Read error: Operation timed out)
[20:10] <loicd> https://github.com/dachary/ceph/blob/d301f9791eef42ec198704bf85bb05222ffc1e8d/src/osd/PGLog.h#L351 ( github is back ;-)
[20:11] <sjusthm> yeah
[20:12] <sjusthm> testing that logic is definitely one of the main goals
[20:17] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[20:18] * sstan (~chatzilla@dmzgw2.cbnco.com) has joined #ceph
[20:19] <saaby> joao: ping
[20:21] <loicd> sjusthm: if you still think defining the PGLog API should be done before writing tests, I'm happy to continue in this direction. I tried to better understand the merge_log & replica_log logic to propose a first API draft that makes sense. And in the process I realized that tests would not be completely useless if written at this stage. At least for the complex logic, not for data members accessors.
[20:22] <sjusthm> if you think it can be done reasonably, then it would be super valuable
[20:22] * eschnou (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:23] * terje- (~terje@63-154-132-97.mpls.qwest.net) has joined #ceph
[20:23] <loicd> I'll give it a shot to show what I have in mind. sjusthm. Writing code is better than explaining with words :-)
[20:24] <sjusthm> yup
[20:25] * mech422 (~guest@ has joined #ceph
[20:27] <loicd> sjusthm: Do you think https://github.com/dachary/ceph/commit/d301f9791eef42ec198704bf85bb05222ffc1e8d is a reasonable first step ? I would like to revise it according to your review before building on top of it ;-)
[20:28] * loicd meant https://github.com/ceph/ceph/pull/308
[20:28] <mech422> Hi all - is the problem with ceph-deploy choking on wheezy nodes that don't have ca-certificates installed known ? or should I bug it ?
[20:28] <sagewk> mech422: bug it please :)
[20:28] <mech422> kk
[20:28] <sagewk> mech422: isn't that a standard package you would normally have?
[20:28] <sagewk> or do you have it uninstalled for a reason?
[20:29] <mech422> i do a 'bare' install on servers, an only add in what I need
[20:29] <mech422> no 'tasksel' packages
[20:29] <sagewk> is it ceph or ceph-deploy that needs it?
[20:29] <mech422> btw - Say Hi to Russ K. for me... we worked together at olliance
[20:29] <sagewk> any objection to making it a Requires: for the ceph-deploy deb package?
[20:30] <mech422> umm - its the wget for adding the key that chokes
[20:30] <mech422> pushy.protocol.proxy.ExceptionProxy: Command 'wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | apt-key add -' returned non-zero exit status 2
[20:30] <sagewk> got it, ok. so ceph-deploy install should install that it before running wget for the ceph keys.
[20:30] <sagewk> and wget for that matter :)
[20:30] <mech422> yeah
[20:30] <sagewk> cool, please bug it, thanks!
[20:30] <mech422> np :-)
[20:31] * terje- (~terje@63-154-132-97.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[20:32] * loicd going to grab a pizza down the street
[20:33] <sjusthm> loicd: going to do some brief testing and merge that pull request
[20:33] <sjusthm> that'll save you endless rebasing
[20:33] <loicd> sjusthm: much appreciated :-D
[20:34] <sjusthm> yep, I think it's the right first step
[20:39] <mech422> sagewk: btw - you might want to add 'lsb-release' for the same reason...
[20:39] * dcasier (~dcasier@ Quit (Read error: No route to host)
[20:49] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[20:49] * rturk-away is now known as rturk
[20:50] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has joined #ceph
[20:50] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has left #ceph
[20:55] * cking (~king@cpc3-craw6-2-0-cust180.croy.cable.virginmedia.com) has joined #ceph
[20:55] <sagewk> sjusthm: i forget, did you look at the leveldb compaction thing already? with the trim updates?
[20:55] <sagewk> sjusthm: https://github.com/ceph/ceph/pull/335
[20:55] <sjusthm> I looked at one of them, one sec
[20:55] <paravoid> so, what do you suggest to do with the slow peering issue?
[20:56] <paravoid> sjust said something about the patch not being well tested and to install it on a test cluster
[20:56] <paravoid> and while I could set a test cluster up, I'm not sure I can easily reproduce the issue
[20:56] <sjusthm> paravoid: ah
[20:56] <paravoid> I mean, it doesn't happen on your test clusters, does it :)
[20:56] <sjusthm> paravoid: bobtail?
[20:57] <paravoid> I'm running bobtail now, yes
[20:57] <sjusthm> k
[20:57] <paravoid> if cuttlefish fixes this, I'll upgrade
[20:57] <sjusthm> let me kick off a suite on it and get back to you tomorrow
[20:57] <paravoid> but you weren't sure about that either
[20:57] <sjusthm> paravoid: it does reduce the problem simply because the updates in question are cheaper
[20:57] <sjusthm> but avoiding them is obviously better stilll
[20:58] <paravoid> I've been waiting for cuttlefish to stabilize a bit
[20:58] <sjusthm> we may end up putting that patch in bobtail/cuttlefish/master
[20:58] <paravoid> also the slow peering issue is going to make upgrading a bit harder
[21:00] <paravoid> so what makes my cluster special in this regard?
[21:01] <sjusthm> not sure, this particular effect has been observed to a much smaller degree on other clusters
[21:01] * dxd828_ (~dxd828@host-92-24-117-118.ppp.as43234.net) has joined #ceph
[21:01] <paravoid> sage was surprised by my report, I'm guessing this behavior isn't normal
[21:01] <sjusthm> no, the other cluster I've seen this on was much less severe
[21:01] <sjusthm> assuming this is even the root cause of your slow peering
[21:02] <paravoid> how can I help track this down further?
[21:02] <sjusthm> we can try this branch once I've gotten it merged into bobtail tomorrow
[21:02] <sjusthm> going to start our qa suite on it once github lets me finish this push
[21:03] <paravoid> github's having issues
[21:03] <sjusthm> yeah
[21:04] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:04] * eegiks (~quassel@2a01:e35:8a2c:b230:8127:c1ba:673b:69e8) has joined #ceph
[21:06] * eschnou (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) Quit (Quit: Leaving)
[21:07] * Tamil (~tamil@ Quit (Quit: Leaving.)
[21:09] <sjusthm> sagewk: my only concern is that the compact_queue can grow without bound as the monitor outpaces the compaction thread
[21:11] <sagewk> sjusthm: hmm, it could merge adjacent ranges..
[21:11] <sjusthm> or just blast the queue and do a complete compaction if the queue gets too big
[21:12] <sagewk> or periodically do a wait if the queue is > some size
[21:13] <sjusthm> yeah, depends on whether you want to block the monitor
[21:13] <sagewk> by 'blast' you mean block and compact, or do an async full compaction?
[21:13] <sjusthm> in the compaction thread, if it sees that the queue is large, it can just empty it and do a full compaction
[21:13] <sjusthm> as a simple hack
[21:14] <sjusthm> which would degenerate to constantly doing full compactions if the monitor generates too much work
[21:14] * rturk is now known as rturk-away
[21:14] <sjusthm> not sure if that's actually desirable
[21:15] <sagewk> yeah.. the store can legitimately be very large, and compaction rewrites everything, right?
[21:15] <sjusthm> yeah
[21:15] <sjusthm> the real answer would be to combine ranges, I suppose
[21:15] <mrjack> re
[21:15] <mrjack> sushi was good
[21:15] <mrjack> :)
[21:16] <sagewk> shouldn't be too hard
[21:16] <sagewk> i'll do that
[21:18] <mrjack> i read a patch, librbd read from local replicas... this will improve performance?
[21:19] <mrjack> with this patch enabled, it won't be possible to ack writes to clients with only the master-replica written...
[21:20] * eschnou (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:20] <mrjack> as then a read from a local replica could deliver old data?
[21:21] <mech422> Hmm - odd - ceph-deploy mon create ran fine one time, then failed trying to set the ulimit on open files to 8192 the second time
[21:21] <joshd1> mrjack: it's only enabled for snapshots, so there's no danger of reading old data
[21:21] <mech422> not sure why it didn't fail the first time ?
[21:22] * joshd1 (~joshd@2607:f298:a:607:ac93:ff05:d54d:d7b4) Quit (Quit: Leaving.)
[21:22] * joshd (~joshd@2607:f298:a:607:ac93:ff05:d54d:d7b4) has joined #ceph
[21:24] <mrjack> joshd1 ic
[21:26] * Tamil (~tamil@ has joined #ceph
[21:30] * nhm (~nhm@ has joined #ceph
[21:30] * cking (~king@cpc3-craw6-2-0-cust180.croy.cable.virginmedia.com) Quit (Quit: It's BIOS Jim, but not as we know it..)
[21:30] * cking (~king@cpc3-craw6-2-0-cust180.croy.cable.virginmedia.com) has joined #ceph
[21:30] <mrjack> mech422: is ceph-mon already running?
[21:31] <mrjack> mech422: what is in the log?
[21:31] <mech422> mrjack: no - just did a reboot after increasing the limits and still gettin an error with :
[21:31] <mech422> failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i n11 --pid-file /var/run/ceph/mon.n11.pid -c /etc/ceph/ceph.conf '
[21:32] <mrjack> yeah, what is in the log? /var/log/ceph/ceph-mon.n11.log?
[21:32] <mrjack> mech422: it does not necessary mean that there is a problem with ulimit -n 8192 :)
[21:32] * skm (~smiley@ has joined #ceph
[21:33] <mech422> 2013-05-30 12:30:42.138217 7f978fecc700 -1 asok(0x2ae0000) AdminSocket: request
[21:33] <mech422> 'mon_status' not defined
[21:33] <mech422> 2013-05-30 12:30:42.186825 7f979402b780 0 mon.n11 does not exist in monmap, wil
[21:33] <mech422> l attempt to join an existing cluster
[21:33] <mech422> 2013-05-30 12:30:42.187184 7f979402b780 -1 no public_addr or public_network spec
[21:33] <mech422> ified, and mon.n11 not present in monmap or ceph.conf
[21:33] <skm> I just setup a 3 node cluster and this is the current cluster heath:
[21:33] <skm> oot@ceph1:/var/log/ceph# ceph -s
[21:33] <skm> health HEALTH_WARN 576 pgs stuck inactive; 576 pgs stuck unclean; mds cluster is degraded; 2/2 in osds are down
[21:33] <skm> monmap e1: 1 mons at {a=}, election epoch 1, quorum 0 a
[21:33] <skm> osdmap e13: 2 osds: 0 up, 2 in
[21:33] <skm> pgmap v14: 576 pgs: 576 creating; 0 bytes data, 0 KB used, 0 KB / 0 KB avail
[21:33] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) Quit (Remote host closed the connection)
[21:33] <skm> mdsmap e9: 1/1/1 up {0=a=up:replay}
[21:34] <skm> can anyone suggest what I should do next?
[21:34] <mrjack> mech422: i would check ceph.conf if there is a mon entry for that monitor
[21:34] * eternaleye (~eternaley@2002:3284:29cb::1) has joined #ceph
[21:34] <mrjack> skm: start osds?
[21:35] <mech422> ceph.conf only has the 'initial' monitor - I'm trying to run 'ceph-deploy mon create storage11' atm .. (along with the other mons)
[21:36] <mrjack> mech422: oh, i know nothing about ceph-deploy... sorry
[21:36] <mech422> thanks ...maybe I'll just go get lunch :-)
[21:37] * Cube (~Cube@ has joined #ceph
[21:37] <skm> they are already started
[21:38] * cking (~king@cpc3-craw6-2-0-cust180.croy.cable.virginmedia.com) Quit (Quit: It's BIOS Jim, but not as we know it..)
[21:39] <mrjack> skm: what is in the logs?
[21:39] <mrjack> skm: is anything in /var/log/ceph/ceph-osd*.log?
[21:40] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[21:41] <skm> http://pastebin.com/qPdxSTMc
[21:41] * nhm (~nhm@ Quit (Ping timeout: 480 seconds)
[21:43] * nhm (~nhm@ has joined #ceph
[21:44] * sstan (~chatzilla@dmzgw2.cbnco.com) Quit (Quit: ChatZilla 0.9.90 [Firefox 19.0/2013021500])
[21:49] * eternaleye (~eternaley@2002:3284:29cb::1) Quit (Quit: ZNC - http://znc.in)
[21:50] * eternaleye (~eternaley@2002:3284:29cb::1) has joined #ceph
[21:53] * andrei (~andrei@host86-155-31-94.range86-155.btcentralplus.com) has joined #ceph
[21:55] * nhm (~nhm@ Quit (Ping timeout: 480 seconds)
[21:56] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[22:01] <mrjack> skm: hm has it worked before? what have you been doing to get ceph to that state?
[22:01] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[22:01] <mrjack> skm: you mon is running, can the osd hosts reach the mon?
[22:02] <skm> this is a new cluster
[22:02] <mrjack> skm: ceph-deploy?
[22:02] <skm> I was trying to follow the quick start guide....
[22:02] <mrjack> skm: sorry don't know about ceph-deploy
[22:03] <skm> i used mkcephfs
[22:03] <skm> I can try ceph-deploy
[22:04] <tnt> sagewk: Mmm, I just added an osd to the cluster than the leader mon is suddenly very busy ...
[22:04] * nhm (~nhm@ma22436d0.tmodns.net) has joined #ceph
[22:05] <sagewk> tnt: can you turn on logs and capture a snapshot of what is going on?
[22:05] <sagewk> ceph mon tell \* injectargs '--debug-ms 1 --debug-mon 20'
[22:06] <tnt> sagewk: yup. just injected it. I'll wait a minute or so. to collect some data
[22:12] <tnt> sagewk: http://ge.tt/8o1f85i/v/0
[22:12] <tnt> The log starts when I started the osd and it got added to the crushmap
[22:13] <sagewk> tnt: doesn't look like it got the debug options
[22:13] <tnt> scroll down
[22:13] * terje- (~terje@63-154-137-79.mpls.qwest.net) has joined #ceph
[22:13] <tnt> I enabled the option a few minutes after that
[22:14] <tnt> 2013-05-30 20:05:37.849798 7f8477ad8700 0 mon.a@0(leader) e1 handle_command mon_command(injectargs --debug-ms 1 --debug-mon 20 v 0) v1
[22:14] <sagewk> oh i see
[22:14] <sagewk> grr, the encode_full logic is broken, that's what's going on.
[22:17] <dmick> https://github.com/hutkev/wireshark-ceph just popped up in my google alerts
[22:18] <tnt> sagewk: well, I'm glad you see what's wrong at least :)
[22:18] <sagewk> i may have spoken too soon :/ can you add --debug-paxos 20 to that?
[22:19] <tnt> yup
[22:21] * terje- (~terje@63-154-137-79.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[22:22] <tnt> sagewk: http://ge.tt/3zQbB5i/v/0
[22:23] * terje_ (~joey@63-154-137-79.mpls.qwest.net) has joined #ceph
[22:25] * eternaleye (~eternaley@2002:3284:29cb::1) Quit (Read error: Connection reset by peer)
[22:26] * eternaleye (~eternaley@2002:3284:29cb::1) has joined #ceph
[22:31] * terje_ (~joey@63-154-137-79.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[22:32] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[22:33] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:33] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[22:33] * ChanServ sets mode +o scuttlemonkey
[22:34] * eternaleye (~eternaley@2002:3284:29cb::1) Quit (Remote host closed the connection)
[22:35] * eternaleye (~eternaley@2002:3284:29cb::1) has joined #ceph
[22:35] * nhm (~nhm@ma22436d0.tmodns.net) Quit (Read error: Connection reset by peer)
[22:35] <tnt> Damn, and IO on all the RBD disk seems to be pretty much frozen damnit.
[22:36] <mrjack> :/
[22:36] <mrjack> tnt i feel with you
[22:36] <sagewk> what does ceph pg stat say?
[22:37] <tnt> v14643306: 12808 pgs: 10814 active+clean, 298 active+remapped+wait_backfill, 175 active+degraded+wait_backfill, 1218 active+recovery_wait, 84 peering, 9 active+remapped+backfilling, 15 active+degraded, 158 active+degraded+remapped+wait_backfill, 10 remapped+peering, 27 active+recovering; 739 GB data, 1623 GB used, 2171 GB / 3795 GB avail; 3446B/s wr, 0op/s; 475295/2745100 degraded (17.314%)
[22:37] <tnt> [WRN] 53 slow requests, 1 included below; oldest blocked for > 434.805111 secs
[22:38] <sagewk> you can query the peering pgs to see why they are blocked
[22:38] <sagewk> ceph pg dump | grep peering ; ceph pg <pgid> query
[22:38] <mrjack> tnt: i had a osd timeout a few minutes before you...
[22:38] * terje (~joey@63-154-137-79.mpls.qwest.net) has joined #ceph
[22:41] * danieagle (~Daniel@ Quit (Read error: Operation timed out)
[22:42] <tnt> sagewk: they actually seem to vary (i.e some peer and some unpeer). The CPU load on the OSD is also very high.
[22:42] <tnt> (as in using pretty much all CPU)
[22:43] * todin (tuxadero@kudu.in-berlin.de) Quit (Read error: Operation timed out)
[22:44] * eternaleye (~eternaley@2002:3284:29cb::1) Quit (Remote host closed the connection)
[22:45] * eternaleye (~eternaley@2002:3284:29cb::1) has joined #ceph
[22:45] <sagewk> are the osds logging?
[22:46] * terje (~joey@63-154-137-79.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[22:46] <tnt> they don't have any debug option enabled no
[22:50] <sagewk> what does the pg query say? or is it a rotating set of pgs that are in the peering state?
[22:51] <tnt> one of the physical server just died completely (memory / cpu just exploded), have to restart it
[22:53] <dmick> I hate it when cpus explode. gold wire and ceramic chunks all in my wall
[22:55] * danieagle (~Daniel@ has joined #ceph
[22:55] * xmltok (~xmltok@pool101.bizrate.com) Quit (Quit: Bye!)
[22:56] * andrei (~andrei@host86-155-31-94.range86-155.btcentralplus.com) Quit (Read error: Operation timed out)
[22:59] * nhm (~nhm@ has joined #ceph
[23:02] * eegiks (~quassel@2a01:e35:8a2c:b230:8127:c1ba:673b:69e8) Quit (Ping timeout: 480 seconds)
[23:09] * eegiks (~quassel@2a01:e35:8a2c:b230:499:a2c0:7e4d:7601) has joined #ceph
[23:09] <sagewk> tnt: it is normal to see a few pgs transitioning thorugh peering as backfill finishes
[23:09] <sagewk> but you shouldn't see requests stalled for that long. you can pick out which pg the stalled request is on and query that one to see what is going on
[23:10] <tnt> sagewk: well right now nothing is really working, one of the osd is refusing to come up (process runs but it never comes as 'up' in the osd tree). They all (mon & osd) are using high CPU and memory and IO ...
[23:11] * eschnou (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:12] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[23:19] <tnt> Ok, I think it's coming back alive ..
[23:19] <tnt> 2013-05-30 21:19:40.860099 mon.0 [INF] pgmap v14644727: 12808 pgs: 10803 active+clean, 448 active+remapped+wait_backfill, 77 active+degraded+wait_backfill, 1289 active+recovery_wait, 6 active+degraded+backfilling, 140 active+degraded+remapped+wait_backfill, 14 active+recovery_wait+remapped, 31 active+recovering; 742 GB data, 1763 GB used, 2308 GB / 4071 GB avail; 3867KB/s wr, 92op/s; 439051/2794272 degraded (15.713%); recovering 154 o/s, 49178KB/s
[23:20] <tnt> that's a much more reasonable status.
[23:20] <saaby> all active.. :)
[23:20] <sagewk> good to hear. sorry, limited amount i can do from this end to help over irc :(
[23:20] <tnt> But man... the amount of load generated on the cluster because I went from 14 to 15 OSD is huge ... something definitely wrong there.
[23:21] <sagewk> i don't think there should be much in the way of increased mon load, certainly
[23:23] * dxd828_ (~dxd828@host-92-24-117-118.ppp.as43234.net) Quit (Quit: Computer has gone to sleep.)
[23:23] <tnt> sagewk: did you see anything in the mon log that would indicate why it's using 100% of cpu ? (and a 3 times larger store and 3 times the IO load)
[23:24] <sagewk> nope.
[23:24] <sagewk> there is the pgmap rewrites that will go away in the next dev release or two, but nothing otherwise out of the ordinary
[23:25] * dxd828_ (~dxd828@host-92-24-117-118.ppp.as43234.net) has joined #ceph
[23:29] <saaby> sagewk: is it normal to have reached pgmap v1210865 already after a few weeks now?
[23:29] <sagewk> yeah
[23:29] <saaby> == a pgmap rewrite ~ every second
[23:29] <saaby> ok
[23:29] <sagewk> an incremental change is written, yeah.
[23:29] <saaby> right
[23:29] <sagewk> the whole map is currently rewritten every 100 iterations or something; that's the part we need to fix to be more efficient.
[23:30] <saaby> ok, so a new map every second is good and normal, and how it should be?
[23:31] <sagewk> yeah
[23:31] <sagewk> unless there is no io on the cluster
[23:31] <saaby> right
[23:31] <lurbs> Or you hit 2^32-1 seconds since a cluster was built? ;)
[23:31] <sagewk> you can slow it down by chanign the paxos_propose_interval to 3 or 5, but it may delay publishing of osdmaps in certain cases, and will make ceph -w output slower.
[23:31] <sagewk> 2^64
[23:31] <saaby> ok
[23:32] <saaby> as long as it doesn't pose a problem
[23:32] <lurbs> sagewk: That's no fun, no epoch time problem to look forward to.
[23:33] <sagewk> we'll have to make do
[23:33] <sagewk> the osdmap epoch is 2^32, but they don't update as often, and 32 bits is enough for a few decades iirc
[23:33] * terje- (~terje@63-154-145-97.mpls.qwest.net) has joined #ceph
[23:34] <saaby> I think I can live with having a deadline in 2^64 seconds.. :)
[23:34] * vata (~vata@2607:fad8:4:6:c93e:7f92:f3fb:7cd9) Quit (Quit: Leaving.)
[23:35] <saaby> and yeah, osdmap has only reached e63255 so far
[23:35] <tnt> sagewk: and how large is the map ? I mean, 4 Mo/s seems a lot, that'd mean the map is like 400 Mo.
[23:35] <sagewk> osdmap is usually pretty small. few MB on a cluster with 1000s of osds
[23:39] <tnt> I meant pgmap, the one rewritten ever 100 iteration.
[23:40] <sagewk> that one is much bigger. 10s of MB or more
[23:41] * terje- (~terje@63-154-145-97.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[23:49] * terje-_ (~terje@63-154-145-97.mpls.qwest.net) has joined #ceph
[23:50] <mech422> ok..had a good lunch and feeling brave...gonna skip ceph-deploy and try to install manually
[23:51] * nhm (~nhm@ Quit (Ping timeout: 480 seconds)
[23:54] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[23:56] <elder> joshd I'm looking at the rbd/kernel.sh workunit
[23:56] <elder> If you do a "rbd snap rollback" is that an operation on the underlying image?
[23:56] <joshd> yes
[23:57] <joshd> it's effectively a write to every object
[23:57] * terje-_ (~terje@63-154-145-97.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[23:57] <elder> So it's making the image match that snapshot?
[23:57] <joshd> yes
[23:57] <elder> If that occurs, won't that cause an event to fire, forcing a refresh?
[23:58] <elder> The test script is forcing one using /sys/bus/rbd/.../refresh
[23:58] <joshd> it does
[23:58] <elder> OK
[23:58] <elder> I thought the manual one wasn't necessary.
[23:58] <mrjack> tnt: has your cluster recovered?
[23:58] <elder> (And apparently it is unnecessary)
[23:58] <joshd> yeah, it might have just been there to work around a bug in earlier kernels
[23:58] <elder> It seems a little dangerous to be doing a rollback on a mapped image.
[23:58] <elder> More than a little.
[23:59] <mrjack> ^^
[23:59] <elder> But this is just a test, after all.
[23:59] <sagewk> elder: the fs will probably corrupt immediately
[23:59] <tnt> mrjack: it's in progress ... data is moving in to the new OSD, but things are running fairly well.
[23:59] <elder> Right.
[23:59] <sagewk> although you can always rollback again..
[23:59] <elder> Repeatedly!
[23:59] <joshd> certainly it's unsafe if it's mounted
[23:59] <elder> (But by then the damage is done)
[23:59] <mrjack> tnt: do you have io again?
[23:59] <tnt> mrjack: yes, which is definitely nice.
[23:59] <elder> OK. I'm just fixing the script because it doesn't work now that snapshots don't have directories in /sysfs

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.