#ceph IRC Log


IRC Log for 2013-01-03

Timestamps are in GMT/BST.

[0:09] * miroslav (~miroslav@adsl-67-127-52-236.dsl.pltn13.pacbell.net) has joined #ceph
[0:11] <jks> 2013-01-03 00:11:29.405040 mon.0 265639 : [DBG] osd.0 reported failed by osd.2
[0:11] <jks> 2013-01-03 00:11:29.405213 mon.0 265640 : [DBG] osd.1 reported failed by osd.2
[0:11] <jks> 2013-01-03 00:11:30.963650 mon.0 265641 : [DBG] osd.2 reported failed by osd.1
[0:12] <jks> anyone seen stuff like this before?
[0:12] <jks> ceph is still running on those osds, and each osds is failure of other osds
[0:12] <jks> network connection seems to be fine between server, so it is a bit odd
[0:13] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[0:13] * agh (~agh@www.nowhere-else.org) has joined #ceph
[0:14] <jks> "ceph status" reports HEALTH_OK.. and I can access the stored data without problems
[0:17] * slang (~slang@c-71-239-8-58.hsd1.il.comcast.net) Quit (Quit: slang)
[0:19] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[0:23] <dmick> jks: what version is this?
[0:24] <dmick> and are the machines very very busy, or thrashing? (it basically means the OSDs aren't answering their heartbeat messages fast enough for some reason)
[0:28] * LeaChim (~LeaChim@b01bde88.bb.sky.com) Quit (Remote host closed the connection)
[0:31] * gucki (~smuxi@46-126-114-222.dynamic.hispeed.ch) has joined #ceph
[0:31] <gucki> hi
[0:31] <gucki> i've a big problem
[0:32] <gucki> ceph is reporting full error, but the fullest disk is 95% (due to a recovery)
[0:32] <gucki> how can i make ceph continue to work? so increase the full ratio?
[0:33] <sjust> gucki: increasing the full ratio could turn a small problem into a very big one
[0:33] <sjust> or a big problem into a very big one
[0:33] <sjust> ceph can't really function if any osds are actually full
[0:33] <sjust> hence the 95% write block
[0:34] <gucki> sjust: so what can i do?
[0:34] <gucki> sjust: my whole is frozen atm :(
[0:34] <sjust> yes
[0:34] <sjust> are the other osds not also close to full?
[0:36] <gucki> no, but some are near their "near full" ratio
[0:36] <sjust> right
[0:36] <sjust> can you add more osds?
[0:36] <gucki> sjust: no :-(
[0:36] <sjust> will you be able to add more osds soon?
[0:37] <gucki> sjust: some disks are only 50% used...
[0:37] <sjust> oh, that's the real problem
[0:37] <gucki> sjust: the algorithm does not really seem to spread the data equal to their weights..
[0:37] <nhm> gucki: how many OSD and PGs?
[0:37] <sjust> how many pgs do you have?
[0:38] <gucki> sjust: so i could add some more osds, but then only using the same disk
[0:38] <sjust> hang on
[0:38] <gucki> sjust: would it help if i reduce the weight of the full cluster?
[0:38] <sjust> can you post the output of ceph osd tree?
[0:39] <gucki> sjust: yes, it's here http://pastie.org/5613461
[0:40] <gucki> how can i get the pg number?
[0:40] <sjust> ceph osd dump
[0:40] <gucki> i'm not surhttp://pastie.org/5613467
[0:40] <gucki> http://pastie.org/5613467
[0:41] <sjust> gucki: your crushmap is odd, I think
[0:41] <sjust> ceph osd getcrushmap -o /tmp/map
[0:41] <gucki> sjust: why? because of the different wieghts?
[0:41] <gucki> sjust: the weight is set according to disk size...
[0:41] <gucki> sjust: i have some big disks and some small disks...that's why there's such huge difference
[0:42] <gucki> sjust: now one of the small disks is full, while some of the big disks only have 50% used :(
[0:42] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Quit: Leaving)
[0:42] <sjust> then crushtool -d /tmp/map
[0:42] <sjust> and post that
[0:42] <gucki> http://pastie.org/5613474
[0:44] <gucki> example of unequal disk usage: http://pastie.org/5613477
[0:44] <gucki> puh, more and more vms stop working :(
[0:44] <gucki> i really need to get it back up..
[0:44] <gucki> shit :(
[0:45] <gucki> sjust: as a fast measure
[0:45] <gucki> sjust: would i help if i shutdownt the osd and set "noout"?
[0:45] <gucki> sjust: then it should all work again, right'ß
[0:45] <gucki> sjust: then we could fix with more time and calm :)
[0:46] <sjust> the weight of each osd is the size of the disk in TB?
[0:47] <gucki> sjust: yes
[0:48] <gucki> sjust: what do you think of my fast recovery idea? shutdown the full osd?
[0:54] <gucki> sjust: osd.13 has 96% disk usage
[0:54] <gucki> sjust: sjust osd.12 has 84%...but it's a much smaller disk ...
[0:55] <gucki> sjust: do you see any way we can get it back running fast?
[0:58] <gucki> sjust: here are all disk usages: http://pastie.org/5613513
[1:00] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[1:00] <gucki> sjust: are you still here? :)
[1:00] <sjust> sorry, one sec
[1:06] <gucki> sjust: can you let me know if i can shutdown the full osd and everything will continue (in degraded mode?)? then we have more time to fix the real cause...
[1:07] <gucki> sjust: right now more and more ceph client start to hang every minute, it's a real bad situation right now :(
[1:07] <gucki> sjust: and i'm worried some more nodes go down or ceph crashes :(
[1:08] <sjust> are the full ones also the small ones?
[1:09] <gucki> sjust: here are all disk usages: http://pastie.org/5613513
[1:09] <gucki> sjust: i'd set noout, so no redistribution would start
[1:10] <gucki> sjust: so yeah, 3 osds at 85% are also small and are likealy to get overloaded
[1:10] <sjust> you should probably just mark noout and take the full one out
[1:10] <sjust> how large is the full one?
[1:11] <gucki> sjust: 160 gig, the smallest have 80 gig
[1:11] <gucki> ok i take it out now
[1:11] <sjust> yep
[1:12] <gucki> sjust: ah wait...if it take it out, wouldn't it trigger a rebalance?
[1:12] <sjust> yes, to the bigger disks, for the most part
[1:12] <gucki> sjust: shouldn't i just shut it down but leave it in?
[1:12] <sjust> you could do that too
[1:12] <gucki> sjust: the starnge this is...there was no disk "near full"
[1:12] <gucki> sjust: so it went directly from around 82% to 96%.. :(
[1:12] <sjust> what?
[1:13] <gucki> sjust: i had a disk failure around 2 hours ago. before that failure no osd was near full
[1:14] <gucki> sjust: when osd.11 crashed (due to a disk io fault), the rebalance started...and just after it finished the one disk had 96% usage :(
[1:14] <gucki> sjust: so i really wonder why the distributation is not equal to the weights?
[1:15] <gucki> sjust: ok, i took osd.13 out..now 6% degraded
[1:18] <gucki> sjust: it doesn't seem to help, the vms are still not responding
[1:19] <sjust> you may have to mark it out
[1:19] <gucki> sjust: what if another osd gets marked full because of this?
[1:19] <sjust> same thing
[1:20] <gucki> sjust: but i cannot shutdown that much osds...then there might be some objects without any copies left
[1:20] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[1:23] <sjust> joao: is there something gucki needs to do to remove the full flag?
[1:24] <sjust> or sagewk: ^
[1:27] <gucki> sjust: is there any way i could alter full treshold?
[1:27] <sjust> yes, but that would probably make things worse
[1:28] <gucki> sjust: why? there should not start any rebalance, so the cluster would not get fuller atm
[1:28] <sjust> your vms are writing
[1:28] <gucki> sjust: yes, but only a few megs
[1:28] <gucki> sjust: i know the data wont increase a lot within the next hours
[1:29] <gucki> sjust: so everything would at least work again and we have more time to redistribute the data
[1:31] * jlogan (~Thunderbi@2600:c00:3010:1:c1cc:53b1:28d6:4bc2) Quit (Ping timeout: 480 seconds)
[1:32] * jjgalvez (~jjgalvez@ Quit (Ping timeout: 480 seconds)
[1:32] <sjust> ceph mon tell * -- injectargs '--mon-osd-full-ratio=0.98'
[1:32] <sjust> ceph mon tell \* -- injectargs '--mon-osd-full-ratio=0.98'
[1:33] <sjust> that'll bump it to 0.98
[1:33] <sjust> you really don't want to leave it like that
[1:33] <gucki> sjust: ok, now something REALLY strange happend!
[1:33] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[1:33] <gucki> sjust: i restarted osd.13 (without doing anything)
[1:34] <gucki> sjust: and now disk usage went down to 66%!
[1:34] <sjust> ok
[1:34] <sjust> cool
[1:34] <gucki> sjust: but why?=!
[1:35] <gucki> sjust: does ceph delete temporary files or something like this?
[1:35] <sjust> actually, yes
[1:35] <gucki> sjust: on startup?
[1:35] <sjust> sort of
[1:35] <gucki> sjust: but 50 gig of temporary files?
[1:35] <sjust> it would be really hard to generate more than a few megs of temps
[1:35] <sjust> mostly, the temps are actually temporary links
[1:36] <sjust> so 50 gigs is not really possible
[1:36] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[1:36] <sjust> unless something had recently caused a bunch of pgs to be removed from the osd
[1:36] <iggy> that would be a metric ass ton of links
[1:36] <sjust> iggy: indeed
[1:41] <gucki> sjust: well it's like i said. disk usage is after restarting osd.13 "/dev/sdd1 149G 99G 51G 66% /xfs-drive3"
[1:42] <gucki> sjust: before it was 96% as you can see from http://pastie.org/5613513
[1:42] <sjust> yeah
[1:42] <gucki> sjust: i really hope ceph didn't get messed up and now deleted a lot of data by accident.. :(
[1:42] <gucki> sjust: so far everything looks good
[1:45] <gucki> sjust: what can i do so ceph properly uses the weights/ disks? :)
[1:45] <gucki> sjust: otherwise i'll have a problem again soon i guess :(
[1:46] <joshd> gucki: are the logs for that osd going to its data disk or something?
[1:49] <gucki> joshd: no, the logs go to a different disk
[1:52] <gucki> sjust: ok, seems like i'll have a problem soon again
[1:52] <gucki> sjust: the disk is now again /dev/sdd1 149G 109G 41G 73% /xfs-drive3
[1:52] <gucki> sjust: so when recovery will continue like this, it'll be full in a few minutes again :(
[1:53] <gucki> sjust: guess i'll better update weights now real quick?
[1:55] <gucki> sjust: 75% now :(
[2:03] * jlogan (~Thunderbi@ has joined #ceph
[2:22] <gucki> sjust: i have an idea why the distribution is that bad: when an osd goes down, ceph does not remove it's weight from the host. so when the host has two osds each having 0.5, the host hast 1.0. now one osd goes down, but the host still has 1.0, so in fact the remaining osd gets a weight of 1.0 :-(
[2:23] <gucki> sjust: can you confirm this...and this is not really intended, is it?
[2:24] * miroslav (~miroslav@adsl-67-127-52-236.dsl.pltn13.pacbell.net) Quit (Ping timeout: 480 seconds)
[2:26] * sagelap (~sage@2607:f298:a:607:60d2:4e79:ba9:158c) Quit (Ping timeout: 480 seconds)
[2:28] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[2:30] <sjust> gucki: that is true to an extent
[2:31] <gucki> sjust: should i file a bug report/ issue for it?
[2:31] <sjust> usually you don't want the weight on the inner nodes to change automatically
[2:31] <sjust> that would cause additional rebalancing
[2:31] <sjust> I might also be wrong about this
[2:32] <sjust> you will want to reduce the weight on the osds which are filling
[2:32] <sjust> gucki: also, the new weight would actually be less than 1.0 with the crush settings you have (from the example above)
[2:32] <sjust> in fact, the effective weight ends up close to 0.5
[2:33] <sjust> oh, no
[2:33] <sjust> I'm quite wrong
[2:33] <sjust> the weight does actually end up being 0.5
[2:33] <sjust> if crush chooses the down osd, it will retry from the top
[2:33] <sjust> with the settings you have
[2:36] <gucki> sjust: mh, ok...so you think it's ok? do you then know why the disk usage is that different compared to the weights? there must be a bug for sure...?
[2:37] * sagelap (~sage@172.sub-70-197-142.myvzw.com) has joined #ceph
[2:39] <sjust> I need to head out, I'll be on again in an hour or so
[2:40] <gucki> sjust: hopefully i'll be sleeping by this time ;). just waiting for recovery to finish. but i'll be back again tomorrow... ;)
[2:41] <sjust> real quick, is 'kvm1' where most of your data is?
[2:41] <sjust> gucki: ^
[2:56] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[2:56] <gucki> sjust: sry, yes
[2:58] <gucki> sjust: http://pastie.org/pastes/5613831/text
[2:59] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit ()
[3:11] * dpippenger (~riven@ Quit (Remote host closed the connection)
[3:14] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[3:16] * fzylogic (~fzylogic@ Quit (Quit: fzylogic)
[3:19] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:23] <dmick> gucki: he had to leave for the day, but may see this later in the day or tomorrow
[3:24] <gucki> dmick: ok :)
[3:25] <gucki> dmick: one last question: i removed the faulty osds using "ceph osd rm ..". now ceph osd tree shows them as state "DNE"
[3:25] <gucki> dmick: however their weight is still accounted in the hosts weight, which is too due to this. how can i remove them completely from the crushmap so the host weight is equal to the sum of the remaining osds?
[3:27] <joshd> you have to remove them from crush too, like 'ceph osd crush rm osd.5'
[3:30] <gucki> joshd: oh, i totally overlooked this :(. just did it, now a really huge rebalance started..
[3:36] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:42] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[3:42] * agh (~agh@www.nowhere-else.org) has joined #ceph
[3:43] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[4:03] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[4:09] * jlogan (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[4:11] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[4:17] <paravoid> so, a single OSD died (that's a bug, but that's another story)
[4:17] <paravoid> on a cluster that's ~26% degraded and recovering
[4:18] <paravoid> now it's peering at least 20' and the cluster is completely unresponsive
[4:18] <paravoid> that is, radosgw traffic has completely ceased
[4:19] <paravoid> peering count doesn't go down, I wonder what it's waiting for
[4:20] <paravoid> (that's 0.56)
[4:22] <paravoid> ok it suddenly started peering and peered all 116 of them in like 3 seconds
[4:24] <paravoid> nope, up to 99 peering
[4:25] <paravoid> sigh.
[4:32] * nolan_ (~nolan@2001:470:1:41:20c:29ff:fe9a:60be) has joined #ceph
[4:33] <paravoid> tons of
[4:33] <paravoid> 2013-01-03 03:33:10.805948 7f40c6144700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f40b7927700' had timed out after 30
[4:33] <paravoid> 2013-01-03 03:33:10.805970 7f40c6144700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f40b8128700' had timed out after 30
[4:33] <paravoid> 2013-01-03 03:33:15.806044 7f40c6144700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f40b7927700' had timed out after 30
[4:33] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[4:33] * nolan (~nolan@2001:470:1:41:20c:29ff:fe9a:60be) Quit (Remote host closed the connection)
[4:33] * nolan_ is now known as nolan
[4:33] <paravoid> 2013-01-03 03:33:15.806066 7f40c6144700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f40b8128700' had timed out after 30
[4:34] * MooingLe1ur (~troy@phx-pnap.pinchaser.com) has joined #ceph
[4:34] * `10 (~10@juke.fm) Quit (Ping timeout: 480 seconds)
[4:34] * MooingLemur (~troy@phx-pnap.pinchaser.com) Quit (Read error: Connection reset by peer)
[4:34] <paravoid> which in the same function as the one that produced the crash
[4:34] <paravoid> (hit suicide timeout)
[4:34] * agh (~agh@www.nowhere-else.org) Quit (Ping timeout: 480 seconds)
[4:34] * `10 (~10@juke.fm) has joined #ceph
[4:34] * iggy2 (~iggy@theiggy.com) has joined #ceph
[4:35] * iggy (~iggy@theiggy.com) Quit (Read error: Connection reset by peer)
[4:36] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * jtangwk (~Adium@2001:770:10:500:1cdd:1a59:92ca:c0f3) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * imjustmatthew (~imjustmat@pool-173-53-54-22.rcmdva.fios.verizon.net) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * tezra (~rolson@ Quit (resistance.oftc.net charm.oftc.net)
[4:36] * Kioob (~kioob@luuna.daevel.fr) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * morse (~morse@supercomputing.univpm.it) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * MK_FG (~MK_FG@00018720.user.oftc.net) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * lurbs (user@uber.geek.nz) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * dok (~dok@static-50-53-68-158.bvtn.or.frontiernet.net) Quit (resistance.oftc.net charm.oftc.net)
[4:36] * KindTwo (~KindOne@ has joined #ceph
[4:36] * KindOne (~KindOne@ Quit (Ping timeout: 480 seconds)
[4:36] * KindTwo is now known as KindOne
[4:37] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[4:37] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[4:37] * jtangwk (~Adium@2001:770:10:500:1cdd:1a59:92ca:c0f3) has joined #ceph
[4:37] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[4:37] * imjustmatthew (~imjustmat@pool-173-53-54-22.rcmdva.fios.verizon.net) has joined #ceph
[4:37] * tezra (~rolson@ has joined #ceph
[4:37] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[4:37] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) has joined #ceph
[4:37] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[4:37] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) has joined #ceph
[4:37] * MK_FG (~MK_FG@00018720.user.oftc.net) has joined #ceph
[4:37] * lurbs (user@uber.geek.nz) has joined #ceph
[4:37] * dok (~dok@static-50-53-68-158.bvtn.or.frontiernet.net) has joined #ceph
[4:40] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[4:40] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[4:44] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:45] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[4:52] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[5:08] * Cube (~Cube@pool-71-108-128-153.lsanca.dsl-w.verizon.net) has joined #ceph
[5:12] <paravoid> still getting heartbeat_map "had timed out after 30" and can't recover that osd
[5:12] <paravoid> I've tried writing to the filesystem and it looks fine
[5:12] <paravoid> no kernel messages about it either
[5:13] <nhm> paravoid: were you able to catch up with sjust at all?
[5:13] <paravoid> no
[5:14] <nhm> paravoid: He and Sage are probably the two guys that you want to talk to.
[5:15] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[5:17] <paravoid> thanks
[5:22] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[5:25] * Cube (~Cube@pool-71-108-128-153.lsanca.dsl-w.verizon.net) Quit (Quit: Leaving.)
[5:43] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[5:44] <sage> paravoid: in that case you should look for anything in dmesg about hung kernel tasks or syscalls; usually it's a kenrel or disk problem . if not, you can attach with gdb and see what the filestore op threads are stuck doing.
[5:45] * iggy2 is now known as iggy
[5:46] <paravoid> it's not
[5:46] <paravoid> I already looked at that
[5:46] <paravoid> it's two threads in particular
[5:46] <paravoid> I think maybe they're deadlocked or something
[5:46] <paravoid> I was about to attach gdb, any hints on what to do there?
[5:47] <paravoid> sage: ^
[5:48] <sage> thr app all bt
[5:48] <sage> and pastebin the output?
[5:49] <paravoid> the timeout messages have the thread id
[5:49] <sage> yeah
[5:49] <paravoid> so I'm thinking to get a bt from just those
[5:49] <paravoid> I could get it for all of them, but are you sure you want to? :)
[5:49] <sage> yeah that is likely enough
[5:50] <paravoid> sec, it commited suicide again
[5:53] <paravoid> http://pastebin.com/i7zsYSuN
[5:54] <paravoid> I did a dd oflag=direct on the filesystem in question, worked fine
[5:54] <paravoid> a ls -lR also worked
[5:54] <paravoid> no stuck processes, fs errors in dmesg
[5:55] <paravoid> the osd warns about "timed out after 30" then after a while commits suicide
[5:55] <paravoid> this has being going on for hours
[5:56] <paravoid> any ideas? :)
[5:57] <paravoid> sage: ^ (sorry, not sure if you prefer notifies or not)
[5:57] <sage> thinking...
[5:57] <sage> my question is whehter it is really stuck or is making progress.
[5:58] <sage> after you restart it at it gets stuck, can you try ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_osd 20
[5:58] <sage> and see if it is spinning or making progress?
[5:58] <sage> you can also adjust the suicide timeout up to give it more of a chance to make progress.
[5:58] <sage> just one stuck osd?
[5:59] <paravoid> yeah
[5:59] <sage> (and notifies are good, lest is get distracted :)
[6:00] <paravoid> note that the peering count as seen by ceph -w never decreases
[6:00] <paravoid> these are peering threads, right?
[6:00] <paravoid> or is this is about peering a single pg?
[6:02] <paravoid> sage: lots of 2013-01-03 05:02:00.352139 7f32a6b34780 20 osd.15 pg_epoch: 15076 pg[3.232( v 13891'2369 (1602'1088,13891'2369] local-les=14017 n=1169 ec=14 les/c 14017/14017 15076/15076/14016) [] r=0 lpr=0 pi=10700-15075/48 (info mismatch, log(1602'1088,0'0]) (log bound mismatch, actual=[1602'1089,9702'2326]) lcod 0'0 mlcod 0'0 inactive] read_log 207332 9702'2327 (9702'2326) modify 8af88232/6451.17__shadow__x73f5SoxRFMLVlJfg40pYrMjVm1jkhN_1/head//3 by cli
[6:03] <sage> that's just loading the log on startup.. i'd turn debug down for a bit and then back up, it is very noisy and goes loggins make it go way slower
[6:03] <paravoid> heh, seems so yeah
[6:03] <paravoid> it's past that now
[6:03] <sage> peering means the pg is sending messages between replicas to agree on the current state of the pg
[6:03] <paravoid> 2013-01-03 05:03:40.894603 7f32949b4700 20 osd.15 15165 get_map 11889 - loading and decoding 0x37b46700
[6:04] <paravoid> 2013-01-03 05:03:40.894612 7f32951b5700 10 osd.15 pg_epoch: 15094 pg[3.3153( v 13875'2451 (1602'1154,13875'2451] local-les=10845 n=1186 ec=14 les/c 10845/6929 10707/15093/10700) [32,5]/[5,32] r=-1 lpr=15093 pi=1904-15092/82 lcod 0'0 remapped NOTIFY] handle_advance_map [32,5]/[5,32]
[6:04] <paravoid> 2013-01-03 05:03:40.894644 7f32951b5700 10 osd.15 pg_epoch: 15095 pg[3.3153( v 13875'2451 (1602'1154,13875'2451] local-les=10845 n=1186 ec=14 les/c 10845/6929 10707/15093/10700) [32,5]/[5,32] r=-1 lpr=15093 pi=1904-15092/82 lcod 0'0 remapped NOTIFY] state<Reset>: Reset advmap
[6:04] <paravoid> 2013-01-03 05:03:40.894655 7f32951b5700 10 osd.15 pg_epoch: 15095 pg[3.3153( v 13875'2451 (1602'1154,13875'2451] local-les=10845 n=1186 ec=14 les/c 10845/6929 10707/15093/10700) [32,5]/[5,32] r=-1 lpr=15093 pi=1904-15092/82 lcod 0'0 remapped NOTIFY] _calc_past_interval_range: already have past intervals back to 6929
[6:04] <paravoid> 2013-01-03 05:03:40.894766 7f32949b4700 10 osd.15 15165 add_map_bl 11889 401869 bytes
[6:04] <paravoid> still startup?
[6:04] <sage> there are worker threads, but its all event-based, so the threads won't normally block
[6:04] <sage> oh.. that's it's catching up on old osdmaps. has this osd been down for a long time?
[6:04] <paravoid> about 2h
[6:05] <paravoid> when it started crashing
[6:05] <sage> ceph osd dump 11889 | grep modified
[6:05] <sage> how old is it?
[6:05] <paravoid> it's been flapping, so it doesn't help much
[6:05] <sage> vs ceph osd dump | grep 15165
[6:05] <paravoid> no results
[6:05] <sage> er, ceph osd dump 15165 | grep modified
[6:06] <sage> sigh.. 'modifed'
[6:06] <sage> should probably fix that
[6:06] <paravoid> haha!
[6:06] <nhm> sage: :)
[6:06] <paravoid> root@ms-be1002:~# ceph osd dump 15165 | grep modifed
[6:06] <paravoid> modifed 2013-01-03 05:02:33.891761
[6:06] <paravoid> root@ms-be1002:~# ceph osd dump 11889 | grep modifed
[6:06] <paravoid> modifed 2013-01-02 19:16:43.760220
[6:06] <paravoid> strange
[6:06] <paravoid> btw, the cluster is degraded
[6:06] <paravoid> has been for a while
[6:07] <paravoid> I added two more boxes
[6:07] <sage> yeah.
[6:07] <paravoid> on monday iirc
[6:07] <paravoid> it's trying to catch up, not making a particularly quick progress though
[6:08] <sage> is osd.15 up or down in the current osdmap? (ceph osd dump | grep osd.15\ )
[6:08] <paravoid> it's up
[6:08] <paravoid> until it aborts, then gets down again, then upstart starts it again and so on :)
[6:09] <paravoid> 2013-01-03 05:08:57.103506 mon.0 [INF] pgmap v590507: 16952 pgs: 1 active, 6643 active+clean, 6981 active+remapped+wait_backfill, 431 active+degraded+wait_backfill, 9 active+recovery_wait, 119 peering, 23 active+remapped, 75 active+remapped+backfilling, 1 active+degraded+backfilling, 1 stale+peering, 751 active+degraded+remapped+wait_backfill, 1107 active+recovery_wait+remapped, 9 active+recovery_wait+degraded, 713 remapped+peering, 29 incomplete, 13 activ
[6:09] <paravoid> fwiw
[6:09] <paravoid> the 119 peering figure is for osd.15
[6:09] <paravoid> I think :)
[6:10] <sage> ok, that's the main problem.. it shouldn't be marking itself up if it's behind on the maps. or else the flapping just makes things worse.
[6:10] <paravoid> aha
[6:10] <sage> did you generate a completeish log (since the log loading part)?
[6:11] <sage> between the two bits you pasted/
[6:11] <sage> ?
[6:11] <paravoid> 2013-01-03 05:07:52.040494 7f32949b4700 10 osd.15 pg_epoch: 14573 pg[3.ded( v 13392'2566 (1602'1241,13392'2566] local-les=10734 n=1229 ec=14 les/c 10734/6929 10706/14167/14167) [27,21]/[21,27] r=-1 lpr=14359 pi=1878-14166/41 lcod 0'0 inactive NOTIFY] _calc_past_interval_range: already have past intervals back to 6929
[6:11] <paravoid> 2013-01-03 05:07:52.040843 7f32951b5700 10 osd.15 15165 add_map_bl 13880 367757 bytes
[6:11] <paravoid> 2013-01-03 05:07:52.040965 7f32a31d1700 -1 *** Caught signal (Aborted) **
[6:12] <paravoid> that was for 2013-01-03 05:07:52.040460 7f32951b5700 20 osd.15 15165 get_map 13880 - loading and decoding 0x50493a80
[6:12] <paravoid> so, dies while catching up?
[6:12] <sage> can you post the full osd.log somewhere so i can take a closer look?
[6:12] <paravoid> sure
[6:12] <sage> it's supposed to do the hard work of processing old maps *before* it marks itself up.. but something is awry
[6:13] <paravoid> let me filter it with only that run
[6:13] <paravoid> it's currently 599M :-)
[6:17] <paravoid> bzipping
[6:18] <paravoid> btw, I'm guessing I can't just downgrade to 0.55.1, right?
[6:20] <paravoid> sage: http://noc.wikimedia.org/ceph-crash-debug.log.bz2
[6:20] <sage> paravoid: in this case you can
[6:20] <sage> no data type encoding changes
[6:20] <paravoid> I tried it and failed miserably
[6:20] <paravoid> it spews lots of errors on boot
[6:21] <sage> hmm, maybe we did rev something..
[6:21] <paravoid> and another thing, kinda unrelated
[6:22] <paravoid> all ops are stuck now
[6:22] <paravoid> since this has been going on
[6:22] <paravoid> 2013-01-03 05:20:28.830342 osd.34 [WRN] slow request 4463.977058 seconds old,
[6:22] <paravoid> etc.
[6:22] <paravoid> lots and lots of them
[6:22] <sage> that is probably just because of the peering pgs
[6:23] <paravoid> well, when I stop osd.15 there are no peering pgs anymore
[6:23] <paravoid> I think
[6:23] <paravoid> there are very few of them unfound
[6:24] <paravoid> got the log?
[6:24] * sagelap (~sage@172.sub-70-197-142.myvzw.com) Quit (Ping timeout: 480 seconds)
[6:24] <sage> that's part is good ..
[6:25] <paravoid> hm?
[6:25] <jjgalvez> Hi guys, I'm getting a failed assert: https://gist.github.com/0a8dbb24a0a2b87cd37d when I try and start up the mds. This just started when I upgraded to 0.56 and was previously working. I've got the full log if anyone has a chance to take a look.
[6:26] * Cube (~Cube@pool-71-108-128-153.lsanca.dsl-w.verizon.net) has joined #ceph
[6:26] <sage> paravoid: hmm, what version are the other osds in this cluster running?
[6:26] <paravoid> all of them 0.56
[6:27] <paravoid> btw, there's a 0.55.1<->0.56 proble, that I encounted while upgrading
[6:27] <paravoid> all of the 0.55.1 ones crashed at the same time when they peered with 0.56
[6:27] <nhm> paravoid/sage: good luck, hope you guys can track it down
[6:27] <sage> jj: this is an mds bug hopefully fixed by yan's patches, which went into 0.56.
[6:28] <paravoid> had to do an emergency upgrade to 0.56 to all of them to make it work again
[6:28] <sage> jj: but the problem si likely now in the on disk data, introduced some time ago
[6:28] <paravoid> maybe that's why I can't downgrade osd.15 to 0.55
[6:28] <sage> paravoid: can you post the stack trace in a bug in the tracker?
[6:28] <nhm> time for me to get to bed.
[6:28] <paravoid> which one?
[6:29] <sage> the upgrade problem
[6:29] <paravoid> the osd.15 one or the 0.55->0.56 one?
[6:29] <sage> 0.55->0.56 one
[6:29] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[6:30] <sage> ok, i see the osd.15 problem. nontrivial fix. if you can just leave that one down for now, that will be better
[6:30] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[6:30] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit ()
[6:30] <paravoid> oops :)
[6:31] <paravoid> well, I don't mind having one less OSD
[6:31] <paravoid> however as I said all ops are stuck
[6:31] <sage> they should unstick with osd.15 down...?
[6:31] <sage> if you keep it down i mean, and it isn't allowed to uselessly thrash?
[6:32] <paravoid> yeah that's what I mean too
[6:32] <paravoid> I tried it
[6:32] <sage> what is the pg status at that point?
[6:32] <paravoid> at one point it had no peering pgs
[6:33] <paravoid> 2013-01-03 05:32:45.086559 mon.0 [INF] pgmap v591391: 16952 pgs: 6717 active+clean, 6911 active+remapped+wait_backfill, 516 active+degraded+wait_backfill, 9 active+recovery_wait, 1 active+recovering+remapped, 36 active+remapped, 164 down+peering, 82 active+remapped+backfilling, 170 active+degraded, 2 active+degraded+backfilling, 1143 active+degraded+remapped+wait_backfill, 1103 active+recovery_wait+remapped, 9 active+recovery_wait+degraded, 11 active+degra
[6:33] <paravoid> that's how it is now
[6:33] <paravoid> and getting bette
[6:34] <paravoid> better*
[6:34] <paravoid> it's been about 12' since I stopped osd.15
[6:34] * Cube (~Cube@pool-71-108-128-153.lsanca.dsl-w.verizon.net) Quit (Ping timeout: 480 seconds)
[6:34] <sage> http://tracker.newdream.net/issues/3714
[6:34] <paravoid> 2013-01-03 05:34:32.464344 mon.0 [INF] pgmap v591484: 16952 pgs: 6736 active+clean, 6909 active+remapped+wait_backfill, 646 active+degraded+wait_backfill, 9 active+recovery_wait, 1 active+recovering+remapped, 1 peering, 21 active+remapped, 82 active+remapped+backfilling, 62 active+degraded+backfilling, 1185 active+degraded+remapped+wait_backfill, 1102 active+recovery_wait+remapped, 13 active+recovery_wait+degraded, 59 incomplete, 48 active+degraded+remappe
[6:35] <sage> ceph pg dump | grep down, and ceph pg <pgid> query on one of them to see who they need
[6:35] <sage> the down+peering are the ones to worry about
[6:35] <paravoid> 2013-01-03 05:35:46.746078 mon.0 [INF] pgmap v591546: 16952 pgs: 6744 active+clean, 6907 active+remapped+wait_backfill, 644 active+degraded+wait_backfill, 9 active+recovery_wait, 1 active+recovering+remapped, 21 active+remapped, 79 active+remapped+backfilling, 64 active+degraded+backfilling, 1185 active+degraded+remapped+wait_backfill, 1102 active+recovery_wait+remapped, 13 active+recovery_wait+degraded, 59 incomplete, 46 active+degraded+remapped+backfilli
[6:36] <paravoid> no down ones anymore
[6:36] <paravoid> unfound otoh...
[6:36] <paravoid> operations still fail
[6:37] <paravoid> both new ones and "slow" old ones
[6:37] <sage> incomplete?
[6:38] <paravoid> sorry?
[6:38] <sage> what about teh 'incomplete' pgs? ceph pg dump | grep incomplete, and query one of those
[6:38] <paravoid> oh you mean
[6:38] <paravoid> yeah
[6:40] <paravoid> 3.3d89 0 0 0 0 0 0 0 incomplete 2013-01-03 05:33:26.366281 0'0 15362'124 [17,29] [17,29] 0'0 2012-12-21 16:34:01.951393 0'0 0.000000
[6:40] <paravoid> what's the argument to "ceph pg query" supposed to be?
[6:40] <paravoid> 3.3d89 doesn't work
[6:40] <paravoid> does it need it in decimal?
[6:41] <paravoid> ah
[6:41] <paravoid> ceph pg <id> query
[6:41] <paravoid> my bad
[6:42] <paravoid> sage: what should I look for there?
[6:42] <paravoid> it shows two osds, none of them is 15
[6:42] <paravoid> same for up & acting
[6:42] <sage> near the bottom it shoudl say what it is waiting for
[6:42] <sage> pastebin?
[6:43] <paravoid> sure :-)
[6:44] <paravoid> sage: http://pastebin.com/XQEw0iCt
[6:44] <paravoid> not my lucky day is it?
[6:44] <paravoid> down_osds_we_would_probe 15
[6:45] <sage> not so much. you can probably get away with marking osd.15 lost, though, given that it has been down for a while. let's check
[6:46] <sage> ceph osd dump 771 | grep modifed
[6:46] <sage> is that before or after osd.15 first went down?
[6:46] <paravoid> btw, it hasn't been down for too much, just about 3h now
[6:46] <paravoid> checking
[6:46] <paravoid> modifed 2012-12-23 11:17:46.985453
[6:46] <paravoid> heh
[6:46] <sage> ok, that's not helpful :)
[6:47] <paravoid> where does 771 come from?
[6:47] <sage> osd.17 and osd.29 are up and happy?
[6:47] <paravoid> a past intervals
[6:47] <paravoid> they are
[6:47] <paravoid> but that was just one pg
[6:47] <paravoid> I have more incomplete ones
[6:48] <paravoid> 59 to be exact
[6:48] <sage> first past interval. should be a lower bound on the past osdmap intervals we aren't sure about yet. strange that it is so far back in time.. presumably the cluster hasn't been degraded for that long?
[6:48] <paravoid> no
[6:48] <sage> maybe check a few more, they are probably also waiting for osd.15
[6:48] <paravoid> it has been degraded since monday
[6:48] <paravoid> (12/31)
[6:48] <sage> oh, do ceph osd dump 10705 | grep modifed
[6:48] <sage> "last_epoch_started": 10705,
[6:49] <paravoid> modifed 2013-01-02 15:16:55.447864
[6:49] <sage> so if that is *after* when you know osd.15 was not being useful (couldn't have been processing writes), then it's safe to mark down.
[6:50] <sage> it was the only osd freaking out?
[6:50] <paravoid> that date is before osd.15 started to freak out
[6:51] <paravoid> and yes, it's the only one so far
[6:51] <paravoid> although I'm not inspired with confidence about the rest :-)
[6:52] <paravoid> 2013-01-03 02:44:29.684485 7f22df90f700 -1 *** Caught signal (Aborted) **
[6:52] <paravoid> osd.15's first crash
[6:52] <sage> well, if you can wait until we fix the map startup thing, that's ideal. or, you can risk doing 'ceph osd lost 15', which could in theory mean that an acked write is lost.
[6:53] <janos> woohoo, wrote my first python ever. to dump out all my osd sizes, how much used/avail/ and % full
[6:53] <paravoid> heh
[6:54] <paravoid> sucks to be me
[6:54] <janos> hey i have to start somewhere
[6:54] <janos> paravoid: sounds sucky
[6:54] <jjgalvez> sage: thanks I found the bug in the tracker and the code commit on github. Is there anything that can be done to bring the mds back up once the problem is in the disk data? I don't see any info on getting around the problem.
[6:55] <sage> paravoid: may be a quick fix, working on it now.
[6:55] <paravoid> oh wow
[6:55] <sage> jj: probably need a hex editor. or patch up the mds code to silently tolerate those decrements maybe.. might work
[6:56] <paravoid> sage: so, question. why is osd.15 so behind? is it because of the cluster being degraded?
[6:56] <paravoid> the issue you're working on is going to fix osd.15's boot, but why did it crash in the first place
[6:56] <sage> i suspect most of teh degradation is from osd.15 flapping. once it fell behind it dug a deeper hole. not sure how it got behind to being with
[6:57] <sage> good question, we haven't looked at that yet :)
[6:57] <paravoid> the initial crash was also from a heartbeat
[6:57] <paravoid> I've seen a bunch of those today
[6:57] <paravoid> after 0.56
[6:57] <paravoid> much earlier though
[6:57] * dmick is now known as dmick_away
[6:57] <paravoid> when I was upgrading 0.56
[6:57] <paravoid> several hours passed, then out of the blue osd.15 died :)
[6:58] <paravoid> the cluster was degraded even before 0.56, as it's rebalancing from me adding two more boxes
[7:05] <paravoid> sage: 0.55->0.56 issue: http://tracker.newdream.net/issues/3715
[7:05] <sage> paravoid: great, thanks!
[7:06] <paravoid> thank you for all the help
[7:08] <paravoid> sage: final question: what do you mean by "lost data"? entire objects, contents, both?
[7:08] <paravoid> I don't mind losing a few objects as this will be detected in another layer
[7:08] <paravoid> corrupted data on the other hand...
[7:09] <sage> if you tell the cluster that osd.15 is lost, it won't wait to query it to see what changes it may have. so if at some point osd.15 *was* able to process a write, and no other replicas survived, then that write would be lost (either the update or object creation or deletion).
[7:10] <sage> so lost in that case might mean an objects contents warp back in time, or the deletion didn't happen, or the creation didn't happen.
[7:10] <sage> hold off on that, though, i think this fix will work.
[7:10] <paravoid> oh heh :)
[7:12] <sage> also i think that by default the cluster nwo refuses to process writes unless 2 of 3 replicas are up, so if eveyrone else is healthy it should be ok. want to confirm that ho
[7:12] <paravoid> that pool only has 2 replicas
[7:13] <paravoid> for now, I'm planning to increase it soon
[7:13] <sage> k
[7:13] <paravoid> I added two boxes, I was waiting for it to recover and then increase replicas
[7:13] <paravoid> and add more boxes before or after that
[7:14] <paravoid> I'm at 48 osds, about to make it to 144 within the month
[7:15] <sage> nice
[7:16] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[7:16] <paravoid> if it lets me :)
[7:16] <sage> sjustlaptop: i have a patch fo ryou to review :)
[7:16] <sjustlaptop> which one?
[7:16] <sage> i'll push it now
[7:17] <sage> wip-3714
[7:21] <paravoid> oh wow, is it 10pm there?
[7:21] <paravoid> sorry :-)
[7:21] <sage> where are you?
[7:21] <paravoid> Greece
[7:21] <paravoid> not much better here, it's 8am
[7:21] <sage> :)
[7:22] <sjustlaptop> hmm
[7:22] <sage> sjustlaptop: repushed. rebased on top of testing, with a proper commit msg.
[7:22] <sjustlaptop> why drain peering_wq in activate_map() ?
[7:23] <paravoid> sage: s/works/worse/
[7:23] <sage> we need the peering_wq to process the maps before we boot.
[7:23] <sage> we could slurp down all the maps and then do one huge wait before sending the boot msg, but that doesn't let us accomplish incremental work
[7:23] <sage> we could do it in caller instead
[7:24] <sage> paravoid's cluster is thousands of osdmaps behind, but the osd is marking itself up before it processes the maps (to update past_intervals etc.)
[7:26] * jlogan1 (~Thunderbi@2600:c00:3010:1:519e:fddc:b274:5689) has joined #ceph
[7:27] <paravoid> sage: going to appear in http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-3714 right?
[7:27] <sage> paravoid: yep
[7:27] <paravoid> should I?
[7:28] <sage> let's wait for sjustlaptop. i only did a very basic test on my box.
[7:28] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[7:28] <sjustlaptop> hang on, need to remember osd startup
[7:28] <sage> low risk change, but still
[7:29] <sjustlaptop> highest risk is a deadlock as far as I can tell, one sec
[7:29] <sage> yeah
[7:29] <sage> should rename a bunc hof these methods and move some code around, but functionally it should be about right
[7:31] <paravoid> sage: are cluster operations halted because of those incomplete pgs?
[7:31] <sage> at least some of them are. hopefully all, but we haven't verified.
[7:31] <paravoid> is that normal?
[7:31] <paravoid> 59 pgs out of 16k
[7:32] <paravoid> and everything going down?
[7:32] <sage> that they are incomplete? more so for 2x, when there is thrashing.
[7:33] <paravoid> yeah, the pool has 16k pgs and ceph pg dump | grep -c incomplete is 59
[7:34] <sage> what's the cluster storing?
[7:34] * jlogan1 (~Thunderbi@2600:c00:3010:1:519e:fddc:b274:5689) Quit (Ping timeout: 480 seconds)
[7:35] <paravoid> radosgw
[7:35] <paravoid> images mostly
[7:37] <sjustlaptop> sage: ok, that looks good
[7:37] * sagelap (~sage@ has joined #ceph
[7:37] <sjustlaptop> sagelap: looks good
[7:37] <sage> paravoid: give it a go!
[7:37] <paravoid> upgrading :)
[7:38] <sjustlaptop> activate_map as it now stands should be split at the early return point into a notify_pgs and an activate_map probably
[7:38] <sage> sjustlaptop: pushing a nicer branch that refactors things a bit.
[7:38] <sjustlaptop> but that can definitely wait
[7:38] <sage> doing that :)
[7:38] <sjustlaptop> cool
[7:40] <paravoid> how does your release policy works?
[7:40] <paravoid> I saw master was merged into next
[7:40] <paravoid> is next going to be 0.57?
[7:40] <sjustlaptop> that's the notion
[7:40] <paravoid> is 0.56 bobtail? :)
[7:40] <sjustlaptop> it will be
[7:41] <sjustlaptop> probably 0.56.1 I guess
[7:41] <paravoid> okay, osd started
[7:41] * sagelap1 (~sage@ has joined #ceph
[7:41] <sage> it will take a long time to chew through those maps before it marks itself up
[7:41] <paravoid> yeah, got that much
[7:41] <sage> but once it marks itself up things should peer quickly
[7:42] <paravoid> doesn't seem to be marked as up yet, so looks good so far
[7:42] <sage> thanks sam!
[7:42] <paravoid> may I suggest having some nice messages that say that?
[7:42] <sage> you may :)
[7:42] <paravoid> so that other people won't wonder why their osd isn't marked up although they started it
[7:43] <sage> off to bed... good luck!
[7:43] <paravoid> heh
[7:43] <paravoid> thanks a lot
[7:43] <paravoid> really appreciated
[7:43] <paravoid> and sorry for the ungodly hour, didn't realize it until too late
[7:44] <sjustlaptop> paravoid: better to fix it now than after bobtail is released!
[7:44] <paravoid> something's wrong though
[7:44] <paravoid> peering pg count is up to 200 like before
[7:44] <paravoid> although osd is marked as down
[7:45] <paravoid> and still getting heartbeat timeouts
[7:45] <sjustlaptop> oh crud
[7:46] <sage> is it sending peering messages?
[7:46] <sjustlaptop> the osd already marked itself at the updated osd epoch
[7:46] <sjustlaptop> the superblock already reflects the new epoch
[7:46] <sjustlaptop> so it probably booted immediately
[7:46] <paravoid> peering is over now
[7:46] <paravoid> I'm also getting
[7:46] <paravoid> 2013-01-03 06:46:00.987594 7fa3d1b76780 0 monclient: wait_auth_rotating timed out after 30
[7:46] <paravoid> 2013-01-03 06:46:00.987653 7fa3d1b76780 -1 osd.15 15376 unable to obtain rotating service keys; retrying
[7:46] <paravoid> that I didn't before
[7:47] <paravoid> no peering pgs now, pgs still unfound
[7:47] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[7:47] <paravoid> and incomplete
[7:47] <sjustlaptop> is the osd in?
[7:47] <paravoid> it's down
[7:48] <paravoid> and spewing heartbeat timeouts like before
[7:48] <sjustlaptop> hmm
[7:48] <sjustlaptop> I was probably wrong before
[7:48] <sage> oh, we probably need to increase the thread timeout
[7:48] <sjustlaptop> it is still going to probably have each pg fully advance in a single workqueue item
[7:49] <sjustlaptop> you'll need to pump the heartbeat timeout way up
[7:49] <paravoid> okay
[7:49] <paravoid> and I'll stop
[7:49] <sage> 'osd op thread timeout = 0' to disable the timeout
[7:49] <paravoid> if it doesn't work, I'll ping you tomorrow
[7:49] <paravoid> I feel bad already :)
[7:50] <paravoid> thanks again
[7:51] <sage> sjustlaptop: i wonder if it wouldn't be good to fully flush the osdmaps through before we activate the first time anyway.
[7:51] <sage> there isn't an easy way to force it to do that first without making the map messages from the monitor very small
[7:52] <sjustlaptop> sage: in this case, the problem is merely that the workqueue is just grabbing the OSD's current map epoch and advancing the pg to that
[7:53] <sjustlaptop> we could, instead, have it advance up to a maximum number of maps and queue another null (without activating)
[7:53] <paravoid> 2013-01-03 06:53:44.017085 7f275dbfc700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f274f3df700' had timed out after 0
[7:53] <paravoid> haha
[7:53] <paravoid> I'll just put a very large value I guess :)
[7:54] <sage> or -1, let me check
[7:54] <sjustlaptop> sage: alternately, we could add a way for a running workqueue process to ping the heartbeat timeout
[7:55] <sage> paravoid: large number
[7:55] <sage> hmm
[7:56] <paravoid> i put 50400
[7:56] <paravoid> should be enough
[7:57] <sage> sjustlaptop: its sort of moot. it's only behind because we weren't processing them as we go, but with the current fix that won't happen anymore
[7:57] <sjustlaptop> sage: yeah
[7:59] <sage> paravoid: if you see it immediatlye m ark itself up again (it might), a workaround is to 'ceph osd set noup', start it up, and once it's chewing on things 'ceph osd unset noup'.
[7:59] <paravoid> I see it as down
[8:00] <sage> ok good
[8:00] <paravoid> the first time I noticed peering count going up to 200
[8:00] <paravoid> for a while
[8:00] <paravoid> so it might have done that
[8:00] <paravoid> thanks :-)
[8:05] <sage> paravoid: when you started moving to 0.56, did you upgrade any of the client-side code, or only one osd?
[8:06] <paravoid> oh yeah, forgot about that
[8:06] <paravoid> I think the crashes started when I upgraded radosgw
[8:06] <sage> ok that makes more sense.
[8:07] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:09] <paravoid> also, I'm pretty sure I had a memory leak bug in the monitor process with 0.55.1
[8:10] <paravoid> I'll know if it's in 0.56 tomorrow or so I guess :)
[8:10] <paravoid> happened only on the active mon
[8:10] <paravoid> and the leak was about 15G of RAM or so :)
[8:10] <sage> there is 1 known leak, but it is small
[8:11] <paravoid> small per what?
[8:11] <paravoid> maybe I had too many pgs? or peerings? or...?
[8:11] <paravoid> that multiplied the effect
[8:11] <sage> many pgs does consume more ram
[8:11] <sage> but it should be stable
[8:12] <paravoid> it isn't
[8:12] <sage> k
[8:12] <paravoid> and 15G RAM for 16k pgs sounds a bit excessive :-)
[8:12] <paravoid> but you know best
[8:12] <sage> a bit :)
[8:12] <paravoid> I'll get memory profiler stats if/when that happens again
[8:12] <sage> ok, now i'm really going to bed.
[8:12] <paravoid> bye!
[8:12] <sage> that's be great!
[8:12] <sage> 'night
[8:12] <paravoid> thanks a bunch
[8:16] * Cube (~Cube@pool-71-108-128-153.lsanca.dsl-w.verizon.net) has joined #ceph
[8:20] * agh (~agh@www.nowhere-else.org) has joined #ceph
[8:22] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[8:23] <agh> Hello to all
[8:25] <agh> I've a question about hardware. Do you think that an AMD mono-proc in a host with 6 disks will be OK for an XFS OSD ? (6 OSDs, one per disk) ?
[8:28] * andret (~andre@pcandre.nine.ch) has joined #ceph
[8:29] * `gregorg` (~Greg@ Quit (Quit: Quitte)
[8:30] * low (~low@ has joined #ceph
[8:35] * Cube (~Cube@pool-71-108-128-153.lsanca.dsl-w.verizon.net) Quit (Quit: Leaving.)
[8:36] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[8:51] * The_Bishop (~bishop@e177089165.adsl.alicedsl.de) has joined #ceph
[9:04] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:13] * IceGuest_75 (~IceChat7@buerogw01.ispgateway.de) has joined #ceph
[9:16] <IceGuest_75> goor mornin #ceph
[9:16] * IceGuest_75 is now known as norbi
[9:16] <norbi> new year, new crash, after upgrading ceph to 0.56
[9:36] * ScOut3R (~ScOut3R@ has joined #ceph
[9:46] * ninkotech (~duplo@ip-94-113-217-68.net.upcbroadband.cz) has joined #ceph
[9:47] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[9:47] * loicd (~loic@magenta.dachary.org) has joined #ceph
[9:47] <norbi> scout3r, have u upgraded from 0.55.1 to 0.56 ?
[9:48] <ScOut3R> norbi: i'm still running 0.48.2
[9:48] <norbi> oh ok :)
[9:53] <agh> Hello to all. I absolutly need a shared FS, like CephFS. But, CephFS seems to be very instable. Will you have any suggestion to do a shared fs on with Ceph RBD and another tool ? (GFS2, OCFS2, etc. ?)
[9:56] * richard (~richard@office2.argeweb.nl) has joined #ceph
[9:57] * richard (~richard@office2.argeweb.nl) Quit ()
[9:57] * foxhunt (~richard@office2.argeweb.nl) has joined #ceph
[10:01] <norbi> ceph is logging some things to /tmp/memlog, how can i disable this ?
[10:02] * Morg (d4438402@ircip1.mibbit.com) has joined #ceph
[10:04] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[10:12] * Leseb (~Leseb@ has joined #ceph
[10:22] * LeaChim (~LeaChim@b01bde88.bb.sky.com) has joined #ceph
[10:23] * low (~low@ Quit (Ping timeout: 480 seconds)
[10:28] <foxhunt> hello, i am trying to build a POC for using ceph, but i have a few little questions
[10:28] <foxhunt> i have hardware to set up a 3 node ceph cluster (with multiple osd per node)
[10:29] <foxhunt> i am trying to do the setup with puppet
[10:30] <foxhunt> i thought it would be nice to include de hostname in de osd setup like this
[10:30] <foxhunt> [osd.ceph-c1n1-01]
[10:30] <foxhunt> host = ceph-c1n1
[10:30] <foxhunt> [osd.ceph-c1n1-02]
[10:30] <foxhunt> host = ceph-c1n1
[10:30] <foxhunt> but that doesn't seems to work, can the id for a osd only be numeric?
[10:32] * jtangwk (~Adium@2001:770:10:500:1cdd:1a59:92ca:c0f3) Quit (Quit: Leaving.)
[10:33] <norbi> many scripts parsing the ceph.conf, you can patch these files on you own :)
[10:33] <foxhunt> @norbi how do you mean?
[10:33] <cephalobot`> foxhunt: Error: "norbi" is not a valid command.
[10:34] <foxhunt> norbi: how to you mean?
[10:34] <norbi> one problem is /etc/init.d/ceph
[10:35] * jtangwk (~Adium@2001:770:10:500:1cdd:1a59:92ca:c0f3) has joined #ceph
[10:37] <norbi> and other scripts or /usr/lib64/ceph/ceph_common.sh
[10:37] * low (~low@ has joined #ceph
[10:37] <foxhunt> norbi: so it is bether to use osd.numericid
[10:38] <norbi> yes. because "osd create" will generate a number for u if you want to expand the cluster
[10:38] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[10:40] <foxhunt> norbi: could id use somthing like 101, 102, 103, for node 1 and 201, 202, 203 for node 2, etc?
[10:42] <norbi> yes that works
[10:42] <norbi> but
[10:42] <norbi> it u expand the cluster later u will get the number 0, 1, 2 ...
[10:43] <norbi> if
[10:44] <foxhunt> which version do you advice for the POC the 0.48 or the new 0.55?
[10:45] <norbi> 0.56 is the newest
[10:45] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[10:45] <norbi> i'm testing here with 0.56 and that seems to be stable
[10:57] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[10:57] * ScOut3R (~ScOut3R@ has joined #ceph
[11:00] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[11:00] <tnt> mmm, the topic needs updating.
[11:02] * jtangwk (~Adium@2001:770:10:500:1cdd:1a59:92ca:c0f3) Quit (Ping timeout: 480 seconds)
[11:02] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[11:03] * agh (~agh@www.nowhere-else.org) has joined #ceph
[11:04] <agh> Hello to all. I've been testing CephFS, but it's really not stable (unfortunatly). So I go to an other direction : using NFS, GFS2, OCFS2 or whatever on top of RBD to export a POSIX FS. Do you have any recommendation ?
[11:05] <tnt> well nfs is going to create a SPOF ... the nfs server. Not sure if it matters for you.
[11:05] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[11:06] <agh> tnt: yes... not good NFS.
[11:07] <agh> tnt: but the fact is I really need a distributed FS (not only block device). Have you ever heard of an installation of this type ?
[11:08] * ScOut3R (~ScOut3R@ has joined #ceph
[11:08] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[11:08] * ScOut3R_ (~ScOut3R@ has joined #ceph
[11:15] <tnt> agh: well lustre is pretty common, but that doesn't use ceph ... it also doesn't do redudancy on its own, you have to handle that yourself.
[11:15] <tnt> we just moved away from lustre to ceph, but we adapted the whole application to use object storage rather than posix fs.
[11:18] <agh> tnt: Yes, I love ceph way, CephFS is great too but very instable. So, is there a soft to do a CephFS like fs ? ontop of rbd ?
[11:18] <agh> even cLVM ?
[11:19] <agh> i gonna try OCFS2
[11:21] <tnt> I never used cLVM so can't say anything ...
[11:22] <agh> tnt: I'll look at it
[11:24] <darkfaded> agh: try to aim for the simplest possible solution. if you build a large stack of things running on things on top of things it'll fail too much
[11:24] * low (~low@ Quit (Read error: No route to host)
[11:26] * low (~low@ has joined #ceph
[11:28] * jtangwk (~Adium@2001:770:10:500:48e:a072:72ca:9727) has joined #ceph
[11:36] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[11:38] * jluis (~JL@ has joined #ceph
[11:38] * sagewk (~sage@2607:f298:a:607:3507:8d8b:f292:9944) Quit (Read error: Operation timed out)
[11:43] * joao (~JL@89-181-159-175.net.novis.pt) Quit (Ping timeout: 480 seconds)
[11:44] * joao (~JL@ has joined #ceph
[11:44] * ChanServ sets mode +o joao
[11:50] * jluis (~JL@ Quit (Ping timeout: 480 seconds)
[11:54] * sagewk (~sage@2607:f298:a:607:e5f3:fe46:26e2:2fe) has joined #ceph
[12:09] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:12] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[12:13] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[12:13] * loicd (~loic@magenta.dachary.org) has joined #ceph
[12:16] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:16] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[12:24] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:24] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[12:24] * loicd (~loic@magenta.dachary.org) has joined #ceph
[12:29] <jks> dmick: hmm, well they are not more busy than usually... It's a test system and it has been running the same test program for weeks... but the first time I saw these messages
[12:29] <jks> dmick: I'm running 0.55.1
[12:32] <agh> is there anyone using CephFS with success ?
[12:34] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[12:35] <norbi> yes :)
[12:37] <agh> norbi: can you tell me ?
[12:38] <norbi> what ? using ceph here with 4 servers, 21 OSDs, 3 MONs, 2 MDS abd 3 clients mounting ceph with mount.ceph
[12:38] <norbi> 21tb
[12:38] <norbi> its only a test system, but it's stable
[12:38] <agh> norbi: ok, what sort of througput do you use on CephFS ?
[12:39] <agh> nrobi: (because, it's very unstable for me :'( )
[12:39] <agh> norbi: my MDS are always getting crashed or laggy
[12:39] * mtk (bDdbBhx3Rd@panix2.panix.com) has joined #ceph
[12:39] <norbi> u can send the crash log to de devel mailinglist ?
[12:40] <norbi> the
[12:40] <norbi> sort of throuput ?
[12:40] <agh> (sorry for my English... I'm French)
[12:41] <agh> yes, what do you do on your CephFS mount points ? Big files ? little files ? A lot of transfers ?
[12:41] <norbi> <-- germany :)
[12:41] <norbi> we are saving backup files
[12:41] <norbi> big files, about 5GB up to 500GB
[12:41] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[12:41] <agh> and you never have troubles with CephFS ?
[12:42] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[12:42] <norbi> and todey we have 3 clients that have mounted ceph and storing files same time into ceph. the 3 backup clients get the traffic from about 30 webservers
[12:42] <norbi> today
[12:42] <norbi> never :)
[12:43] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit ()
[12:43] <agh> norbi: i'm jalous
[12:43] <norbi> after version 0.48, before it was terrible
[12:43] <agh> norbi: do you have any specific configuration for Ceph ?
[12:43] <agh> norbi: i've the latest version (0.56)
[12:43] <norbi> no, not realy
[12:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:44] <agh> norbi: could you send me your ceph.conf file ? (if you don't mind)
[12:45] <norbi> the only thing i have done, the osds are using SSD f�r the journals
[12:45] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[12:46] <agh> norbi: ok. And by the way, what sort of hardware do you have for osd ?
[12:47] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:48] <norbi> http://pastebin.com/Ama3JKUt
[12:48] <norbi> the config
[12:48] <agh> norbi: thanks a lot
[12:49] <norbi> one host has LSI Raid controller and 8 disks
[12:49] <agh> norby: why "max open files" ?
[12:49] <norbi> + 2 ssd via md-softwareraid1
[12:50] <norbi> the other hast only mptsas controller and 3 disks, and the other one lsi controller with 8 disks
[12:51] <agh> norbi: ok, thanks
[12:51] <norbi> 789
[12:51] <norbi> mon clock drift allowed =
[12:51] <norbi> (max open files) dont know why i have configured this :)
[12:52] <norbi> have seen this long time ago
[12:53] <agh> ;)
[13:00] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:01] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[13:01] * loicd (~loic@magenta.dachary.org) has joined #ceph
[13:04] <tnt> agh: and you're using the latest kernel ?
[13:04] <tnt> (wrt to cephfs)
[13:14] * yehudasa (~yehudasa@2607:f298:a:607:b90e:a84b:2655:e11b) Quit (Read error: Operation timed out)
[13:14] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[13:20] * joao sets mode -o joao
[13:24] * maxiz (~pfliu@ has joined #ceph
[13:29] * yehudasa (~yehudasa@2607:f298:a:607:44f5:f47d:52c5:c82d) has joined #ceph
[13:48] * Morg (d4438402@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[13:51] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[13:51] * agh (~agh@www.nowhere-else.org) has joined #ceph
[13:51] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[13:51] * loicd (~loic@magenta.dachary.org) has joined #ceph
[13:54] <agh> tnt: sorry for the delay, yes, i use the last kernel
[13:54] <agh> tnt: (3.7)
[13:56] * cgm_tco (~cgm_tco@ has joined #ceph
[13:58] <cgm_tco> hello guys
[13:59] <cgm_tco> im looking for the bobtail packages for ubuntu
[14:00] <tnt> http://ceph.com/debian-testing/
[14:00] <tnt> altough it's not 'bobtail' officially as I understand.
[14:09] * mtk (bDdbBhx3Rd@panix2.panix.com) Quit (Remote host closed the connection)
[14:09] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[14:09] <jtang> hmm
[14:10] <jtang> just reading the ceph blog, looks like the scrubber just got better
[14:10] <jtang> along wiht better fuse support
[14:10] <jtang> *nice*
[14:16] <foxhunt> hello, i restarted one of the nodes of the poc as a test to see what happened, and no i have the following health
[14:16] <foxhunt> HEALTH_WARN 291 pgs down; 291 pgs peering; 291 pgs stuck inactive; 291 pgs stuck unclean
[14:16] * cgm_tco (~cgm_tco@ Quit (Quit: irc2go)
[14:18] <norbi> u have forgotten to start the osds ? :)
[14:22] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[14:23] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[14:24] * andret (~andre@pcandre.nine.ch) has joined #ceph
[14:30] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:33] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[14:33] * loicd (~loic@magenta.dachary.org) has joined #ceph
[14:33] <foxhunt> all the ceph-osd are running
[14:34] <foxhunt> or do i need to use some ceph command?
[14:34] <norbi> "ceph -s" says that all OSDs are in ?
[14:35] <foxhunt> root@ceph-c1n1:/mnt/cephfs# ceph -s
[14:35] <foxhunt> health HEALTH_WARN 291 pgs down; 291 pgs peering; 291 pgs stuck inactive; 291 pgs stuck unclean
[14:35] <foxhunt> monmap e1: 3 mons at {ceph-c1n1=,ceph-c1n2=,ceph-c1n3=}, election epoch 30, quorum 0,1,2 ceph-c1n1,ceph-c1n2,ceph-c1n3
[14:35] <foxhunt> osdmap e115: 36 osds: 34 up, 34 in
[14:35] <foxhunt> pgmap v4825: 60096 pgs: 59805 active+clean, 291 down+peering; 17479 bytes data, 36824 MB used, 94931 GB / 94967 GB avail
[14:35] <foxhunt> mdsmap e16: 1/1/1 up {0=ceph-c1n3=up:active}, 1 up:standby
[14:35] <norbi> 36 osds: 34 up, 34 in
[14:35] <norbi> 2 OSDs are down
[14:35] <norbi> or not configured yet ?
[14:35] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[14:38] <foxhunt> found it, 2 osd down of another node
[14:38] <norbi> :)
[14:40] <foxhunt> but i found them bij checking the osd processes on the nodes, weird that ceph health detail didn't mention them bij name
[14:41] <norbi> u see it via "ceph osd tree"
[14:42] * gregorg (~Greg@ has joined #ceph
[14:47] <Kioob`Taff> is there an ARM version of Ceph ? :p
[14:49] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[14:49] * loicd (~loic@magenta.dachary.org) has joined #ceph
[14:49] <foxhunt> thanks health is OK
[14:54] <tnt> Anyone using radosgw under 0.56 ? I'm getting segfaults :(
[14:58] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[15:00] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[15:00] * ChanServ sets mode +o elder
[15:09] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:16] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[15:16] * loicd (~loic@magenta.dachary.org) has joined #ceph
[15:25] <mikedawson> Is the setting osd_journal_size used when using a partition for the Journal? I'm not setting it in ceph.conf, but it appears to default to 5120 despite my SSD partitions for journals being 10GB. Should I set it to 10240?
[15:26] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: slang)
[15:26] <wido> mikedawson: You shouldn't set osd journal size for a block device
[15:29] <mikedawson> wido: thanks. Is there any decent documentation about tuning an SSD Journal partition to speed up small writes? My use case IO is ~16K, 95% write totaling about 350Mbps
[15:30] <mikedawson> Right now I'm getting about 75% of max IOPS from my 7200rpm drives. I'd like to tune the journals to take the IOPS and write the data to OSDs in a more sequential manner, if possible.
[15:31] <mikedawson> Basically SSD journals take the IOPS load and write to OSDs that can't take the IOPS load.
[15:31] <mikedawson> Seems feasable, if Ceph journaling can be tuned for that purpose
[15:32] <Kioob`Taff> mikedawson: for similar problem, I'd increase "journal queue max ops" and "journal queue max bytes", then "filestore max sync interval" and "filestore min sync interval". But journal still far to be fully used
[15:34] <mikedawson> Kioob`Taff: Thanks! I'll experiment. Do you know of any documentation about tuning for small io?
[15:34] <Kioob`Taff> no... just a thread on the mailinglist
[15:37] <Kioob`Taff> mikedawson: http://www.spinics.net/lists/ceph-devel/msg08534.html
[15:39] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[15:39] * nhorman (~nhorman@nat-pool-rdu.redhat.com) has joined #ceph
[15:58] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[15:58] * loicd (~loic@magenta.dachary.org) has joined #ceph
[16:01] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[16:05] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[16:05] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[16:19] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[16:29] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[16:30] * agh (~agh@www.nowhere-else.org) has joined #ceph
[16:37] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[16:38] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[16:44] * norbi (~IceChat7@buerogw01.ispgateway.de) Quit (Quit: Make it idiot proof and someone will make a better idiot.)
[16:46] * vata (~vata@ has joined #ceph
[16:53] * nwat1 (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[16:55] * sagelap1 (~sage@ Quit (Read error: Operation timed out)
[16:56] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) Quit (Remote host closed the connection)
[16:56] <mikedawson> I'd like to speed up recovery of my cluster after changing a pool from 2x to 3x replication. 2013-01-03 10:56:46.839527 mon.0 [INF] pgmap v17726: 4688 pgs: 3146 active+clean, 68 active+degraded+wait_backfill, 1452 active+recovery_wait, 18 active+degraded, 4 active+recovering; 124 GB data, 264 GB used, 22083 GB / 22348 GB avail; 14149/136915 degraded (10.334%)
[16:57] <mikedawson> Are there settings I can tweak to increase concurrency so there are less items listed in a wait state?
[17:01] * The_Bishop_ (~bishop@e177089165.adsl.alicedsl.de) has joined #ceph
[17:02] <wido> mikedawson: "osd recovery max active"
[17:03] <wido> mikedawson: By default it's set to 5
[17:03] * The_Bishop (~bishop@e177089165.adsl.alicedsl.de) Quit (Read error: Operation timed out)
[17:04] <mikedawson> wido: does osd_recovery_threads matter? It seems to default to 1
[17:04] <wido> mikedawson: Not sure. I'm not exactly sure what the setting does
[17:08] <ScOut3R_> mikedawson: if you are running the osds on more than 1 host then the network connection between them can be a huge bottleneck
[17:09] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:09] <mikedawson> ScOut3R_: I am, but my storage network traffic is very bursty right now
[17:10] <mikedawson> 1GigE links maxing at peaks of ~25MiB with large pauses between the spikes of transmission
[17:10] * jlogan (~Thunderbi@2600:c00:3010:1:519e:fddc:b274:5689) has joined #ceph
[17:11] <ScOut3R_> mikedawson: and what are the hosts doing in the meantime?
[17:12] <mikedawson> ScOut3R_: Not sure... seems like they are largely idle
[17:13] <mikedawson> it's like the Ceph config defaults to putting a very conservative priority on re-balancing
[17:13] <mikedawson> so conservative in fact, my hardware largely idle and rebalancing is taking a long time
[17:14] <ScOut3R_> mikedawson: interesting, i'm using the very default settings and my hosts are going nuts during a rebalance even with very small data (14TB cluster, 30GB useful data)
[17:15] <mikedawson> hrm ... my current setup is 4 nodes w/ two 3TB osds each and roughly 50GB useful data
[17:17] <ScOut3R_> hm
[17:17] <ScOut3R_> how have you changed the replication number?
[17:18] <mikedawson> rebalance after changing pool from 2x to 3x is taking a 2-3 hours with my hardware (CPU utilization, RAM utilization, network utilization) seemingly idle
[17:18] <ScOut3R_> 2-3 hours?
[17:18] <ScOut3R_> hm
[17:18] <ScOut3R_> have you checked i/o utilization?
[17:18] <mikedawson> [osd]
[17:18] <mikedawson> osd recovery threads = 4
[17:18] <mikedawson> osd recovery max active = 10
[17:19] * gucki_ (~smuxi@46-126-114-222.dynamic.hispeed.ch) has joined #ceph
[17:19] <mikedawson> ScOut3R_: how do you check i/o utilization?
[17:19] <ScOut3R_> mikedawson: i use iotop
[17:20] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[17:20] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[17:20] <mikedawson> iotop is bouncing up and down sometimes up to ~20M/s
[17:21] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[17:21] * aliguori (~anthony@ has joined #ceph
[17:22] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[17:25] <mikedawson> ceph-mon is actually higher on iotop than either ceph-osd most of the time on the box running MON
[17:26] <ScOut3R_> well, a bouncing value does not mean a bottleneck, at least not always
[17:27] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[17:27] * loicd (~loic@magenta.dachary.org) has joined #ceph
[17:28] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[17:30] <tnt> So is nobody using 0.56 with RGW ? or just nobody gets a segfault but me ?
[17:30] <paravoid> sagewk, sjust: osd.15 recovered; according to graphs it took about 1h40
[17:31] <paravoid> tnt: http://tracker.newdream.net/issues/3715
[17:31] * low (~low@ Quit (Quit: Leaving)
[17:33] <tnt> paravoid: intersting. Not exactly the same issue because here it's the radosgw process itself that crash and not the osd.
[17:33] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[17:33] * gregaf (~Adium@2607:f298:a:607:8d9:37d1:d5f7:46d3) has joined #ceph
[17:33] <paravoid> aha
[17:33] <paravoid> works for me other than that
[17:33] * sagelap (~sage@185.sub-70-197-142.myvzw.com) has joined #ceph
[17:34] <tnt> but the end of the stack trace is similar, ending in (clone()+0x6d) [0x7ffa4e0a9cbd] from some Thread stuff ...
[17:41] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[17:44] * ScOut3R_ (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:44] * foxhunt (~richard@office2.argeweb.nl) Quit (Remote host closed the connection)
[17:46] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[17:48] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:49] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[17:53] <noob2> tnt: i haven't tried it yet
[17:58] * tnt (~tnt@11.97-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[17:58] <nhm> paravoid: good news it sounds like?
[17:59] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:00] <paravoid> nhm: well, worked for me, but only after I increased heartbeat timeout to a very large value (per sage/sjust's suggestion)
[18:00] <paravoid> so I guess more work is needed?
[18:00] <paravoid> dunno :)
[18:01] <nhm> paravoid: interesting!
[18:01] * MooingLe1ur is now known as MooingLemur
[18:01] <nhm> paravoid: I wonder what's clobbering the heartbeat.
[18:01] <paravoid> nothing, it was just tooking a while
[18:01] <paravoid> and they told something about how the replay all happens in a single workqueue
[18:02] <paravoid> that osd was far behind, so 30' was not enough
[18:02] <nhm> paravoid: hrm
[18:03] <nhm> paravoid: seems like the heartbeat shouldn't be delayed due to slow replay? Not that I really know how any of that code works.
[18:08] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[18:08] * sagelap (~sage@185.sub-70-197-142.myvzw.com) Quit (Read error: Connection reset by peer)
[18:08] <sstan> is there a way to set a maximum pool size (ex. 100Gb) ?
[18:08] <sstan> for a given pool
[18:10] <gregaf> sstan: nope; no quotas anywhere in Ceph yet
[18:10] <sstan> so the max size per pool is what's available ..
[18:11] <sstan> one would have to run more than one ceph cluster in parallel
[18:11] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[18:12] <sstan> that wasn't clear. Max size = available storage space available to osds
[18:12] <sstan> thanks for the info, gregaf
[18:13] <gregaf> sstan: well, if you're willing to use that as a blocker then you can restrict the pool to a subset of the OSDs using CRUSH rules, rather than setting up whole disparate clusters
[18:13] <sstan> true; I forgot about that
[18:15] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:23] * sagelap (~sage@2607:f298:a:607:b12b:c856:f3f1:71e5) has joined #ceph
[18:25] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[18:26] * ScOut3R (~ScOut3R@2E6BADF9.dsl.pool.telekom.hu) has joined #ceph
[18:30] * nwat1 (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:30] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[18:33] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[18:38] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[18:38] * agh (~agh@www.nowhere-else.org) has joined #ceph
[18:44] * fzylogic (~fzylogic@ has joined #ceph
[18:49] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:52] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[18:55] <joao> sagewk, around?
[18:55] <sagewk> yeah
[18:56] <joao> do you think we should extend #3633 to also report clock skews on the osds via 'ceph status/health'?
[18:56] <joao> or on the other components, beside the mons
[18:57] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Remote host closed the connection)
[18:58] <joao> or maybe just slightly adjust the existing implementation to allow easy expansion for that purpose if need arises
[18:59] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[19:00] <phantomcircuit> i have two osd's on the same system for two disks
[19:00] <phantomcircuit> journals on separate devices
[19:01] <phantomcircuit> the first osd (by id#) seems to have more activity than the second
[19:01] <phantomcircuit> even though they have the same weight
[19:01] <sjustlaptop> what is the workload?
[19:01] <phantomcircuit> virtual machines
[19:02] <phantomcircuit> pool default -> datacenter -> host -> osd.0/osd.1 both up with weight 1
[19:03] <sjustlaptop> how much more activity?
[19:03] <sjustlaptop> and how many pgs?
[19:04] <phantomcircuit> roughly double the activity
[19:05] <phantomcircuit> pg_num is 128
[19:07] <nhm> phantomcircuit: how many osds total, and what is the overall difference in activity?
[19:07] <phantomcircuit> currently there are just the two
[19:07] <phantomcircuit> which is why i find it kind of odd
[19:08] <sjustlaptop> is the total amount of activity relatively low?
[19:08] <phantomcircuit> total iops is low but throughput is high
[19:08] <sjustlaptop> what are the vms doing?
[19:08] <phantomcircuit> im copying entire vm images
[19:10] <nhm> ooh, ruby on rails SQL injection flaw.
[19:11] <phantomcircuit> nhm, they're all over the place trying to argue that it isn't a real flaw
[19:11] <phantomcircuit> it's hilarious
[19:11] <janos> sql injection tends to be a flaw in the developers
[19:11] <janos> ;)
[19:12] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[19:13] <phantomcircuit> janos, this is a rails flaw
[19:13] <phantomcircuit> period
[19:13] <phantomcircuit> anybody saying differently is a dumbass
[19:13] <janos> i agree (reading up on it now) but i still think my statement holds for most cases
[19:14] <phantomcircuit> for most cases yes
[19:14] <phantomcircuit> in this case lol no
[19:14] <janos> esp. considering my experience that RoR (and activeRecord) generally approach databases as some ugly-but-necessary tech part
[19:15] <janos> some of the db's i've seen.... let's just say clotheslining or chest-stomping is inadequate to describe my feelings
[19:16] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[19:17] <dmick_away> joao: yt?
[19:18] * dmick_away is now known as dmick
[19:18] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[19:18] <joao> assuming that means 'you there', yep, here :)
[19:19] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[19:19] <dmick> so the guy reporting mon problems on the list: he said he'd experienced them while upgrading to 0.56, but you said the bug was fixed in 0.56
[19:20] <dmick> can't help but think there's more to the story
[19:20] <dmick> (norbi@rocknob.de)
[19:21] <dmick> but maybe you meant the non-upgraded monitors were crashing and not surprisingly?
[19:22] <joao> dmick, the monitors that were failing were the ones still on 0.55.1 as far as I could tell
[19:23] <dmick> yeah, read a bit more carefully and decided that's what you must have meant
[19:23] <dmick> ok. disregard me :)
[19:23] <joao> the problem itself would present itself on that conditions, whenever a new monitor was promoted to leader even for just a little while
[19:24] * nwat (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[19:29] <phantomcircuit> ceph-osd[3767]: segfault at 18 ip 0000033ee05cc235 sp 0000033ec07f08f0 error 4 in libc-2.15.so[33ee0550000+1a2000]
[19:29] <phantomcircuit> uh oh
[19:31] * sagelap (~sage@2607:f298:a:607:b12b:c856:f3f1:71e5) Quit (Read error: Operation timed out)
[19:32] * sagelap (~sage@2607:f298:a:607:34fc:1d2c:684c:76dd) has joined #ceph
[19:34] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[19:34] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[19:34] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[19:39] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[19:40] <phantomcircuit> http://imgur.com/81F07
[19:41] <phantomcircuit> kernel panic when the network died
[19:41] <phantomcircuit> 3.5.7-gentoo
[19:43] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[19:45] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[19:46] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[19:47] * agh (~agh@www.nowhere-else.org) has joined #ceph
[19:47] <elder> phantomcircuit, that image isn't enough to really know what happened.
[19:47] <elder> (That imgur link)
[19:48] <phantomcircuit> elder, yeah i figured but it's all i've got
[19:48] <elder> Chances are it'
[19:48] <phantomcircuit> for some reason when the kernel panic'd it reset the resolution to 800x600
[19:48] <elder> s in the ceph messenger, but there's no guarantee of that.
[19:48] <elder> Lots of strange things can happen when the kernel panics...
[19:50] <tnt> don't shoot the messenger ...
[19:50] <phantomcircuit> yeah i could attach a virtual serial port over IPMI
[19:53] <phantomcircuit> nope wouldn't help i have kernel panics to serial disabled :/
[19:53] <tnt> netconsole
[19:55] <phantomcircuit> hmm osd.0 keeps crashing
[20:02] * korgon (~Peto@isp-korex- has joined #ceph
[20:02] * Cube1 (~Cube@ has joined #ceph
[20:04] <phantomcircuit> using ceph-fuse how can i mount a different pool?
[20:07] <korgon> somebody to discuss CEPH as alternative for HPC file system both capacity (HDDs) and performance (SSDs)?
[20:07] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[20:11] <korgon> what do you suggest as architecture for CEPH for HPC environment?
[20:11] <korgon> is there some support for RDMA?
[20:12] <korgon> I think of capacity nodes with 36-60x 3.5" drives and SSD nodes with 24+ SSDs (every 8ssds per lsi2308 controller) or bunch of pci-e SSDs per node
[20:15] <korgon> what do you suggest for journals? Memory for SSDs and SSDs for HDDs?
[20:16] <korgon> did you meet some IOPs saturation because of CEPH? Can CEPH saturate 2.5M IOPs from 24x SSDs per node?
[20:19] * nhorman (~nhorman@nat-pool-rdu.redhat.com) Quit (Quit: Leaving)
[20:19] <nhm> korgon: Hi! How big of a deployment are you thinking about?
[20:20] * ScOut3R (~ScOut3R@2E6BADF9.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[20:20] <korgon> tens of PB for HDDs hundreds of TB for SSDs
[20:22] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) has joined #ceph
[20:22] <nhm> korgon: Nice! So basically the status right now is that the posix filesystem component in Ceph is mostly stable in a single metadata configuration, but we are mostly looking deployments of the cephfs on a per-customer basis. To be perfectly honest, we don't have anyone running a cephfs deployment with 10s of petabytes yet.
[20:22] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[20:23] <nhm> korgon: We are however working with a couple of National Laboratories in the US to evaluate cephfs as a potential competitor for lustre.
[20:24] <korgon> it's planned for 2nd half 2013
[20:24] <nhm> korgon: currently there is not RDMA support, though several organizations have expressed interest in developing it out. Right now we support IB with IPoIB and have some folks starting to look into rsocket.
[20:25] <korgon> should be used for computation cluster as filesystem and there is alternative to use nodes also for hadoop and replace HDFS with CEPHfs or direct use of RADOS
[20:26] <korgon> but till now all vendors don't know anything of CEPH :), but I'd like to see it there as very promising technology
[20:27] <nhm> korgon: I have been doing some testing on a 36 drive supermicro chassis with cheap SAS9207-8i controllers. I top out at about 2.8GB/s with client on the localhost (that's including journal writes). I haven't had time yet to really dig into why I'm hitting a wall there.
[20:27] <korgon> RDS?
[20:27] <nhm> korgon: I'll be testing on some larger DDN chassis soon.
[20:27] * agh (~agh@www.nowhere-else.org) Quit (Ping timeout: 480 seconds)
[20:28] <nhm> korgon: that was doing 8 concurrent rados bench instances streaming 4MB writes over librados.
[20:28] <nhm> korgon: there may be some additional overhead with the posix layer.
[20:30] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[20:30] <korgon> did you try to test direct access for those 36 drives to be sure no hw or driver problems?
[20:31] <nhm> korgon: nope, it was just a quick 1-off test. Real tests are in the pipeline.
[20:31] <korgon> 2.8GB/s is very similar what you can get from pci-e 2.0 version
[20:31] <nhm> korgon: that was with 4 controllers each in 8x slots
[20:32] <nhm> korgon: on a dual E5 Xeon board
[20:32] <korgon> so no expanders? not e16 version?
[20:33] <nhm> no expanders, didn't want to deal with them.
[20:33] <korgon> 32 drives from drives and 4 to MB?
[20:33] <nhm> btw, that setup with 24 OSDs on spinning disks with 8 SSDs for jouranls.
[20:34] <nhm> 6 spinning disks and 2 SSDs per controller.
[20:34] <nhm> just system disk on MB.
[20:34] <nhm> so had 4 bays empty in that test.
[20:34] <korgon> so 24 OSDs and 8SSDs?
[20:34] <nhm> yep
[20:34] <nhm> 3 journals per SSD. Intel 520s
[20:35] <korgon> that 8 SSDs are bottleneck
[20:35] <nhm> was only 1.4GB/s aggregate to the journals, they shouldn't be.
[20:35] <nhm> See the tests here: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[20:36] <nhm> that's the same setup, but with only 6 spinning disks and 2 SSDs on a single controller in that chassis.
[20:36] <korgon> try to use 32 HDDS with journals per HDD
[20:36] <nhm> korgon: yeah that's comming too. Right now I'm doing parametric sweeps of ceph parameters and different IO schedulers. Then I'll probably try doing some real scaling tests with more drives in the chassis.
[20:37] <nhm> korgon: btw, mind if I ask what organization you are with?
[20:40] <korgon> I just right know consult one project with one university in my country
[20:47] * BManojlovic (~steki@ has joined #ceph
[20:50] <noob2> what's the general solution people have settled on for monitoring their disks?
[20:52] <noob2> i'm not sure if i should be watching logs for errors or having sitescope or some other crap monitor all the drives
[20:52] <Kioob> smartmontools... if it's available
[20:53] <noob2> yeah i was thinking that also
[21:06] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[21:06] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:07] * cdblack (c0373628@ircip1.mibbit.com) has joined #ceph
[21:10] * sagelap (~sage@2607:f298:a:607:34fc:1d2c:684c:76dd) Quit (Ping timeout: 480 seconds)
[21:10] * sagelap (~sage@2607:f298:a:607:b12b:c856:f3f1:71e5) has joined #ceph
[21:28] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[21:28] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[21:30] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[21:37] * ScOut3R (~ScOut3R@2E6BADF9.dsl.pool.telekom.hu) has joined #ceph
[21:38] * darkfaded (~floh@ Quit (Ping timeout: 480 seconds)
[21:41] <mikedawson> pg 4.722 is stuck unclean for 69484.181620, current state active+degraded, last acting [7,6]
[21:41] <mikedawson> Does the 4 in 4.722 mean this placement group is in Pool #4?
[21:42] <sjustlaptop> sagewk: wip_3699 is ready for a review
[21:42] <sagewk> k
[21:42] <mikedawson> And do [7,6] mean that it lives on osd.7 and osd.6?
[21:42] <joshd> mikedawson: yes to both
[21:43] <dmick> sjustlaptop: can we PM about the issue I emailed last night?
[21:43] <joshd> well, the 7,6 means the last reported acting set for it was 7 and 6
[21:43] <sjustlaptop> dmick: sure
[21:43] <sjustlaptop> sagewk: wip-3714-b looks good
[21:44] <joshd> mikedawson: 'ceph pg dump' will show the current mapping
[21:44] <paravoid> sjustlaptop: did you see my note?
[21:44] <wido> wow, 0.56 really boosts RBD performance :)
[21:44] <mikedawson> joshd: I changed pool 4 from 2x to 3x replication this morning. I am now stuck with HEALTH_WARN 18 pgs degraded; 18 pgs stuck unclean; recovery 106/90035 degraded (0.118%) but I can't decide how to get these 18 PGs moving towards HEALTH_OK
[21:44] <wido> VMs feel a lot more snappier
[21:44] <paravoid> sjustlaptop: it took about 1h40 to recover, so that 30s heartbeat was really off
[21:44] <sjustlaptop> good god
[21:45] <sjustlaptop> paravoid: indeed
[21:45] <joshd> wido: that's good to hear. is that compared to argonaut?
[21:45] <sjustlaptop> that is a lot of maps
[21:45] <wido> joshd: Yes, upgraded from 0.48.2
[21:45] <paravoid> sjustlaptop: so I think that wip-3714 is only solving part of the problem
[21:45] <sjustlaptop> paravoid: actually, the reason wip-3714 didn't appear to work is that when it came up the first time, it caught up to the monitor
[21:45] <wido> just running "sync" is much faster, used to take about 500 ~ 800ms
[21:46] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[21:46] <sjustlaptop> if wip-3714 had been on the node at that time, it should have worked ok
[21:46] <paravoid> how come?
[21:46] <paravoid> is this the "dig the hole deeper" thing?
[21:46] <mikedawson> Just deleted a volume from RBD, now I have 2013-01-03 15:45:16.166272 mon.0 [INF] pgmap v24772: 4672 pgs: 4654 active+clean, 18 active+degraded; 23336 MB data, 48425 MB used, 22300 GB / 22348 GB avail; -106/11702 degraded (-0.906%)
[21:46] <joshd> mikedawson: does 'ceph pg dump | grep degraded' show them mapped to only two osds? if so, enabling the crush tunables may fix it
[21:47] <mikedawson> negative numbers seem a bit odd
[21:47] * jbarbee (17192e61@ircip1.mibbit.com) has joined #ceph
[21:47] <joshd> mikedawson: negative numbers are a bug
[21:47] <sagewk> sjutlaptop: branch looks ok, modulo the 2 comments
[21:48] <sagewk> sjustlaptop: and it should go in next, and cherry-pick -x to testing
[21:48] <sagewk> incidentally, we should s/testing/last/
[21:48] <joshd> mikedawson: http://ceph.com/docs/master/rados/operations/crush-map/#tunables is what I mean about crush tunables
[21:49] <sagewk> joshd: let's change the crush defaults in master now that v0.56 is out
[21:49] <mikedawson> joshd: yes, these PGs only show two mappings. Thanks for the link. I'll give it a try
[21:49] <joshd> sagewk: sounds good
[21:50] <mikedawson> joshd: know bug or should I enter something in the tracker?
[21:50] <wer> Hey, is adjusting the replication size of a pool no big deal? I just figured I was going to have to blow away my pool... but 'ceph osd pool set <pool> size 3' seemed to work. :)
[21:50] <joshd> mikedawson: I'm not aware of any offhand
[21:50] <joshd> wer: yeah, it shouldn't be a problem
[21:51] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[21:51] <wer> joshd: Nice. I was all shuddering from having to adjust the pg_nums before.. .which required moving a lot of data :) ty
[21:53] <joshd> wer: yeah, that's the only thing that can't change easily. experimental split is in 0.56
[21:53] <wer> k. ossm.
[21:53] * ScOut3R (~ScOut3R@2E6BADF9.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[21:55] <dmick> joshd: are new replicas created as objects are touched/scrubbed (lazily)?
[21:56] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[21:56] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) has joined #ceph
[21:57] <joshd> dmick: I think it's just handled through the normal peering/recovery process as the osdmap with the new pool size is propagated
[22:00] <wer> joshd: yup. It is. Things are degraded until things are no longer degraded. Sweet. So I reduced the replication count and was hoping to see the 'GB used' change in ceph -w output. Should I not see that decrease/increase when messing with replication?
[22:02] <joshd> wer: it should change, but it's updated lazily, so it could be a little while (and I think old replicas might not be deleted until the pg is clean again)
[22:02] <wer> k. I will reduce and watch for a bit. Thanks!
[22:03] <joshd> you're welcome :)
[22:03] <mikedawson> joshd: Issue added http://tracker.newdream.net/issues/3720
[22:04] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) has joined #ceph
[22:05] <joshd> mikedawson: thanks
[22:06] <CloudGuy> hey all .
[22:06] <CloudGuy> happy new year
[22:06] <noob2> happy new years
[22:09] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[22:10] * jefferai (~quassel@quassel.jefferai.org) has joined #ceph
[22:11] <wer> Anyone know what when I do 'rados df', my bucket that has 162 objects reports as having 258 objects?
[22:11] <wer> s/what/why/
[22:12] * darkfader (~floh@ has joined #ceph
[22:16] * darkfaded (~floh@ has joined #ceph
[22:16] * darkfader (~floh@ Quit (Read error: Connection reset by peer)
[22:16] <joshd> wer: radosgw can use more than one rados object per radosgw object (I don't know the current details)
[22:18] <wer> Hmm. Ok. Also, like an idiot... the reason that GB used number didn't change much is because I have my journals on the osd's :) So the amount of data used is actually a lot due to the journals. So I don't think the number is that lazy.
[22:19] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Read error: Operation timed out)
[22:22] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[22:22] * agh (~agh@www.nowhere-else.org) has joined #ceph
[22:22] * nhm rocks out to some Michael Jackson.
[22:24] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[22:28] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[22:29] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[22:31] <mikedawson> joshd: HEALTH_OK after the tunables
[22:31] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[22:31] <mikedawson> joshd: and the negative numbers stuff is gone as well
[22:31] <mikedawson> Thanks!
[22:31] <joshd> mikedawson: great!
[22:32] <mikedawson> What exactly do the choose-local-tries settings do?
[22:32] * agh (~agh@www.nowhere-else.org) has joined #ceph
[22:33] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[22:34] <joshd> when crush tries a placement, and fails to get the desired number of replicas, local tries don't back up the entire tree, so it could get stuck retrying a subtree that wouldn't satisfy the requirements (like 3 replicas)
[22:34] <joshd> instead iirc it retries from the root
[22:37] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Quit: Leaving.)
[22:38] * korgon (~Peto@isp-korex- has joined #ceph
[22:42] <sjustlaptop> paravoid: basically, the peering_wq thing just updates the pg from whereever it is to wherever the OSD is
[22:43] <sjustlaptop> so when the osd came up the first time, the osd was mostly recent even though the pgs hadn't caught up
[22:43] * jbarbee (17192e61@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[22:43] <sjustlaptop> the wip branch would have forced the OSD to wait for the pgs on each set of maps before advancing to the next set of maps
[22:43] <sjustlaptop> it's still a bit of a hack, but under normal circumstances, none of this can happen
[22:45] <paravoid> what was abnormal with my case?
[22:45] <paravoid> just wondering, so I can make it normal again :-)
[22:51] <sjustlaptop> paravoid: if I understand correctly, you had one machine which had been marked out for a while?
[22:52] <sjustlaptop> or rather, one osd which had been down for a while
[22:52] <sjustlaptop> ?
[22:52] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[22:52] <sjustlaptop> anyway, the odd part was the large number of maps which had transpired while that osd had been down
[22:52] <sjustlaptop> that's not really unusual in itself (wip-3714 should take care of that case)
[22:53] <sjustlaptop> but there aren't many other cases where an osd would be that many maps behind
[22:53] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit ()
[22:54] <paravoid> no
[22:54] <paravoid> it wasn't
[22:54] <paravoid> so what happened is
[22:54] <paravoid> the cluster was degraded (28% or so) because of new hardware being introduced
[22:54] <paravoid> at one random point osd.15 died
[22:55] <paravoid> then upstart started it again, only to commit suicide again, and so on for a while
[22:55] <paravoid> from the first crash and until I deployed wip-3714 it was something like 4-5h
[22:56] * BManojlovic (~steki@ has joined #ceph
[22:59] <sjustlaptop> ah... the crashes were generating a lot of maps, that's pretty much what happened
[22:59] <sjustlaptop> wip-3714 should prevent such a pattern
[22:59] <paravoid> okay
[23:00] <paravoid> as long as you guys have the whole picture, I have faith :)
[23:02] * gucki_ (~smuxi@46-126-114-222.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[23:02] * gucki (~smuxi@46-126-114-222.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[23:04] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[23:07] * ScOut3R (~ScOut3R@2E6BADF9.dsl.pool.telekom.hu) has joined #ceph
[23:13] * mistur_ (~yoann@kewl.mistur.org) has joined #ceph
[23:14] * nwat (~Adium@soenat3.cse.ucsc.edu) Quit (Quit: Leaving.)
[23:15] * mistur (~yoann@kewl.mistur.org) Quit (Ping timeout: 480 seconds)
[23:23] * jskinner (~jskinner@ has joined #ceph
[23:25] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[23:32] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[23:36] * aliguori (~anthony@ Quit (Remote host closed the connection)
[23:37] * ScOut3R (~ScOut3R@2E6BADF9.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[23:51] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[23:53] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[23:53] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[23:57] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:57] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) has joined #ceph
[23:57] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) Quit ()
[23:58] * glowell1 (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[23:58] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.