#ceph IRC Log


IRC Log for 2013-04-10

Timestamps are in GMT/BST.

[0:03] * portante|afk (~user@ Quit (Quit: rebooting)
[0:03] * BillK (~BillK@58-7-53-210.dyn.iinet.net.au) has joined #ceph
[0:08] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[0:16] * Cube (~Cube@ Quit (Remote host closed the connection)
[0:16] * Cube (~Cube@ has joined #ceph
[0:19] * BManojlovic (~steki@fo-d- Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:20] * ivotron (~ivo@eduroam-238-87.ucsc.edu) Quit (Ping timeout: 480 seconds)
[0:27] <Kioob> so. When I run "ceph -w", it display a very old error message. Where that message can be stored ?
[0:28] <dmick> probably a log; that's what -w shows
[0:28] <dmick> maybe it starts off with "the last logged message"; don't recall
[0:29] <Kioob> ok, so mc.sub_want() get last logged messages, via mon client, right ?
[0:30] <Kioob> and mc.renew_subs() clear that list ?
[0:30] <dmick> this is my theory. I know it gets new messages.
[0:30] <dmick> it's 'subscribing to log events'
[0:30] <dmick> so it's a bit buried, exactly how it happens
[0:31] * vata (~vata@2607:fad8:4:6:dcad:e235:1569:718a) Quit (Quit: Leaving.)
[0:31] <Kioob> oh ok :/
[0:31] <Kioob> so... I really don't know of to "debug" that
[0:33] <dmick> is it a problem?
[0:33] <Kioob> well... "ceph -w" never worked on my cluster
[0:33] <Kioob> and since I have often "stability" problems, maybe it's related
[0:34] <dmick> an extra message is almost certainly not a sign of stability problems; getting nothing at all on -w is a bit odd
[0:34] <dmick> but it depends on the log level you have enabled
[0:34] <Kioob> I start : ceph --verbose --watch --watch-debug
[0:34] <ron-slc> gregaf1: It looks like my osd pool permissions don't work, when set like: osd 'allow rwx pool=data,allow rwx pool=data2x,allow rwx pool=data3xsata' , but do work when set like: osd 'allow rws *' I see no proper convention for multiple osd's in authX documentation. What am I missing?
[0:34] <Kioob> then I see the status
[0:35] <Kioob> followed by one very old message : 2012-12-10 11:45:35.407734 unknown.0 [INF] mkfs de035250-323d-4cf6-8c4b-cf0faf6296b1
[0:35] <Kioob> then nothing else
[0:36] <gregaf1> ron-slc: I think you want semicolons instead of commas? multiple OSDs doesn't have anything to do with it; multiple pools are separate grants but otherwise don't impact each other
[0:43] * ivotron (~ivo@dhcp-59-235.cse.ucsc.edu) has joined #ceph
[0:45] <mrjack> how can i change osd max backfills on the fly?
[0:46] <Kioob> probably with : ceph osd tell 0 injectargs '--max-backfills X'
[0:47] <mrjack> probably?
[0:47] <Kioob> not sure, I'm just readding the wiki, and see that
[0:47] <dmick> http://ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=injectargs#runtime-changes
[0:47] <dmick> injectargs as a search would lead you right there
[0:49] * rustam (~rustam@ has joined #ceph
[0:49] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[0:52] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[0:53] * MarkN (~nathan@ has joined #ceph
[0:53] * MarkN (~nathan@ has left #ceph
[0:55] <ron-slc> gregaf1: Well , looks like it is supposed to be commas. I found the problem was due to a pool rename from data2xsata to data3xsata , wrecking auth.
[0:58] <mrjack> failed to parse arguments: --max-backfills,5
[0:58] <mrjack> hm
[1:00] <Kioob> and without comma ?
[1:02] <dmick> http://ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=injectargs#runtime-changes no commas there
[1:02] <ron-slc> gregaf1: I renamed pool back to data2xsata, all worked. I had this happen long ago, but blamed it on a hyphen in the pool name... so I found the solution is to rename the pool, restart all related OSDs, and then auths will work on reamed pools.
[1:02] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[1:02] <ron-slc> gregaf1: gotta love issues, which manifest in other areas....
[1:03] <gregaf1> yeah; I think that's fixed for cuttlefish iirc
[1:03] <ron-slc> awesome! I've been suckered into it twice.. ;)
[1:04] <ron-slc> Seems I'm physically incapable of naming a new pool properly the 1st time.
[1:04] <dmick> ron-slc: it's a bogus limitation, thrust on us by surprise by a library change
[1:04] <dmick> so what I'm saying is don't feel bad
[1:04] <ron-slc> ah was it libboost again? That got me in argonaut
[1:05] <dmick> boost-spirit, yes
[1:06] <dmick> http://tracker.ceph.com/issues/4122
[1:06] <ron-slc> gotta love it. One thing I'm beyond happy about, no matter how brutally-horribly I've purposely treated my development cluster, it has never lost data.. Some problems caused on purpose, and some on accident.
[1:07] <ron-slc> My production cluster; though.. I'm careful with, and test in dev first.. ;)
[1:16] * ivotron (~ivo@dhcp-59-235.cse.ucsc.edu) Quit (Ping timeout: 480 seconds)
[1:23] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[1:25] <Kioob> yeah... 2 on 3 rbd (kernel) client nodes had just segfault.
[1:27] * slang (~slang@c-71-239-8-58.hsd1.il.comcast.net) Quit (Ping timeout: 480 seconds)
[1:29] <gregaf1> joao, do we have a task in the tracker yet for splitting up the PG map on-disk format in the monitors?
[1:29] * jlogan (~Thunderbi@2600:c00:3010:1:64ea:852f:5756:f4bf) Quit (Ping timeout: 480 seconds)
[1:33] <mrjack> how does replication size affect performance? i had rep size 2 on a 4 node cluster, now 7 nodes, should i increase rep size?
[1:35] <mrjack> ron-slc: i had data loss when i used btrfs instead of xfs/ext4 when i started to test ceph...
[1:35] <mrjack> ron-slc: problem was that sometimes btrfs was unmountable after reboot, and i tested what happens when all nodes are shut off on the same time... 3 of 4 btrfs fs didn't make it so ceph was unable to recover any data :(
[1:36] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:37] <pioto> so, http://ceph.com/docs/master/rados/operations/auth-intro/#ceph-authorization-caps mentions "cephfs filesystem access controls"... where are those documented? i haven't seen anything other than a general "you can talk to an mds" capability?
[1:40] <elder> joshd, are you around?
[1:41] <joshd> yeah
[1:41] <elder> I'm trying to make sense of what I'm getting back from osd reads. I'm getting lots of result 0, 0 bytes/4096 transferred.
[1:41] <elder> I was expecting to see ENOENTs
[1:41] <elder> Is that reasonable?
[1:42] <ron-slc> mrjack: yes, I as well with BTRFS. I moved to XFS on dev. Prod has always been XFS.
[1:42] <ron-slc> mrjack: but, as said, no data-loss was due to Ceph, nor its recommended uses. :)
[1:43] <elder> I still get some result -2 (ENOENT) but most of them don't say that.
[1:44] <joshd> elder: yes, if the object doesn't exist, you should get -ENOENT. it's possible to get zero bytes back if you create an object without writing to it, or write to it and then truncate it, but that seems more likely to be a bug
[1:45] <elder> OK, well that's sort of how I understood it. I'm going to instrument things a bit more to make sure I know what I'm looking at.
[1:46] <elder> It seems to be why my layered reads aren't working, I'm getting 0 back and that means zero-fill rather than redirect to parent.
[1:46] <elder> But I need to validate what I'm using. I could have screwed something up myself...
[1:49] <dmick> pioto: not sure. gregaf1? Are those xattrs? and are they documented?
[1:52] <mrjack> how does ceph work with rbd when it reads - does it read from all replications simultaneously?
[1:52] * slang (~slang@c-71-239-8-58.hsd1.il.comcast.net) has joined #ceph
[1:53] <pioto> also, the concept of cross-pool rbd clones seems weird to me, but maybe it's because i think of 'pool' in the sense that ZFS uses it, which i know isn't correct for a ceph cluster
[1:54] <pioto> basically, if i'm reading thigns right, it's still basically copy-on-write, even if the source image is in one pool, and the new one is in another?
[2:00] <dmick> pioto: yes
[2:00] <dmick> I tend to think of "pool" as "the one level of directory you get"
[2:02] * xmltok (~xmltok@pool101.bizrate.com) Quit (Quit: Bye!)
[2:03] * slang (~slang@c-71-239-8-58.hsd1.il.comcast.net) Quit (Ping timeout: 480 seconds)
[2:15] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[2:24] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[2:25] * diegows (~diegows@ has joined #ceph
[2:30] * Qten (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[2:35] <joao> gregaf1, I think there is, not sure thouh
[2:35] <joao> *though
[2:35] <joao> let me look
[2:40] <joao> gregaf1, ah, found it!
[2:40] <joao> http://tracker.ceph.com/issues/4200
[2:43] <elder> joshd, I figured out my problem.
[2:43] <elder> Working now...
[2:46] <gregaf1> thanks joao
[2:54] <mrjack> is recovrery in general faster if i have a higher replication level?
[2:57] * l0uis (~l0uis@madmax.fitnr.com) Quit (Quit: leaving)
[3:10] * rturk is now known as rturk-away
[3:17] * rustam (~rustam@ Quit (Remote host closed the connection)
[3:30] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[3:32] * alram (~alram@ Quit (Quit: leaving)
[3:33] <mrjack> what will perform better, one raid10 with journal and osd data on it, or two raid1, one for journal and one for osd?
[3:34] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has left #ceph
[3:35] * dpippenger (~riven@ Quit (Remote host closed the connection)
[3:37] <lurbs> If it's spinning disks, I'd guess the latter. But if you only have four (spinning) disks per machine then I'd suggest having each as a separate OSD and adding an SSD or two as journal for them. Faster, more capacity.
[3:47] <mrjack> well i would do that if i could, but most of the nodes are 1u and no more space left for ssd unfortunatelly
[3:49] <mrjack> i would have to make the raid10 degraded mhmmmm
[3:52] <mrjack> and do you know if recovery will be faster if i have higher replication size?
[4:09] <lurbs> Slower, I'd imagine. More replicas to check against, and fix from, the primary object.
[4:32] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:32] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[4:33] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[4:33] * winston-d (~Miranda@fmdmzpr02-ext.fm.intel.com) has joined #ceph
[4:34] * ivotron (~ivo@69-170-63-251.static-ip.telepacific.net) has joined #ceph
[4:35] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Read error: Operation timed out)
[4:36] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[4:38] <Qten> in ceph.conf with radosgw can anyone tell me what "rgw keystone accepted roles = {accepted user roles}" should be or is?
[4:41] <Qten> i imagine its just the admin/member roles from keystone
[4:42] <dmick> I can tell you the default is "Member, admin"
[4:42] <dmick> the comment says
[4:42] <dmick> // roles required to serve requests
[4:42] <Qten> yeah is it just listing them as admin,member?
[4:42] <Qten> i'm confused on formatting etc
[4:42] <dmick> that is literally the default value there
[4:43] <Qten> i'll try it with the Member, Admin
[4:44] <Qten> without the "" i guess
[4:44] <dmick> it's "Member, admin"
[4:44] <Qten> ah ok
[4:44] <dmick> I gotta assume case is significant
[4:45] <Qten> cheers i'll give it a whirl
[4:45] <Qten> yeah Member is
[4:45] <dmick> I presume you've seen http://ceph.com/docs/master/radosgw/config/#integrating-with-openstack-keystone
[4:45] <Qten> weirdos
[4:45] <Qten> http://ceph.com/docs/master/radosgw/config/#integrating-with-openstack-keystone
[4:45] <Qten> yeah looking at that now
[4:45] * rustam (~rustam@ has joined #ceph
[4:46] <Qten> thanks dmick
[4:46] <dmick> ta
[4:49] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[4:52] * rustam (~rustam@ Quit (Remote host closed the connection)
[5:05] * jamespage (~jamespage@culvain.gromper.net) Quit (Quit: Coyote finally caught me)
[5:06] * jamespage (~jamespage@culvain.gromper.net) has joined #ceph
[5:09] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[5:22] * chutzpah (~chutz@ Quit (Quit: Leaving)
[5:27] * BillK (~BillK@58-7-53-210.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[5:31] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[5:36] * BillK (~BillK@124-148-226-41.dyn.iinet.net.au) has joined #ceph
[5:42] * dmick (~dmick@2607:f298:a:607:9e6:1dc2:5b27:f931) Quit (Quit: Leaving.)
[5:44] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[5:53] * jlogan (~Thunderbi@2600:c00:3010:1:59e7:37c9:7f2a:f7e2) has joined #ceph
[6:03] * ivotron (~ivo@69-170-63-251.static-ip.telepacific.net) Quit (Ping timeout: 480 seconds)
[6:19] * gazoombo (uid6629@id-6629.hillingdon.irccloud.com) has joined #ceph
[6:19] * gazoombo (uid6629@id-6629.hillingdon.irccloud.com) has left #ceph
[6:24] * yasu` (~yasu`@dhcp-59-149.cse.ucsc.edu) Quit (Remote host closed the connection)
[6:36] * rustam (~rustam@ has joined #ceph
[6:38] * rustam (~rustam@ Quit (Remote host closed the connection)
[7:08] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[7:10] * Cube (~Cube@ Quit (Ping timeout: 481 seconds)
[7:13] * eschnou (~eschnou@173.213-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[7:18] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[7:27] * l0nk (~alex@87-231-111-125.rev.numericable.fr) has joined #ceph
[7:31] * ivotron (~ivo@adsl-76-254-17-170.dsl.pltn13.sbcglobal.net) has joined #ceph
[7:32] * l0nk (~alex@87-231-111-125.rev.numericable.fr) Quit (Quit: Leaving.)
[7:35] * The_Bishop (~bishop@2001:470:50b6:0:18a9:d14e:1316:a12f) Quit (Ping timeout: 480 seconds)
[7:36] * eschnou (~eschnou@173.213-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[7:40] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) has joined #ceph
[7:43] * The_Bishop (~bishop@2001:470:50b6:0:b4f7:e8b9:9017:7966) has joined #ceph
[7:55] * loicd (~loic@magenta.dachary.org) has joined #ceph
[7:56] * tnt (~tnt@228.204-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:09] * l0nk (~alex@87-231-111-125.rev.numericable.fr) has joined #ceph
[8:10] * norbi (~nonline@buerogw01.ispgateway.de) has joined #ceph
[8:27] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[8:36] * sleinen (~Adium@2001:620:0:26:ad73:2c4f:b04d:6007) has joined #ceph
[8:37] * barryo (~borourke@cumberdale.ph.ed.ac.uk) Quit (Quit: Leaving.)
[9:02] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[9:11] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[9:11] * BManojlovic (~steki@ has joined #ceph
[9:12] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[9:15] * eschnou (~eschnou@ has joined #ceph
[9:15] * l0nk (~alex@87-231-111-125.rev.numericable.fr) Quit (Quit: Leaving.)
[9:24] * rustam (~rustam@ has joined #ceph
[9:28] * tnt (~tnt@228.204-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:28] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:34] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[9:35] * leseb (~Adium@ has joined #ceph
[9:35] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:37] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:38] * ScOut3R (~ScOut3R@ has joined #ceph
[9:40] * l0nk (~alex@ has joined #ceph
[9:42] * loicd (~loic@ has joined #ceph
[9:48] * rustam (~rustam@ Quit (Remote host closed the connection)
[9:54] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[9:58] * Vjarjadian_ (~IceChat77@ Quit (Quit: For Sale: Parachute. Only used once, never opened, small stain.)
[10:02] * LeaChim (~LeaChim@ has joined #ceph
[10:07] * rustam (~rustam@ has joined #ceph
[10:26] * rahmu (~rahmu@ has joined #ceph
[10:28] * jgallard (~jgallard@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[10:28] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:29] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[10:29] * schlitzer|work (~schlitzer@ Quit (Remote host closed the connection)
[10:31] <loicd> jgallard: for a ceph cluster, you define pools in which objects / RBD are created
[10:31] <loicd> when used with OpenStack, there usualy is a single RBD pool
[10:32] <loicd> each pool has control ( via a set of rules in the crush map ) over the object placement ( http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds )
[10:34] <loicd> however, once an object ( or many objects in the case of RBD ) are created within a pool, the placement is controlled by ceph and the user cannot influence the placement
[10:34] <loicd> the user can however ask on which host the primary OSD of a given object is located
[10:36] <loicd> since a RBD is made of multiple objects, the list of primary OSDs for all objects composing it may very well be all hosts used by the pool
[10:36] <jgallard> ok
[10:37] <jgallard> but, all these primary OSD can be configured to be in the same rack?
[10:38] <jgallard> loicd, ^
[10:38] <loicd> no
[10:39] <jgallard> it's not possible to configure several pools? one per rack?
[10:39] <loicd> You can define a pool that will allocate within the same rack.
[10:40] <loicd> and therefore you have primary OSD within the same rack.
[10:40] <jgallard> ok, so in that case, all the OSD will be in this rack, right?
[10:40] <Gugge-47527> but why, you would loose the pool if you loose the rack :)
[10:40] <loicd> yes
[10:40] <loicd> Gugge-47527: exactly :-)
[10:40] <jgallard> at this time the rack is not our granularity of failure :D
[10:41] <loicd> Gugge-47527: for the record jgallard is working on the cinder multi-backend implementation but he is new to ceph ;-)
[10:41] <Gugge-47527> okay :)
[10:42] * rustam (~rustam@ Quit (Remote host closed the connection)
[10:45] <jgallard> loicd, in your link ( https://etherpad.openstack.org/instance-volume-collocation ) you talk about availability zone, but aggregates should be most appropriate, no? ( https://wiki.openstack.org/wiki/Host-aggregates )
[10:45] <Gugge-47527> kernel rbd still dont support format 2 images in any kernels right?
[10:45] * fghaas (~florian@212095007101.public.telering.at) has joined #ceph
[10:46] <loicd> Gugge-47527: i've seen pull requests sent a few weeks ago but I've not checked if they are upstreamed yet.
[10:47] <loicd> jgallard: aggregates would be fine too, yes
[10:47] <loicd> jgallard: I suggested availability zones because I'm familiar with them and I'm positive it's doable
[10:47] <Gugge-47527> loicd: okay :)
[10:47] * v0id (~v0@91-115-225-5.adsl.highway.telekom.at) has joined #ceph
[10:50] * jgallard still continue to read the etherpad
[10:52] * checka (~v0@91-115-228-64.adsl.highway.telekom.at) Quit (Read error: Operation timed out)
[10:52] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[10:55] * winston-d (~Miranda@fmdmzpr02-ext.fm.intel.com) Quit (Quit: Miranda IM! Smaller, Faster, Easier. http://miranda-im.org)
[11:19] * dxd828 (~dxd828@ has joined #ceph
[11:21] <dxd828> Does anyone know about a problem with Ceph-Fuse on CentOS 6.4 where the client just locks up and makes the terminal completely unresponsive? Our ceph cluster is fine and at full health. Any ideas?
[11:24] <diffuse> have you tried attaching to the process with gdb?
[11:24] <diffuse> That might indicate whre it is hanging.
[11:25] * eschnou (~eschnou@ Quit (Ping timeout: 480 seconds)
[11:29] <dxd828> diffuse, I have not used GDB before, but will look into it.
[11:30] <absynth> or strace -p <pid> to it
[11:30] <absynth> maybe you can see what it's trying to do
[11:31] * ninkotech_ (~duplo@static-84-242-87-186.net.upcbroadband.cz) Quit (Quit: Konversation terminated!)
[11:31] * fghaas (~florian@212095007101.public.telering.at) Quit (Quit: Leaving.)
[11:32] * ninkotech_ (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[11:33] <dxd828> absynth, ok will have a go.. Just checked the logs and I'm getting this a lot http://pastebin.com/AVGU6a91
[11:34] * thorus (~jonas@pf01.intranet.centron.de) Quit (Remote host closed the connection)
[11:35] <wogri_risc> is your ceph client also a cluster member at the same time?
[11:35] <wogri_risc> your logs look like networking issues to me.
[11:35] <absynth> that dmesg can come from anything
[11:36] <wogri_risc> absynth: true
[11:36] <dxd828> wogri_risc, nope just a fuse client
[11:36] <absynth> it's the typical entry when a task cannot access the file system for more than hung_task_timeout_secs seconds
[11:37] <absynth> we used to see this a lot when having issues on the cluster, but you already ruled that out
[11:38] <wogri_risc> dxd828 - what happens if you have a ping running on this machine to the mon you're talking to
[11:42] <dxd828> wogri_risc, well we have three mons on the same gbit network, so just I just get lots of responses under 0.1ms
[11:42] <wogri_risc> even when the client hangs? duh.
[11:42] <wogri_risc> or when the client starts to hang
[11:44] <dxd828> wogri_risc, yep. When i try and ls or cd into the mounted directory the commands just hang and then I have to kill the terminal
[11:47] <absynth> hm, this is cephfs, right?
[11:47] <absynth> MDS issues?
[11:50] <dxd828> absynth, yep. I am running the MDS on the same server as the OSD but on a different drive.
[11:50] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Read error: Connection reset by peer)
[11:51] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[11:54] <absynth> hmmm, i have no experience with CephFS
[11:54] <absynth> is colocating OSD and MDS supported?
[11:54] <wogri_risc> although I don't have any experience myself I think this is supported.
[12:06] * ninkotech (~duplo@static-84-242-87-186.net.upcbroadband.cz) Quit (Quit: Konversation terminated!)
[12:06] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[12:07] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:10] * mcclurmc_laptop (~mcclurmc@firewall.ctxuk.citrix.com) has joined #ceph
[12:14] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[12:19] <dxd828> Does anyone know if it is possible to get the CephFS kernel client working on CentOS 6.4 2.6.32?
[12:19] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Quit: ...)
[12:20] * jlogan (~Thunderbi@2600:c00:3010:1:59e7:37c9:7f2a:f7e2) Quit (Quit: jlogan)
[12:21] <dxd828> ^ nevermind, just realised kernel is too old.
[12:28] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[12:43] <loicd> is it possible to merge two pools together or split a pool in two ?
[12:46] <wido> loicd: I don't think so. Merging can only be done by simply do a cp from pool A to B
[12:47] <wido> loicd: Since the PG name start with the pool ID they are in
[12:47] <wido> you'd have to rename all PGs to do so
[12:47] * eschnou (~eschnou@ has joined #ceph
[12:47] <loicd> wido: thanks, that makes sense indeed ;-)
[12:49] <wido> loicd: I saw you have a pull request open on Github
[12:49] <loicd> yes
[12:49] <wido> did you have to ping Sage or somebody for it? Or do they notice them?
[12:49] <loicd> nothing urgent though
[12:50] <wido> Mine isn't urgent either, but just wondering if they get some kind of notification
[12:50] <loicd> I did not do that recently. I rebased it a week ago to help with review. I don't want to be pushy for a marginal test ;-)
[12:50] * KindOne (~KindOne@h8.176.130.174.dynamic.ip.windstream.net) has joined #ceph
[13:00] <joao> wido, loicd, we do get notifications from gh
[13:01] <joao> but sage is on vacation, so that might be why things have been going slower
[13:01] <wido> joao: Thanks! No hurry though, was just wondering
[13:01] <loicd> joao: thanks for reassuring us :-)
[13:01] <wido> Sage needs vacation as well
[13:01] <joao> he does indeed
[13:01] <joao> :p
[13:01] <absynth> what do you mean, vacation?
[13:02] * tserong (~tserong@124-171-119-73.dyn.iinet.net.au) has joined #ceph
[13:02] <absynth> is that what you give against flu and smallpox?
[13:02] <loicd> absynth: :-)
[13:11] * verwilst (~verwilst@dD576962F.access.telenet.be) has joined #ceph
[13:12] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[13:13] * diegows (~diegows@ has joined #ceph
[13:13] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[13:19] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[13:40] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:00] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[14:14] <joelio> /wi4
[14:14] <joelio> balls
[14:14] * joelio really needs to fix irssi autocomplete.. or just read
[14:17] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[14:21] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[14:29] * gaveen (~gaveen@ has joined #ceph
[14:35] * vanham (~vanham@ has joined #ceph
[14:37] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[14:43] <Azrael> health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
[14:43] <Azrael> hmmm
[14:44] <Azrael> i can't seem to figure out how to get this out of inactive/unclean
[14:46] <Azrael> i think i need to adjust my crush map... i'll try that.
[14:47] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[14:48] * brambles (lechuck@s0.barwen.ch) Quit (Read error: Connection reset by peer)
[14:48] * brambles (lechuck@s0.barwen.ch) has joined #ceph
[14:48] * capri (~capri@ Quit (Read error: Connection reset by peer)
[14:48] * capri (~capri@ has joined #ceph
[14:49] * JohansGlock (~quassel@kantoor.transip.nl) has joined #ceph
[14:49] <vanham> Azrael, I'm also getting started at ceph but
[14:49] <vanham> Can you post your full ceph -w here?
[14:50] <vanham> (until first bottom status line)
[14:51] <Azrael> vanham: sure
[14:51] <Azrael> health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
[14:51] <Azrael> monmap e3: 3 mons at {ceph-jsw-mon0=,ceph-jsw-mon1=,ceph-jsw-mon2=}, election epoch 18, quorum 0,1,2 ceph-jsw-mon0,ceph-jsw-mon1,ceph-jsw-mon2
[14:51] <Azrael> osdmap e24: 6 osds: 6 up, 6 in
[14:51] <Azrael> pgmap v60: 192 pgs: 192 creating; 0 bytes data, 197 MB used, 23730 MB / 23928 MB avail
[14:51] <Azrael> mdsmap e1: 0/0/1 up
[14:52] <vanham> There
[14:52] <vanham> did you just start this cluster?
[14:52] <Azrael> 3 mon nodes and 3 osd nodes. each osd node has two osd daemons (two disks)
[14:52] <Azrael> yes
[14:52] <joao> it would be much appreciated if you wouldn't paste huge walls of logs in the channel
[14:52] <wogri_risc> Azrael: this is usually a crushmap problem.
[14:52] * _are_ (~quassel@2a01:238:4325:ca00:f065:c93c:f967:9285) Quit (Ping timeout: 480 seconds)
[14:52] * _are_ (~quassel@2a01:238:4325:ca00:f065:c93c:f967:9285) has joined #ceph
[14:52] <vanham> it is initializing
[14:52] <Azrael> sorry joao
[14:52] <vanham> oh
[14:52] <Azrael> vanham: its not doing anything however. i just started it a couple hours ago.
[14:52] <vanham> how long ago?
[14:52] <vanham> ok
[14:52] <vanham> sorry
[14:52] <dxd828> diffuse, absynth following your advice from earlier I did a strace of the fs fuse client when a lookup occurred, the log is 265MB and most of it looks the same. Is there anything in particular I should look for?
[14:53] <Azrael> its running in virtualbox
[14:53] <vanham> did you alter your crushmap?
[14:53] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[14:53] <Azrael> yes
[14:53] <Azrael> i added the devices
[14:53] <vanham> ahhh, so this is what the mkceph script does
[14:53] <vanham> hummm
[14:53] <vanham> Did you think of using it?
[14:53] <Azrael> i didn't use mkcephs
[14:54] <vanham> There is a documentation on its beginning on how to use it to add osd by osd
[14:54] <Azrael> which is probably my issue
[14:54] <Azrael> i'm using chef instead
[14:54] <Azrael> ok
[14:54] <vanham> it is easier to start with it
[14:54] <vanham> ok them
[14:54] <Azrael> indeed. i have another cluster, deployed with mkcephfs, running ok.
[14:55] <Azrael> ok
[14:55] <Azrael> i'll poke around mkcephfs man page etc
[14:55] <Azrael> gather up the steps
[14:55] <vanham> Although I don't have that much experience with it I see two possibilities
[14:55] <vanham> 1 - OSD can't write to the disk
[14:55] <vanham> 2 - Nodes can't comunicate with each toher
[14:55] <vanham> *other
[14:55] <vanham> as far as the manual setup goes I dont know anything about it
[14:56] <vanham> but
[14:56] * JohansGlock_ (~quassel@kantoor.transip.nl) Quit (Ping timeout: 480 seconds)
[14:56] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[14:56] <vanham> ceph -w already says osds are up and in
[14:56] <Azrael> hmm
[14:56] <vanham> the pgmap already have the pieces defined
[14:56] <Azrael> right
[14:56] <vanham> so I would bet at 1 or 2
[14:57] * l0uis (~l0uis@madmax.fitnr.com) has joined #ceph
[14:57] * jgallard_ (~jgallard@ has joined #ceph
[14:57] <vanham> did you try the logs at /var/log/ceph/*.log?
[14:58] <Azrael> 2013-04-10 14:50:48.553214 7f1242965780 -1 journal _check_disk_write_cache: fclose error: (61) No data available
[14:58] <Azrael> 2013-04-10 14:50:48.553278 7f1242965780 1 journal _open /var/lib/ceph/osd/ceph-4/journal fd 26: 1049604096 bytes, block size 4096 bytes, directio = 1, aio = 1
[14:58] <Azrael> seeing stuff like that
[14:59] <vanham> "No data available" at fclose()? It doesn't even make sense from a programmers perspective
[14:59] <Azrael> hehe
[15:00] <vanham> 1 sec
[15:00] * imjustmatthew (~imjustmat@c-24-127-107-51.hsd1.va.comcast.net) has joined #ceph
[15:01] <Azrael> i suspect the issue is the crushmap
[15:01] <vanham> du -sh /var/lib/ceph/osd at the osd nodes. What comes up?
[15:01] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:01] <Azrael> 332K /var/lib/ceph/osd/ceph-4
[15:01] <vanham> is the journal bellow it?
[15:01] <Azrael> so
[15:01] <vanham> or it is at another device
[15:01] <Azrael> i used ceph-disk-prepare
[15:01] <vanham> ?
[15:02] <Azrael> which makes it another device
[15:02] <vanham> ok
[15:02] <Azrael> /var/lib/ceph/osd/ceph-*/journal is just a symlink to /dev/disk/by-partuuid/<uuid>
[15:02] <Azrael> which is in turn a symlink to /dev/sdX2
[15:02] <vanham> ok
[15:02] <Azrael> where X is the dev...
[15:04] * jgallard (~jgallard@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[15:04] <vanham> one last thing
[15:04] <vanham> does ceph osd dump prints the right ip addresses?
[15:04] <vanham> (as I nodes can use those to communicate)
[15:04] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has left #ceph
[15:05] * jgallard_ (~jgallard@ Quit (Ping timeout: 480 seconds)
[15:05] <Azrael> woooo
[15:06] <Azrael> fixed the map
[15:06] <Azrael> disk going bananas
[15:06] <vanham> great
[15:06] <vanham> !
[15:06] <vanham> sorry my questions didn't help
[15:06] <vanham> what is wrong with the map?
[15:07] <vanham> (this is my way of learning what mistakes I'll make in the future, hehe)
[15:08] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[15:11] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[15:12] * ljonsson (~ljonsson@ext.cscinfo.com) has joined #ceph
[15:12] <Azrael> hehe
[15:12] <Azrael> well
[15:12] <Azrael> it was the stock default map
[15:13] <Azrael> types were defined, default rules there
[15:13] <Azrael> and the 'root' type was declared but that was it
[15:13] <Azrael> nothing under root
[15:13] <Azrael> ie. no racks, hosts, and their osds
[15:13] <vanham> hummmm
[15:13] <vanham> thanks!
[15:13] <Azrael> sure thing
[15:19] <joao> this is annoying; I keep seeing unicorns on gh
[15:29] <imjustmatthew> joao: are you around this morning?
[15:29] <joao> I am
[15:30] <imjustmatthew> joao: Awesome, do you have a sec to glance at http://pastebin.com/KAEEvd20 and see if that's likely the same issue from Monday or a different issue?
[15:30] <joao> different issue
[15:30] <joao> that one is on the mds
[15:32] <imjustmatthew> K, I only have light logging on it, and I don't know how to reproduce it, but the log I have is at http://goo.gl/VAIFh
[15:32] <imjustmatthew> do you want me to drop it over into a new issue at the tracker?
[15:33] <joao> imjustmatthew, I don't seem to find any similar issue on the tracker, so that would probably be best
[15:33] <joao> please file it under 'fs'
[15:37] * jgallard (~jgallard@ has joined #ceph
[15:38] <imjustmatthew> Opened as #4696
[15:38] <imjustmatthew> Thanks for your help!
[15:39] <joao> thank you
[15:44] * portante (~user@ has joined #ceph
[15:46] * loicd (~loic@ has joined #ceph
[15:46] * jskinner (~jskinner@ has joined #ceph
[15:51] * drokita (~drokita@ has joined #ceph
[16:00] * lerrie2 (~Larry@remote.compukos.nl) has joined #ceph
[16:01] <Elbandi_> if a raid full stripe size is 256k, what is the good value for ceph stripe unit/count and object size?
[16:05] <matt_> Elbandi_, probably anything that is divisible by 256k
[16:08] <Elbandi_> hmm, currently this is 4M (default value)
[16:09] * PerlStalker (~PerlStalk@ has joined #ceph
[16:10] <Elbandi_> but the issue is, that there is 4x incoming bandwidth, while serving files from cephfs with webserver (random read)
[16:11] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[16:12] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[16:26] * norbi (~nonline@buerogw01.ispgateway.de) Quit (Quit: Miranda IM! Smaller, Faster, Easier. http://miranda-im.org)
[16:26] <dxd828> Does anyone know why you would get: closing stale session client in the mds log, it shows the ip of the client which is experiencing the hanging.
[16:41] * vata (~vata@2607:fad8:4:6:4d:fbbc:e8dd:b327) has joined #ceph
[16:42] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[16:43] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[16:44] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[16:44] <loicd> matt_: Hi ! FYI here is the use case related to the questions I asked yesterday about OSD / RBD placement https://etherpad.openstack.org/instance-volume-collocation . And thanks again for the feedback :-)
[16:45] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) Quit (Quit: Leaving.)
[16:48] * fghaas (~florian@213162068047.public.t-mobile.at) has joined #ceph
[16:49] * tchmnkyz (~jeremy@0001638b.user.oftc.net) has joined #ceph
[16:50] <tchmnkyz> hey guys, i keeps eeing a error every now and then when i run a ceph -s
[16:50] <tchmnkyz> i get 2013-04-10 09:45:44.296220 7f4e4c591700 0 -- :/16558 >> pipe(0x22f2500 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
[16:50] <matt_> loicd, good find. Thanks for sharing
[16:51] * fghaas (~florian@213162068047.public.t-mobile.at) Quit ()
[16:52] <loicd> matt_: yw :-)
[16:52] <matt_> tchmnkyz, that usually means you have a down/unavailable monitor
[16:55] <tchmnkyz> it shows up
[16:56] <tchmnkyz> maybe its hung i will restart that mon daemon
[16:56] <matt_> do you have multiples monitors? When you have 3 it might fault on the first then try the second and succeed
[16:56] <tchmnkyz> yea
[16:56] <tchmnkyz> i have 3 right now
[16:56] <tchmnkyz> i have 3 servers that all they do is mon/mds
[16:56] <matt_> it probably means on is down or out of quorum
[16:57] <matt_> one*
[16:57] <tchmnkyz> ok
[16:57] <tchmnkyz> i just restarted the one mon daemon
[16:57] <dxd828> Can anyone make sense of this line in the MDS log? >> pipe(0x22bf6b80 sd=36 :6804 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
[17:09] * drokita1 (~drokita@ has joined #ceph
[17:12] * drokita (~drokita@ Quit (Read error: Operation timed out)
[17:14] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[17:15] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) has joined #ceph
[17:17] * drokita1 (~drokita@ Quit (Ping timeout: 480 seconds)
[17:20] <ron-slc> Hello all. quick question on SATA disk failures. I had a SATA disk become unresponsive with read timeouts, and subsequent XFS Kernel Bug notices with bad page states, Then osd had bad page states, which ended with GPF, and Kernel Panic.
[17:21] * verwilst (~verwilst@dD576962F.access.telenet.be) Quit (Quit: Ex-Chat)
[17:21] <ron-slc> The reason I ask, I'm trying to plan if I should isolate SATA disks (junky), from servers with my Enterprise SAS disks (more reliable).
[17:22] * Vjarjadian (~IceChat77@ has joined #ceph
[17:25] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[17:26] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[17:27] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: Leaving.)
[17:27] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[17:27] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[17:27] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:28] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) has joined #ceph
[17:33] * portante|lt (~user@ has joined #ceph
[17:38] * tnt (~tnt@228.204-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[17:38] * loicd (~loic@magenta.dachary.org) has joined #ceph
[17:39] <andreask> separate them in different pools
[17:39] <sleinen> ron-slc: SATA disks tend to stubbornly retry reading from bad sectors. Some SATA disks have a feature called "TLER" (Time Limited Error Recovery) that makes them play better with RAID-style systems such as Ceph. Enabling that should help you.
[17:44] * jskinner_ (~jskinner@ has joined #ceph
[17:47] * sleinen (~Adium@2001:620:0:26:ad73:2c4f:b04d:6007) Quit (Quit: Leaving.)
[17:47] * sleinen (~Adium@ has joined #ceph
[17:48] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:51] * jskinner (~jskinner@ Quit (Ping timeout: 480 seconds)
[17:52] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) Quit (Quit: Leaving.)
[17:52] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[17:55] * sleinen (~Adium@ Quit (Ping timeout: 480 seconds)
[17:57] * gregorg (~Greg@ has joined #ceph
[17:58] <alexxy> joao: upgrade was from version 0.56.x
[17:58] <joao> ?
[17:59] <alexxy> since i still cannot login to http://tracker.ceph.com/issues/4644
[17:59] <scuttlemonkey> alexxy: can you try creating an account again, I'd like to see what's going on there (hopefully just an issue of getting sent to spam or something)
[18:00] <alexxy> scuttlemonkey: can you delete my old one?
[18:00] <alexxy> alexxy
[18:00] <scuttlemonkey> email?
[18:00] <alexxy> mail alexxy at gentoo dot org
[18:00] * jskinner_ (~jskinner@ Quit (Remote host closed the connection)
[18:00] <alexxy> or add openid adress
[18:01] <scuttlemonkey> I don't see either alexxy username or that email in user list
[18:01] <alexxy> ohh
[18:01] <alexxy> strange
[18:01] <alexxy> lets try again in a hour or so
[18:01] <alexxy> i should go for some time
[18:02] <ron-slc> sleinen: thanks for the advise!! I'm hoping to make failures a little more graceful.
[18:02] <scuttlemonkey> alexxy: sure np, just shout if you need my help
[18:03] <tchmnkyz> w
[18:03] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:04] <SvenPHX> I have a fun problem in my dev cluster. The data partitions on all three of my mon servers got blown away
[18:04] <SvenPHX> is it possible to recover?
[18:04] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[18:05] <joao> eh, don't think so
[18:06] * portante` (~user@ has joined #ceph
[18:06] <SvenPHX> Ok
[18:06] <SvenPHX> NBD, just wanted to try in case it every happened to my production cluster
[18:08] * BillK (~BillK@124-148-226-41.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:09] <joao> imjustmatthew, around?
[18:09] <imjustmatthew> yes
[18:10] <joao> are you able to reliably reproduce the mon crash from the other day?
[18:10] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Quit: Leaving.)
[18:10] * portante|lt (~user@ Quit (Ping timeout: 480 seconds)
[18:10] <imjustmatthew> reproduce might be a strong word.... it keeps happending and I don't know why
[18:11] <joao> well, I do know why, and am working on a patch; just wanted to know if you could give it a spin when it's finished
[18:12] <imjustmatthew> Yes, definitely
[18:12] <joao> also, imjustmatthew, http://tracker.ceph.com/issues/3495
[18:13] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[18:16] <imjustmatthew> k, thanks
[18:16] * leseb (~Adium@ Quit (Quit: Leaving.)
[18:23] * BillK (~BillK@58-7-112-44.dyn.iinet.net.au) has joined #ceph
[18:23] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:32] * mattch (~mattch@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[18:35] * noahmehl_away (~noahmehl@cpe-75-186-45-161.cinci.res.rr.com) has joined #ceph
[18:40] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[18:40] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[18:45] * chutzpah (~chutz@ has joined #ceph
[18:45] * jgallard (~jgallard@ Quit (Quit: Leaving)
[18:51] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[18:51] * loicd (~loic@magenta.dachary.org) has joined #ceph
[18:51] * l0nk (~alex@ Quit (Quit: Leaving.)
[18:52] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[18:55] * mcclurmc_laptop (~mcclurmc@firewall.ctxuk.citrix.com) Quit (Ping timeout: 480 seconds)
[18:57] * yasu` (~yasu`@dhcp-59-149.cse.ucsc.edu) has joined #ceph
[19:01] * Cube (~Cube@ has joined #ceph
[19:01] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[19:03] <yasu`> Hi, does anyone have a pointer to a doc that can tell me how to recover a Ceph cluster that hanged up at full usage space ?
[19:05] <yasu`> I have repsize=3 and one reserved OSD, so either changing to repsize=2 or adding the reserved OSD might add some space for the cluster to breathe. Is this possible ?
[19:05] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[19:07] <scuttlemonkey> yasu`: http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#no-free-drive-space
[19:07] <scuttlemonkey> probably the best place to start
[19:08] <yasu`> scuttlemonkey: thanks, I read and come back if further question.
[19:08] <scuttlemonkey> cool, good luck
[19:11] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[19:13] <elder> joshd, what sets up the symlinks under /dev/rbd/rbd/?
[19:14] * gregaf (~Adium@cpe-76-174-249-52.socal.res.rr.com) has joined #ceph
[19:15] <yasu`> I didn't have spare OSD ...
[19:15] * BillK (~BillK@58-7-112-44.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[19:17] <yasu`> scuttlemonkey: could you tell me what instruction is it that the doc refers to by 'deleting some PGs in the full OSD' ?
[19:17] * rahmu (~rahmu@ Quit (Remote host closed the connection)
[19:17] <elder> joshd, it looks like format 2 images don't get their symlink created in /dev/rbd/rbd, though format 1 images do.
[19:17] <yasu`> In my understanding, if we remove some PGs but if the OSD gets back to the cluster it will receive the remove PGs from other OSDs, no ?
[19:18] <scuttlemonkey> yasu`: guess I misinterpreted your 'one reserved OSD' statement...sounded like a spare to me
[19:19] <yasu`> You thought right; I also thought I had a spare. :)
[19:19] <scuttlemonkey> if the cluster is absolutely deadlocked and you want to free up space after you change it to replication level of 2...you can manually delete pgs from the OSDs individually
[19:19] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:19] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[19:19] <yasu`> I confirmed afterwards that I don't have it
[19:19] <scuttlemonkey> ahh
[19:20] <yasu`> so I guess changing the repsize=2 is the easiest way ?
[19:20] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[19:21] <scuttlemonkey> if you can't add more space, then that would be the cheapest way to free up space, yes
[19:21] <scuttlemonkey> unless you want to delete data
[19:21] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[19:22] <yasu`> Okay, seeking for the instruction to change repsize ...
[19:22] <elder> joshd, strike that. I don't have version 2 image suppot compiled in the kernel I'm using. The "rbd map" command isn't reporting an error though...
[19:23] <scuttlemonkey> yasu`: http://ceph.com/docs/master/rados/operations/pools/#set-the-number-of-object-replicas
[19:23] <yasu`> I feel like I read somewhere that I cannot change repsize without deleting the pool ... ?
[19:24] <scuttlemonkey> I think you are referring to increasing the number of placement groups in a pool
[19:25] <scuttlemonkey> not the replication level
[19:25] <scuttlemonkey> pg splitting, while it now exists, is still a very experimental feature
[19:25] <yasu`> okay. that's good
[19:27] <yasu`> the command gets accepted, ceph osd dump shows the rep size =2 for data pool,
[19:27] <yasu`> but ceph -s shows HEALTH_ERR 1 full osd(s); 5 near full osd(s)
[19:27] <yasu`> do I need to scrub or something ?
[19:28] <yasu`> ah it's changing gradually
[19:28] <scuttlemonkey> yeah, it takes a bit of time to rebalance
[19:28] * BillK (~BillK@124-148-214-105.dyn.iinet.net.au) has joined #ceph
[19:29] <yasu`> I learned that having spare OSD and having the repsize rather larger might serve a good buffer for emergency. :)
[19:31] <scuttlemonkey> yeah, you can also change the health warning to give you a bit more lead time
[19:32] <scuttlemonkey> think it defaults to 95% full...but you could change that to 90% or lower
[19:32] <yasu`> okay, I'll set it if it recovers, thanks.
[19:32] <scuttlemonkey> np
[19:33] <yasu`> lovely, it became HEALTH_OK automatically. :)
[19:33] <scuttlemonkey> sweet, glad to hear it
[19:35] * jskinner (~jskinner@ has joined #ceph
[19:38] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[19:39] * Steki (~steki@fo-d- has joined #ceph
[19:39] * jskinner (~jskinner@ has joined #ceph
[19:41] * jboster (~james@ has joined #ceph
[19:42] <jboster> Hi, Just wondering about ceph hadoop integration. Is it robust enough to spend some time trying out?
[19:43] * noahmehl_away is now known as noahmehl
[19:43] * rturk-away is now known as rturk
[19:43] * rturk is now known as rturk-away
[19:43] * rturk-away is now known as rturk
[19:45] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[19:46] <ljonsson> Hello everyone, I have a newbie question about RBD
[19:47] <gregaf> hope it is, jboster ;)
[19:47] <ljonsson> The other day one of my ceph nodes went down, but came back up fine, repairing itself ok
[19:47] <gregaf> I know nwatkins and Joe (they're not online right now) run it fairly often; our regression tests are limited but exist and are expanding
[19:47] <ljonsson> Since then I can't mount my RBD volumes anymore
[19:47] <jboster> gregaf: Ok Great! It looks like you have to patch hadoop? Does that mean running Cloudera isn't going to work?
[19:48] <gregaf> jboster: I don't think there's any patching hadoop with the new stuff; it's just a new FileSystem JAR that you drop in place and add a few config options for
[19:49] <ljonsson> Various errors in syslog like this: http://pastebin.com/3Y0ZqMTV
[19:49] <ljonsson> Any thoughts?
[19:49] <scuttlemonkey> jboster: I know there are people using ceph w/ cloudera
[19:49] <ljonsson> please
[19:51] <gregaf> I think you just follow these instructions http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/13339 to get the JAR, and then drop it into the right places in Cloudera
[19:52] <gregaf> scuttlemonkey: have you actually run this? I'm afraid I haven't
[19:52] <scuttlemonkey> I haven't, although I think John ran through it when he wrote the doc at: http://ceph.com/docs/master/cephfs/hadoop/
[19:52] * dmick (~dmick@2607:f298:a:607:607c:b82e:2508:2c50) has joined #ceph
[19:54] * gregaf (~Adium@cpe-76-174-249-52.socal.res.rr.com) Quit (Quit: Leaving.)
[19:55] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[19:55] * ChanServ sets mode +o elder
[19:55] <scuttlemonkey> ljonsson: lookin
[19:55] <jboster> gregaf,scuttlemonkey: Thanks, I was looking at the src/client/hadoop directory in the ceph repo. I guess that is old stuff
[20:16] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Remote host closed the connection)
[20:25] * Steki (~steki@fo-d- Quit (Quit: Ja odoh a vi sta 'ocete...)
[20:26] <scuttlemonkey> ljonsson: sry, got pulled away
[20:27] <scuttlemonkey> so you can't mount your block devices at all?
[20:29] <scuttlemonkey> oh I see, you're getting timeout msgs
[20:29] * cephuser1 (4fc2db7d@ircip2.mibbit.com) has joined #ceph
[20:30] * jskinner_ (~jskinner@ has joined #ceph
[20:30] * jskinner (~jskinner@ Quit (Read error: Connection reset by peer)
[20:30] * BillK (~BillK@124-148-214-105.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[20:30] <scuttlemonkey> ljonsson: what version of ceph are you running?
[20:31] <ljonsson> 0.56.2
[20:32] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[20:33] <cephuser1> hi i'm using ceph 0.56.4 and i've to replace some drives. But while ceph is backfilling / recovering all VMs have high and and sometimes they're even offline
[20:35] <scuttlemonkey> cephuser1: did you mark the osds out and let the cluster balance before you ripped the drives out?
[20:35] <scuttlemonkey> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
[20:36] <paravoid> so I've instructed my cluster to do deep scrubs on all OSDs
[20:36] <paravoid> about ~30 pgs are being scrubbed at any time
[20:36] <paravoid> the cluster has 135 OSDs, so it shouldn't be 'osd max scrubs'
[20:38] <cephuser1> scuttlemonkey i reweight all disks from 1 to 0.1/0.0 and then ceph osd down.
[20:38] <cephuser1> scuttlemonkey then i putted in the new drives and i'm reweighting them from 0.0 to 1.0 in 0.1 steps
[20:39] * frank9999 (~frank@kantoor.transip.nl) Quit ()
[20:39] <scuttlemonkey> cephuser1: you shouldn't have to ramp up slowly
[20:39] <scuttlemonkey> with bobtail (I believe) you should be able to just add the osd at 1.0 and let the cluster balance
[20:39] <cephuser1> scuttlemonkey i already lowered osd recovery max active = 2 and osd max backfills = 3
[20:40] <scuttlemonkey> ljonsson: lemme poke one of the experts, I'm not quite groking the issue here
[20:40] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[20:40] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit ()
[20:40] <cephuser1> scuttlemonkey but when i put them back at 1.0 the vms are nearly all down. So right now i'm reweighting in 0.1 steps. The VMs are then "just" down for 5-7min
[20:41] <cephuser1> scuttlemonkey i then wait 20 min and then reweight by 0.1 again... just an ugly workaround...
[20:41] * Steki (~steki@fo-d- has joined #ceph
[20:43] * danieagle (~Daniel@ has joined #ceph
[20:43] <scuttlemonkey> cephuser1: yeah, something about that doesn't make sense...these balancing issues were fixed in bobtail
[20:43] <scuttlemonkey> did you replace a ton of osds all at once?
[20:43] * BillK (~BillK@124-148-221-217.dyn.iinet.net.au) has joined #ceph
[20:44] <scuttlemonkey> ljonsson: what kernel are you using?
[20:44] <ljonsson> Linux plcephd01 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
[20:45] <cephuser1> scuttlemonkey mhm right now some drives are SSDs so they're a lot faster than the HDDs i'm going to replace them too. May this be the problem?
[20:46] <dmick> ljonsson: http://ceph.com/docs/master/install/os-recommendations/#linux-kernel
[20:46] <dmick> one of the reasons is the version of rbd in 3.2 is pretty old
[20:46] <dmick> I'm not sure if that bug is one we know about, but basically we've stopped looking at bugs in code that old
[20:47] * dpippenger (~riven@ has joined #ceph
[20:47] <scuttlemonkey> thanks dmick
[20:48] <scuttlemonkey> cephuser1: hmmm, I don't think that would cause issue
[20:48] <scuttlemonkey> I'd be more inclined to think something like the io setting on the pools was at fault if you're ripping and replacing lots of osds
[20:48] <cephuser1> scuttlemonkey just ONE at a time not more OSDs
[20:49] <cephuser1> so right now HEALTH_OK and everything is in balance
[20:49] <cephuser1> scuttlemonkey so right now HEALTH_OK and everything is in balance
[20:49] <scuttlemonkey> gotcha
[20:49] <cephuser1> scuttlemonkey i then just add 0.1 weight for ONE osd at a time
[20:49] <ljonsson> I understand. It's just curious that it stopped working all of a sudden after an incident
[20:50] <ljonsson> I'll look at upgrading
[20:50] <ljonsson> Thanks
[20:50] * alram (~alram@ has joined #ceph
[20:50] <cephuser1> scuttlemonkey and then the VMS get high latencies or down
[20:53] <scuttlemonkey> anything in the logs when they go down?
[20:54] <scuttlemonkey> if you're only removing 1 osd at a time and letting the cluster rebalance there is no reason for it
[20:54] <scuttlemonkey> unless your min_size is high on the pool or something
[20:55] * mcclurmc_laptop (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[20:55] <cephuser1> scuttlemonkey: no nothing in the logs but it is recovering at 3700MB/s that this is not possible on SATA HDDs is clear
[20:56] <cephuser1> scuttlemonkey 2013-04-10 20:55:33.711289 mon.0 [INF] pgmap v9293315: 8128 pgs: 233 active, 7876 active+clean, 19 active+recovery_wait; 557 GB data, 1168 GB used, 7003 GB / 8171 GB avail; 2108KB/s wr, 329op/s; 31/309692 degraded (0.010%); recovering 840 o/s, 3278MB/s
[20:56] <alexxy> scuttlemonkey: i'm trying to register again to bugzie
[20:57] <alexxy> scuttlemonkey: it said me the this user already exist as well as mail
[20:58] <scuttlemonkey> cephuser1: dunno then...I guess it could be something strange with a mixed pool of SSD/HDD
[20:59] <scuttlemonkey> but I haven't played with that...I usually split the hardware by pool
[20:59] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has joined #ceph
[20:59] <scuttlemonkey> alexxy: looking again
[20:59] <scuttlemonkey> alexxy: (have my big monitor now)
[20:59] <dmick> alexxy: yes, you exist; the confirmation email was trapped in your spam folder
[21:00] <scuttlemonkey> alexxy: oh, I didn't grab all users...just active
[21:00] <scuttlemonkey> account deleted
[21:00] <scuttlemonkey> try now
[21:03] <cephuser1> scuttlemonkey maybe due to that one? http://tracker.ceph.com/issues/3737
[21:04] <scuttlemonkey> cephuser1: possible, although beyond my level of expertise
[21:04] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:05] * jskinner_ (~jskinner@ Quit (Remote host closed the connection)
[21:08] * jskinner (~jskinner@ has joined #ceph
[21:13] <alexxy> scuttlemonkey: from what email i should get confirmation?
[21:14] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[21:14] <scuttlemonkey> think it should be redmine@tracker.ceph.com
[21:15] <dmick> yes
[21:15] <dmick> subject "Your Ceph account activation"
[21:15] <alexxy> scuttlemonkey: i have no such mails at all
[21:15] <scuttlemonkey> I activated your account
[21:15] <alexxy> nor in spam
[21:15] <alexxy> etc
[21:15] <dmick> no To: line
[21:16] <rturk> I don't see anything in the mail queue
[21:16] <dmick> alexxy: something about your email is broken, and it might be worth investigating
[21:16] <dmick> i.e. someone is dumping false-positive-for-spam messages
[21:16] <alexxy> problem is that gentoo org mail sever forward all mails
[21:16] <alexxy> to my work mail
[21:17] <alexxy> and i dont see any rejects in mail serverlog
[21:17] <scuttlemonkey> well your account is active
[21:17] <scuttlemonkey> but I would test a few mail messages
[21:17] <scuttlemonkey> otherwise all watch notifications are going to be lost in the same way
[21:18] <dmick> it's also multipart mime, which exacerbates the situation (the .sig is a separate part)
[21:18] <rturk> Hm..
[21:18] <dmick> but still.
[21:18] <rturk> looks like it's on our side: status=bounced (host mail.gentoo.org[] said: 550 5.1.7 <redmine@tracker.ceph.com>: Sender address rejected: undeliverable address
[21:18] <rturk> let's talk to sandon about making sure redmine@tracker.ceph.com is deliverable - some mail servers actually do sender address verification
[21:18] <alexxy> ghmm
[21:19] <dmick> it just sent one to me...
[21:19] <rturk> dmick: you are one of the lucky people who use a mail server that doesn't do sender address verification?
[21:19] <rturk> http://en.wikipedia.org/wiki/Callback_verification
[21:19] <dmick> perhaps. It is my own server.
[21:19] * cephuser1 (4fc2db7d@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[21:19] <dmick> I could try with gmail I suppose
[21:20] <rturk> don't think gmail does callback verification either. It's somewhat rare
[21:20] <rturk> used to work for a company that did it, ran into this somewhat frequently
[21:21] <alexxy> scuttlemonkey: thank you
[21:22] <scuttlemonkey> alexxy: np, glad we could get you sorted.
[21:23] <dmick> RCPT TO:<redmine@tracker.ceph.com>
[21:23] <dmick> 554 5.7.1 <redmine@tracker.ceph.com>: Recipient address rejected: Access denied
[21:23] <dmick> indeed
[21:23] <alexxy> so my work server will also mart it as spam
[21:24] <dmick> also, I think we can theoretically accept issue updates by email, but not if the address doesn't work
[21:25] <dmick> rturk or scuttlemonkey, will you follow up with Sandon?
[21:25] <rturk> yes
[21:25] <rturk> doing it now
[21:25] <dmick> tnx
[21:25] <dmick> note that we might need to reconfigure in redmine as a result to avoid bogosity
[21:28] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Quit: Leaving.)
[21:30] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:35] * BillK (~BillK@124-148-221-217.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[21:35] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:38] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[21:43] * sagewk (~sage@2607:f298:a:607:c995:b982:794a:d0eb) Quit (Ping timeout: 480 seconds)
[21:44] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) Quit (Read error: Connection reset by peer)
[21:51] * sagewk (~sage@2607:f298:a:607:dcf3:d317:b771:4962) has joined #ceph
[21:54] * jskinner (~jskinner@ Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * alram (~alram@ Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * dpippenger (~riven@ Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * portante` (~user@ Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * vanham (~vanham@ Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * stiller1 (~Adium@2001:980:87b9:1:30f7:19d2:ed14:a775) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * tjikkun_ (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * via (~via@smtp2.matthewvia.info) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * darkfaded (~floh@ Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * TMM (~hp@535240C7.cm-6-3b.dynamic.ziggo.nl) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * Svedrin (svedrin@2a01:4f8:121:3a8:0:1:0:2) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * liiwi (liiwi@idle.fi) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * Romeo (~romeo@ Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * Zethrok (~martin@ Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * Ormod (~valtha@ohmu.fi) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * wonko_be (bernard@november.openminds.be) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * mistur (~yoann@kewl.mistur.org) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * thelan (~thelan@paris.servme.fr) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * ivoks (~ivoks@jupiter.init.hr) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * Meyer__ (meyer@c64.org) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * sbadia (~sbadia@yasaw.net) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * fred1 (~fredl@2a00:1a48:7803:107:8532:c238:ff08:354) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * soren (~soren@hydrogen.linux2go.dk) Quit (reticulum.oftc.net solenoid.oftc.net)
[21:54] * jskinner (~jskinner@ has joined #ceph
[21:54] * alram (~alram@ has joined #ceph
[21:54] * dpippenger (~riven@ has joined #ceph
[21:54] * portante` (~user@ has joined #ceph
[21:54] * vanham (~vanham@ has joined #ceph
[21:54] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[21:54] * stiller1 (~Adium@2001:980:87b9:1:30f7:19d2:ed14:a775) has joined #ceph
[21:54] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[21:54] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[21:54] * tjikkun_ (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[21:54] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[21:54] * via (~via@smtp2.matthewvia.info) has joined #ceph
[21:54] * darkfaded (~floh@ has joined #ceph
[21:54] * soren (~soren@hydrogen.linux2go.dk) has joined #ceph
[21:54] * TMM (~hp@535240C7.cm-6-3b.dynamic.ziggo.nl) has joined #ceph
[21:54] * Svedrin (svedrin@2a01:4f8:121:3a8:0:1:0:2) has joined #ceph
[21:54] * liiwi (liiwi@idle.fi) has joined #ceph
[21:54] * Romeo (~romeo@ has joined #ceph
[21:54] * ivoks (~ivoks@jupiter.init.hr) has joined #ceph
[21:54] * fred1 (~fredl@2a00:1a48:7803:107:8532:c238:ff08:354) has joined #ceph
[21:54] * thelan (~thelan@paris.servme.fr) has joined #ceph
[21:54] * mistur (~yoann@kewl.mistur.org) has joined #ceph
[21:54] * sbadia (~sbadia@yasaw.net) has joined #ceph
[21:54] * wonko_be (bernard@november.openminds.be) has joined #ceph
[21:54] * Ormod (~valtha@ohmu.fi) has joined #ceph
[21:54] * Zethrok (~martin@ has joined #ceph
[21:54] * Meyer__ (meyer@c64.org) has joined #ceph
[21:54] * frank9999 (~frank@kantoor.transip.nl) has joined #ceph
[21:55] * TMM (~hp@535240C7.cm-6-3b.dynamic.ziggo.nl) Quit (Quit: Bye)
[21:56] * TMM (~hp@535240C7.cm-6-3b.dynamic.ziggo.nl) has joined #ceph
[21:59] * ljonsson1 (~ljonsson@pool-74-103-170-242.phlapa.fios.verizon.net) has joined #ceph
[21:59] * ctrl (~ctrl@ has joined #ceph
[21:59] <ctrl> Hi all!
[22:01] <scuttlemonkey> hey ctrl
[22:01] * ljonsson2 (~ljonsson@ext.cscinfo.com) has joined #ceph
[22:01] <ctrl> Is anyone use ceph as storage for mssql? Or is a bad idea for that?
[22:02] <PerlStalker> It works just fine for my sql vm
[22:02] <fghaas> as in sql server?
[22:02] <ctrl> im talking about a mssql as vm in rbd
[22:03] <janos> IN a vm
[22:03] <janos> mssql isn't an OS (yet)
[22:03] <ctrl> yep
[22:04] <janos> what version win server are you hosting on? (tangent)
[22:04] <ctrl> win2003 or win2008
[22:04] * jskinner_ (~jskinner@ has joined #ceph
[22:05] * ljonsson (~ljonsson@ext.cscinfo.com) Quit (Ping timeout: 480 seconds)
[22:05] * jskinner (~jskinner@ Quit (Read error: No route to host)
[22:07] * ljonsson1 (~ljonsson@pool-74-103-170-242.phlapa.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[22:08] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:09] <ctrl> so, anyone?
[22:09] * amatter (~oftc-webi@ has joined #ceph
[22:10] <vanham> ctrl, I'll have to do it in a couple of months. Maybe before that.
[22:10] <amatter> I've noticed there's a fork of samba in the ceph repository but haven't found any documentation on how this differs from the standard samba source?
[22:11] <vanham> I'm thinking about going with it currently
[22:11] <PerlStalker> ctrl: I already told you that it works for me.
[22:11] <ctrl> PerlStalker: Cool! What's your setup?
[22:12] <PerlStalker> kvm with rdb blocks running win 2012 and sql server 2012/2008 express
[22:13] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[22:14] <ctrl> PerlStakler: how large your ceph cluster? and what kind of disks?
[22:14] * sleinen (~Adium@2001:620:0:26:949d:c332:f5d7:ed5e) has joined #ceph
[22:14] <PerlStalker> 6 osd nodes, 4 sas disks in raid 5.
[22:15] * drokita (~drokita@ has joined #ceph
[22:15] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[22:16] <ctrl> PerlStalker: what's performance per rbd?
[22:16] <PerlStalker> Approx 9TB of raw storage.
[22:16] <PerlStalker> ctrl: I don't have the benchmarks handy.
[22:17] <PerlStalker> Given that we were moving off of NFS on SATA, the words "Fast as hell" may have been used.
[22:17] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[22:17] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit ()
[22:17] <amatter> Also, I'm a bit nervous about upgrading a 50 terabyte production ceph cluster from argonaut to bobtail. Has there been any significant issues with the upgrade or is data loss a significant possibility?
[22:18] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[22:18] <ctrl> PerlStalker: ok, how large mssql base?
[22:18] <PerlStalker> amatter: I had stability problems on 0.56.3. Which, I think, have been fixed in .4
[22:19] <PerlStalker> ctrl: Small.
[22:19] <ctrl> PerlStalker: What kind of network do u use?
[22:19] <ctrl> PerlStalker: well, my db is around 600Gb (
[22:20] * jluis (~JL@a94-133-44-47.cpe.netcabo.pt) has joined #ceph
[22:21] <nhm__> amatter: I know others have upgraded sucessfully, but that isn't really my area.
[22:21] <amatter> thanks for the advice, guys!
[22:21] <PerlStalker> ctrl: bridges through openvswitch
[22:22] <nhm__> amatter: might want to wait and see if anyone of the other guys on channel have done it.
[22:23] <PerlStalker> amatter: My upgrade was smooth but, as I said, I had stability problems with 0.56.3. I'm still waiting to see if those same problems show up now that I've upgraded to 0.56.4.
[22:24] <amatter> have either of you been using cephfs on the clsuters you've upgraded?
[22:24] <PerlStalker> amatter: No.
[22:25] * ctrl2 (~ctrl@ has joined #ceph
[22:25] <ctrl2> PerlStalker: where u store journals? on ssd?
[22:25] <PerlStalker> On the disks.
[22:26] <ctrl2> PerlStalker: super fast sas? :)
[22:26] <PerlStalker> Aye
[22:28] <ctrl2> PerlStalker: Ok! Thanks for answers! :)
[22:29] * jboster (~james@ Quit (Quit: Leaving)
[22:30] <PerlStalker> ctrl2: I'm only averaging .5 MB/s talking to the cluster though. It's not huge.
[22:31] <PerlStalker> But my cluster has taken spikes up to 16 MB/s without a noticeable slow down.
[22:31] * ctrl (~ctrl@ Quit (Ping timeout: 480 seconds)
[22:33] <ctrl2> PerlStalker: Right now i have basic ceph cluster connected via infiniband with sata disks, journals on ssd, 1 ssd per 3 disks. In *nix vms good write and read performance, but in win vms is poor (
[22:34] <ctrl2> PerlStalker: As basic cluster i mean a 3 nodes.
[22:34] <PerlStalker> ctrl2: Are you using virtio store on the windows boxes?
[22:35] <ctrl2> PerlStalker: Yeah, but speed is around 60 - 80 mb/s
[22:42] <PerlStalker> My windows boxes don't have that much of an i/o load.
[22:51] * ctrl2 (~ctrl@ has left #ceph
[22:54] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[22:54] * alram (~alram@ Quit (Quit: leaving)
[22:55] <nhm__> ctrl2: I've done a fair amount of testing with linux VMs, but honestly haven't done any testing with windows VMs.
[22:56] <nhm__> ctrl2: do you get any benefit from more VMs, or is it slow no matter how many windows VMs you use?
[23:04] <scheuk> I have an interesting operations question for the ceph community
[23:04] <scheuk> Our cluster is running 0.48.2 and we would like to upgrade the cluster to 0.56.4
[23:05] <scheuk> however our OSDs are getting full, I would say around 75%
[23:05] <scheuk> we are running 6 OSDs and we have the capacity to double that to a total of 12
[23:06] <scheuk> now my question is what should I do first, upgrade then expand, or expand then upgrade?
[23:06] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[23:06] <scheuk> I know when I expanded before with 0.48, it put writes into wait state until the objects where replicated.
[23:07] <janos> personally i'd upgrade first. but you'd want to hold out for someone more expert than myself to say
[23:08] <scheuk> ok
[23:08] <janos> introducing new OSD's has gotten smoother with the ramp-up in 56
[23:08] <loicd> nhm__: Hi, I worked on an openstack / ceph related optimization proposition today https://etherpad.openstack.org/instance-volume-collocation . Is it of interest to you ?
[23:08] <janos> and you'd be introducing new OSD's into an upgraded system
[23:09] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[23:22] * rustam (~rustam@ has joined #ceph
[23:23] * SubOracle (~quassel@00019f1e.user.oftc.net) has joined #ceph
[23:30] * diffuse (~diffuse@ Quit (Ping timeout: 480 seconds)
[23:38] * sleinen (~Adium@2001:620:0:26:949d:c332:f5d7:ed5e) Quit (Quit: Leaving.)
[23:46] * Yen (~Yen@ip-83-134-98-4.dsl.scarlet.be) Quit (Ping timeout: 480 seconds)
[23:47] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:47] * mattm_codero (~matt@108-95-148-196.lightspeed.austtx.sbcglobal.net) has joined #ceph
[23:48] * Yen (~Yen@ip-83-134-116-137.dsl.scarlet.be) has joined #ceph
[23:49] * drokita1 (~drokita@ has joined #ceph
[23:53] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[23:55] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[23:57] * drokita1 (~drokita@ Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.