#ceph IRC Log


IRC Log for 2012-12-11

Timestamps are in GMT/BST.

[0:01] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:11] * jefferai (~quassel@quassel.jefferai.org) has joined #ceph
[0:19] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) Quit (Remote host closed the connection)
[0:19] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[0:22] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[0:25] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[0:25] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[0:25] * Leseb_ is now known as Leseb
[0:27] <terje> I'm having trouble adding an osd to a pool I've created.
[0:27] <terje> ceph is telling me: (22) Invalid argument but I'm not sure what I'm doing wrong - looks right to me..
[0:27] <terje> here's what i'm doing: http://pastie.org/5508737
[0:28] <terje> do I need to remove the OSD from the default pool first I wonder?
[0:30] <terje> documentation suggests I don't
[0:30] * ebo^ (~ebo@ Quit (Ping timeout: 480 seconds)
[0:32] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[0:33] * Ryan_Lane (~Adium@ has joined #ceph
[0:35] <lurbs> terje: When you create a new pool it'll use the default CRUSH map by default, in which case it's really just a new namespace.
[0:36] <lurbs> And all of the existing OSDs that are referenced in the default CRUSH map are used for that pool.
[0:37] <terje> what is the point of having an additional pool I wonder?
[0:39] <lurbs> You can set different rules for it, for auth, number of replicas, CRUSH map, etc.
[0:39] <terje> ah, I wish to use a seperate crush map for it.
[0:39] <lurbs> You'll need someone else's help for that, I've not done it. :)
[0:40] <terje> ok, thanks.
[0:42] <lurbs> http://www.sebastien-han.fr/blog/2012/12/07/ceph-2-speed-storage-with-crush/
[0:42] <lurbs> May help.
[0:42] <terje> nice
[1:00] * gregorg_taf (~Greg@ has joined #ceph
[1:00] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[1:02] * gohko_ (~gohko@natter.interq.or.jp) has joined #ceph
[1:07] * gohko (~gohko@natter.interq.or.jp) Quit (Ping timeout: 480 seconds)
[1:07] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[1:13] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:22] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) Quit (Ping timeout: 480 seconds)
[1:28] * BManojlovic (~steki@242-174-222-85.adsl.verat.net) Quit (Remote host closed the connection)
[1:29] * KindTwo (KindOne@h59.26.131.174.dynamic.ip.windstream.net) has joined #ceph
[1:33] * jlogan1 (~Thunderbi@2600:c00:3010:1:14a3:ca45:3136:669a) Quit (Ping timeout: 480 seconds)
[1:33] * KindOne (KindOne@h183.63.186.173.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[1:33] * KindTwo is now known as KindOne
[1:36] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[1:39] * gaveen (~gaveen@ Quit (Quit: leaving)
[1:44] <paravoid> anyone around?
[1:44] <paravoid> I have a reproducible problem where ceph osds get a sigabrt
[1:45] <paravoid> and got a backtrace
[1:45] <paravoid> it's an assert(peer_missing.count(fromosd));
[1:45] <paravoid> from src/osd/ReplicatedPG.cc:4890
[1:45] <paravoid> this happens with ceph 0.55, but v0.55..HEAD doesn't look having any related commits
[1:46] <slang> paravoid: can you post the log using pastebin.com or pastee.org
[1:46] <paravoid> what log?
[1:46] <paravoid> the backtrace?
[1:48] <slang> paravoid: ah you used gdb to get the backtrace?
[1:48] <paravoid> yes
[1:48] <slang> paravoid: yeah the full backtrace would be helpful I think
[1:49] * gucki (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) has joined #ceph
[1:49] <slang> paravoid: and if its reproducible, can you enable the osd logging, restart that osd, and post the log generated?
[1:50] <paravoid> hrm, okay
[1:51] <paravoid> it happens on multiple osds
[1:51] <slang> debug osd = 20 and debug ms = 1
[1:51] <slang> paravoid: right away or does it take a while?
[1:51] <paravoid> when adding new OSDs ceph becomes quite unstable
[1:51] <paravoid> it's on the peering/replication path
[1:52] <paravoid> the cluster has an unusually high number of PGs
[1:52] <paravoid> 64k, for 8 OSDs :-)
[1:52] <paravoid> I know that's too much
[1:53] <paravoid> however, from what I understand there's no way to increase PGs later (at least for now, I've read something about pg splitting and/or pgp?)
[1:53] <paravoid> so, planning ahead for a bigger cluster we have to allocate enough PGs from the start
[1:54] * Ryan_Lane1 (~Adium@ has joined #ceph
[1:56] <via> is greg farnum in here?
[1:56] * joao points to gregaf
[1:56] <via> gregaf: pm?
[2:00] * Ryan_Lane (~Adium@ Quit (Ping timeout: 480 seconds)
[2:03] * LeaChim (~LeaChim@b0fafb7d.bb.sky.com) Quit (Remote host closed the connection)
[2:04] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[2:04] * rread (~rread@c-98-234-218-55.hsd1.ca.comcast.net) Quit (Quit: rread)
[2:10] * jjgalvez1 (~jjgalvez@ Quit (Quit: Leaving.)
[2:31] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[2:40] <mikedawson> Lost an osd today. The drive was knocked offline due to a backplane issue (data on the drive is still good). I removed the OSD. Before resyncing, I lost another drive with same issue (they are in a shared chassis).
[2:41] <mikedawson> The second OSD process will not start. Now that I know the first drive is good, is there a way to re-add it to Ceph while preserving the data?
[2:42] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:45] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has left #ceph
[2:45] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[2:58] <dmick> mikedawson: are you saying the second one won't start because it's still missing its disk?
[2:59] <glowell> I've got irc.oftc.net:6667 configured. I'm not sure what actual host it's conecting to. I'm on a mac laptop and have never bother to get tcpdump and the like installed.
[2:59] <glowell> Odd that #ceph seems to work fine.
[2:59] <dmick> in fact, this is #ceph :)
[3:01] <glowell> I meant to do that :-)
[3:01] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has left #ceph
[3:02] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[3:02] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[3:03] <mikedawson> dmick: both disks are good. I removed the first one to fail (because I thought it was bad at the time). The second segfaults when you start the OSD process. sjust looked into it and PGs a bit off
[3:04] <dmick> I see
[3:04] <mikedawson> I just re-added the first one with the normal process to add a new OSD minus the line "ceph-osd -i {osd-num} --mkfs --mkkey"
[3:04] <dmick> if you put it back in the chassis in the same place, and start the OSD with the same parameters, it's just as if the OSD process died and restarted, right?
[3:05] <mikedawson> HEALTH_WARN 50 pgs down; 50 pgs peering; 50 pgs stuck inactive; 50 pgs stuck unclean
[3:06] * maxiz (~pfliu@ has joined #ceph
[3:06] <dmick> what was the process you used, specifically, to add it back?
[3:07] <mikedawson> Adding an OSD (Manual) http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
[3:08] <dmick> you didn't blow away the filesystems, though?
[3:08] <mikedawson> filesystem intact
[3:08] <dmick> so, I mean, there's a lot of those steps you omitted, I hope
[3:09] <mikedawson> My process:
[3:09] <mikedawson> ceph osd create
[3:09] <slang> paravoid: I think that should be ok. Were you able to get the logs with debugging enabled?
[3:10] <paravoid> I'm afraid it'd have to wait until tomorrow probably, sorry
[3:10] <mikedawson> that returned 18 (the original number of this OSD I'm trying to bring back online with its original data). So far so good
[3:10] <mikedawson> Then I skipped "ceph-osd -i {osd-num} --mkfs --mkkey" thinking I've already got this setup on the disk
[3:10] <mikedawson> ceph auth add osd.18 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-18/keyring
[3:10] <mikedawson> worked as expected
[3:11] <mikedawson> ceph osd crush set 18 osd.18 1.0 root=default rack=unknownrack host=node3
[3:11] <mikedawson> worked as expected
[3:11] <paravoid> slang: have in mind this is a simple setup 4 + 4 osds in two boxes, no strange settings or anything
[3:11] <mikedawson> /etc/init.d/ceph start
[3:11] <mikedawson> worked as expected
[3:11] <dmick> mikedawson: ok; so you had removed the osd completely from the crush map and everything, and now it's back. All right.
[3:13] <mikedawson> yep. But its still not back to HEALTH_OK
[3:13] <slang> paravoid: except the 64k pgs
[3:13] <paravoid> yes
[3:14] <paravoid> that's the unusual part
[3:14] <paravoid> and it wouldn't be as unusual if there was a way to scale pgs as the cluster grows
[3:15] <paravoid> not complaining, just saying that it's unusual for a reason
[3:15] <slang> sjust: do you know of any problems with 64k pgs?
[3:15] <slang> paravoid: yeah I think the work to be able to split pgs will resolve that
[3:17] <nhm> slang: 65536 PGs is the max. check out the ceph_pg struct.
[3:17] <slang> nhm: ah ok
[3:18] <paravoid> so where 2012-12-11 02:18:16.576998 mon.0 [INF] pgmap v11750: 66728 pgs: 9581 active, 16659 active+clean, 2 active+remapped+wait_backfill, 28382 active+recovery_wait, 12085 peering, 4 active+remapped, 3 active+recovery_wait+remapped, 7 remapped+peering, 5 active+recovering; 79015 MB data, 166 GB used, 18433 GB / 18600 GB avail; 100586/230461 degraded (43.646%); 11716/115185 unfound (10.171%)
[3:18] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) Quit (Quit: wer)
[3:18] <paravoid> that's the status of the cluster
[3:18] <paravoid> not too happy
[3:18] <slang> paravoid: ah that's more than allowed
[3:19] <paravoid> if 65536 is the max, how did 66728 work?
[3:19] <paravoid> well, fsvo work
[3:19] <mikedawson> 2012-12-10 21:18:24.148318 osd.1 [WRN] slow request 240.089922 seconds old, received at 2012-12-10 21:14:24.058354: osd_op(client.8958.0:36 rb.0.17b3.24ad520.00000000001e [sparse-read 0~4194304] 4.aa9fa8e8 RETRY) v4 currently reached pg
[3:19] <slang> paravoid: yeah we should bail out when you try to create with more than that
[3:19] <slang> paravoid: but it sounds like that's your problem
[3:20] * guigouz (~guigouz@ Quit (Ping timeout: 480 seconds)
[3:20] <paravoid> I'm not so sure, it looks like general cluster unhappiness
[3:20] <paravoid> but I can try with less than that
[3:25] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[3:25] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[3:32] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[3:36] <mikedawson> Removed osd.17 (the one that is segfaulting), and I'm back to HEALTH_OK.
[3:40] * Wolfgang (~Adium@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[3:42] * Cube (~Cube@ Quit (Quit: Leaving.)
[3:48] * Ryan_Lane1 (~Adium@ Quit (Quit: Leaving.)
[3:51] <mikedawson> Formatted and re-added osd.17, and I'm back to HEALTH_OK.
[3:52] <mikedawson> Thanks Inktank guys! You guys kick ass!
[3:52] <nhm> mikedawson: woot, glad to hear it!
[4:02] <dmick> good show mikedawson
[4:04] * Wolfgang (~Adium@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:07] <mikedawson> right now I have root -> rack -> host in my ceph osd tree. I need to add a level for chassis because these hosts are 4 nodes in a 2U shared chassis. Any tips?
[4:08] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:17] * BillK (~billk@124-169-198-193.dyn.iinet.net.au) Quit (Remote host closed the connection)
[4:29] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[4:33] <mikedawson> I have one host with OSD processes crashing frequently. All 3 ceph-osd processes are using quite a bit of CPU according to top. Processes on the other 1 nodes are not using any CPU to speak of. Cluster is HEALTH_OK and basically idle.
[4:33] <mikedawson> 0.55
[4:35] <mikedawson> top on the offending host http://pastebin.com/LVerNnvc
[4:36] <mikedawson> restarting the osds on this host does not change the high CPU usage
[4:52] * dmick (~dmick@2607:f298:a:607:e07b:b1cf:23d:d5a0) Quit (Quit: Leaving.)
[5:12] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[5:22] * trhoden (~trhoden@pool-108-28-184-124.washdc.fios.verizon.net) Quit (Quit: trhoden)
[5:29] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[5:30] * yasu` (~yasu`@dhcp-59-224.cse.ucsc.edu) Quit (Remote host closed the connection)
[5:51] * joshd1 (~jdurgin@2602:306:c5db:310:1c6c:c7b2:9650:bc87) has joined #ceph
[6:11] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[7:04] * jlogan (~Thunderbi@2600:c00:3010:1:14a3:ca45:3136:669a) has joined #ceph
[7:06] * tryggvil (~tryggvil@nova032-254.cust.nova.is) has joined #ceph
[7:14] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[7:25] * tryggvil (~tryggvil@nova032-254.cust.nova.is) Quit (Quit: tryggvil)
[7:28] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[8:03] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:05] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Remote host closed the connection)
[8:13] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:24] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[8:29] * KindOne (KindOne@h59.26.131.174.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[8:31] * KindOne (KindOne@h141.20.131.174.dynamic.ip.windstream.net) has joined #ceph
[8:37] * ebo^ (~ebo@ has joined #ceph
[8:56] * rread (~rread@c-98-234-218-55.hsd1.ca.comcast.net) has joined #ceph
[8:57] * rread (~rread@c-98-234-218-55.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[8:57] * rread (~rread@ has joined #ceph
[8:59] * pmjdebruijn (~pmjdebrui@overlord.pcode.nl) Quit (Remote host closed the connection)
[9:06] * rread (~rread@ Quit (Ping timeout: 480 seconds)
[9:06] * jlogan (~Thunderbi@2600:c00:3010:1:14a3:ca45:3136:669a) Quit (Ping timeout: 480 seconds)
[9:06] * low (~low@ has joined #ceph
[9:10] * KindTwo (KindOne@h179.55.186.173.dynamic.ip.windstream.net) has joined #ceph
[9:11] * ebo^ (~ebo@ Quit (Read error: Operation timed out)
[9:12] * loicd (~loic@LPuteaux-156-16-100-112.w80-12.abo.wanadoo.fr) has joined #ceph
[9:15] * KindOne (KindOne@h141.20.131.174.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[9:15] * KindTwo is now known as KindOne
[9:39] * BManojlovic (~steki@ has joined #ceph
[9:41] * IceGuest_75 (~IceChat7@buerogw01.ispgateway.de) has joined #ceph
[9:43] <IceGuest_75> hi there, need a little bit help in using ceph with mount.ceph. i can read/write files but i can't using "df", it hangs with "statfs("/tmp/blub",...". i'm using ceph 0.55 mit 3 OSDs, 2 MDS and 3 MONs. any hints ?
[9:43] * IceGuest_75 is now known as norbi
[9:44] <norbi> ceph status says "health HEALTH_OK" :)
[9:46] * Leseb (~Leseb@ has joined #ceph
[9:49] <joshd1> norbi: which kernel version are you using?
[9:49] <norbi> 3.6.9
[9:49] <joshd1> anything in syslog/dmesg?
[9:50] <joshd1> otherwise I'd check that all the ceph daemons are still running
[9:51] <joshd1> and see if it works with ceph-fuse
[9:52] <norbi> hm no messages in syslog/dmesg
[9:54] * ebo^ (~ebo@icg1104.icg.kfa-juelich.de) has joined #ceph
[9:54] * agh (~2ee79308@2600:3c00::2:2424) has joined #ceph
[9:55] <norbi> ceph deamons are running fine, no problems with reading and writing. i will try with ceph-fuse
[10:00] * fc (~fc@home.ploup.net) has joined #ceph
[10:00] * KindTwo (KindOne@h161.33.186.173.dynamic.ip.windstream.net) has joined #ceph
[10:01] <norbi> ok, after unmounting an mounting "df" is working :) with ceph-fuse it is working too. problem "solved" :)
[10:03] * KindOne (KindOne@h179.55.186.173.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[10:03] * KindTwo is now known as KindOne
[10:04] <joshd1> cool. if it happens again, we'd probably need logs with debugging from the client and/or mds
[10:26] * LeaChim (~LeaChim@b0fafb7d.bb.sky.com) has joined #ceph
[10:28] <norbi> ok i will do, think it was a firewall problem. by the way, help command for "ceph" would be fine, example: "ceph osd help" or "ceph help" or "ceph mds help"
[10:31] <joshd1> I agree: http://tracker.newdream.net/issues/2894
[10:32] <norbi> hehe ok ;)
[10:35] * yoshi (~yoshi@ has joined #ceph
[10:40] * maxiz (~pfliu@ Quit (Quit: Ex-Chat)
[10:48] * nosebleedkt (~kostas@ has joined #ceph
[10:48] <nosebleedkt> hi everybody
[10:50] * ebo^ (~ebo@icg1104.icg.kfa-juelich.de) Quit (Quit: Verlassend)
[10:51] <nosebleedkt> I have a cluster. It's made of 2 PCs. 1st PC provides 3 OSDs. 2nd Pc provides 2 OSDs. I make a hypothetical disaster scenario and I shutdown the 2nd PC directly from power button. But I see ceph cannot get to a healthy state after all. Still says: health HEALTH_WARN 223 pgs stale; 223 pgs stuck stale
[10:53] <joshd1> does your crush map separate replicas onto different hosts? ('ceph osd tree' will show you)
[10:54] * loicd (~loic@LPuteaux-156-16-100-112.w80-12.abo.wanadoo.fr) Quit (Quit: Leaving.)
[10:55] <jtang> i thought by default it put replicas on different osd's instead of hosts
[10:55] <nosebleedkt> 0 1 osd.0 up 1
[10:55] <nosebleedkt> 1 1 osd.1 up 1
[10:55] <nosebleedkt> 2 1 osd.2 up 1
[10:55] <nosebleedkt> 3 2 osd.3 down 0
[10:55] <nosebleedkt> 4 2 osd.4 down 0
[10:56] <nosebleedkt> that's the result of 'ceph osd tree'
[10:57] <norbi> osd.1 osd.2 and osd.3 are different osds :)
[10:58] <nosebleedkt> yes, they are. so ?
[10:59] <nosebleedkt> osd.0,1,2 are on 1st PC.
[10:59] <nosebleedkt> osd.3,4 are on 2nd PC
[10:59] <norbi> if you have only 2 replicas and both are stored on osd3 and osd4 ?
[10:59] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[11:00] <nosebleedkt> how do i see that?
[11:00] <nosebleedkt> and how should i fix that
[11:00] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[11:01] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[11:01] <norbi> i think you must set the osd in crush map to the same host, so ceph will store the data not on the same host. but im not realy sure
[11:01] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[11:04] <joshd1> I think mkcephfs tries to set it up with replicas on different hosts by default, but if you added osds since then and didn't specify a host, crush wouldn't know the right host for the new one
[11:05] * Morg (d4438402@ircip4.mibbit.com) has joined #ceph
[11:07] <nosebleedkt> joshd, yes I added osd.3,4 later on.
[11:09] <nosebleedkt> what i did lately was to comment out the osd sections on 2nd ceph.conf
[11:09] <nosebleedkt> maybe thats the wrong thing
[11:09] <joshd1> the osd sections don't matter except on the node those osds run on
[11:10] <nosebleedkt> hmm
[11:10] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[11:11] <joshd1> but it sounds like you may have had pgs replicated only on 3 and 4, since crush didn't know which hosts they were on, which is the reason for the stale pgs
[11:11] <nosebleedkt> so, what should i do then joshd ?
[11:12] <nosebleedkt> i need to learn this
[11:12] <joshd1> bring your second node back up, and edit the crush map so osd 3 and 4 are inside a node for 'pc2' or whatever you want to call it
[11:13] <nosebleedkt> ok
[11:13] <nosebleedkt> 2nd node is up
[11:13] <nosebleedkt> ceph health is ok now
[11:13] <nosebleedkt> but how do i edit the crush map ?
[11:14] <joshd1> I'm not sure if the 'ceph crush move' command will do everything you need here, but you can edit the entire thing as a file: http://ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map
[11:14] * loicd (~loic@ has joined #ceph
[11:15] <nosebleedkt> wow! wait.
[11:15] <nosebleedkt> do i have to do this everytime i add an OSD ?
[11:16] <joshd1> no, when you add it you just need to specify host=foo when you do 'ceph osd crush set ...'
[11:18] <nosebleedkt> hmm
[11:18] <nosebleedkt> i do ceph osd crush set ID weight root=default
[11:19] <norbi> weight = it's a number
[11:19] <norbi> ceph osd crush set 1 osd.1 1 pool=default rack=unknownrack host=bla
[11:19] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[11:19] <nosebleedkt> norbi, pool=default didnt work for me on ceph-0.55
[11:21] <norbi> have done it in these minutes, with 0.55 there are no problems with pool=default, ceph now remappes
[11:22] <nosebleedkt> hmm let my try
[11:23] * joshd1 (~jdurgin@2602:306:c5db:310:1c6c:c7b2:9650:bc87) Quit (Quit: Leaving.)
[11:23] <nosebleedkt> norbi, seems to work
[11:23] <nosebleedkt> now ceph starts working like crazu
[11:23] <norbi> :)
[11:25] <nosebleedkt> ceph health ok now
[11:25] <nosebleedkt> lets shutdown the 2nd node
[11:27] <nosebleedkt> mon.0 [INF] osdmap e284: 5 osds: 3 up, 5 in
[11:27] <nosebleedkt> 10/48 degraded (20.833%)
[11:27] <nosebleedkt> and stopped there
[11:32] <nosebleedkt> 5/48 degraded (10.417%)
[11:35] <nosebleedkt> norbi, still nothing :(
[11:35] <norbi> he is remapping ?
[11:35] <norbi> ceph -w will show u
[11:36] <nosebleedkt> its there
[11:36] <nosebleedkt> on
[11:36] <nosebleedkt> 5/48 degraded (10.417%)
[11:36] <nosebleedkt> and not working
[11:36] <nosebleedkt> anymore
[11:36] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has joined #ceph
[11:39] <norbi> and "ceph status" shows stucked pgs ?
[11:39] * gohko_ (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[11:40] <nosebleedkt> root@masterceph:~# ceph status
[11:40] <nosebleedkt> health HEALTH_WARN 126 pgs degraded; 295 pgs stale; 295 pgs stuck stale; 126 pgs stuck unclean; recovery 5/48 degraded (10.417%)
[11:40] <nosebleedkt> i opened 2nd node again
[11:41] <nosebleedkt> got health ok now
[11:41] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[11:41] <norbi> http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#stuck-placement-groups
[11:41] <norbi> :)
[11:43] <norbi> with "ceph health detail" u can see the last state
[11:43] <norbi> the last columns shows u where the PGs are stored
[11:44] <norbi> if u see there [3,4] the PG must be moved
[11:46] <nosebleedkt> yes
[11:46] <nosebleedkt> i see [3,4]
[11:46] <nosebleedkt> eg.
[11:46] <nosebleedkt> pg 0.24 is stale+active+clean, acting [3,4]
[11:46] <nosebleedkt> pg 2.26 is stale+active+clean, acting [4,3]
[11:49] <nosebleedkt> i will manually remove the OSDs from cluster, then add them again from the beggining.
[11:49] <nosebleedkt> with the new crush command. maybe something is failing now because of previous mixes of commands :d
[11:52] <norbi> the false way :) think this must be manageable in productive systems
[11:54] <nosebleedkt> yes, but i added the OSDs in rather questionable way
[11:54] <norbi> if your ceph is now "health ok"
[11:55] <norbi> it must be possible to move the PG to another OSD
[11:56] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[11:56] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit ()
[12:00] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has left #ceph
[12:09] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[12:10] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[12:14] * ninkotech (~duplo@ has joined #ceph
[12:23] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[12:23] * tryggvil (~tryggvil@ has joined #ceph
[12:27] * tryggvil (~tryggvil@ Quit ()
[12:29] <nosebleedkt> what does mean : osd.3 [INF] 1.59 scrub ok
[12:29] <nosebleedkt> ?
[12:31] * yoshi (~yoshi@ has joined #ceph
[12:34] * tryggvil (~tryggvil@ has joined #ceph
[12:38] * tryggvil_ (~tryggvil@ has joined #ceph
[12:38] * tryggvil (~tryggvil@ Quit (Read error: Connection reset by peer)
[12:38] * tryggvil_ is now known as tryggvil
[12:57] <nosebleedkt> norbi, no life here !
[12:57] <nosebleedkt> norbi, i made again from the start a whole new cluster
[12:58] <nosebleedkt> norbi, i switched off the 2nd node
[12:58] <norbi> lunch time here :)
[12:58] <nosebleedkt> ceph cannot get healthy :D
[12:58] <nosebleedkt> hehe, have a nice time
[13:01] <norbi> i have 9 PGs in state "stuck unclean" in my testsetup, must discover why they are not get remapped
[13:05] <nosebleedkt> 0.55 ?
[13:05] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[13:07] <norbi> yes
[13:07] <nosebleedkt> also i use this command
[13:07] <nosebleedkt> ceph osd crush set ID weight pool=default rack=unknownrack host=myhost
[13:08] <nosebleedkt> but you said set ID osd.ID weight ?
[13:09] <norbi> yes
[13:09] <norbi> can u see in documentation
[13:09] <Psi-jack> Hmmm, setting osd journal = blah, you can set that directly to a raw block device, no?
[13:12] <Psi-jack> And 0.55 eh? Is that considered stable like 0.48 was?
[13:12] <Psi-jack> Stable as stable is, that is. ;)
[13:21] * deepsa (~deepsa@ Quit (Read error: Connection reset by peer)
[13:22] * deepsa (~deepsa@ has joined #ceph
[13:24] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[13:25] <Psi-jack> Ahh 0.56 is the next targetted "stable", named Bobtail? So for now, 0.48.2 argonaut is the most current stable? If I'm gathering this correctly.
[13:27] <joao> Psi-jack, argonaut is the latest LTS, but as we are closing in on bobtail there have been many fixes released
[13:27] * tontsa (~tontsa@solu.fi) Quit (Read error: Connection reset by peer)
[13:27] <joao> .56 should be out in a couple of weeks though
[13:29] * guigouz (~guigouz@201-87-100-166.static-corp.ajato.com.br) has joined #ceph
[13:29] <nosebleedkt> root@osdprovider:~# ceph osd crush set 3 osd.3 1.0 pool=default rack=unknownrack host=osdprovider
[13:29] <nosebleedkt> osd.3 does not exist. create it before updating the crush map
[13:29] <norbi> ceph osd create
[13:29] <Psi-jack> joao: Upgrading from 0.48.2 to 0.56 when it's ready will be relatively easy, or is it truly recommended, if building a new cluster, to /start/ with 0.55?
[13:30] <joao> oh, you should be able to upgrade to .56 from argonaut
[13:30] <Psi-jack> joao: And regarding the starting point? ;)
[13:31] <joao> well, if you're starting off fresh, maybe giving 0.55 a try would be best
[13:31] <joao> or even the latest master
[13:31] <Psi-jack> Hmmm. Heh.
[13:32] <norbi> i have upgraded this week frmo 0.48 to 0.55, was realy easy
[13:32] <joao> 0.55 introduced new stuff, that introduced new bugs, that have been fixed since the release
[13:32] <Psi-jack> Heh. Yeah. Currently in Arch's AUR, 0.48, and ceph-git are the two currently available (yeah, not even 0.48.2, already flagged it for that.)
[13:32] <Psi-jack> Ahhhhh.
[13:32] <joao> we're ironing things out for 0.56
[13:32] * Psi-jack nods.
[13:33] <joao> besides, we'd love to have as much feedback on 0.55 as possible, as we are closing in on bobtail
[13:33] <joao> :)
[13:33] <Psi-jack> joao: Heh, yeah. That I can imagine... But I also want to be running this in a production environment. :)
[13:34] <Psi-jack> My hardware to support it isn't top-of-the-line, but definitely going to provide a crapton better than what I had setup for my VM cluster. :)
[13:35] <joao> oh, in that case, argonaut is still the LTS and upgrade should be fairly straightforward (unless if you create a cluster without cephx, in which case you'll have to either enable it before upgrading, or disable it explicitely)
[13:35] <Kioob`Taff> I'm not sure it can be easily fixed, but error messages in 0.55 are not really helpfull
[13:35] <Psi-jack> I see. cephx, is the security layer?
[13:35] <joao> 0.55 however does have a bunch of fixes
[13:35] <Psi-jack> Kioob`Taff: There is an error displaying the error! :D
[13:36] <joao> so, if you think you can manage 0.55 for the next 2-3 weeks, although it's pretty stable afaik, then I'd bet on 0.55
[13:36] <joao> Kioob`Taff, what kind of messages?
[13:36] <joao> Psi-jack, yeah, cephx is the security layer
[13:37] <joao> s/layer/mechanism
[13:37] <Psi-jack> joao: Hmmm.. I'll take some time to consider it. I'm currently building 0.48.2 at the moment. I still have to rebuild my final storage server and get the final 3 HDD's needed, but hopefully this will all come together by this weekend.
[13:38] <Kioob`Taff> with cephx for example, in case you have not upgrade your configuration from 0.48 (with cephx enabled), it doesn't work, but the error message isn't usefull... I don't remember... I should try to reinstall a cluster
[13:39] <Psi-jack> Going to move a couple VM's disks over to the backup server, which is a Netgear ReadyNAS, while moving the remainder to the one HDD not going into the ceph cluster. Just so I can start converting all the qcow2 disks to rbd's ;)
[13:39] <joao> Psi-jack, I wish I could have given you a straight answer, but at this point in time, so close to the release, I feel the need to argue both 0.48's and 0.55's case :)
[13:40] <Morg> hello
[13:40] <Morg> got one question
[13:40] <joao> Kioob`Taff, I believe those messages you saw were basically auth debug messages stating something like 'unknown provider' or something?
[13:41] <Morg> anyone knows what is minimum block size for ceph?
[13:41] <Kioob`Taff> If I well remembre, it was "no such file or directory"
[13:41] <joao> oh
[13:41] <joao> when looking for the keyring?
[13:41] <Kioob`Taff> yes
[13:41] <Kioob`Taff> and the searched filepath is not displayed
[13:42] <joao> yeah, I think that a more detailed error message can be arranged
[13:42] <joao> will run it through the guys later on today
[13:42] <Psi-jack> joao: Oh no, that's good enough information no doubt. LOL
[13:42] <Kioob`Taff> I had to find the low level command behind «/etc/init.d/ceph start osd», to restart it with strace to find the «good» filepath
[13:43] <Kioob`Taff> I would love a «can't find keyring file in (A, B, C, D)»
[13:43] <joao> let me create a feature request for making error messages output paths whenever possible
[13:44] <Psi-jack> joao: I'm a little reluctant to go with 0.55 if there's bugs, and I'm really reluctant to start off with a master checkout directly, buuut, if it's truly suggested, I could consider it. ;)
[13:44] <joao> Kioob`Taff, yeah, that would certainly be helpful regardless of the whole cephx thing
[13:44] <Kioob`Taff> also, maybe it's normal, but on a client if you try «rbd ls» but that client doesn't have privilege to access «mon», there is a timeout. No error message, just a timeout
[13:45] <joao> Kioob`Taff, there's a bunch of things regarding error message handling that probably should be rewritten on the monitor
[13:46] <joao> and you're probably being bit by one of them
[13:46] <Psi-jack> joao: In fact, I'm looking through the ceph-git AUR PKGBUILD now.
[13:46] <joao> I will check out on the monitor if there's any messages associated with lack of permissions though, just to make sure
[13:46] <Kioob`Taff> thanks joao :)
[13:47] <Kioob`Taff> for now, error handling is my main problem in ceph. I lose a lost of time to guess what's happening
[13:48] <Kioob`Taff> (and the fact that my hardware sucks :D)
[13:48] <nosebleedkt> so afterall i have the same issue
[13:48] <joao> Kioob`Taff, keep the suggestions comming
[13:48] <joao> *coming
[13:48] <Kioob`Taff> ok :)
[13:48] <nosebleedkt> I have a cluster. It's made of 2 PCs. 1st PC provides 3 OSDs. 2nd Pc provides 2 OSDs. I make a hypothetical disaster scenario and I shutdown the 2nd PC directly from power button. But I see ceph cannot get to a healthy state after all. Still says: health HEALTH_WARN 223 pgs stale; 223 pgs stuck stale
[13:48] <joao> we're glad to know what isn't working so well on the usability side of things
[13:49] <nosebleedkt> joao, any help man ?
[13:49] <joao> just a sec
[13:49] <joao> dealing with lunch plans, sorry
[13:51] <Morg> nosebleedkt: what ver do you use?
[13:51] <Psi-jack> Heh
[13:51] <nosebleedkt> Morg, 0.55
[13:51] <Psi-jack> Looks like the PKGBUILD just gets the latest from git, and updates it.
[13:52] <Morg> im using same ver, with 5 nodes, 2 osd's per node, did the same disaster scenario
[13:52] <Morg> but when i try to rm osd from failed machine
[13:53] <Kioob`Taff> joao : a less important point. In internal bench, it can be handy to have a journal only bench. In my case the problem come from journal only (because of slow SSD)
[13:53] <Morg> i got only constant 2012-12-11 13:52:32.348895 7f18796d2700 0 -- >> pipe(0x7f186c032900 sd=4 :0 pgs=0 cs=0 l=1).fault msg
[13:53] <nosebleedkt> Morg, i use those commands to add an OSD later
[13:53] <nosebleedkt> (OSDHOST) Format disk with XFS: mkfs.xfs /dev/disk
[13:53] <nosebleedkt> (OSDHOST) Create the ceph directory: mkdir /var/lib/ceph/osd/ceph-N
[13:53] <nosebleedkt> (OSDHOST) Mount the disk: mount /dev/disk /var/lib/ceph/osd/ceph-N
[13:53] <nosebleedkt> (OSDHOST) & (MASTERHOST) Add OSD entry in ceph.conf
[13:53] <nosebleedkt> (OSDHOST) Init the directory: ceph-osd i N --mkfs --mkkey -> Returns FSID
[13:53] <nosebleedkt> (OSDHOST) Create the OSD: ceph osd create FSID
[13:53] <nosebleedkt> (OSDHOST) Create a keyring: ceph auth add osd.ID osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-ID/keyring
[13:53] <nosebleedkt> (OSDHOST) Add OSD into CRUSH: ceph osd crush set ID weight pool=default rack=unknownrack host=myhost
[13:53] <nosebleedkt> (OSDHOST) Start the OSD daemon: /etc/init.d/ceph start osd.N
[13:56] <nosebleedkt> what amI doing wrong ?
[13:57] <nosebleedkt> Morg, YES! I get the same msg.
[13:57] <nosebleedkt> that pipe .fault thing
[14:01] <norbi> how can i force remapping from stucked PGs ?i have edit the crushmap and now some PGs are in state active+remapped, but ceph does nothing, the PGs are not unfound
[14:02] <joao> Morg, that's just verbose output with a misleading 'fault' message
[14:02] <joao> that's not an error
[14:02] <Morg> oh
[14:03] <Morg> but still, when i try to remove failed machine from cluster it does nothing
[14:03] <Morg> i mean, i only get those msg's and cant do anything really
[14:04] <joao> where is that happening? the monitors?
[14:04] <Morg> i did same scenario as nosebleedkt
[14:04] <Morg> pull power cord from one of the machines
[14:04] <joao> I mean, where did you see those messages?
[14:04] <nosebleedkt> I also have a big bug when trying to remove RBD images. eg: rbd rm foorbar. I get stuck and get those faulty messages Morg gets.
[14:04] <Morg> from monitor
[14:05] <joao> what does 'ceph -s' report?
[14:05] <Morg> sec.
[14:05] <nosebleedkt> goes up to 98% then is stucks
[14:05] <joao> nosebleedkt, you seem to be making everything right
[14:06] <joao> s/making/doing
[14:06] <Morg> joao: nothing, it's stuck
[14:06] <Morg> got only this msg spam
[14:07] <joao> so, 'ceph -s' hangs?
[14:07] <joao> okay
[14:07] <Morg> 'ceph health' 'ceph -w' also
[14:07] <joao> Morg, could you please run 'ceph --admin-socket /path/to/mon.a.asok mon_status'?
[14:08] <joao> that should be run on the same machine as mon.a
[14:08] <joao> and the path is... (looking for it)
[14:09] * tryggvil (~tryggvil@ Quit (Quit: tryggvil)
[14:09] <joao> should be on /var/run/ceph/$cluster-mon.a.asok
[14:09] <nhm> good morning #ceph
[14:09] <joao> good morning nhm
[14:10] <joao> well, nhm arriving made me realize how hungry I am
[14:10] <joao> must be lunch time already
[14:10] <joao> ;)
[14:10] <norbi> how can i fix my stucked "active+remapped" PGs @joao ? :)
[14:11] <norbi> lunch time is over, time for hard work :D
[14:11] <nosebleedkt> lol
[14:11] <Morg> still nothing
[14:11] <joao> it's still 1pm here
[14:11] <nosebleedkt> joao, must respond to many problems ?
[14:11] <joao> that counts as lunch hour to me :p
[14:11] <nosebleedkt> he needs a lot of hands to type :d
[14:11] <nosebleedkt> :D
[14:12] * tryggvil (~tryggvil@ has joined #ceph
[14:12] <joao> norbi, I don't really know how to fix that. Typically you just have to wait for the osds to figure things out between themselves
[14:13] <Psi-jack> Heh. well, hmmm, blasted!
[14:13] <norbi> and i cant force ?, not so good
[14:13] <Psi-jack> ceph-git failed with an autoreconf error. LOL
[14:13] <Psi-jack> 'configure.ac' or 'configure.in' is required
[14:13] <joao> Morg, if you run 'ceph --admin-socket /var/run/ceph/<don't-know-what>mon.a.asok mon_status' on the same node where mon.a is running, you should have a response
[14:13] <joao> assuming mon.a is up
[14:14] <Morg> it's up
[14:14] <Morg> and i got no response
[14:14] <joao> Psi-jack, have you initiated the submodules?
[14:14] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:14] <joao> git submodule init
[14:14] <joao> git submodule update
[14:14] <joao> Morg, do you have the .asok present?
[14:14] <Psi-jack> joao: submodules? Hmmm... Maybe that's what is lacking in this PKGBUILD. I didn't see anything about submodules.
[14:14] <joao> norbi, I don't know if you can force it
[14:15] <Morg> http://pastebin.com/tME7gTAR
[14:15] <Morg> yup
[14:15] <Morg> got .asok for mon,mds, and 2 osd's
[14:16] <Morg> that im running on this node
[14:16] <Psi-jack> Trying now, after modifications. :)
[14:16] <joao> okay, that's fairly weird
[14:16] <joao> never seen an admin socket fail before
[14:16] <Psi-jack> Huh... Still doing it. :/
[14:16] <joao> Morg, can you make the logs available somewhere?
[14:16] <Morg> im using 0.55 on ubuntu 12,04
[14:17] <Morg> 5 nodes, 2 osd per node, 3 mons, 1 mds per node
[14:17] <joao> Psi-jack, did you run ./autogen.sh after updating the submodules?
[14:17] <Morg> i pulled power cord from node3, with mon3
[14:19] <joao> Morg, can you try the remaining asoks for the remaining monitors?
[14:19] <Psi-jack> joao: Yes. It's right when it tries to configure that it seems to fail.
[14:19] <Morg> sure, one sec
[14:19] <joao> if nothing works, it would be appreciated if you'd made the logs for the mons available somewhere
[14:20] <joao> I gotta run
[14:20] <joao> lunch
[14:20] <joao> bbiab
[14:30] * EmilienM (~my1@ks3274192.kimsufi.com) has joined #ceph
[14:30] * EmilienM (~my1@ks3274192.kimsufi.com) has left #ceph
[14:31] <norbi> where is a usable documentation about the command "ceph" :( what can i do with ceph pg ...
[14:37] <Psi-jack> joao: When you get back. The actual problem seems to be that autogen.sh itself has a return code of 1, not 0. http://pastebin.ca/2291420
[14:47] * Wolfgang (~Adium@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:50] * Wolfgang (~Adium@cpe-98-14-23-162.nyc.res.rr.com) Quit ()
[14:51] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:57] * yoshi (~yoshi@ has joined #ceph
[14:57] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) has joined #ceph
[15:05] * norbi (~IceChat7@buerogw01.ispgateway.de) Quit (Quit: Depression is merely anger without enthusiasm)
[15:07] * guigouz1 (~guigouz@201-87-100-166.static-corp.ajato.com.br) has joined #ceph
[15:09] <nosebleedkt> root@masterceph:~# ceph -w
[15:09] <nosebleedkt> health HEALTH_WARN 16 pgs degraded; 16 pgs stuck unclean
[15:09] <nosebleedkt> what happened after i killed one of my nodes which contain 2 OSDs
[15:09] * guigouz (~guigouz@201-87-100-166.static-corp.ajato.com.br) Quit (Ping timeout: 480 seconds)
[15:09] * guigouz1 is now known as guigouz
[15:09] <nosebleedkt> ceph doesn't want to respond health OK
[15:11] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:12] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[15:17] * guigouz (~guigouz@201-87-100-166.static-corp.ajato.com.br) Quit (Quit: Computer has gone to sleep.)
[15:18] * tryggvil (~tryggvil@ Quit (Quit: tryggvil)
[15:25] * Morg (d4438402@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[15:25] * tryggvil (~tryggvil@ has joined #ceph
[15:32] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[15:37] * gaveen (~gaveen@ has joined #ceph
[15:46] <Psi-jack> joao: Yep., The problem seems to be in src/leveldb, as per the last line of the autogen.sh: ( cd src/leveldb && mkdir -p m4 && autoreconf -fvi; ) it fails with: autoreconf: 'configure.ac' or 'configure.in' is required
[15:53] <joao> Psi-jack, I always have that happen to me whenever I forget to init and update the submodules; but after doing that, and after running autogen all over again, things Just Work
[15:54] <Psi-jack> joao: I think I found the issue just recently. ;)
[15:54] <joao> what you're going through is just weird
[15:54] <Psi-jack> joao: The way this PKGBUILD for ceph-git works is it checks out src/ceph, then locally clones that to src/ceph-build, and that's the directory that needs the submodule's put in.
[15:54] <Psi-jack> Bingo!
[15:55] <joao> oh
[15:55] <joao> nosebleedkt, what does your 'ceph -s' say?
[15:56] <nosebleedkt> root@masterceph:~# ceph -s
[15:56] <nosebleedkt> health HEALTH_WARN 16 pgs degraded; 16 pgs stuck unclean
[15:56] <nosebleedkt> monmap e1: 1 mons at {a=}, election epoch 2, quorum 0 a
[15:56] <nosebleedkt> osdmap e131: 4 osds: 3 up, 3 in
[15:56] <nosebleedkt> pgmap v764: 768 pgs: 752 active+clean, 16 active+degraded; 121 MB data, 599 MB used, 2442 MB / 3042 MB avail
[15:56] <nosebleedkt> mdsmap e4: 1/1/1 up {0=a=up:active}
[15:57] <Psi-jack> joao: Heh, yeah. Wierd, definitely, but apparently cloning from ceph into ceph-build doesn't do the submodules either, and breaks the ability for init/update to work.
[15:58] <joao> nosebleedkt, I would bet that the unclean and stuck pgs you have are due to that 4th osd to be down
[15:58] <Psi-jack> So, I just opted not to init/update in the main clone, and only do so in the build clone. ;)
[15:58] <joao> but we can make sure of it by running 'ceph pg dump'
[15:58] <nosebleedkt> yes joao
[15:58] <joao> pastebin it somewhere and we'll take a look
[15:58] <joao> :)
[15:59] <joao> Psi-jack, afaik, the clone will only clone the ceph repo anyway
[15:59] <nosebleedkt> joao, http://pastebin.com/GbpPMJU4
[15:59] <joao> you'd have to init and update the submodules on the locally cloned 'ceph-build'
[15:59] <nosebleedkt> joao, isnt ceph supposed to automatically move the PGs to valid OSDs?
[15:59] * tryggvil (~tryggvil@ Quit (Quit: tryggvil)
[16:00] <nosebleedkt> why they still remain in OSD.4
[16:00] <Psi-jack> joao: Yes.. I had tried that, but it wasn't working for some odd reason. Not sure why, but either way, so far, the new method is working. Currently building ceph from master.
[16:00] <joao> if they are replicated on one of the osds still in the cluster, then yeah
[16:00] <Psi-jack> A LOT less gcc warnings, too.
[16:00] <joao> Psi-jack, cool, glad you got it working
[16:00] <Psi-jack> So far, haven't seen a single warning. ;)
[16:02] <joao> nosebleedkt, is that dump complete?
[16:02] <nosebleedkt> nop
[16:02] <joao> did you remove any pgs from it?
[16:02] <joao> it only reports the active+clean pgs
[16:02] <nosebleedkt> its what putty can hold in screen
[16:02] <nosebleedkt> oh
[16:02] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[16:03] <joao> nosebleedkt, try 'ceph pg dump >& somefile'
[16:03] <joao> and either drop it somewhere or just dcc it to me
[16:03] <Psi-jack> Hmmm. cephx.. Haven't even played at all with that yet. heh
[16:04] <Psi-jack> Guessing, obviously, it's preferred to use, since it's by default enabled in newer versions. heh
[16:04] <joao> yeah
[16:04] <joao> it's not big of a deal really
[16:04] <joao> oops
[16:04] <nosebleedkt> i dcc it
[16:04] <joao> nosebleedkt, send that again please
[16:04] <Psi-jack> Oh, question I had earlier, in the ceph.conf, the osd journal = blah, blah can be /dev/sdX, directly pointed to a raw block device?
[16:05] <nosebleedkt> joao, http://pastebin.com/A5mvVBMA
[16:06] <joao> Psi-jack, folks usually run into issues with cephx mainly when they didn't use it at all and, all of a sudden, either need to disable it explicitly on the conf or set it up before upgrading
[16:06] <joao> Psi-jack, yeah
[16:07] <joao> the journal can live on a raw block device or a file
[16:07] <joao> the latter can impact your performance though ;)
[16:07] <Psi-jack> joao: Yeah, since I'm going to be likely starting off with the git master version, as you recommended, I will be setting up cephx from the start, it seems. ;)
[16:07] <Psi-jack> joao: Heh yeah. I'm going to journal directly to an SSD GPT partition. :)
[16:08] <Psi-jack> osd journal size, I presume, is in MB?
[16:09] <Psi-jack> Funny, I'll be journaling CephFS to SSD, and XFS's logdev to SSD. :)
[16:09] <joao> think so, but I saw someone recommending not to set the journal size a while back; if you are going to use the whole device that is
[16:09] <Psi-jack> Hmmm
[16:09] <joao> but I'm probably not the best person to tell you what to do there
[16:09] <Psi-jack> I suppose, if it can determine size automagically.
[16:10] <joao> don't know :x
[16:10] * Psi-jack chuckles.
[16:11] <Psi-jack> Ahhh, here we go: If this is 0, and the journal is a block device, the entire block device is used.
[16:11] <Psi-jack> Poifect.
[16:12] <nosebleedkt> joao, i need to go. Will you mail me the results?
[16:12] <joao> nosebleedkt, I'm not sure if I can help you with this one
[16:12] <nosebleedkt> oh
[16:13] <joao> have a couple of things to do myself, but will inquire sjust later on to get an idea of what's happening
[16:13] <nosebleedkt> some PGs are stored in an OSD that is still down
[16:13] <nosebleedkt> and ceph does not move those PGs to a good OSD
[16:13] <joao> if the osd is not up, and the pgs are not available on other osds, there is no way to magically move them anywhere really :)
[16:14] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[16:14] <nosebleedkt> shouldnt ceph have replicas in other OSDs ?
[16:14] * gaveen (~gaveen@ has joined #ceph
[16:15] <joao> yeah, that's why I'll be talking with sjust, to get a better understanding of what is going on
[16:15] <joao> I believe it has something to do with osd.1 though
[16:15] <joao> is that the one that is down?
[16:15] <nosebleedkt> nop
[16:15] <nosebleedkt> osd.3 is down
[16:15] <nosebleedkt> osd.0,1,2 are up
[16:16] <joao> oh, I see
[16:16] <joao> pgs are degraded throughout 0 and 1
[16:16] <joao> (maybe 2 too, but haven't seen any yet)
[16:16] <joao> I'm assuming those were the pgs that were mapped on 3 too
[16:17] <nosebleedkt> could be
[16:17] <joao> just can't for the life of me understand why they are degraded
[16:17] <nosebleedkt> i must go now. can we continue tomorrow?
[16:17] <joao> let me take a moment here and revisit the days I used to use windows
[16:17] <joao> would you be willing to restart osd.1 and osd.0 and see what happens?
[16:17] <joao> :)
[16:17] <nosebleedkt> i closed the systems
[16:17] <nosebleedkt> loll
[16:18] <joao> okay
[16:18] <joao> np
[16:18] <joao> I have other stuff to do anyway; I'll talk to sam later and we can check up on this again tomorrow
[16:18] <nosebleedkt> ok
[16:18] <nosebleedkt> thank u
[16:18] <joao> np
[16:19] * nosebleedkt (~kostas@ Quit (Read error: Connection reset by peer)
[16:20] * l0nk (~alex@ has joined #ceph
[16:21] * loicd (~loic@magenta.dachary.org) has joined #ceph
[16:21] * gaveen (~gaveen@ Quit (Quit: leaving)
[16:23] <jtang> one of the guys here just bough 40tb of storage and he's wondering how he's gonna sync it to another machine for disaster recovery :P
[16:23] <jtang> 4k for 24tb of storage :P
[16:24] * gaveen (~gaveen@ has joined #ceph
[16:29] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[16:32] * agh (~2ee79308@2600:3c00::2:2424) Quit (Quit: TheGrebs.com CGI:IRC (Session timeout))
[16:33] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) Quit (Remote host closed the connection)
[16:35] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[16:47] * loicd (~loic@magenta.dachary.org) Quit (Ping timeout: 480 seconds)
[16:47] * vata (~vata@ Quit (Quit: Leaving.)
[16:51] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[16:53] * low (~low@ Quit (Quit: bbl)
[16:55] * loicd (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) has joined #ceph
[17:00] * mgalkiewicz (~mgalkiewi@staticline-31-182-128-114.toya.net.pl) has joined #ceph
[17:04] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:10] * aliguori (~anthony@ has joined #ceph
[17:11] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:27] <iggy> knowing that cephfs isn't considered production ready, is anybody using it on a daily basis (not necessarily production, but just actually using it)
[17:29] <Robe> dreamhost
[17:30] <darkfaded> iggy: i had an interesting experience with it just a week ago, hammering it from a dd loop for about a day without hickups. this is different from a few months ago. so even if noone is directly working to make it more stable, the cleanups of the lower layers really helped.
[17:31] <Robe> iggy: oh, cephfs directly
[17:31] <Robe> nvm
[17:31] <Robe> darkfaded: what hiccups did you run into in the past?
[17:32] <joao> darkfaded, there's people actively working on cephfs to make it much more stable
[17:32] <joao> just putting it out there :p
[17:32] <darkfaded> Robe: i'm good at breaking things, i.e. run a dd and an rsync from two clients made it topside back in feb
[17:32] <iggy> I'm considering replacing an aging media server with 2 little mini-itx's with a few TB worth of disk... priced it out last night at about $600 for the hardware
[17:32] <darkfaded> joao: oh?! i thought that was an item for '13 :)
[17:32] <Robe> darkfaded: how did topside look like?
[17:32] <Robe> and was that with fuse or in-kernel?
[17:33] <iggy> so it's just home use, but I'd still prefer not to lose stuff
[17:33] <joao> slang and gregaf (I think) have been working on it
[17:33] <darkfaded> Robe: in-kernel, and it looked like 4-5KB/s and then complete stall no write IO visible
[17:33] <Robe> darkfaded: ah
[17:33] <darkfaded> like maybe if i had lost the MDSses
[17:34] <darkfaded> takeaway is it seemed much happier this time
[17:34] <Robe> darkfaded: goody
[17:34] <Robe> which kernel did you use for client?
[17:34] <darkfaded> that was a fc16 with 3.6.2 fedora stock kernel
[17:34] <darkfaded> i hope 3.6.2 is right
[17:34] <darkfaded> but think so
[17:35] <darkfaded> and the fc17 rpms
[17:35] <Robe> same kernel in both cases?
[17:35] <darkfaded> :>
[17:35] <darkfaded> no, last time was 3.0-3.2 something
[17:35] <Robe> oh, ok
[17:35] <darkfaded> 10 months in between those tests
[17:36] <Robe> probably "a few" improvements in the kernel client aswell :)
[17:36] <darkfaded> yeah :))
[17:36] <darkfaded> just gnome3 still looked the same :)
[17:37] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[17:37] <Robe> hehe
[17:38] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) has joined #ceph
[17:39] * mgalkiewicz (~mgalkiewi@staticline-31-182-128-114.toya.net.pl) has left #ceph
[17:52] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) Quit (Quit: Computer has gone to sleep.)
[18:02] * terje__ (~joey@71-218-5-161.hlrn.qwest.net) has joined #ceph
[18:02] * jlogan1 (~Thunderbi@2600:c00:3010:1:14a3:ca45:3136:669a) has joined #ceph
[18:02] * terje___ (~terje@71-218-5-161.hlrn.qwest.net) has joined #ceph
[18:04] * terje_ (~joey@75-166-98-10.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[18:04] * terje (~terje@75-166-98-10.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[18:04] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[18:05] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[18:05] <noob2> can i use the rbd or rados python library to get a list of currently mapped devices?
[18:07] <elder> Anybody else have experience connecting gdb to a Qemu VM?
[18:08] <elder> It works for me on my 32-bit guest, but my 64-bit guest is not allowing me to supply the "-s" qemu flag viq "virsh edit"
[18:19] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[18:22] <noob2> elder: have you tried cloudstack yet?
[18:22] <elder> Nope.
[18:22] <noob2> ok just curious
[18:22] <noob2> wondering how easy it is to put together
[18:29] <slang> lxo: see my update for #3597
[18:30] * guigouz (~guigouz@ has joined #ceph
[18:38] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:40] <phantomcircuit> i see that suse has decided btrfs is ready for real use
[18:40] <phantomcircuit> is it just me or is that a bit uh... crazy
[18:40] <noob2> it's a bit early
[18:40] <phantomcircuit> that's a more diplomatic word for it
[18:40] <phantomcircuit> :P
[18:40] <noob2> btr still has some rough edges to me
[18:40] <noob2> lol
[18:40] <noob2> yeah oracle is using it as the default now also
[18:41] <noob2> look at the bright side. they'll work out the bugs faster if they really push it
[18:41] <phantomcircuit> i was going to use cloudstack but it's really a very large and afaict complex system
[18:42] <Psi-jack> Heh
[18:42] <Psi-jack> I use Proxmox VE 2.2 and it's pretty solid, IMHO.
[18:42] <noob2> phantomcircuit: try openstack and then you'll change your mind :D
[18:42] <noob2> cloud looks easy compared to it
[18:43] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:45] <phantomcircuit> openstack is just like
[18:45] <phantomcircuit> racksapce took all their interval systems
[18:45] <phantomcircuit> changed the names
[18:45] <phantomcircuit> and dumped it out
[18:45] <phantomcircuit> enterprise ^ 3
[18:48] <noob2> yup
[18:48] <noob2> it's insanely complex
[18:54] <gregaf> no, that's how we got Swift
[18:54] * chutzpah (~chutz@ has joined #ceph
[18:54] <Psi-jack> heh
[18:54] <gregaf> the Nova stuff if I understand right came from some people who knew what they were doing
[18:55] <gregaf> the *rest* of it came from *other* companies who had internally-produced functionality
[18:55] <gregaf> and jammed it together
[18:55] <gregaf> ;)
[18:55] <Psi-jack> Sweet. ceph-git successfully built.
[18:57] <Psi-jack> Okay. So there's a cluster address, and a public address. But I want to know more about how those are actually utilized.
[18:58] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has joined #ceph
[18:58] <jtang> hrmmm
[18:59] <noob2> gregaf: that makes sense
[18:59] <noob2> nova is pretty decent
[18:59] <noob2> i liked using it
[19:02] <gregaf> the OSDs bind to different ports for:
[19:02] <gregaf> 1) the OSD-OSD communication channel,
[19:02] <gregaf> 2) the OSD heartbeating channel,
[19:02] <gregaf> 3) the communications channel for everybody else (monitors, MDSes, clients)
[19:02] <Psi-jack> Heh...
[19:02] <gregaf> if you specify a cluster address, it will bind to that address for 1 and 2
[19:02] <gregaf> and the public address for 3
[19:02] <Psi-jack> Well. My actual storage network I have combined with my hypervisors over it's own private network.
[19:03] <gregaf> otherwise they'll all go together (generally on whatever's "closest" to the monitor IPs)
[19:03] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:03] <gregaf> it's not a problem if they're all the same, but some people have multiple NICs and want to split up the traffic, or firewall it a little bit, or whatever
[19:04] <Psi-jack> So the cluster addr, I could point to the SAN IP. And the Public to the public IP and only use that when needed and don't want to include a VM into the cluster network.
[19:04] <Psi-jack> Right. ;)
[19:04] * sjustlaptop (~sam@ has joined #ceph
[19:04] <Psi-jack> I have split networks, dedicated to just the storage network.
[19:04] * houkouonchi-work (~linux@ has joined #ceph
[19:04] <Psi-jack> In fact, the storage side of the network has 2x1Gb, while the public network only has 1x1Gb
[19:05] * drokita (~drokita@ has joined #ceph
[19:06] <jtang> the latest mainline stable kernel on sl6 is doing great on our test system
[19:06] <Psi-jack> Hmmm... http://wiki.debian.org/OpenStackCephHowto I'm using this as a baseline template.. But I'm wondering what the "devs" lines are for?
[19:07] <gregaf> whee, I wonder where that came from
[19:08] <gregaf> devs let you specify devices to format and mount as the OSDs
[19:08] <gregaf> with some of the install methods
[19:08] <Psi-jack> Ahhh
[19:08] <gregaf> they aren't well-tested or -supported
[19:08] <Psi-jack> Gotcha. I'll have it all pre-planned and mounted myself anyway.
[19:08] <Psi-jack> Cause, I have XFS journaling to SSD anyway. ;)
[19:11] <Psi-jack> Okay. So, IF I set cluster addr, and public addr differentl, will I still be able to have hosts coming in from cluster and public?
[19:11] * ssedov (stas@ssh.deglitch.com) Quit (Read error: Network is unreachable)
[19:11] * stass (stas@ssh.deglitch.com) has joined #ceph
[19:14] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[19:15] <gregaf> Psi-jack: the OSDs will bind to those addresses
[19:15] <gregaf> clients and monitors need to be able to route to the "public addr" and all the OSDs need to be able to route to the "cluster addr"
[19:16] <Psi-jack> So the OSD's will bind to the Cluster address, for internal traffic only, while mon/clients/etc will bind only to the public addr?
[19:17] <gregaf> the OSDs bind to both; everybody else binds to whatever they have that is "closest" to the monitor IPs (though if public addr or public network is specified they'll preferentially bind to those)
[19:17] <Psi-jack> I would prefer clients to be able to use the cluster address and the public.
[19:19] * gaveen (~gaveen@ Quit (Quit: leaving)
[19:19] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) has joined #ceph
[19:19] <Psi-jack> As in, I don't want bulky traffic going through public. ;)
[19:19] <gregaf> well they only use one IP, but it can be whatever IP you want it to be
[19:20] <gregaf> sounds like maybe you just want everything going through your beefy network though
[19:20] <Psi-jack> Mostly, yes.
[19:20] <Psi-jack> Stressing mostly. ;)
[19:21] <Psi-jack> Then again, thinking about it some... For my mail server store, I can just make an rbd disk for the storage and not care. I used to NFS mount the /home, but don't really need to do that anymore.
[19:21] <Psi-jack> Long as I can get it access to the network. ;)
[19:21] <Psi-jack> Err to the rbd
[19:22] <Psi-jack> I was thinking of having ALL rbd traffic go through the storage (cluster) network, and all CephFS/RBD mounts actually just going through the public network. That's what I was thinking.
[19:23] <Psi-jack> By all RBD, I mean, Hypervisor's directly providing a disk to the guest VM.
[19:23] <gregaf> no, you can't do that
[19:23] <Psi-jack> and CephFS/RBD mounts, having the guest VM OS mounting within it.
[19:23] <Psi-jack> Yeah.. I see..
[19:24] <Psi-jack> Either way, I could expand out my storage network subnet I suppose, so I could add multiple VM's onto their bridges into that network as needed then, and just have ALL traffic go through the dedicated storage network only.
[19:24] <wer> osd.3: running failed: '/usr/bin/ceph --admin-daemon /var/run/ceph/ceph-osd.3.asok Ceph v.55 just dies. And it isn't doing anything :)
[19:25] <wer> 0 -- >> pipe(0x47e2900 sd=37 :6840 pgs=0 cs=0 l=0).accept connect_seq 8 vs exis
[19:25] <wer> ting 7 state standby This is what the sod's do.
[19:25] <Psi-jack> So, I'd set cluster network and public network to the same network, and cluster addr and public addr to each hosts's IP's?
[19:25] <gregaf> if you set the network you don't need to set the host addresses individually
[19:26] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: Leaving)
[19:26] <Psi-jack> Gotcha, just in [global] set both?
[19:26] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[19:26] * ChanServ sets mode +o scuttlemonkey
[19:27] <gregaf> yep
[19:28] * gaveen (~gaveen@ has joined #ceph
[19:28] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:30] <Psi-jack> filestore fiemap = false ; solve rbd data corruption (sileht: disable by default in 0.48) Does this apply for 0.55 (git master checkout), as well?
[19:31] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[19:33] * Cube (~Cube@ has joined #ceph
[19:34] * Ryan_Lane (~Adium@ has joined #ceph
[19:35] <Psi-jack> mon is what the clients connect to, if I understand correctly?
[19:37] * gaveen (~gaveen@ Quit (Quit: leaving)
[19:37] * gaveen (~gaveen@ has joined #ceph
[19:38] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[19:39] <iggy> clients connect to mds and osd (iiuc)
[19:40] <Psi-jack> Hmm, interesting.. Heh, this example config I'm looking at specifically doesn't even define mds's
[19:41] <iggy> they are only required for cephfs, so you're probably looking at an rbd only config
[19:41] <Psi-jack> Nope. I might end up using CephFS.
[19:41] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) Quit (Quit: wer)
[19:42] <Psi-jack> The main thing is.. Can I mount either (or both), an rbd and/or cephfs on multiple systems at the same time?
[19:42] <iggy> I meant the example was probably for and rbd only setup
[19:42] <Psi-jack> And use the data at the same time, for use in like a mail server's Maildir stor?
[19:42] <Psi-jack> iggy: Oh yeah. ;)
[19:43] <iggy> you can't mount an rbd on multiple systems
[19:43] * sjustlaptop (~sam@ has joined #ceph
[19:43] <iggy> well, not without using something like GFS or ocfs
[19:43] <Psi-jack> Kinda thought that would be the case for rbd.
[19:43] <janos> rbd exposes a block device
[19:43] <Psi-jack> Right. :)
[19:43] <iggy> the same reason you can't mount an iscsi volume multiple places
[19:43] <Psi-jack> But, CephFS should be able to do so?
[19:43] <iggy> yes
[19:43] <Psi-jack> Sweet!
[19:44] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:44] <Psi-jack> So that will definitely solve my NFSv4 problems. ;D
[19:44] <janos> CephFS is under heavy development, iirc
[19:44] <Psi-jack> janos: Correct, next LTS coming out in 2~3 weeks.
[19:45] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) has joined #ceph
[19:46] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[19:46] <Psi-jack> I was told once a while back that only 1 mds should generally be needed..
[19:46] <Psi-jack> How actually true or false is that? heh.
[19:48] <slang> Psi-jack: It depends on how large you expect your file namespace to grow
[19:48] <gregaf> s/large/active
[19:49] <Psi-jack> Been trying to find [mds] section settings at all, but so far, failing.
[19:49] <Psi-jack> I'm guessing it probably has a msd data for setting the datadir.
[19:51] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[19:52] * yasu` (~yasu`@dhcp-59-224.cse.ucsc.edu) has joined #ceph
[19:52] <via> is src/leveldb supposed to be missing in next on github?
[19:55] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[19:56] <gregaf> it's a submodule
[19:57] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[19:58] <via> i see
[20:03] * nwat (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[20:04] * dshea (~dshea@masamune.med.harvard.edu) has joined #ceph
[20:05] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[20:07] <dshea> If I need to change the ip addresses of all the monitors in a cluster, which monmap file(s) do I need to clobber with the monmap tool? I tried last_committed for each mon in the cluster, but I am still getting the warning about the mon addr not matching the monmap file. But it doesn't say which monmap file. (-.-)
[20:14] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[20:17] <joao> you shouldn't clobber files on the mon store
[20:17] <joao> that's a really bad idea
[20:18] <joao> dshea, '--inject-monmap' to your 'ceph-mon' might be a better option than messing around in the mon store
[20:19] <dshea> is there any documentation on how to do this, searching the site only turned up the monmaptool manpage
[20:20] <gregaf> if you're trying to migrate IPs for your monitors, then (if it's available) a better choice is generally to do an incremental add of the new ones and then remove the old ones
[20:20] <Psi-jack> Okay. About how much space should I allocate for mon and mds stores, as I'm likely to just put those straight on the SSD.
[20:22] <dshea> gregaf: it's a test cluster, so the ip traffic is moving from a single vlan to 2 vlans, which actually is another question, mons should be listening on port 6789 on the public network to handle cephfs mount requests?
[20:22] <joao> Psi-jack, far from being a production cluster, but this monitor has been under pretty heavy load for the last couple of days
[20:22] <joao> ubuntu@plana41:~/master-ceph/src$ du -chs dev/mon.a/
[20:22] <joao> 35M dev/mon.a/
[20:23] <Psi-jack> Hmm, only 35M total utilization?
[20:23] <joao> I would think that the monitor store doesn't tend to grow way big, but maybe someone else can provide some more insight
[20:23] <joao> Psi-jack, that's the store
[20:23] <gregaf> dshea: the monitor IPs need to be accessible to everybody accessing the Ceph cluster, yes
[20:24] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:24] <gregaf> we don't have documentation on migrating monitor IPs because you're really not supposed to do it wholesale ;)
[20:25] <joao> gregaf, I'm starting to think that we probably should at least say that much on the docs
[20:25] <dshea> gregaf: thanks, so is there a way to in-place remap the mons or do I just need to rebuild the cluster from scratch?
[20:25] <joao> "you're not supposed to, but if you *really* need, here's a couple of ways that you'd be able to do it without messing everything up"
[20:26] <Kioob> with version 0.55 I still have that warnings : mount syncfs(2) syscall not supported / mount no syncfs(2), must use sync(2). / mount WARNING: multiple ceph-osd daemons on the same host will be slow
[20:26] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[20:26] <Kioob> I was thinking it was �solved� in 0.55, no ?
[20:28] <gregaf> dshea: hmm, I haven't thought it through although it's nominally possible (but unpleasant)
[20:28] <gregaf> joao, you want to take a stab at it?
[20:28] <joao> gregaf, sure
[20:28] <dshea> I can easily envision cases where networking just walks up one day and says surprise! new vlans
[20:29] <dshea> I was thinking just put the new ips in the configuration file and then re-fire it up, but apparently there is a lot more to this under the covers
[20:30] <joao> the monitors keep their own monmaps stashed in the store; changing the config won't change that
[20:30] * sjustlaptop (~sam@ has joined #ceph
[20:30] <joao> dshea, if you really have to, '--inject-monmap <monmap-file>' should work
[20:30] <joao> do that on all your monitors
[20:31] <dshea> is this monmaptool option? or ceph?
[20:31] <joao> ceph-mon's option
[20:32] <joao> gotta run
[20:32] <joao> bbl
[20:33] <dshea> thanks joao
[20:35] <Kioob> � osd: use syncfs(2) even when glibc is old � <== what should I understand ?
[20:38] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[20:53] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[20:56] * fede1 (~fede@ has joined #ceph
[20:57] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[21:03] * sjustlaptop (~sam@2607:f298:a:607:b0b4:db53:526f:cac7) has joined #ceph
[21:11] <mikedawson> Is there a known bug affecting 0.55 where ceph-osd has high CPU usage? I'm seeing one node (out of eight) with 3 OSDs, where top shows 80%-95% CPU for each ceph-osd process. Eventually the processes will segfault (one at a time).
[21:11] <mikedawson> Processes restart and chew CPU forever until they die.
[21:12] * BManojlovic (~steki@242-174-222-85.adsl.verat.net) has joined #ceph
[21:12] <joshd> mikedawson: no, if you could get logs with debug osd = 20, debug filestore = 20, and debug journal =20 that'd be great
[21:12] <mikedawson> The other seven servers show 1%-2% cpu utilization for all OSD processes
[21:14] * ebo^ (~ebo@ has joined #ceph
[21:14] <Psi-jack> joao: Hmm, so I could definitely EASILY get away with 3 mon's with like 5GB each on SSD, probably.
[21:15] <Psi-jack> And mds.. Hmm, probably maybe 5~10GB, mostly all that'll be CephFS based is website content and Maildir storage.
[21:17] <joshd> mds doesn't use local storage except for logs, it stores everything in rados
[21:18] <Psi-jack> Logs... Journal logs?
[21:18] <sjust> mds logs
[21:18] <ebo^> that would be a great faq question: what does mds, mon, ods store on disk (in case of osd, apart from data)
[21:18] <Psi-jack> ebo^: Exactly! That's exactly why I'm asking. ;)
[21:18] <Psi-jack> I don't want to over-allocate too much when it's not needed.
[21:19] <Psi-jack> I'm using 3x120GB SSD's for journal/metadata stuff, spindles for data itself.
[21:19] <gregaf> sjust: Psi-jack: *debugging* logs
[21:19] <gregaf> not any persisted data
[21:20] <Psi-jack> OKay. So 512MB~1GB is more than enough?
[21:20] <gregaf> I wouldn't give the MDS its own partition at all
[21:20] <Psi-jack> Heh
[21:21] <mikedawson> joshd: logs -> http://pastebin.com/pCRXrDBJ
[21:22] <Psi-jack> hehe
[21:23] <Psi-jack> gregaf: Eh, well, it'll have a mount point, nothing major.
[21:23] <joshd> sjust: is that a bug in DBObjectMap?
[21:24] * sjustlaptop (~sam@2607:f298:a:607:b0b4:db53:526f:cac7) Quit (Ping timeout: 480 seconds)
[21:24] <Psi-jack> I can't even determine if mds itself has like "mds data = " or /any/ mds values at all, so far, from documentation.
[21:24] <sjust> joshd: one sec
[21:24] <Psi-jack> Unlike, mon and osd, which are very clearly documented.
[21:26] <ron-slc> Does anybody know if a CRUSH map is able to make a rule like this: Select minimum of 2 hosts (of 2 total, 4 OSD's each), with rep-size=3. When I chooseleaf based type=osd I see several results with all 3 placements on same host, and when I chooseleaf based on host, I only see 2 placements.
[21:26] <gregaf> depending on how you set it up you can specify an "mds data" directory, but the only thing it uses that for is assuming the keyring lives there and using it
[21:27] <Psi-jack> Hmmm, ahh, yes.. l
[21:27] <gregaf> ron-slc: unfortunately that's not really possible
[21:27] <Psi-jack> Which, I will probably just put the keyring in /etc/ceph/keyring.$name
[21:27] <gregaf> you could mangle something, but CRUSH isn't designed for and doesn't do real well when the number of buckets is very close to the number of replicas
[21:29] * fede1 (~fede@ Quit (Quit: WeeChat 0.3.8)
[21:30] * fedepalla (~fede@ has joined #ceph
[21:31] <ron-slc> gregaf: thanks! that's what it's been looking like.
[21:31] <sjust> mikedawson: if they are all on the same server, is it possible that something wonky is going on with the disks?
[21:32] <gregaf> ron-slc: if you were feeling particularly devoted, you could do something like splitting up the disks on each server into groups A and B, where B is twice as large as A
[21:32] <mikedawson> Perhaps, but not really sure how to check
[21:32] <gregaf> and create a CRUSH map with buckets host1, host2, hostmixed
[21:33] <gregaf> where host1 has B from host1, host2 has B from host2, and hostmixed has A from both hosts
[21:33] <gregaf> and write the rule to select a disk from each group
[21:33] <gregaf> but that would be a pain to handle on any expansion, and with only 4 disks wouldn't balance real well anyway since you can't do a clean 2:1 split
[21:33] <mikedawson> sjust: these 3 OSDs share a single SSD for their three journal partitions. OS is on that same SSD. No apparent issue with the OS at present, though.
[21:34] * dmick (~dmick@2607:f298:a:607:cdcc:47f1:28cc:d2b7) has joined #ceph
[21:34] <ebo^> when i do the "getting started" replicas can be placed on the same host. what do i have to change to get no replicas on the same host?
[21:35] <gregaf> ebo^: you'll need to update your CRUSH map and rules to avoid that
[21:36] <sjust> mikedawson: what version are you running?
[21:36] <sjust> oh
[21:36] <sjust> wit
[21:36] <sjust> it's in the backtrace
[21:36] <mikedawson> sjust: this is a four node in 2U setup. The other four nodes share a common backplane. Wiring is split out to each node's SATA ports. The other three nodes in this chassis have normal CPU utilization.
[21:36] <gregaf> however, the steps you go through with getting started will avoid placing replicas on the same host if you have more than two or three of them, and if you've got a cluster that small then doing the split can have other undesirable effects on balance and such
[21:36] <mikedawson> 0.55 on Quantal
[21:38] <ebo^> i'm still not sure if i really want journals on ssds
[21:38] <ebo^> losing one ssd in my setup would mean half a node has to be rebuild, and i have only 6 nodes
[21:38] <Psi-jack> ebo^: Why?
[21:39] <Psi-jack> Have a spare SSD? ;)
[21:39] <ebo^> that's not the point
[21:39] <Psi-jack> Those things are hotswappable. ;)
[21:40] <ebo^> losing the journal means rebuilding the associated osds
[21:40] <Psi-jack> Just by loosing the journal, which should only contain temporary data not yet written to disk?
[21:40] <ebo^> yes
[21:41] <Psi-jack> Correct.
[21:41] <Psi-jack> This would be the same situation as if you didn't use an SSD, and a disk died.
[21:41] <Psi-jack> ANY disk.
[21:41] <ebo^> i have 8 disks, which would mean 4 journals on 2 ssd
[21:41] <ebo^> 1 ssd dead = osds dead
[21:42] <ebo^> *4 osds
[21:42] <Psi-jack> SSD's can take a lot of beating. :)
[21:43] <Psi-jack> Unless they're Intels, then you can kill those in 2 weeks or less. ;)
[21:43] <ebo^> i'm interested in situations that dont follow the plan
[21:44] <sjust> joshd: ugh, that assert is quite impossible
[21:44] <ebo^> i just lost a 12 disks out of 24 disk raid 6 ... in like 2 weeks
[21:44] <ebo^> needless to say, the data is gone
[21:44] <sjust> lower_bound and upper_bound both end in a call to adjust()
[21:44] <sjust> which ends in an assert(invalid || cur_iter->valid())
[21:45] <sjust> we immediately then fail an assert(!(!invalid && ready) || cur_iter->valid())
[21:45] <sjust> so cur_iter->valid() must be false
[21:46] <sjust> so invalid is true
[21:47] <Psi-jack> What is this CRUSH, anyway? And how do you set it? My minimal understanding is CRUSH is used to set how many replica's you have, and on how many hosts.
[21:48] <mikedawson> sjust: do you need me to provide any more debugging or try anything on my end?
[21:48] <sjust> anyway, it means we got an error in upper_bound or lower_bound
[21:48] <sjust> mikedawson: looks like corruption in leveldb
[21:48] <sjust> or the filesystem
[21:49] <mikedawson> filesystem at /var/lib/ceph/osd/ceph-16 is XFS and appears to be in good shape
[21:50] * guigouz (~guigouz@ Quit (Quit: Computer has gone to sleep.)
[21:50] <mikedawson> is there a leveldb instance per node? If so, how could I verify it is hosed on this node?
[21:51] <sjust> hmm
[21:51] <sjust> there is a leveldb instance per osd daemon
[21:52] <mikedawson> seems odd that the 3 osds on this one node are out of whack and no other nodes are affected
[21:53] <mikedawson> this is dev, so I'm happy to break anything if it'll help you
[21:54] * drokita (~drokita@ Quit (Read error: Connection reset by peer)
[21:54] <dmick> Psi-jack: CRUSH is the algorithm used for placing data in the cluster. It's what allows the clients to send almost all their traffic directly to the storage nodes (OSDs) rather than having to arbitrate through the central controller nodes (monitors)
[21:55] <Psi-jack> Hmmmm
[21:55] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[21:55] <dmick> default maps are created for you, but you can tune/adjust yourself: http://ceph.com/docs/master/rados/operations/crush-map/
[21:55] <Psi-jack> Okay, I'll be back in about 20~30 mins, heading home. But but ultimate question is, do I set up CRUSH stuff while making the initial CephFS OSD's, or is it per storage container (RBD for example), etc
[21:55] <Psi-jack> Cool. Will check it out when I get home. :)
[21:59] <dmick> sjust: I have a dumb little leveldb dumper if that will help you
[21:59] <dmick> (I imagine you do too, but jic)
[21:59] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[21:59] <mikedawson> dmick: let me know what to do, and I'll dump the leveldb
[22:00] <dmick> mikedawson: pm me with an email address, and I'll send you source and build instructions? It's a small program
[22:01] <mikedawson> dmick@... ?
[22:01] <dmick> well I meant PM here, but sure, dan.mick@inktank.com
[22:02] <mikedawson> oops. email sent
[22:02] <dmick> got it
[22:03] <dmick> it'll take me a few minutes to be sure it can work outside the Ceph source tree
[22:05] <ron-slc> gregaf: I see what you're saying on the hostmixed. And you're right expansion would be a little challenging. I think I'll add a 3rd host, seems much easier/safer to have minimum # of hosts >= replicas
[22:05] <dshea> I've segregated my two networks in ceph.conf. I have public on and cluster my mon processes have their mon addr on the public. I can't seem to mount cephfs. I get I/O error. ceph -s shows everything as being up and healthy. I think I'm missing something obvious here, but my understanding is the osd will speak to one another over the cluster network and my clients will mount by attaching to the mon process on the pub
[22:05] <dshea> lic network. All the nodes are dual nic and sit on both networks.
[22:05] <dmick> mikedawson: sjust: this has to happen with the cluster shut down, of course
[22:05] <ron-slc> also the weighting can be left alone.
[22:05] <dmick> because the db is locked by the osd
[22:05] <sjust> dmick: just the osd, right?
[22:06] <dmick> right
[22:06] <slang> dshea: how are you trying to mount?
[22:06] <dshea> mount.ceph /mnt/ceph -v
[22:06] <dshea> that's where mon.a sits on my public network
[22:08] <mikedawson> dmick: so do I shut down *all* osds local, and on other nodes? Or just shutdown the osd process for the leveldb I'm trying to dump?
[22:08] <dmick> just the one you want to dump should be enough
[22:08] <sjust> just the one for the leveldb you want to dump
[22:08] <slang> I think you want those to be and
[22:09] <dmick> mikedawson: sjust: mail away
[22:10] <dshea> no, I want a class C ( subnet mask), which is /24
[22:12] <slang> dshea: anything in the log besides EIO?
[22:12] <slang> dmesg
[22:12] <Psi-jack> Hmmm, interesting.
[22:13] <Psi-jack> So you have to "decompile" change, and recompile a crush map? Odd.. ;)
[22:13] <dshea> [412556.570085] libceph: loaded (mon/osd proto 15/24, osdmap 5/6 5/6)
[22:13] <dshea> [412556.603895] ceph: loaded (mds proto 32)
[22:13] <dshea> [412556.605630] libceph: mon0 connection failed
[22:13] <slang> dshea: yeah you're right /24
[22:13] <dshea> root@cluster-gw:~# nmap
[22:13] <dshea> Starting Nmap 5.21 ( http://nmap.org ) at 2012-12-11 16:13 EST
[22:13] <dshea> Nmap scan report for node00 (
[22:13] <dshea> Host is up (0.00029s latency).
[22:13] <dshea> Not shown: 998 closed ports
[22:13] <dshea> PORT STATE SERVICE
[22:13] <dshea> 22/tcp open ssh
[22:13] <dshea> 6789/tcp open ibm-db2-admin
[22:13] <dshea> MAC Address: 68:05:CA:08:A6:B1 (Unknown)
[22:13] <Psi-jack> I take it you only have to change it once, on any of the nodes, and it'll globally use it?
[22:13] <dshea> Nmap done: 1 IP address (1 host up) scanned in 1.65 seconds
[22:13] * gucki_ (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) has joined #ceph
[22:13] <dmick> Psi-jack: yep. I assume the compiler does some consistency checking that's useful to avoid nuking your cluster
[22:14] <dshea> /etc/services shows ibm for the port, but it is mon and not a db2 process ;)
[22:14] <gucki_> hi there
[22:14] <gucki_> do you have any idea what's the best way to move a ceph cluster (around 2tb data) from one colocation to another with minimal service interruption?
[22:15] <Psi-jack> hehe
[22:15] * gucki_ (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) Quit (Remote host closed the connection)
[22:15] <Psi-jack> dmick: Cool. Sounds good. And it has to be done after it's all running too, I presume.
[22:15] <Psi-jack> As in after all the mkcephfs stuff is done.
[22:15] <gucki> sry, was logged in twice ;)
[22:16] <dmick> Psi-jack: well if you want to look at the default one that the cluster installation has created for you, yes :)
[22:16] <Psi-jack> hehe
[22:17] <dmick> see the section of mkcephfs dealing with usecrushmap
[22:17] <Psi-jack> heh.
[22:17] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:18] <Psi-jack> I'll probably let it create it's own first, then follow this doc to adjust it to my specific needs after, since that's mostly what I'll be doing anyway, if ever I need to edit it again.
[22:18] <dmick> sure
[22:18] <slang> dshea: anything in the monitor log?
[22:19] <gregaf> gucki: what kind of bandwidth do you have between the locations?
[22:19] <dshea> 2012-12-11 16:03:20.722278 b2756b40 0 -- >> pipe(0x9221380 sd=22 pgs=0 cs=0 l=0).accept peer addr is really (socket is
[22:20] <gucki> gregaf: hard to tell. the servers have 1gbit uplinks, but i doubt the interconnection between the two colos is that fast
[22:20] <gregaf> is it sufficient to handle your aggregate read and write bandwidth?
[22:20] <gucki> gregaf: i thought about rsyncing the osd directories (which will take 1-2 days?), then shutting down the osds and doing a final rsync again (hopefully only 1-2 hours?).
[22:21] <Psi-jack> Okay... So.. This is what I have so far for my ceph.conf for the initial built-out of a 3-node, 3 OSD, 3 MON, and 3 MDS cluster where mounts will all be laid out over /srv/ceph/{osd|mon|mds}$id: http://pastebin.ca/2291567
[22:21] <gregaf> gucki: I think that would work for the gross data movement, but let's wait until sjust gets back to confirm
[22:21] <gregaf> you'd need to be careful about xattrs and things
[22:22] <Psi-jack> if anyone has a moment to take a quick look, see if I have any flaws in the setup. Everything's all going to be on XFS for OSD, MON, and MDS, and the journal's are raw block device partitions.
[22:22] <gregaf> and moving over your monitors would be a bit tricker; you'd probably want to bring up the new OSDs, then add monitors in the new location and remove monitors in the old location once they're up
[22:22] <lurbs> Psi-jack: If you're using a block device for the journal you don't need the size specified, I think. It'll default to using the whole device.
[22:22] <gucki> gregaf: yes, it should be sufficient to handle the bandwidth usage when the system is running. it'd be ideal to make ceph use the local osd for reading, but write to local and remote osd
[22:22] <Psi-jack> lurbs: heh, the documentation says to set the size to 0.
[22:23] <lurbs> Hmmm, may have changed.
[22:23] * lurbs checks.
[22:23] <Psi-jack> heh
[22:23] <gucki> gregaf: so all data is in both datacenters and i can simply shutdown the cluster and restart it in the other dc. of course writes would be much slower, but that would be ok for 1-2 days..
[22:24] <slang> dshea: you are mounting from
[22:24] <lurbs> Psi-jack: "Since v0.54, this is ignored if the journal is a block device, and the entire block device is used."
[22:24] <gregaf> gucki: there's not a great way to make it do local reads, but what you could do is (and you'd want to test this on a small bit of data first to see how it impacts your use case, and to tune up the config options)
[22:24] <gucki> gregaf: when doing the rsync thing i'd completely shutdown the old cluster before starting the new one. i think i'll have to manually change the ips of the monitors (i had once to do that in a test setup)
[22:24] <lurbs> Depends on the version, then.
[22:24] <Psi-jack> lurbs: Heh. Funny. ;)
[22:24] <Psi-jack> Ohhh, I missed that blurb at the end. ;)
[22:25] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[22:25] <gregaf> but you could add a separate CRUSH group which consists of the remote OSDs, and for all your CRUSH rules add a separate take/emit group that grabs something out of those OSDs
[22:25] <dshea> slang, yup, cluster-gw sits between the 10 nodes (node00-node09) and the "real world"
[22:25] <gregaf> and that would go through recovery to put one copy of all the data in the remote data center
[22:25] <Psi-jack> lurbs: Otherwise, looks good?
[22:26] <gregaf> and once that's quiesced you could make it do two remote and one local, and then eventually just put them all remote, and then your data would have moved
[22:26] <gregaf> that's the all-Ceph way to do it
[22:26] <Psi-jack> My goal is to have this launched this weekend, when hardly (if anyone) here in #ceph will be around. ;)
[22:26] <gregaf> but rsyncing might indeed be faster
[22:26] <dshea> slang: the nodes have two nics each provisioned on the cluster and public networks, cluster-gw has a public network ip and another nic that talks to the outside world
[22:27] <Psi-jack> About the only thing I'm not setting currently is "osd pg bits"
[22:27] <dshea> slang: gotta go, but I will return soon
[22:27] <gregaf> (or I guess really you would make the remote the primary rather than a replica, and then add more remote copies so that the rest of those copies were local transfers)
[22:27] * dshea is now known as dshea_afk
[22:29] <lurbs> Psi-jack: Which version are you using? The cephx auth stuff changed relatively recently.
[22:30] <Psi-jack> lurbs: I plan to be using today's current ceph-git build I just did.
[22:30] <Psi-jack> lurbs: So, yeah, cephx is something I obviously need to understand and have setup right.
[22:30] <Psi-jack> Almost forgot about that. ;)
[22:30] <lurbs> http://ceph.com/docs/master/rados/operations/authentication/
[22:30] <lurbs> See 6).
[22:31] <lurbs> Disclaimer: I'm not a Ceph dev, just a user.
[22:31] <gucki> gregaf: ok, so i'll have a deeper look at the crushmap the next few days and try to make a test setup. the problem with the real system is i'd then have to update all ip bindings (currently it's all on a local subnet) and i'm not really sure how stable the interconnection link is (and if ceph will properly handle this)
[22:31] <Psi-jack> lurbs: Understood. ;)
[22:31] <gregaf> gucki: ah
[22:31] <Psi-jack> Hmm
[22:32] <gregaf> yeah, if you aren't confident in your link and your networking I'd probably just do the rsync as that can all happen offline and be easily "reversed"
[22:32] <Psi-jack> I wonder how cephx will work from my Proxmox VE servers...
[22:32] <Psi-jack> Which reminds me, is it even wise to be using ceph 0.48 clients (for RBD primarily) with 0.55(master) servers? ;)
[22:33] <mikedawson> dmick, sjust email sent with leveldb dump for one of the high CPU OSDs (it was too big to paste). Thanks for your help. I'll be offline for a bit
[22:33] <gucki> gregaf: i mean it should be stable, but it's from one country to another (only 600km thought), and you never know... ;)
[22:33] <Psi-jack> joao: Probably a gooooood question for you, since my clients are going to be using ceph 0.48 clients, while the servers would be using 0.55 from git's master checkout today.
[22:34] <gucki> gregaf: but thank you very much for you idea, i'll look at it for sure the next few days :)
[22:35] <gucki> gregaf: btw, do you know when the next stable will come out? 0.55 has some fixes i need, but i'd like to wait for a stable release.. :)
[22:36] <sjust> gucki: bobtail will be RealSoonNow
[22:36] <gucki> sjust: probably a christmas gift? :)
[22:37] <dmick> ok mikedawson, thanks. I cede the floor to sjust
[22:38] <gucki> sjust: did you read my rsync idea? do you think it'll work with xfs?
[22:38] <sjust> gucki: in principle, maybe, but for 2TB of data, a straight rados copy would be much easier
[22:38] * fedepalla (~fede@ Quit (Quit: WeeChat 0.3.8)
[22:39] <slang> dshea_afk: is there an interconnect between the two networks public/cluster?
[22:39] <sjust> which I guess we don't have a tool for, now that I think about it
[22:39] <lurbs> I vaguely recall there being a limitation with OSDs being numbered sequentially. Has that been lifted?
[22:40] <gucki> sjust: i'm using qemu-rbd, can i still use the rados copy? would it workm incremental like rsync? (so slow initial copy but fast final sync?)
[22:41] <dmick> lurbs: not really.
[22:43] * jjgalvez (~jjgalvez@ has joined #ceph
[22:44] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[22:48] <sjust> gucki: at present, there isn't a rados copy, I'm not sure of the implications of copying the cluster contents... joshd?
[22:49] <lurbs> dmick: So naming OSDs 'osd.XY', where X is the node, and Y is the OSD on the node, is a bad idea?
[22:51] * slang (~slang@cpe-66-91-114-250.hawaii.res.rr.com) Quit (Remote host closed the connection)
[22:51] * Cube (~Cube@ Quit (Quit: Leaving.)
[22:51] <Psi-jack> Hmm? :)
[22:51] * slang (~slang@cpe-66-91-114-250.hawaii.res.rr.com) has joined #ceph
[22:53] * Cube (~Cube@ has joined #ceph
[22:56] <Psi-jack> lurbs: Good question, since that's the way I was setting mine up, osd.{11,12,13}, osd.{21,22,23}, and osd.{31,32,33}
[22:58] <lurbs> Psi-jack: Yeah, mine are all sequential. Was wondering if you'd see that in a howto or doc somewhere.
[23:00] <Psi-jack> lurbs: I did, actually.
[23:01] <Psi-jack> lurbs: http://wiki.debian.org/OpenStackCephHowto
[23:01] <dmick> lurbs: yes, that's a bad idea. in general I'd recommend avoiding attaching any significance to osd names
[23:01] <Psi-jack> A debian howto doc, specifically. heh
[23:02] * dshea_afk is now known as dshea
[23:02] <Psi-jack> dmick: So, node1: 1, 2, 3; node2: 4, 5, 6; node3: 7, 8, 9 would be the better approach?
[23:02] <dmick> the gist is that data structures are sized based on the maximum number seen, and indexed by number
[23:02] <dshea> slang: they are non-routable
[23:02] <dmick> so if the integers are small enough, things work...oKAY...but bad stuff happens if you try to get fancy with it
[23:03] <dmick> Psi-jack: I'd say 0-based, even, but yeah, IMO
[23:03] <dshea> slang: they are all run into a 48-port switch that has them segregated into 2 VLANs
[23:03] <Psi-jack> hmmm, okay.
[23:03] <Psi-jack> dmick: What about mon.$id and mds.$id?
[23:03] <dmick> afaik mon ids are arbitrary
[23:03] <Psi-jack> ceph's docs demonstrate using a, b, etc.
[23:03] <dmick> I'm not certain about msd
[23:03] <dmick> *mds
[23:03] <Psi-jack> Okay.
[23:04] <joshd> mds and mon ids are both arbitrary, it's only osds that are tied directly to them
[23:04] <Psi-jack> Well, I'll renumber these and go with the suggested route with the osd's then. ;)
[23:04] <dshea> slang: I was under the impression they should be since the idea is to segregate the network traffic of the cluster osd processes from the client requests which go to the (I am thinking) mon procs
[23:05] <Psi-jack> As for ordering, would you stack 0,1,2 onto node1, or 0 on node1, 1 on node2, 2 on node4?
[23:05] <Psi-jack> node3*
[23:05] <Psi-jack> I have a slight odd-ball setup due to disk constraints. Each node will have a 320GB, 500GB, and 1TB HDD in each.
[23:05] <dmick> doesn't matter; call them what you like, as long as the crush map does what you're after
[23:05] <dshea> slang: let me see if I have a quick diagram I had done in dia
[23:05] <Psi-jack> dmick: Gotcha.
[23:06] <dmick> well
[23:06] <gucki> another question...when using librdb (ex python binding), how can i get the state/ progress of a copy operation?
[23:06] <dmick> place them where you like I mean :)
[23:06] <janos> woohoo i just made my first crushmap. this is damn cool
[23:06] <Psi-jack> dmick: Heh
[23:06] <janos> more hdd's being shipped
[23:06] <Psi-jack> lurbs: Thanks for catching that! :)
[23:07] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has left #ceph
[23:07] <joshd> gucki: there are c++ functions that include a progress callback mechanism, these could be wrapped in c and then python
[23:07] <joshd> oh, actually we already have c variants
[23:08] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) Quit (Quit: wer)
[23:08] <gucki> joshd: ah great. where can i find them? in the sources in the repo only, or...? :)
[23:08] <joshd> look at librbd.h
[23:08] <dmick> well they're definitely in the sources :)
[23:08] <joshd> they're the *_with_progress* functions
[23:09] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) has joined #ceph
[23:09] <Psi-jack> Heh
[23:09] <Psi-jack> After all SSD partitioning, I still have 77 GB free to use for whatever on this SSD. ;)
[23:11] <gucki> joshd: thanks :)
[23:11] <gucki> dmick: yeah i only looked at the docs pages so far and couldn't find them so i wondered if i just overlooked them :)
[23:12] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[23:12] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[23:12] * Leseb_ is now known as Leseb
[23:12] <dmick> gucki: yeah, I knew what you were asking, just having fun
[23:12] <Psi-jack> Hmmm.. Guess I could use the remaining 77GB for system backups, backup each of the node's boot partitions and store them on the remaining portion of the SSD before copying them to the backup servers. heh
[23:13] <slang> dshea: the clients need to be able to talk to the osds as well as the mons, so the osds need those public addresses as well
[23:13] * loicd (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) Quit (Quit: Leaving.)
[23:14] <slang> dshea: the osds communicate with the mons also, but they can use the cluster network for that
[23:14] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:14] <slang> dshea: can you send me your ceph.conf?
[23:14] <Psi-jack> lurbs: So, everything else you think looks good in that?
[23:15] <lurbs> Psi-jack: Or simply leave it unused. If there's not that much spare area by default on the drive then leaving a bunch of free area will help with wear leveling, write amplification, etc.
[23:15] <Psi-jack> lurbs: True, true.
[23:15] <dshea> slang: sure
[23:15] <lurbs> I didn't spot anything else, but you should probably prefer to my disclaimer.
[23:15] <Psi-jack> hehe
[23:15] <Psi-jack> lurbs: Yes, yes. I know. :)
[23:16] <Psi-jack> lurbs: People with experience is more than I have.
[23:16] <Psi-jack> I've only tested a 2-node OSD cluster once, and the results were fair but not optimal at the time, due to the limitations I had, but most of those limits are resolved. ;)
[23:22] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[23:23] <dshea> slang: I guess in my case then I might have to turn on ip forwarding in the kernel to allow to route to, either that or get a third nic for my test gateway machine, so that he sits on both networks
[23:24] <slang> dshea: oh the client/gateway doesn't have a nic for the
[23:24] <slang> dshea: yeah
[23:24] <dshea> the client has two nics
[23:24] <dshea> one is on the public as
[23:25] <dshea> the other sits as the outside world accesible interface I use to connect to the test environment
[23:25] <Psi-jack> dmick: What about XFS mount options, such as the ones I'm doing for spindle disks, rw,nodev,noexec,noatime,nodiratime,attr2,logbufs=4,logdev=/dev/sde4,noquota; and for SSD (mon and mds), rw,nodev,noexec,noatime,nodiratime
[23:25] <sjust> joshd: can you take a look at (teuthology) wip_spilt
[23:25] <Psi-jack> dmick: Any problems you can think of ceph-wise with any of that?
[23:25] <sjust> *wip_split
[23:25] <sjust> adds pgnum and pgpnum testing to CephManager
[23:25] <slang> dshea: right - ok so the client can't directly reach the network
[23:25] <dshea> correct
[23:26] <Psi-jack> Hmmm, barrier=0 isn't showing in those. Odd.
[23:27] <slang> dshea: that should be ok, as long as the client can reach the osds over the public network (i.e. the osds are listening on the 192.168.0.x)
[23:27] <Psi-jack> Ahhhh.. xfs's barrier is different. LOL.
[23:28] <dshea> slang: ok, they can defintely do that
[23:28] <dshea> root@node00:/var/log/ceph# traceroute
[23:28] <dshea> traceroute to (, 30 hops max, 60 byte packets
[23:28] <dshea> 1 ( 0.116 ms 0.080 ms 0.083 ms
[23:29] <slang> dshea: are the osds configured to use that network for their public addr?
[23:30] <slang> dshea: did you send me your ceph.conf?
[23:30] <dshea> I pvt chatted it to you, I tried to DCC but it failed
[23:31] * slang (~slang@cpe-66-91-114-250.hawaii.res.rr.com) has left #ceph
[23:31] * slang (~slang@cpe-66-91-114-250.hawaii.res.rr.com) has joined #ceph
[23:31] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[23:31] <slang> sigh
[23:31] <slang> dshea: ok I got it
[23:32] <slang> I guess the empathy irc plugin isn't great with private chat, etc.
[23:34] <slang> dshea: and ceph -s from the client succeeds?
[23:35] * LeaChim (~LeaChim@b0fafb7d.bb.sky.com) Quit (Read error: Operation timed out)
[23:35] <dshea> slang: from the client? I thought you run that on the cluster side? Do I just point it at the mon or do I copy the config to my client?
[23:37] <dshea> slang: nvm I copied the cfg over and it seems to work ok
[23:37] <dshea> root@cluster-gw:~# ceph -s
[23:37] <dshea> health HEALTH_OK
[23:37] <dshea> monmap e1: 3 mons at {a=,b=,c=}, election epoch 6, quorum 0,1,2 a,b,c
[23:37] <dshea> osdmap e17: 10 osds: 10 up, 10 in
[23:37] <dshea> pgmap v727: 1920 pgs: 1920 active+clean; 0 bytes data, 10615 MB used, 37231 GB / 37242 GB avail
[23:37] <dshea> mdsmap e1: 0/0/1 up
[23:39] <slang> dshea: can you try mounting with fuse instead of the kernel?
[23:40] <dshea> slang: sure thing, one second
[23:42] <gregaf> dshea: slang: there's no MDS running in that ceph -s
[23:42] <gregaf> CephFS isn't going to work well without that
[23:43] <dshea> gregaf: ? let me see, it must have died
[23:44] <slang> gregaf: yes that's very true
[23:44] <dshea> it core dumped
[23:45] <dshea> I guess Mark was lying to me on the phone the other day
[23:45] <dshea> cephfs needs some dev love
[23:45] <dshea> lol
[23:45] <dshea> this was a clean build, no data even written to it yet, if anybody needs the dumps and stuff, please let me know
[23:46] * LeaChim (~LeaChim@5ad684ae.bb.sky.com) has joined #ceph
[23:46] <slang> dshea: yeah if you want to send the stacktrace that would be useful just in case its something we haven't seen yet
[23:46] <Psi-jack> Hmmm
[23:46] <Psi-jack> Now, I need to come up with systemd service files for CephFS. heh
[23:47] <dshea> What's the best way to send the information on. We're in the process I think of setting up a pre-eval support thing, I don't have the contact details yet
[23:48] <Psi-jack> Heh, nice.. --with-ocf can create an OCF-compliant cluster resource agent.
[23:49] <dshea> slang: should I just pastebin it to the channel for now?
[23:50] <Psi-jack> Hmmm
[23:50] <Psi-jack> What would ceph need with JDK?
[23:50] <iggy> java bindings?
[23:50] <Psi-jack> Java bindings?
[23:50] <Psi-jack> Heh
[23:50] <slang> dshea: you can email it to me if you want
[23:50] <slang> dshea: otherwise pastebin should be fine
[23:51] <iggy> i can't think of anything in ceph that uses java directly
[23:52] <Psi-jack> Okay. So, extras, I don't really "need"
[23:52] <Psi-jack> but the OCF RA, that I might want to look at. :)
[23:52] <gregaf> java is for the bindings, and mostly for the Hadoop ones
[23:53] <Psi-jack> least, with pacemaker involved, I can stonith a node that's apparently failing. ;)
[23:54] <Psi-jack> But.. Ugh... Current bug in pacemaker 1.1.8 makes PTR DNS cause major havok. :/
[23:58] <dshea> slang: http://pastebin.com/NjtsTEur
[23:59] <gregaf> dshea: slang: yehudasa just fixed that or a similar one — are all your daemons v0.55?

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.