#ceph IRC Log


IRC Log for 2012-12-06

Timestamps are in GMT/BST.

[0:03] * cblack101 (86868b4c@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[0:14] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[0:19] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:23] * aliguori (~anthony@ Quit (Remote host closed the connection)
[0:25] * synfin (~rahl@cv-nat-A-128.cv.nrao.edu) has left #ceph
[0:26] * plut0 (~cory@pool-96-236-43-69.albyny.fios.verizon.net) has joined #ceph
[0:37] * gucki (~smuxi@80-218-32-162.dclient.hispeed.ch) Quit (Remote host closed the connection)
[0:44] * roald (~Roald@ Quit (Read error: Connection reset by peer)
[0:49] * rino (~rino@ has joined #ceph
[0:49] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[1:00] <rino> Hi, does ceph have a mode of replication where the data is striped with parity? in the case I lose an osd in a cluster of 4 osds
[1:01] <fmarchand> gregaf ?
[1:02] <sjustlaptop> rino: not at this time
[1:02] <sjustlaptop> wait, you mean as opposed to straight replication?
[1:02] <sjustlaptop> we do do replication
[1:03] <rino> so as I understand mode 2 replication will copy my data to all 4 osds as copies
[1:03] <sjustlaptop> you mean pool size 2
[1:03] <sjustlaptop> ?
[1:03] <rino> yes, sorry
[1:03] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[1:03] <sjustlaptop> each object would be copied to 2 osds
[1:03] <sjustlaptop> pool size three would copy each object to three osds
[1:04] <sjustlaptop> et
[1:04] <sjustlaptop> *etc
[1:04] <fmarchand> gregaf :I made it work with the mds wipe sessions : now the mds starts and I can mount again my cephfs
[1:05] <rino> sjustlaptop: that would still mean that if I lose an osd, there will be data missing? the objects on that lost osd
[1:06] <rino> sorry if this is confusing, i'm not the best at explaining
[1:07] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has left #ceph
[1:07] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:08] <sjustlaptop> if you lose an osd with 4 osds and pool size 2, each object on the lost osd will be re-replicated to a second osd
[1:08] <sjustlaptop> so foo might be on [0, 2]
[1:08] <sjustlaptop> and bar might be on [1, 3]
[1:09] <sjustlaptop> if you loose osd 0, foo might be re-replicated to [2,3]
[1:12] <rino> thanks for the clarification
[1:13] <rino> I think I overcomplicated it for myself
[1:17] <sjustlaptop> no worries
[1:19] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:23] * jlogan1 (~Thunderbi@2600:c00:3010:1:7db5:bf2b:27d1:c794) Quit (Ping timeout: 480 seconds)
[1:24] * fmarchand (~fmarchand@ has left #ceph
[1:27] * BManojlovic (~steki@242-174-222-85.adsl.verat.net) Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:28] * vata (~vata@ Quit (Quit: Leaving.)
[1:31] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[1:32] * xiaoxi (~xiaoxiche@jfdmzpr06-ext.jf.intel.com) has joined #ceph
[1:58] * Lea (~LeaChim@b0fac111.bb.sky.com) Quit (Remote host closed the connection)
[2:05] <infernix> nhm: so the ssd-less benchmar
[2:05] <infernix> you've ran this with 8 disks, correct?
[2:05] <infernix> but the box can take 36?
[2:06] <infernix> any reason other than money why you couldn't test with all 36?
[2:09] <infernix> and also, with 8 disks and 1GB cache that's 128mb per disk. if you take that up to 36 disks, that's down ti 28mb
[2:12] <infernix> conversely if you look at the 16 concurrent ops, if you take the 2308 jbod for instance. btrfs does what, 3.5mbyte/s; put that at 900 IOPs, at 8 disks that makes perfect sense. about 75-100 iops per sata disks
[2:13] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) has joined #ceph
[2:15] <infernix> my guess is that the cache on the raid controllers is only going to help for a limited writing time, and diminishes with greater numbers of disks per box
[2:16] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[2:16] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Read error: Connection reset by peer)
[2:16] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[2:16] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[2:27] <sjustlaptop> hi
[2:27] <sjustlaptop> oops
[2:33] * fc____ (~fc@ has joined #ceph
[2:35] * fc___ (~fc@ Quit (Ping timeout: 480 seconds)
[2:47] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:51] <via> on the subject of mds's, is it better to have, say, 1 active mds and 2 standby's, or 3 active mds's?
[2:51] <via> i've gotten bitten in the past using two mds, since if either died the fs became unusable, so i imagine the more active you have the more likely one will die, so its a performance tradeoff?
[2:52] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[2:52] <joshd> more than one active has many more bugs right now
[2:52] <via> okay, thank you
[2:52] <via> i'll stick with one
[2:56] <jefferai> sage/sagewk: around?
[2:56] <jefferai> started my upgrade to 0.55 and things are *not* going well
[2:56] <jefferai> I killed the daemons on one box ("service ceph stop" no longer worked, so I sent sigterms) and rebooted the box (kernel upgrade too)
[2:57] <jefferai> now that box (which is 0.55, the others are 0.54) keeps saying that it's discarding message auth and sending client elsewhere and that it's not in quorum
[2:57] <jefferai> and the other two boxes cycle between endless "have not heard from osd.X since Y" and "wrong node!" messages
[2:59] <jefferai> and, ceph -s and ceph -w never return
[2:59] <jefferai> regardless of which box
[3:03] <jefferai> joshd: woudl you know, if you're still around?
[3:06] <joshd> jefferai: did you not have cephx enabled before?
[3:06] <jefferai> I did
[3:06] <jefferai> had "auth supported = cephx"
[3:06] <jefferai> it's three out of the six OSDs on that box that the other nodes keep saying they cant' connect to
[3:06] <jefferai> and the endless "wrong node!" errors
[3:07] <jefferai> granted those are on the machines still on 0.54
[3:07] <jefferai> but at this point I'mnot sure what the right thing to do is
[3:07] <jefferai> upgrade another box to 0.55, and then hope that the two 0.55 ones can sync?
[3:08] <jefferai> should I just tell the cluster that all those OSDs are lost
[3:08] <joshd> it's probably unrelated, but message signed was added and turned on by default in 0.55
[3:08] <jefferai> and then try to readd them?
[3:08] <joshd> you could try turning it off (cephx sign messages = false)
[3:09] <joshd> also rm /etc/init/ceph.conf, and 'service ceph ...' will work again
[3:09] <jefferai> ah
[3:09] <joshd> upstart accidentally mixed with init didn't do well
[3:09] <jefferai> ah
[3:10] <jefferai> so basically my other two nodes alternate between two endless log messages:
[3:10] <jefferai> 2012-12-05 21:10:09.482958 7f70f2767700 -1 osd.6 313 heartbeat_check: no reply from osd.2 since 2012-12-05 20:39:04.063234 (cutoff 2012-12-05 21:09:49.482957)
[3:10] <jefferai> and
[3:10] <jefferai> 2012-12-05 21:10:06.833129 7f70cb012700 0 -- >> pipe(0xd9cd900 sd=31 :42719 pgs=4292040037 cs=32 l=0).connect claims to be not - wrong node!
[3:10] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[3:10] <jefferai> turning off message signing didn't help
[3:11] <jefferai> and mon.a which is on the now-upgraded-to-0.55 box keeps saying
[3:11] <jefferai> 2012-12-05 21:11:39.085248 7f09347ab700 1 mon.a@0(electing) e23 discarding message auth(proto 0 26 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[3:13] <jefferai> I could try going back to 0.54...
[3:13] <jefferai> unless that's a bad idea
[3:13] <joshd> rolling back should work fine
[3:14] <joshd> if you can get logs with debug ms = 20 and debug auth = 20 for one of those osds it would help track down the bug
[3:14] <jefferai> which osds?
[3:14] <jefferai> the osds on the 0.55 box aren't outputting any debug, at all
[3:14] <jefferai> after the initial startup
[3:15] <jefferai> I'm a bit worried about restarting OSDs on the other boxes as things are already in a failure state
[3:15] <joshd> the new ones
[3:15] <joshd> but it's very odd if they're not outputting to their logs
[3:15] <jefferai> they are upon first startup
[3:15] <jefferai> then, nothing
[3:16] <jefferai> also, one thing I don't understand is why one of the hosts going down makes all my VMs freeze up
[3:16] <jefferai> I imagine it's because it's waiting on I/O to write to the cluster
[3:16] <jefferai> but if those OSDs are down, shouldn't it write to the OSDs that are up and then let the cluster sync when the failure state goes away?
[3:16] <joshd> yeah, but it takes some time to notice the OSDs are down
[3:16] <jefferai> it's been about 20-30 minutes
[3:17] <joshd> any problems pinging the other boxes from the upgraded one?
[3:17] <jefferai> nope
[3:17] <joshd> the lack of joining the quorum for the new mon is suspicious
[3:18] <jefferai> I changed the debug, hang on
[3:19] <jefferai> here's one of the osd logs:
[3:19] <jefferai> http://paste.kde.org/620636/
[3:19] <joshd> is that the entire thing?
[3:20] <jefferai> here's 5000 lines of another
[3:20] <jefferai> http://paste.kde.org/620642/
[3:20] <jefferai> no, it's not
[3:20] <jefferai> the entire thing is up to about 230MB now
[3:21] <jefferai> but I can try to get one from the start if oyu want
[3:21] <jefferai> stop the service, truncate the log file, start
[3:21] <joshd> no, that's ok. I think debug ms = 20 was too much
[3:23] <jefferai> any ideas
[3:23] <jefferai> ?
[3:23] <joshd> is there anything in syslog or dmesg on the upgraded system? I wonder if the kernel upgrade had some bad effect
[3:23] <jefferai> hm
[3:23] <jefferai> hm
[3:23] <jefferai> [ 2227.493815] nf_conntrack: table full, dropping packet.
[3:23] <jefferai> lots of those
[3:23] <jefferai> that's interesting
[3:23] <joshd> that would explain ceph not being able to send messages
[3:24] <jefferai> yeah
[3:24] <jefferai> I wonder if the limits changed in the new kernel
[3:24] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) Quit (Quit: Leaving.)
[3:25] <jefferai> I'm using iptables but don't see a reason I need connection tracking
[3:27] <lurbs> Parts of /proc/sys aren't populated (because the relevant modules are not yet l
[3:27] <lurbs> oaded) when the procps stuff kicks in.
[3:28] <lurbs> So you can't set nf_conntrack_max in /etc/sysctl.conf unless you get the module loaded via your initramfs.
[3:28] <lurbs> So the limits could be much lower after a reboot than you'd expect.
[3:29] * SpamapS (~clint@xencbyrum2.srihosting.com) Quit (Quit: leaving)
[3:29] <lurbs> I think you'd want to add nf_conntrack_ipv4 to /etc/modules and rebuild the initramfs (update-initramfs -k all -u).
[3:29] <lurbs> ...all assuming Ubuntu/Debian.
[3:30] <jefferai> lurbs: actually I'd prefer to turn it off entirely
[3:30] <jefferai> I don't see why I need conntrack
[3:30] <lurbs> Fair enough. :)
[3:30] <jefferai> ah
[3:30] <jefferai> I see
[3:30] <jefferai> I think it might want it for this ipatables rule
[3:30] * SpamapS (~clint@xencbyrum2.srihosting.com) has joined #ceph
[3:30] <jefferai> -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
[3:31] <lurbs> I tend to just set the stuff I don't care about to NOTRACK.
[3:31] <jefferai> wouldn't that still need a tracking rule?
[3:31] <via> -m state will pull in conntrack iirc, so just don't use it
[3:32] <jefferai> yeah
[3:32] <jefferai> related/established seem reasonable, but...
[3:33] <jefferai> I guess if things are fine without it, no need
[3:35] <jefferai> I wonder if I'll have to find a different workaround...how many things will break without the related, established
[3:35] <jefferai> although if connections that originate on the server aren't affected, then I guess it should be fine
[3:35] <jefferai> but for instance if a connection originates on the server and goes out port X, and there isn't an incoming rule destined for port X...that's kind of the point of ESTABLISHED it seems
[3:37] <jefferai> eh
[3:37] <jefferai> something loaded conntrack anywas
[3:38] <via> blacklist the module <_<
[3:40] <jefferai> it appears to have been a symptom anyways
[3:40] <jefferai> because after a reboot
[3:40] <jefferai> I have the same behavior
[3:40] <jefferai> but without any kernel messages
[3:40] <jefferai> same behavior in ceph though
[3:43] <xiaoxi> excuseme, I am a bit confuse about whether "sync" will stops writes ?
[3:43] <joshd> if the mon is still not joining the quorum, I'd suspect some sort of network/firewall issue still
[3:43] <xiaoxi> It seems frue from the doc(http://ceph.com/docs/master/rados/configuration/journal-ref/), but I can't find how it stop writes in the code? any lock?
[3:45] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: This computer has gone to sleep)
[3:47] <xiaoxi> joshd: any inputs?
[3:49] <jefferai> joshd: at this point I'll try downgrading the kernel
[3:49] <jefferai> because downgrading ceph didn't help
[3:49] <joshd> xiaoxi: I don't remember the locking there too well. if sjust/sjustlaptop is around he'd know
[3:51] <joshd> xiaoxi: looks like it might just be Filestore::lock
[3:53] <jefferai> joshd: how can I tell the other two boxes to stop trying to sort things out with the third box and just make sure they are consistent?
[3:53] <jefferai> because I'm just getting endless "wrong node!" messages
[3:54] <jefferai> and I know that restarting OSDs can fix that
[3:54] <jefferai> but I'm worried about restarting all OSDs
[3:54] <jefferai> without them syncing in the interim
[3:54] <joshd> you can manually do 'ceph osd out N' and 'ceph osd down N' for the osds on the bad node
[3:55] <joshd> then manually mark them in again when they're ready
[3:55] <jefferai> eah, that's the problem is I'm not sure which node is bad, anymore
[3:55] <jefferai> they all seem bad
[3:55] <jefferai> like, if I get the node back to the same kernel/ceph version
[3:55] <joshd> did you do the kernel upgrade on all of them?
[3:55] <jefferai> nope
[3:55] <jefferai> I'm rebooting back to the older kernel version
[3:56] <jefferai> and I've already downgraded ceph
[3:56] <jefferai> but, that node isn't even up anymore and I'ms till getting wrong node! messages
[3:56] <xiaoxi> joshd:thanks~
[3:56] <jefferai> and worse, now those two nodes have stopped being able to see some of each others' osds
[3:56] <jefferai> it's like all of the osds are going haywire
[3:57] <jefferai> and I have no idea how to make sure they're ocnsistent
[3:57] <joshd> there was a bug with osds seeing wrong nodes like that fixed in 0.55
[3:57] <jefferai> yeah, I know
[3:57] <jefferai> that's why I was upgrading
[3:57] <jefferai> :-(
[3:57] <joshd> ah
[3:57] <jefferai> because I've had that before
[3:57] <jefferai> I didn't expect the upgrade to go so poorly though
[3:57] <jefferai> so I really don't know what to do here
[3:58] <jefferai> if there are two storage boxes left, and they can't see some of each others' osds...
[3:59] <joshd> if you want to just prevent things from getting worse, you can set the noout flag so they don't go crazy trying to recover data that doesn't need recovery
[3:59] <jefferai> I don't really know what that means
[3:59] <jefferai> set the noout flag
[3:59] <jefferai> I now have the other box on the same kernel version and back to 0.54
[3:59] <jefferai> and it's still unable to get into quorum
[3:59] <joshd> http://ceph.com/docs/master/rados/operations/troubleshooting-osd/?highlight=noout#flapping-osds
[4:00] <jefferai> and the other boxes still hang forever if I try ceph -s
[4:01] <jefferai> joshd: trying to run "ceph osd out 0" is just making the command hang forever
[4:02] <joshd> it's probably trying to talk to mon.a first, and having issues because it's not in the quorum. it should time out and try the next one though
[4:02] <jefferai> I doubt it
[4:02] <jefferai> "ceph -s" never returned
[4:02] <jefferai> ran for over half an hour
[4:02] <jefferai> on *any* box
[4:02] <jefferai> can I tell it to talk to a specific mon?
[4:03] <joshd> yeah, use -m ip:port
[4:03] <joshd> I think that overrides ceph.conf
[4:04] <jefferai> still hanging
[4:04] <jefferai> even when I gave it the local mon
[4:04] <jefferai> on that box
[4:19] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:20] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[4:20] * ChanServ sets mode +o scuttlemonkey
[4:46] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[4:47] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[4:47] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[4:47] * Leseb_ is now known as Leseb
[4:47] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[5:07] * madkiss (~madkiss@ Quit (Ping timeout: 480 seconds)
[5:08] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.89 [Firefox 17.0/20121119183901])
[5:14] * benner (~benner@ Quit (Ping timeout: 480 seconds)
[5:21] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[5:22] * benner (~benner@ has joined #ceph
[5:23] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[5:25] <xiaoxi> could anyone educate me the call stack for a write request? I am backtrace from queue_transations in filestore.cc,but cannot tell who is the caller for this function
[5:31] * plut0 (~cory@pool-96-236-43-69.albyny.fios.verizon.net) has left #ceph
[5:34] * The_Bishop (~bishop@2001:470:50b6:0:8d5d:14da:c16f:c382) Quit (Ping timeout: 480 seconds)
[5:42] * The_Bishop (~bishop@2001:470:50b6:0:cc4:59c3:e25c:3536) has joined #ceph
[5:55] * tore_ (~tore@ has joined #ceph
[5:56] <tore_> yaye 0.55
[6:00] * yasu` (~yasu`@dhcp-59-168.cse.ucsc.edu) Quit (Remote host closed the connection)
[6:27] * yoshi (~yoshi@p4105-ipngn4301marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[6:29] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Quit: Leaving.)
[6:46] <tore_> hmmm 0.55 doesn't seem to like have journal size redefined in my ceph.conf
[6:48] <via> joshd: i'm having my mds's crash every time they've started up, and produce this: https://pastee.org/8sfny
[6:49] <via> but i'm done dealing with it tonight, i'll check back in the morning
[6:50] <tore_> after the upgrade it looks like the md, mon, osd start fine however I do get a few repetitive errors related to the init script
[6:50] <tore_> 1.) /etc/init.d/ceph: 280: /etc/init.d/ceph: fs_type: not found
[6:50] <tore_> 2.) /etc/init.d/ceph: 303: [: unexpected operator
[6:50] <tore_> I'll need to search and see if someone alreayd filed a bug report for these
[7:14] <tore_> doesn't seem to do any harm
[7:15] <tore_> anyway don't look like anyone submitted it, so I put it in.
[7:22] * yoshi_ (~yoshi@p4105-ipngn4301marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[7:22] * yoshi (~yoshi@p4105-ipngn4301marunouchi.tokyo.ocn.ne.jp) Quit (Read error: Connection reset by peer)
[7:27] * gaveen (~gaveen@ has joined #ceph
[7:43] * madkiss (~madkiss@ has joined #ceph
[8:03] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[8:05] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:11] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[8:11] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[8:12] * madkiss (~madkiss@ Quit (Quit: Leaving.)
[8:14] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:21] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[8:33] * low (~low@ has joined #ceph
[8:36] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[8:39] * deepsa (~deepsa@ has joined #ceph
[9:00] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[9:02] * deepsa (~deepsa@ has joined #ceph
[9:11] * jks (~jks@3e6b7199.rev.stofanet.dk) has joined #ceph
[9:13] * loicd (~loic@ has joined #ceph
[9:13] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[9:20] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[9:23] * tnt (~tnt@207.171-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:31] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[9:36] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[9:37] * deepsa_ (~deepsa@ has joined #ceph
[9:38] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[9:38] * deepsa_ is now known as deepsa
[9:40] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:46] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[9:56] * ScOut3R (~ScOut3R@ has joined #ceph
[10:00] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[10:08] * xiaoxi (~xiaoxiche@jfdmzpr06-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[10:28] * LeaChim (~LeaChim@b0fac111.bb.sky.com) has joined #ceph
[10:29] * BManojlovic (~steki@ has joined #ceph
[10:29] * roald (~Roald@ has joined #ceph
[10:32] * MarcoA (5158e06e@ircip2.mibbit.com) has joined #ceph
[10:32] * Leseb (~Leseb@ has joined #ceph
[10:33] <MarcoA> Hi everyone.
[10:33] <MarcoA> I'm testing ceph, and I'm in trouble with cephx.
[10:35] <MarcoA> Without auth I can mount the CephFS and everything is ok, but with auth enabled i get a "error 5 = Input/output error"
[10:36] <MarcoA> This is ceph-mon.log: http://pastebin.com/UmFh0hjK
[10:36] <MarcoA> my client is, the monitor is
[10:37] <LeaChim> Have you use -o name=admin,secretfile=/whatever or something similar?
[10:38] <MarcoA> yes
[10:38] <MarcoA> this is the output of ceph auth list
[10:38] <MarcoA> http://pastebin.com/S89ztRj5
[10:38] * loicd1 (~loic@ has joined #ceph
[10:39] <MarcoA> i've tried with name=admin and i've created a foo user as you can see... no matter, still no luck
[10:39] <LeaChim> what's the command you're typing to mount it?
[10:40] <MarcoA> sudo mount.ceph ubuntu:/ /mnt/ceph/ -o name=admin,secret=AQDjfL9Q4FkjNBAAqq9R9ZXWProU+ZRUFbUJXw==
[10:40] <MarcoA> "ubuntu" is defined in my /etc/hosts file
[10:41] <LeaChim> try doing this: create a file /tmp/something, put AQDjfL9Q4FkjNBAAqq9R9ZXWProU+ZRUFbUJXw== in that file, nothing else, and try with -o name=admin,secretfile=/tmp/something
[10:43] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[10:46] <MarcoA> tried, nothing to do
[10:46] <MarcoA> mount error 5 = Input/output error
[10:53] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[11:00] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[11:03] * gaveen (~gaveen@ has joined #ceph
[11:03] <match> Having a problem getting kibana working with passenger. Have created a VirtualHost pointing to /opt/kibana/public (which is a symlink to /opt/kibana/static) - whenever I access the page I see 404 in the logs for /opt/kibana/public/js and /opt/kibana/public/api - neither of which exist (but seem to be 'auto-generated' inside lib/js/ajax.js). Any ideas what's going wrong?
[11:04] <match> sorry - wrong window - that was meant for #logstash - oops!
[11:05] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has left #ceph
[11:12] <MarcoA> new details: before the "mount error 5 = Input/output error" in my dmesg i have:
[11:12] <MarcoA> libceph: wrong peer, want, got
[11:12] <MarcoA> libceph: mds0 wrong peer at address
[11:14] <MarcoA> and every 10 seconds I have these lines:
[11:14] <MarcoA> libceph: no secret set (for auth_x protocol)
[11:14] <MarcoA> libceph: error -22 on auth protocol 2 init
[11:32] * The_Bishop (~bishop@2001:470:50b6:0:cc4:59c3:e25c:3536) Quit (Ping timeout: 480 seconds)
[11:41] * The_Bishop (~bishop@2001:470:50b6:0:2da4:3313:73d6:da2) has joined #ceph
[11:55] <roald> MarcoA, could you pastebin your ceph conf?
[11:58] <agh> hello to all
[11:58] <agh> i've a problem
[11:59] <agh> i have one osd server with 1 osd running : osd.11
[11:59] <agh> ok. So now, I want to add another OSD. So i follow the procedure in the doc
[11:59] <agh> and i do :
[11:59] <agh> ceph osd create 12
[12:00] <agh> it works, BUT, my new OSD has not the good name: it is called "osd.0"
[12:00] <agh> [root@ceph-osd-1 ~]# ceph osd tree
[12:00] <agh> dumped osdmap tree epoch 12
[12:00] <agh> # id weight type name up/down reweight
[12:00] <agh> -1 1 pool default
[12:00] <agh> -3 1 rack unknownrack
[12:00] <agh> -2 1 host ceph-osd-1
[12:00] <agh> 11 1 osd.11 up 1
[12:00] * deepsa_ (~deepsa@ has joined #ceph
[12:00] <agh> 0 0 osd.0 down 0
[12:00] <roald> it´s ceph osd create <fsid>, not ceph osd create <osd number>
[12:01] <roald> it automatically picks the first osd number it can find, starting from 0
[12:01] <agh> mm… is it possible to choose it ?
[12:01] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[12:01] <MarcoA> roald: here it is - http://pastebin.com/fffiRgqa
[12:01] * deepsa_ is now known as deepsa
[12:01] <agh> but in the doc, it is written :
[12:01] <agh> ceph osd create {osd-num}
[12:01] <agh> ceph osd create 123 #for example
[12:01] <roald> agh, i don´t think so
[12:02] <roald> yes, well.. that doesn´t work :-)
[12:02] <roald> maybe a dev knows that answer, or maybe there´s a bug about it already
[12:02] <roald> i usually just make my osd´s start at 0
[12:03] <roald> the osd number doesn´t really matter anyway
[12:03] <agh> yes, but how do you manage your ceph.conf, so ?
[12:03] <agh> because it's needed to add OSDs by hand in ceph.conf.
[12:04] <roald> OSDNUM=$( ceph osd create )
[12:04] <roald> then i use $OSDNUM... i have a small script to create osd´s
[12:05] <agh> Mmm.. ok. Thanks
[12:05] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has joined #ceph
[12:05] <roald> agh; i agree that the osd num should be configurable, but i don´t think it is
[12:05] <roald> maybe there´s a secret option somewhere, i haven´t found it yet :)
[12:08] <roald> MarcoA, do you have a monitor running?
[12:11] <MarcoA> yes, one
[12:11] <roald> but it´s not in your config?
[12:11] <MarcoA> it's all on one machine. 1 mon, 2 osd, 1 mds
[12:11] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[12:11] <MarcoA> the conf file is done via the "ceph-deploy" script
[12:12] <roald> hmm i never used that script
[12:12] <roald> read this; http://ceph.com/docs/master/rados/configuration/ceph-conf/
[12:12] <roald> it might help
[12:12] <roald> esp. this; http://ceph.com/docs/master/rados/configuration/ceph-conf/#example-ceph-conf
[12:13] * ctrl (~Nrg3tik@ Quit (Ping timeout: 480 seconds)
[12:14] * MikeMcClurg1 (~mike@firewall.ctxuk.citrix.com) has joined #ceph
[12:14] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) Quit (Read error: Connection reset by peer)
[12:15] <MarcoA> yes, i have read it, and in a red box they says that "... Also, this setting [the host = ...] is ONLY for mkcephfs and manual deployment. It MUST NOT be used with chef or ceph-deploy"
[12:15] <MarcoA> that's I don't have a [mon] section with the host = ... directive
[12:16] <roald> but you don´t have a mon.<mon id> section, so if I were you I´d double check if the mon is actually running
[12:16] <MarcoA> monmap e1: 1 mons at {ubuntu=}, election epoch 1, quorum 0 ubuntu
[12:17] * MikeMcClurg1 (~mike@firewall.ctxuk.citrix.com) has left #ceph
[12:17] <absynth> dona's gona hate me. :D
[12:18] * ctrl (~Nrg3tik@ has joined #ceph
[12:19] <MarcoA> roald: and without auth I'm able to mount the ceph filesystem
[12:20] * yoshi_ (~yoshi@p4105-ipngn4301marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:26] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[12:36] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has joined #ceph
[12:39] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has left #ceph
[12:47] <joao> <absynth> dona's gona hate me. :D
[12:47] <joao> lol
[12:47] <joao> why's that? :p
[12:59] <absynth> contract negotiations :)
[13:18] <tnt> MarcoA: you can try feeding the secret using the kernel keyring
[13:20] * guigouz (~guigouz@201-87-100-166.static-corp.ajato.com.br) has joined #ceph
[13:24] <nhm> good morning #ceph
[13:27] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:35] <roald> good morning? it´s 1:30 pm over here... :-)
[13:41] <tnt> roald: http://www.total-knowledge.com/~ilya/mips/ugt.html
[13:43] * maxiz (~pfliu@ has joined #ceph
[13:43] <roald> wow - sometimes i´m stunned by the amount of useful stuff i learn each day
[13:43] <roald> godo morning nhm :-)
[13:43] <tnt> hehe:)
[13:44] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) has joined #ceph
[13:46] * noob21 (~noob2@ext.cscinfo.com) has joined #ceph
[13:48] <absynth> morning, everyone
[13:49] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) has joined #ceph
[13:52] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[14:10] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[14:13] * BManojlovic (~steki@ has joined #ceph
[14:20] <noob21> morning
[14:23] <noob21> i'm working on converting from btrfs to xfs today by downing the osd's one at a time and reformatting them
[14:23] <noob21> from what i'm seeing, btrfs isn't stable enough yet
[14:25] <noob21> the problem so far seems to be that i get stuck unclean pages when i bring down an osd because i have one dead already
[14:27] <MarcoA> tnt: please, can you explain with an example?
[14:30] <MarcoA> anyway, in the mon log I see that the req key and the expected key are equal, so I don't think it's a "key mismatch"
[14:30] <MarcoA> ephx server client.admin: checking key: req.key=bc007b00083b8a84 expected_key=bc007b00083b8a84
[14:30] <MarcoA> cephx server client.admin: checking key: req.key=bc007b00083b8a84 expected_key=bc007b00083b8a84
[14:31] <tnt> echo "your key in base64" | base64 -d | keyctl padd ceph rbd @u
[14:31] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: slang)
[14:31] <tnt> and use key=rbd argument instead of secret=xxxx
[14:32] <tnt> (note that 'rbd' here is just a string identifying the secret, I just cut & pasted from a script of mine but if can be 'foobar' it doesn't matter)
[14:38] * MarcoA (5158e06e@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[14:38] * MarcoA (5158e06e@ircip2.mibbit.com) has joined #ceph
[14:42] * wubo (~wubo@nat-ind-inet.jhuapl.edu) has joined #ceph
[14:43] <tnt> Mmm, it's a shame there is no ceph wireshark dissector :(
[14:44] <MarcoA> tnt: I have the message: add_key: No such device
[14:44] <tnt> is the cpeh kernel module loaded already ?
[14:47] * MarcoA (5158e06e@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[14:48] * MarcoA (5158e06e@ircip2.mibbit.com) has joined #ceph
[14:49] <MarcoA> My laptop has dumped!
[14:51] <tnt> what kernel version ?
[14:51] <MarcoA> 3.2.0-30-generic #48-Ubuntu SMP
[14:52] <MarcoA> x86_64
[14:52] <tnt> Ah yes, there are a couple of bugs in key management in 3.2 that were fixed in more recent kernel.
[14:53] <MarcoA> doh
[14:54] <tnt> to use kernel clients (either rbd or cephfs) you really need more recent version.
[14:56] <MarcoA> but, do you think that my problem is related to kernel client?
[14:58] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[14:59] <tnt> well ... I would definitely try to eliminate that first.
[15:00] <tnt> Of course if you're trying on your laptop it's not as easy changing kernel as it is on a VM ...
[15:00] * brainopia (~brainopia@ has joined #ceph
[15:01] <brainopia> is it efficient to use rados gateway to concurrently stream a lot of media files directly to users?
[15:03] <tnt> compared to what ?
[15:03] <brainopia> right now I have 10+ TB of media stored on JBOD and served with nginx, thinking about move to ceph and wondering what should be used as fronted to serve media?
[15:03] <brainopia> from ceph*
[15:05] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[15:05] * deepsa (~deepsa@ has joined #ceph
[15:07] <absynth> how many parallel streams do you have?
[15:08] <brainopia> 120 max, but it will increase in future, we'd like to fully utilize 1gbps with 600+ connections
[15:09] <absynth> hm, might be interesting to hear a statement from the guys
[15:09] * absynth looks at gregaf and sagewk
[15:16] * Kioob (~kioob@luuna.daevel.fr) Quit (Ping timeout: 480 seconds)
[15:28] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[15:29] * drokita (~drokita@ has joined #ceph
[15:42] * MarcoA (5158e06e@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[15:42] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[15:44] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[15:59] <absynth> sage, gregaf, joshd around?
[16:01] * PerlStalker (~PerlStalk@ has joined #ceph
[16:13] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:17] * MarcoA (5158e06e@ircip3.mibbit.com) has joined #ceph
[16:18] <MarcoA> tnt: created vm with Ubuntu Server 12.04, kernel 3.2 --> kernel dumped
[16:19] <MarcoA> tnt: kernel upgraded to 3.4.0-030400-generic --> kernel dumped :(
[16:22] <tnt> Yeah, the fixes are like in 3.6 ...
[16:22] <tnt> 3.6.8 preferrably.
[16:22] <absynth> try ... yep, what tnt said
[16:22] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:23] <absynth> 3.6.7 / 3.6.8 or such
[16:23] * ScOut3R (~scout3r@1F2E942C.dsl.pool.telekom.hu) has joined #ceph
[16:24] <tnt> I have a 3.6.9 amd64 in debian package form if you need ...
[16:25] * cblack101 (86868947@ircip2.mibbit.com) has joined #ceph
[16:25] * sagewk (~sage@2607:f298:a:607:611c:760a:a2d9:a4b1) Quit (Ping timeout: 480 seconds)
[16:28] <ScOut3R> i was wondering if there is a way to tell CRUSH to store more than one replica on a host if number of hosts < pool size
[16:29] <ScOut3R> take for example a two node cluster with a pool which has a size of 4
[16:29] <MarcoA> tnt: yes, it would be great
[16:32] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[16:34] <MarcoA> tnt: wait, i'm following the upubuntu.com howto
[16:35] * sagewk (~sage@2607:f298:a:607:bdbd:b82b:32ec:ace8) has joined #ceph
[16:36] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[16:37] <tnt> MarcoA: Ok. I just do a 'fakeroot make-kpkg --revision 1.0.x1 kernel_image' inside a git checkout of the kernel ... just need the config file.
[16:38] <absynth> sagewk: awake?
[16:40] <noob21> if i have drives of different sizes should i reweight them in crush or will it account for that ?
[16:44] <MarcoA> tnt: hm, with 3.6.9-030609-generic I have "add_key: Invalid argument" error
[16:44] <MarcoA> the modules are loaded (ceph and libceph)
[16:48] * Kioob`Taff1 (~plug-oliv@local.plusdinfo.com) has joined #ceph
[16:48] <Kioob`Taff1> Hi
[16:49] * Kioob`Taff1 is now known as Kiooby
[16:49] <nhm> absynth: Sage is probably going to be afk today/tomorrow, what's up?
[16:50] <Kiooby> from that page : http://ceph.com/docs/master/install/os-recommendations/ I understand that «Debian Squeeze» have a version of glibc which have the syncfs system call
[16:50] <tnt> MarcoA: what command do you type exactly ?
[16:50] <Kiooby> but in my logs I have : ceph-osd: 2012-12-06 16:43:07.038243 7f2d5b9b6780 0 filestore(/var/lib/ceph/osd/ceph-10) mount syncfs(2) syscall not support by glibc
[16:51] <nhm> Kiooby: what version of ceph are you using and which packages?
[16:51] <Kiooby> 0.48.2argonaut-1~bpo60+1
[16:52] <tnt> Kiooby: what kernel versionis squeeze ?
[16:52] <Kiooby> I use my own kernel : 3.6.9-dae-intel
[16:52] <Kiooby> mmm.. bad kernel .config ?
[16:53] <nhm> Kiooby: that could be, but which version of glibc do you actually have?
[16:54] <Kiooby> ...
[16:54] <Kiooby> glibc is not installed :D
[16:55] <tnt> package is libc6
[16:55] <Kiooby> yes, I see... virtual package. So : libc6 2.11.3-4
[16:56] * low (~low@ Quit (Quit: bbl)
[16:56] * oliver2 (~oliver@p4FD07609.dip.t-dialin.net) has joined #ceph
[16:56] <nhm> syncfs stuff was added in 2.14
[16:56] <Kiooby> so, the doc is wrong ?
[16:56] <tnt> looks like ti
[16:56] <nhm> looks like it. I'll tell our writers.
[16:56] <absynth> nhm: we have an issue with our ceph installation, which pertains to reparing stuff
[16:57] <tnt> also, you can tell them 0.55 is not bobtail I think.
[16:57] <nhm> absynth: ah, no good. Your working with Dona right?
[16:57] <Kiooby> for information Debian Wheezy have a 2.13* version of libc6...
[16:58] <nhm> tnt: yeah, that happened kind of suddenly. ;)
[16:58] <nhm> tnt: I'm releasing a arognoaut vs bob^H^H^H 0.55 performance article some time in the next week or two too. :)
[16:58] <absynth> nhm: uh yeah, kinda
[16:59] <nhm> Kiooby: yeah, I didn't think either supported syncfs properly in glibc.
[16:59] <MarcoA> tnt: echo "AQDjfL9Q4FkjNBAAqq9R9ZXWProU+ZRUFbUJXw==" | base64 -d | keyctl padd ceph pwd @u
[16:59] <tnt> MarcoA: and you have the base64 utility installed right ?
[17:00] <absynth> nhm: we are seeing seemingly corrupted blocks although the status of our filesystem is OK
[17:01] <absynth> yesterday, we had a network outage and started seeing things like "-0,144% degraded"
[17:01] <absynth> now, should we do a repair per osd or does that make things even worse?
[17:01] <tnt> MarcoA: when I type this command here, it works just fine ...
[17:01] <MarcoA> tnt: base64 installed, yes
[17:02] <tnt> MarcoA: can you type echo "AQDjfL9Q4FkjNBAAqq9R9ZXWProU+ZRUFbUJXw==" | base64 -d | hexdump -C and pastebin the output
[17:02] <MarcoA> here it is: http://pastebin.com/dVZhTaUH
[17:03] <Kiooby> nhm: I'm trying with the libc package from experimental... (2.16)
[17:04] <nhm> absynth: I wouldn't touch it personally until one of the guys who has done more of that kind of diagnosis gets on.
[17:05] <nhm> absynth: I mostly do performance, so when break things I go "hrm" and reinstall. ;)
[17:06] <tnt> MarcoA: same here ... well I have no idea why it doesn't accept it ...
[17:06] <nhm> Kiooby: If this isn't production, you might try out 0.55. Theoretically you won't need to worry about glibc.
[17:07] <Kiooby> nhm / tnt : «mount syncfs(2) syscall not support by glibc», with the libc 2.16
[17:07] <Kiooby> it's for production... and I'm not really happy to play with libc from experimental :)
[17:07] <tnt> Kiooby: can you check with ldd that the right glibc is used ? (in case the previous is still present on the system)
[17:08] <nhm> Kiooby: ok, depending on how long you can wait, bobtail should be out in a couple of weeks.
[17:09] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:09] <Kiooby> ldd /usr/bin/ceph | grep libc
[17:09] <Kiooby> libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6
[17:09] <absynth> nhm: hrm.
[17:09] <Kiooby> http://packages.debian.org/experimental/amd64/libc6/filelist
[17:10] <Kiooby> so it's the good one
[17:10] <absynth> nhm: from our experience, this kind of behavior was the foreplay for a HUGE fuckup
[17:10] <absynth> pardon my french
[17:10] <nhm> absynth: what company/organization are you with btw?
[17:11] <joao> I thought absynth was from filoo?
[17:11] <absynth> i am
[17:11] * joao nailed it
[17:11] <nhm> joao: yeah, I'm terrible with remembering anything. :)
[17:11] <absynth> 5 points for joao
[17:12] * fc____ (~fc@ Quit (Quit: leaving)
[17:12] <tnt> Kiooby: well, it's a symlink, check it points to libc-2.16.so ... maybe it got preserved.
[17:14] <tnt> Kiooby: don't bother ... it's detected at compile time.
[17:14] <Kiooby> it's the good one : /lib/x86_64-linux-gnu/libc.so.6 -> libc-2.16.so
[17:14] <Kiooby> and I also upgrade libc-bin to version 2.16, to be sure
[17:14] <Kiooby> I still have message «mount syncfs(2) syscall not support by glibc»
[17:14] <Kiooby> oh !
[17:15] <Kiooby> ok
[17:15] <Kiooby> so...
[17:15] <Kiooby> I have to stay with that for know
[17:15] <Kiooby> now*
[17:15] <tnt> you can rebuild the packages on a machine with the new glibc installed if you need t.
[17:16] <Kiooby> mm, for production usage, I'm not sure, but yes I can do that
[17:16] <Kiooby> the bobtail version doesn't need that syncfs syscall ?
[17:18] <tnt> I use my own packages in prod ... I have some custom patches I need.
[17:18] <oliver2> nhm: well, do you have the way by hand, how to check disk-image -> rbd-prefix ala rb.0.2c460.2f5bd10e, and then on which osd.X it resides? Would be helpful, cause one of header-less images is from our own VM,
[17:18] <tnt> Kiooby: I think in bobtail they just reimplemented the actual syscall call inside ceph IIUC
[17:19] <Kiooby> ok, thanks
[17:20] <MarcoA> Guys, the script "ceph-deploy" will replace mkcephfs, right?
[17:21] <MarcoA> in bobtail, the script will be able to manage mds too?
[17:21] * danieagle (~Daniel@ has joined #ceph
[17:21] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) Quit (Remote host closed the connection)
[17:23] <MarcoA> I wish to know what's the "official" tool (read: the tool that will be Long Term Supported) to install a ceph cluster :)
[17:23] <nhm> oliver2: I'm afraid I'm rather useless for this. I'm trying to track down someone who is les so. :)
[17:24] <oliver2> nhm: OK, understood. I know it has already been discussed in former threads on ceph-devel... but find it, if you _really_ need it...
[17:24] <jmlowe> nhm: you have any tips for interfaces dropping lots of packets?
[17:26] <nhm> jmlowe: hrm, what's a lot?
[17:26] <jmlowe> 8/sec
[17:26] <jmlowe> enough it is causing my osd's headaches
[17:27] <nhm> might want to look through the stuff in here and see if anything looks suspect: http://fasterdata.es.net/host-tuning/linux/
[17:27] <jmlowe> ethtool -i eth2
[17:27] <jmlowe> driver: be2net
[17:27] <jmlowe> version:
[17:27] <jmlowe> firmware-version: 2.102.517.7
[17:27] <jmlowe> it's an emulex 10GigE
[17:29] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[17:35] <ScOut3R> hm, just figured out the answer for my question, use two choose rules ;)
[17:36] * ScOut3R (~scout3r@1F2E942C.dsl.pool.telekom.hu) Quit (Quit: Lost terminal)
[17:36] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[17:37] * loicd1 is now known as loicd
[17:43] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) has joined #ceph
[17:44] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Remote host closed the connection)
[17:45] <Kiooby> it's not possible to use a raw device as journal ?
[17:45] <Kiooby> (I put : "osd journal = /dev/sdk2" in my conf)
[17:46] <jmlowe> hmm, better but still dropping packets
[17:46] * vata (~vata@ has joined #ceph
[17:46] <Kiooby> oh... ignore my last message
[17:46] <Kiooby> (I put sdk2 twice)
[17:47] <Kiooby> but that didn't solve anything
[17:53] <tnt> I use full devices as journal
[17:54] <Kiooby> so... I didn't use the good syntax ?
[17:54] <tnt> but it need to be that way from the start. If not you actually need to shutdown the osd, flush the old file journal, init the device as a journal and restart the osd.
[17:54] <Kiooby> it's a fresh new cluster, I can reset all data
[17:54] <tnt> your config is right "osd journal = /dev/xvdb1" is what I have.
[17:55] <Kiooby> ok, thanks
[17:56] <Kiooby> in fact, I have :
[17:56] * jlogan1 (~Thunderbi@2600:c00:3010:1:7db5:bf2b:27d1:c794) has joined #ceph
[17:56] <Kiooby> 1 journal _open /dev/sdk3 fd 20: 20971520000 bytes, block size 4096 bytes, directio = 1, aio = 0
[17:56] <Kiooby> -1 journal read_header error decoding journal header
[17:56] <Kiooby> -1 filestore(/var/lib/ceph/osd/ceph-17) mount failed to open journal /dev/sdk3: (22) Invalid argument
[17:57] <rweeks> well
[17:57] <rweeks> your previous statement said /dev/sdk2
[17:57] <rweeks> :)
[17:57] * aliguori (~anthony@ has joined #ceph
[17:57] <Kiooby> I have 8 OSD, and 8 journal partitions
[17:57] <rweeks> ah ok
[17:57] <Kiooby> # grep journal /etc/ceph/ceph.conf
[17:57] <Kiooby> osd journal size = 20000 ; journal size, in megabytes
[17:57] <Kiooby> ;debug journal = 20
[17:57] <Kiooby> osd journal = /dev/sdh2
[17:57] <Kiooby> osd journal = /dev/sdh3
[17:57] <Kiooby> osd journal = /dev/sdi2
[17:57] <Kiooby> osd journal = /dev/sdi3
[17:57] <Kiooby> osd journal = /dev/sdj2
[17:57] <Kiooby> osd journal = /dev/sdj3
[17:57] <Kiooby> osd journal = /dev/sdk2
[17:57] <Kiooby> osd journal = /dev/sdk3
[17:57] * kbad (~kbad@malicious.dreamhost.com) has joined #ceph
[17:59] * kbad_ (~kbad@malicious.dreamhost.com) Quit (Ping timeout: 480 seconds)
[18:00] <tnt> Kiooby: don't set journal size
[18:01] <Kiooby> of course !
[18:06] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[18:17] * hhoover (~hhoover@of2-nat1.sat6.rackspace.com) has joined #ceph
[18:20] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[18:20] * yehuda_hm (~yehuda@2602:306:330b:a40:152f:4820:319d:f1a7) has joined #ceph
[18:21] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[18:21] * cblack101 (86868947@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[18:21] * hhoover (~hhoover@of2-nat1.sat6.rackspace.com) has left #ceph
[18:26] * loicd (~loic@magenta.dachary.org) has joined #ceph
[18:27] * oliver2 (~oliver@p4FD07609.dip.t-dialin.net) has left #ceph
[18:29] * LeaChim (~LeaChim@b0fac111.bb.sky.com) Quit (Ping timeout: 480 seconds)
[18:31] <noob21> i'm impressed with ceph's ability to figure out what needs to be repaired on drives
[18:31] <noob21> i shutdown one of my servers, wiped it the filesystems on the osd's, changed to xfs and brought it back up. it's recovering just fine
[18:31] * tnt (~tnt@207.171-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:31] <noob21> it's about 10% away from being finished
[18:32] <nhm> noob21: excellent. We are actually going to do that for a customer soon who started out with btrfs but also wants to move to xfs.
[18:32] <noob21> yeah btrfs is proving too unstable on my old hardware at the moment
[18:33] <noob21> at first i was shutting down, letting it sync and then bringing up new stuff
[18:33] <noob21> this time i just tried to shutdown, wipe and bring back up right away and ceph doesn't seem to care
[18:33] <nhm> I'm curious how stable ext4 is. From a performance perspective it's often ahead of xfs for ceph.
[18:33] <noob21> yeah i know
[18:34] <noob21> i think with the newer kernels that might not be the case anymore according to the xfs dev's
[18:34] <nhm> noob21: for ceph specifically?
[18:34] <noob21> they said they fixed a lot of perf bugs
[18:34] <noob21> just for xfs itself
[18:34] <nhm> noob21: I'm still often seeing ext4 ahead for ceph with kernel 3.6.3
[18:35] <noob21> there was a video out there i was watching of an Australian xfs developer talking about the improvements they made in the new kernels
[18:35] <noob21> oh ok
[18:35] <noob21> maybe that's not the case for ceph then
[18:35] <noob21> how big is the performance difference?
[18:35] <nhm> noob21: oh, that might have been Dave Chinner's talk.
[18:35] <noob21> yeah that sounds right
[18:36] <nhm> noob21: don't have my latest results up yet, but some write results for kernel 3.4 with ceph 0.50 are here:
[18:36] <nhm> http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[18:36] <nhm> http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
[18:36] <noob21> that supermicro is awesome :D
[18:37] <noob21> ah yeah you're right that is a big difference
[18:37] <noob21> 50% in some cases
[18:38] <infernix> that xfs talk is a couple months old
[18:38] <infernix> or more. i think they were taking about 3.3
[18:38] <noob21> gotcha
[18:38] <nhm> infernix: I think it was from last year
[18:38] <infernix> nhm, i've been racking my brain around the 2u, 3u and 4u supermicros
[18:38] <noob21> about which to go with?
[18:39] <noob21> depends on how badass your raid controller is :D
[18:39] <nhm> noob21: bah, just use lots of them. ;)
[18:39] <infernix> the 2U has a 12 disk option with 2 2.5" in the back. that would give 12 disks for ceph and a raid1 on the 2.5" in the back
[18:39] <infernix> that yields some 12*80=960MB/sec
[18:39] <noob21> infernix: what does that go for $$?
[18:39] <infernix> which still fits in 10gbit
[18:39] <darkfaded> nhm: we also had a lot of good results in class today, solid throughput even on ceph fs for hours. only one mon/mds though and i wonder how much it's related :) underlying xfs and single ssd for journal and osd and all
[18:39] <noob21> right
[18:40] <infernix> noob21: pricing is on the way, only have the pricing for the 36 disk 4U currently
[18:40] <noob21> what's that run?
[18:40] <noob21> i'm curious how bad hp ripped us off lol
[18:40] <infernix> the problem with the 3U 16 slot (14 effectively) but moreso with the 4U 24 slot and 4U 36 slot, is that you get
[18:40] <nhm> infernix: yeah, I like the 12 disk option personally. I'd do the "A" chassis with directly connected disks.
[18:41] <noob21> i went with the 12 disk options from hp also
[18:41] <infernix> 1760MB/s and 2720MB/s aggregate throughput
[18:41] <nhm> darkfaded: good deal!
[18:41] <noob21> yeah now you're bumping into a bandwidth limit on your network
[18:41] <infernix> now infiniband can easily sustain this, but IPoIB only does about 15gbit
[18:41] <infernix> and to get to 40gbit you need to use RDMA or other infiniband type protocols
[18:41] <infernix> like sdp and the like
[18:42] <noob21> yeah
[18:42] <noob21> you could vPC your 10Gb ports if you wanted
[18:42] <noob21> i've done that with the gluster here
[18:42] <noob21> gets you close to 20Gb
[18:42] <infernix> now the other problem is rebuild times, with X disks of Y size, rebuild time can exceed 16 hours
[18:43] <infernix> the 36 disk box with 4TB is about 38 hours for a full rebuild
[18:43] <noob21> wow
[18:43] <infernix> i'm thinking ~16 hours is the acceptable limit
[18:43] <nhm> infernix: also, keep mind that we haven't pushed any nodes that hard yet. I get good scalability up to about 1.4GB/s writes (2.8GB/s if you count journal) per node without much tuning, but after that we plateu. In the next couple of months I'll have more time to actually figure out how to tune things right for that kind of a setup, but for now we don't know what the bottleneck is.
[18:43] <noob21> that's if you add a whole box right?
[18:43] <infernix> nhm: plateau how?
[18:43] <infernix> hm
[18:43] <nhm> infernix: you add more drives and osds, and performance doesn't improve.
[18:44] <noob21> are you talking aggregate for the cluster or just that machine?
[18:44] <nhm> noob21: per node
[18:44] <infernix> i'm looking at getting 6x2U with 12x3TB each
[18:44] <noob21> damn
[18:44] <noob21> that's way more than i'm pushing
[18:44] <infernix> node as in OSD server?
[18:44] <infernix> or node as in client?
[18:44] <noob21> oh i mean osd server
[18:44] <noob21> sorry about that
[18:45] <nhm> noob21: so once a node hits 2.8GB/s counting the journals we don't go faster with more disks. At those speeds you start hitting things like processor and memory affinity and sometimes pcie issues though, so it may not even be ceph related.
[18:45] <noob21> right
[18:45] <noob21> makes sense
[18:45] <infernix> but to push 2.8GB/s to the network you will need infiniband
[18:45] <noob21> i'm just thinking about what my crap little setup is doing
[18:45] <infernix> or how else are you doing that?
[18:45] <infernix> bonding of 10GE?
[18:46] <noob21> yeah bonded 10Gb 802.3ad
[18:46] <infernix> so doesn't work for single connections
[18:46] <nhm> infernix: that's counting journal writes, so the client sees 1.4GB/s. The client in this case was on localhost.
[18:46] <infernix> needs multiple
[18:46] <infernix> hm ok, so assuming no SSD journals, and at the end of a 7200 sata disk you get 80mb
[18:47] <infernix> should I count half that speed as a performance metric per disk? e.g. 40MB?
[18:47] <noob21> yeah
[18:47] <noob21> i get about 60MB from SATA Drives
[18:47] <nhm> infernix: that's pretty close to what we see depending on various facotrs.
[18:48] <noob21> give or take
[18:48] <infernix> obviously the journal goes to the start of the disk where the speed is fastest
[18:48] <nhm> infernix: all depends on how you set it up.
[18:48] <infernix> no SSD, 12 OSDs on 3TB dissk per server, 6 servers
[18:48] <infernix> 2x 6 cores and 64gb ram
[18:48] <infernix> that's my current outline
[18:49] <noob21> that's very similar to what my gluster setup is
[18:49] * richardshaw (~richardsh@katya.aggress.net) has joined #ceph
[18:49] <nhm> 32GB of ram may be sufficient if you want to cut costs, or use the money elsewhere.
[18:50] <noob21> nhm: on the newer 0.55 ceph i'm seeing a lot of slow request messages.
[18:50] <noob21> oh nvm, i'm running 0.54
[18:51] <infernix> ram is cheap
[18:51] <noob21> it is
[18:51] <infernix> i just need to be sure that i can read with 2.5GB/sec from this
[18:51] <infernix> write can be whatever
[18:51] <infernix> 400MB is fine
[18:52] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:52] <noob21> are you going to enable jumbo frames
[18:52] <infernix> either by creating rbd block devices and doing dd if=bigblob of=rbd
[18:52] <infernix> or by writing some python to do it in a smarter way
[18:52] <infernix> no need, we're on infiniband
[18:52] <infernix> framesize anywhere from 4k to 64k
[18:53] * MarcoA (5158e06e@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[18:53] <noob21> i see
[18:53] <rweeks> so you're doing IBoIP?
[18:54] <rweeks> er
[18:54] <rweeks> strike that
[18:54] <rweeks> reverse it
[18:54] <jefferai> infernix: I wouldn't count on it
[18:55] <jefferai> I have a three-node cluster, disks are all SSDs, each has a dual 10GbE active/active connection
[18:55] <jefferai> benchmarking shows I can read quite fast, but nowhere near 2.5Gb/s
[18:55] <nhm> jefferai: how many osds
[18:55] <nhm> ?
[18:55] <jefferai> 6 per node
[18:56] <nhm> And what kind of read speed?
[18:56] <jefferai> In somewhat artificial benchmark tests, I could get about 700MB/s
[18:56] <jefferai> which is faster than 2.5Gb/s but not 2.5GB/s
[18:56] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[18:57] <jefferai> unless infernix meant 2.5Gb
[18:57] <jefferai> depends which speed infiniband he's using, I guess
[18:57] <nhm> jefferai: is this with XFS?
[18:57] <jefferai> yeah, XFS
[18:58] <jefferai> I mean, I'm quite happy with the performance
[18:58] <jefferai> also I'm using VMs and the choice of filesystem on the VM plays a massive role
[18:58] <jefferai> for instance I'm using ZFS and it is not speedy
[18:58] <jefferai> but I'm using it for the snapshotting capabilities
[18:58] <nhm> jefferai: Might take more OSDs to push the throughput.
[18:58] <jefferai> nhm: perhaps -- I'll be adding more eventually
[18:58] <noob21> i think i killed my cluster. resizing images works ok but when i mount they all show as 1GB haha
[18:59] <gregaf> does it scale if you add more VMs or hosts?
[18:59] <jefferai> still, 2.5GB/s (not Gb/s) is a *lot*
[18:59] <gregaf> 700MB/s I think I remember being one of the limits we see on a single client reading
[18:59] <jefferai> gregaf: ah, that could make sense
[18:59] <jefferai> I did that benchmark from one of my storage nodes
[18:59] <jefferai> so, it would have been single client
[18:59] <nhm> jefferai: I can do about 70-80MB/s per OSD with for reads on XFS if I'm doing it from localhost
[18:59] <jefferai> it's possible if I ran from multiple clients I'd get above that
[18:59] <jefferai> even then, it might still come down to infernix's exact use case
[19:00] <jefferai> if he's trying to get that bandwidth to a single VM, for instance...
[19:00] <gregaf> yeah, I believe our messenger implementation tops out at around that speed, according to Sam's microbenchmarks
[19:00] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[19:00] <gregaf> and nhm has definitely seen it, which is why he uses multiple daemons when doing tests
[19:00] <richardshaw> Hi, i'm going round in circles slightly, I want to install ceph on a single host with 4 disks for storage, what's the best way of using that storage to include resilience?
[19:00] <jefferai> richardshaw: not using ceph
[19:00] <nhm> gregaf: I've been meaning to try to do some messenger benchmarking at some point.
[19:01] <jefferai> richardshaw: you're better off using zfs or, if you're feeling lucky, a newer kernel and btrfs
[19:01] <gregaf> nhm: Sam set up some stuff to do that several months ago; make sure you talk to him
[19:01] <jefferai> or using mdraid
[19:01] <gregaf> richardshaw: if, on the other hand, you were trying to get a feel for Ceph on a small system, then what you'd do is set up each disk as a separate OSD
[19:01] <noob21> from 0.54 can you rolling upgrade to 0.55 now?
[19:02] <infernix> jefferai: but how parallel is your access pattern?
[19:02] <jefferai> noob21: theoretically but I ran into massive problems last night
[19:02] <gregaf> and it will be default put a copy of whatever you write on two disks
[19:02] <noob21> ok i'll hold off then
[19:02] <infernix> jefferai: eg how are you benchmarking? dd if=/dev/zero?
[19:02] <gregaf> apparently we've screwed something up, which I'm sure will reveal itself in our argonaut-bobtail tests :/
[19:02] <jefferai> noob21: it's not yet clear to me if those problems were caused by bugs in the 0.54 nodes that manifested when I restarted the upgraded box, or a problem with 0.55
[19:02] <richardshaw> I want to use it was a backend to openstack that I may add additional nodes to, but right now, it's just one host
[19:02] <noob21> ok
[19:03] <noob21> so for the production cluster of ceph i'm building would it be best to stick with argonaut?
[19:03] <infernix> i really need to know if i can get the speed before i put down $30k
[19:03] <jefferai> infernix: using one of the built-in benchmarks
[19:03] <noob21> cause my dev setup is 0.54
[19:03] <jefferai> infernix: but again -- one client
[19:03] <jefferai> sounds like nhm has more experience
[19:03] <jefferai> or gregaf
[19:03] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[19:04] <infernix> i don't need it to a VM, basically i can write to an SRP storage device with 4.5GByte/s
[19:04] <infernix> when using directio
[19:04] <infernix> i need to get about 2.5gbyte/s from *something* to that and sustain it for about 10TB
[19:05] <noob21> high performance computing?
[19:05] <infernix> sorta
[19:05] <nhm> noob21: bobtail is right around the corner if you can hold out.
[19:05] <jefferai> infernix: it's possible that parallelizing requests will let you get that
[19:05] <noob21> nhm: yeah my servers are not coming for another 2 weeks
[19:06] <noob21> and after that it'll prob take a week to get a rack request done, etc
[19:06] <noob21> ports and all that
[19:06] <nhm> infernix: tough to say unfortunately. You might even get it at first but over time will stop hitting it due to fragmentation issues on the underlying fileystems.
[19:06] <jefferai> noob21: I will say that when things got resolved, I hadn't lost data (that I know of), but I was in a massive panic
[19:06] <infernix> nhm, I'll probabyl be throwing away that 10TB daily
[19:06] <infernix> and then readding it
[19:07] <noob21> jefferai: i can imagine :)
[19:07] <jefferai> nhm: although -- if you periodically take an OSD out, wipe it, re-add it, it should remove fragmentation issues
[19:07] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[19:07] <infernix> and i want to scale it up to, i don't know, 40TB in 6 months
[19:07] <jefferai> noob21: yeah, I need to figure out the proper way to take an entire node out of the cluster
[19:07] <noob21> instead of just /etc/init.d/ceph stop :D
[19:08] <jefferai> because one thing that happened is that the other nodes never timed out in terms of thinking that the node I took out of service wasn't around
[19:08] <jefferai> noob21: yeah
[19:08] <noob21> oh ok
[19:08] <jefferai> so all my VMs froze because they tried writing to e.g. log files
[19:08] <noob21> so they never started the resync
[19:08] <noob21> right
[19:08] <nhm> infernix: btw, what are the IO sizes?
[19:08] <nhm> infernix: and how random?
[19:08] <infernix> 1M to 4M
[19:08] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[19:08] <jefferai> so perhaps it's -- take the mon out of the cluster, then mark the osds as down/out
[19:08] <jefferai> then do the upgrade
[19:08] <jefferai> but I don't konw for sure
[19:08] <jefferai> need to find out from someone that does
[19:09] <noob21> yeah i haven't figured out a way to do this either
[19:09] <infernix> nhm: i'm basically backing up logical volumes to ceph, and in disaster scenarios i need to sequentially read them and throw them on a high speed ssd device
[19:09] <infernix> that last bit needs to scream in order to meet SLA demands
[19:09] <jefferai> infernix: unfortunatey it's really hard to promise anything
[19:10] <nhm> infernix: Ok. I guess I'd think about ceph and a couple of other options that seem plausible, and figure out whether or not you can build nodes that transfer well.
[19:10] <jefferai> I say this not at all being affiliated with Inktank, but: the Inktank guys do have services for desigining clusters to meet specific needs
[19:10] <jefferai> I'm not sure it's cheap, but they'd know the best
[19:10] <infernix> the other plausible option is about 33 disks in a linux md raid 10
[19:11] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[19:11] <infernix> with striped reads that's 33*80mb=2.6gbyte/s
[19:11] <nhm> infernix: yes, but doesn't meet your data growth needs?
[19:11] <infernix> but scaling that kind of sucks
[19:11] <infernix> exactly
[19:11] <jefferai> right
[19:11] <infernix> i don't know of any other options
[19:12] <infernix> apart from stupid expensive ssd constructions
[19:13] <nhm> infernix: when do you need to decide by?
[19:13] <infernix> as for parallelism, i can basically try to write something that uses librados, and reads a rbd device in parallel and then write to the SSD SRP device in parallel too. the SRP end goes to 4.5gb/sec with 2 threads and 4MB io sizes
[19:14] <infernix> nhm: before years end
[19:14] <infernix> but preferably one or two weeks
[19:14] <jefferai> infernix: I'm not sure, but you may be better off not using rbd, and rather using librados directly
[19:14] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[19:14] * loicd (~loic@magenta.dachary.org) has joined #ceph
[19:14] <infernix> jefferai: wouldn't be a problem
[19:14] <jefferai> just saying, rbd may not be the fastest interface since it has to translate the objects into disk blocks
[19:15] <jefferai> if you can read and write objects directly you may be better off
[19:15] <infernix> well i am backing up block devices to it, but yeah i see your point
[19:17] * chutzpah (~chutz@ has joined #ceph
[19:17] <nhm> infernix: Would it work to buy 1 12 disk chassis, test it out, and if it doesn't work, transfer all of the parts into a bigger chassis and then you are only out the cost of a 12 disk chassis?
[19:18] <infernix> but how can i test with one chassis only?
[19:18] <infernix> the point is to see how well it scales beyond one
[19:18] <jefferai> buy two then :-)
[19:18] <jefferai> can test with one
[19:18] <nhm> infernix: Yes, but if you don't like the results in 1 chassis, maybe that helps influence your decision. :)
[19:18] <jefferai> then test with two
[19:18] <infernix> will get 6 if i know it can work
[19:18] <infernix> nhm: i've seen your benchmarks
[19:19] <infernix> i would like to see benchmarks with more nodes
[19:19] <infernix> but i don't think they are out there
[19:20] <nhm> infernix: yeah. We've got a decent sized test cluster, but the nodes have strange performance issues.
[19:20] <nhm> infernix: it works better for QA than performance testing. ;(
[19:20] <infernix> old hardware?
[19:21] <nhm> infernix: lets say it's not new and vendor "optimized". :)
[19:22] <nhm> with their own secret sauce in various firmware/bioses.
[19:22] * jefferai hates that
[19:23] <infernix> well if i can figure out an alternative plan for the 12 disk 2u boxes i can probably justify it
[19:23] <nhm> that's actually why we have the 36 drive supermicro chassis.
[19:23] <infernix> but it's going to be hard to justify $30k without at least a reasonable certainty
[19:24] <nhm> I needed to get my hands on something I trusted.
[19:24] <infernix> yeah but as i calculated, 36 disks = 34 data disks * 80mb/s = 2.7GByte/s, which is more than even 15gbit infiniband
[19:24] <infernix> ipoib that is
[19:24] <jefferai> nhm: hah, my supermicros have been nothing but trouble
[19:25] <jefferai> infernix: one thing I'll tell you
[19:25] <jefferai> if performance isn't *everything*
[19:25] <jefferai> is that at this point, there are certainly issues with ceph
[19:25] <jefferai> there are bugs, daemons can go haywire
[19:25] <jefferai> but every time that's happened, I've never lost data
[19:25] <jefferai> I've been able to restart various daemons and watch the cluster heal itself
[19:26] <jefferai> even when I've been shitting bricks worried that I fucked something up
[19:26] <nhm> infernix: yeah, I don't strictly recommend that chassis unless you plan to buy a lot of them. I wanted it though because it lets me test 12-drive configurations, 16-drive configurations, 24-drive configurations, etc.
[19:26] <jefferai> yeah, in general you're better off buying more 12-drive 2U chassis
[19:26] <infernix> well that's the good thing, this is a DR system, so it's not used externally
[19:27] <jefferai> you can fit less drives, but you can better optimize memory availability for the daemons, and increase network bandwidth
[19:27] <infernix> e.g. no VMs running off of it, no end users talking to it
[19:27] <jefferai> or at least, that's what everyone in here told me :-D
[19:28] <infernix> oh, actually come to think of it, i am forced to use rbd
[19:28] <nhm> We've had a lot of people interested in building out ceph clusters with VMs running on the same boxes as the OSDs. I'm not particularly fond of it. ;(
[19:28] <infernix> we will want to attach those backed up LVs to a VM to do some basic fs checks
[19:29] <jefferai> nhm: yeah, I have separate compute boxes with higher core/memory counts
[19:29] <jefferai> those act as some mons too
[19:30] <infernix> so hrm, if ceph doesn't cut it, what else can i do. maybe convert them all to software raid 10s, export as SRP with SCST and stripe the boxes in a LVM-HA VG
[19:30] <nhm> jefferai: good for you. Feels too much like putting all of your eggs in one basket to have everything running on everything.
[19:31] <jefferai> yep
[19:31] <jefferai> infernix: if you're going for performance stay away from LVM
[19:31] <jefferai> it has a small performance hit and also has some racy code
[19:31] <infernix> jefferai: lvm does fine
[19:31] <nhm> infernix: If you can stomach it, lustre is quite fast.
[19:32] <infernix> jefferai: i'm pusing 4.5GB/sec writes through LVM
[19:32] <nhm> infernix: not particularly stable.
[19:32] <infernix> more for reads
[19:33] * ircolle1 (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[19:33] <nhm> infernix: it's a lot of work to maintain though and doesn't buy you a lot of the things ceph does.
[19:34] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: ircolle)
[19:34] <infernix> 2.5GB/sec reads is a requirement, whatever solution i build out. if ceph can do it, i'll buy the boxes
[19:34] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[19:38] <infernix> what does DreamObject do in terms of performance?
[19:39] <infernix> and do they use SSD?
[19:39] <noob21> i'd imagine it's fairly snappy. their network pipes are huge
[19:39] <noob21> i think it's all 10Gb, and a 40Gb backbone
[19:40] <nhm> infernix: they are using Dell R515s with 12 drives each.
[19:42] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[19:43] <nhm> infernix: some slides from ceph day: http://www.slideshare.net/Inktank_Ceph/20121102-dreamobjects
[19:44] <infernix> why 300gb sas as OS? is there anythign going on there that warrants 10k or 15k?
[19:44] <noob21> prob not
[19:44] <noob21> anyone setup cloudstack 4 yet with ceph?
[19:44] <noob21> i'm curious how easy it is to build
[19:45] <nhm> infernix: maybe dell had a bunch of them
[19:45] <infernix> nhm: pretty much validates my conclusions
[19:45] <infernix> but no performance metrics in the slides unfortunately :)
[19:46] <nhm> infernix: no, and my guess is that they'll be limited by the rados gateways rather than the OSDs. They have like 7PB of those boxes.
[19:47] <infernix> yeah i wouldn't be bottlenecked by radosgw
[19:50] <infernix> NRC comes down to about $0.29/GB
[19:50] * infernix goes off to calculate MRC
[19:50] <noob21> not bad
[19:50] <noob21> my builds come out to about .75/GB
[19:51] <noob21> dell must be a lot cheaper
[19:52] <infernix> supermicro
[19:52] <infernix> not dell
[19:53] <infernix> but i havent factured in port costs yet
[19:53] <infernix> network/power etc
[19:53] <noob21> oh ok
[19:53] <noob21> that's still really cheap
[19:54] <infernix> just buying the hardware
[19:54] <noob21> what kind of raid cards are you putting in there?
[19:54] <infernix> none
[19:54] <noob21> oh ok
[19:54] <infernix> LSI HBAs
[19:54] <noob21> for just the jbod support?
[19:54] <infernix> yes
[19:54] <infernix> might even skip those, there's some of that on board with some of the mainboards
[19:55] <noob21> lol
[19:55] <noob21> so ceph makes raid almost useless
[19:55] <nhm> infernix: which LSIs?
[19:55] <infernix> not sure yet as I haven't looked at this again. with the 36 drive box i planned to use one HBA per backplane, e.g. 2
[19:55] <benpol> infernix: with a lot of supermicro boxes you need the HBA to get 6Gb/s SAS/SATA
[19:55] <infernix> nhm: the 9207s,2308 chip i think
[19:55] <nhm> infernix: yep, that card worked pretty well with ceph if you have journals on SSDs.
[19:56] <infernix> benpol: 12 port backplane at 4x SAS 8087 = 24gbit = 2gbit per port = 200Mb/sec
[19:56] <infernix> plenty for SATA
[19:56] <nhm> infernix: the HighPoint Rocket 2720SGL did about the same.
[19:56] <noob21> infernix: is that cost before or after replication?
[19:56] <infernix> nhm: no SSDs planned
[19:56] <jmlowe> noob21: more like Redundant Array of Inexpensive Servers RAIS instead of RAID
[19:56] <infernix> noob21: after. 2x rep
[19:56] <noob21> oh wow
[19:56] <noob21> ok
[19:57] <nhm> noob21: WB cache seems to be helpful if you don't have SSD journals.
[19:57] <infernix> or maybe i messed up my math
[19:57] <nhm> noob21: probably not helpful enough to justify the cost in a lot of cases though.
[19:57] <noob21> got some links where i can buy these? :D
[19:57] <infernix> 3x replication takes it to $0.44/GB
[19:57] <noob21> i was calculating .75 cents for 3x replication
[19:57] <noob21> with hp stuff
[19:58] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:01] * guigouz (~guigouz@201-87-100-166.static-corp.ajato.com.br) Quit (Quit: Computer has gone to sleep.)
[20:01] * jlogan1 (~Thunderbi@2600:c00:3010:1:7db5:bf2b:27d1:c794) Quit (Ping timeout: 480 seconds)
[20:02] <noob21> so something like this? http://www.neweggbusiness.com/Product/Product.aspx?Item=N82E16816101675
[20:06] <infernix> CSE-826BE16-R920LPB chassis with the optional 2x2.5" drive trays in the back
[20:06] <infernix> i need to update some pricing here though so .29 might not be totally accurate
[20:07] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[20:07] * Ryan_Lane (~Adium@ has joined #ceph
[20:08] <infernix> but even if i add $2500 for chassis i'm at $0.43 per GB with 2 replicas
[20:09] <infernix> where chassis = full box without disks
[20:09] <infernix> i'm getting exact numbers in the mail today
[20:10] <noob21> cool
[20:14] * yasu` (~yasu`@dhcp-59-168.cse.ucsc.edu) has joined #ceph
[20:17] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[20:17] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:18] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[20:28] * ScOut3R (~scout3r@1F2E942C.dsl.pool.telekom.hu) has joined #ceph
[20:28] * jlogan1 (~Thunderbi@2600:c00:3010:1:7db5:bf2b:27d1:c794) has joined #ceph
[20:28] <elder> Back in a bit.
[20:31] * guigouz (~guigouz@ has joined #ceph
[20:31] * madkiss (~madkiss@ has joined #ceph
[20:31] <madkiss> cheers
[20:32] <madkiss> Robe: you about?
[20:35] <Robe> madkiss: yes
[20:36] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[20:36] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:39] * alan_ (~alan@ctv-95-173-34-17.vinita.lt) has joined #ceph
[20:41] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[20:42] <nhm> any of you guys happen to know if there is anything like xfs_bmap for btrfs that lets you see fragmentation statistics?
[20:43] * BManojlovic (~steki@242-174-222-85.adsl.verat.net) has joined #ceph
[20:44] * nwat (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[20:46] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[20:50] * nwat (~Adium@soenat3.cse.ucsc.edu) has left #ceph
[20:54] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[20:54] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[21:03] * gaveen (~gaveen@ has joined #ceph
[21:04] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[21:06] <infernix> ok so i have 4 boxes with 8x 300gb 10k SAS
[21:06] <infernix> 28 usable disks if i blow that stuff up
[21:06] <infernix> how's ceph on debian 6?
[21:06] <infernix> should i upgrade to testing?
[21:08] <lurbs> infernix: I'd imagine that would depend on if you were going to use the kernel RBD driver, or BTRFS. Newer kernels, even the one in squeeze-backports, would probably be better for that.
[21:11] <infernix> XFS
[21:11] * infernix starts to blow some stuff up
[21:12] <nhm> infernix: debian doesn't have a new enough glibc for syncfs, so you need to make sure you use ceph 0.55+ and make sure it's properly using syncfs.
[21:13] <nhm> infernix: (newer ceph versions get around the glibc limitation)
[21:13] <nhm> infernix: also, newer kernels are better than older ones. If you can do 3.6.x that would be helpful.
[21:14] <lurbs> nhm: Are the newer kernels still desirable if you're just using XFS and the libvirt RBD implementation?
[21:14] <lurbs> And if so, why?
[21:15] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[21:15] <nhm> lurbs: might not need to in that case.
[21:15] <nhm> lurbs: not sure what kind of recent xfs improvements there are.
[21:16] <nhm> lurbs: for kernel rbd it sounds like you really do want new kernels (and for btrfs too!)
[21:21] <brainopia> guys, so I've wondered could rados gateway sustain 600+ concurrent streaming of videos to saturate 1gbps connection?
[21:22] <infernix> nhm: debian testing isnt either?
[21:22] <nhm> infernix: don't think so
[21:22] <nhm> infernix: you need 2.14
[21:23] <nhm> 2.14+ rather
[21:24] * The_Bishop (~bishop@2001:470:50b6:0:2da4:3313:73d6:da2) Quit (Ping timeout: 480 seconds)
[21:24] <infernix> that isnt' even in sid
[21:24] * infernix is dissapoint
[21:24] <infernix> so what does have it
[21:25] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[21:25] <infernix> ubuntu. so go with 12.04? or .10?
[21:26] <nhm> brainopia: I can push 100MB/s thorugh the gateway with lots of concurrent 4MB objects.
[21:27] <nhm> brainopia: that was last summer on a much older ceph and kernel.
[21:27] <jefferai> infernix: are you aware of the debian/debian-testing repos?
[21:27] <jefferai> they have ceph packages for both debian and ubuntu
[21:27] <brainopia> nhm: one gateway or a pool of them?
[21:27] <jefferai> that said
[21:27] <nhm> infernix: either should be fine
[21:28] <jefferai> The most well-tested version is 12.04
[21:28] <jefferai> it's the only version that the developers run all regression tests on
[21:28] <jefferai> and by version I mean "any distro, any version"
[21:28] <jefferai> they also have 3.6.3 kernels that they build available for 12.04
[21:28] <nhm> brainopia: that was 1 gateway.
[21:28] <jefferai> http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/linux-3.6.3/
[21:29] <jefferai> infernix: so - use 12.04, with ceph from the debian-testing repository (http://ceph.com/docs/master/install/debian/)
[21:29] <gregaf> nhm: brainopia: it's not a great fit for streaming media without having a caching layer in front, though
[21:29] <jefferai> and if you're using RBD -- that kernel
[21:29] <infernix> the client is a debian testing box with 3.6.8
[21:29] <infernix> that doesn't need the glibc 2.14+ right?
[21:30] <jefferai> sure it does
[21:30] <brainopia> gregaf: well, I have a lot of different media, so it won't fit into cache, and it's accessed pretty randomly, so only readahead is useful for caching
[21:30] <jefferai> although it sounds like 0.55+ doesn't need it
[21:30] <nhm> gregaf: hrm, I guess I assumed client buffering.
[21:30] <jefferai> (maybe? not sure)
[21:30] <infernix> jefferai: i'm talking about the client that will use rdb? i also have centos 5.8 clients
[21:30] <jefferai> nhm: so they worked around the syncfs issue with 0.55, without big performance hits?
[21:31] <jefferai> infernix: ah, the glibc/syncfs bits are for the OSDs
[21:31] <gregaf> the bigger problem is that it doesn't do any data caching on its own and so if you have 6 people watching the same video you can bottleneck pretty badly
[21:31] <infernix> right, got me worried there
[21:31] <jefferai> as for centos -- good luck
[21:31] <gregaf> it'll work *better* in bobtail or the .55 dev release
[21:31] <gregaf> but I wouldn't build a service out of it
[21:31] <infernix> i need to talk with the cluster from centos 5.5 clients too
[21:31] <infernix> *5.8
[21:31] <gregaf> 600 streams might be okay though, especially if you could make guarantees about concurrent access to large objects
[21:32] <jefferai> infernix: you won't find any official support for that
[21:32] <nhm> gregaf: ah, makes sense. I guess you'd probably have to stick something infront of apache.
[21:32] <jefferai> and probably not much unofficial support either
[21:32] <jefferai> it's so, so old
[21:32] <infernix> you don't need to convince me there
[21:32] * The_Bishop (~bishop@2001:470:50b6:0:21b0:97ac:b32b:7f5e) has joined #ceph
[21:32] <infernix> but legacy is legacy
[21:32] <infernix> i have to deal with it
[21:32] <jefferai> it may very well impact your performance
[21:32] <brainopia> gregaf: well, replications should help with concurrent access, right?
[21:32] <jefferai> if you can even get it working/running
[21:33] <nhm> brainopia: I think he was thinking cache on the gateway server.
[21:33] <gregaf> brainopia: not really — the problems are pretty intricate and your use case might be fine
[21:34] <infernix> if i can't even get librados built on centos 5.8 and talk to ceph in userspace, that'd be a dealbreaker
[21:34] <gregaf> actually I think in Bobtail it would work just fine as long as you've set up your server so it's powerful enough to drive those 600 connections
[21:34] <brainopia> gregaf: what changed in bobtail?
[21:35] <gregaf> RGW chunks up objects a lot more now than it did in argonaut, which means when you're streaming a movie you're distributing the reads instead of sending them all to one OSD
[21:35] <brainopia> is the chunk size configurable?
[21:36] <dmick> hey elder: do you know anything more about http://tracker.newdream.net/issues/3449 than is in the bug?
[21:36] <gregaf> I think so…yehudasa?
[21:36] <gregaf> he can talk about this in more detail than I can anyway
[21:36] <lurbs> Would it be possible to shut the /etc/logrotate.d/ceph script up by putting a >/dev/null after each of the 'if which' commands?
[21:37] <dmick> lurbs: I think that's been done?..
[21:37] <lurbs> Ah, not in the 0.55 packages.
[21:37] <dmick> 3ace9a7c66c79847d215f4bd38f40ca8b07bab8e
[21:38] <dmick> easy edit
[21:38] <lurbs> I'll check git before whining next time, thanks.
[21:38] <dmick> np
[21:38] <lurbs> It's done locally anyway.
[21:38] <yehudasa> brainopia: there are two different things, chink size, and stripe size (in later versions)
[21:38] <dmick> asked and answered, and multiplied out to others who may have the same problem
[21:39] <yehudasa> brainopia: stripe size is configurable
[21:39] <yehudasa> chunks are the units that rgw uses to interact with the backend
[21:40] <yehudasa> not really relevant
[21:40] <brainopia> yehudasa: if a big-enough request comes asking for data near the stripe boundary it would hit two storage nodes?
[21:40] <yehudasa> brainopia: yes
[21:41] <brainopia> is there upper bound on stripe size?
[21:41] <yehudasa> not really
[21:41] <yehudasa> 5G due to protocol restrictions
[21:42] <brainopia> nice, it's a lot better then 1mb at most raid controllers :D
[21:42] <jefferai> gosh, that seems low
[21:42] <jefferai> ( :-D )
[21:42] <yehudasa> by default it's 4M
[21:43] <brainopia> yehudasa: do you think it's sane to use rados gateway as web server for static files (big media files) stored inside ceph?
[21:44] * Cube (~Cube@ has joined #ceph
[21:44] <yehudasa> brainopia: that's one use case
[21:44] <yehudasa> brainopia: though note that radosgw can't be configured yet to have alternative 404 pages, etc.
[21:45] <brainopia> yehudasa: good enough for me :)
[21:45] <brainopia> thanks
[21:45] <brainopia> I will try and come back with more questions ;)
[21:48] * Cube1 (~Cube@ has joined #ceph
[21:52] <wer> Should I grab the keyring for ceph as well as the ceph.conf when doing multiple nodes?
[21:52] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[21:52] * brainopia (~brainopia@ Quit (Quit: brainopia)
[21:53] <yehudasa> wer: you can have different keyring for each node
[21:53] <yehudasa> really depends on your setup
[21:54] <wer> :) So I have a working ceph install with 24 osd's, a mon and a radosgw all on one node for testing.
[21:54] <wer> I am bringing up three other nodes.
[21:56] <wer> My plan was to just create a config for two nodes... and I wanted to see mkcephfs run across the two. But I am unclear on what the second node needs to accomplish this.
[21:57] <joshd> wer: mkcephfs is only for initial setup. if you're reinitializing everything, that's fine, otherwise check out http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
[21:57] * jjgalvez (~jjgalvez@ has joined #ceph
[21:59] <absynth> joshd: did you read olivers mail on the list, by chance?
[22:00] <benpol> wer: and once the new OSDs are added, you might think about transitioning to some new pools with an increased number of placement groups
[22:00] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[22:00] <wer> joshd: yes. I am going to do an initial setup. But that doc doesn't clarify which osd's keys get checked in where :) One one node, don't the keys get checked into cephs keyring? So will mkcephfs ceck in the keys of the second node to the second nodes' keyring? And if so, is it ok to keep a master keyring with all the osd's keys on each node? Is that making sense?
[22:01] <wer> benpol: yeah, I need to do more homework on placement groups :)
[22:03] <wer> oh, actually, I don't even see where the osd's keys are being used. Yeah I am just confused.
[22:04] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[22:06] <benpol> wer: see step 8 on http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#adding-an-osd-manual
[22:07] <wer> where is ceph auth list keeping all these entries :) is it the mon?
[22:07] <benpol> wer: yes
[22:07] <wer> oh! ok. Then if I just get a consitent ceph.conf on there I can add osd's.... I think I get it now.
[22:10] <wer> so on the second new node is it possible to run mkcephfs with --init-local-daemons?
[22:10] * joao (~JL@ Quit (Read error: Connection reset by peer)
[22:11] <benpol> wer: see step 7 on the previously mentioned page ("ceph-osd -i {osd-num} --mkfs --mkkey"), that generates the key and intializes the directory layout for the new osd
[22:12] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[22:13] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[22:14] <wer> benpol: If I run mkcephfs --init-local-daemons it essentially does all that right?
[22:14] * ScOut3R (~scout3r@1F2E942C.dsl.pool.telekom.hu) Quit (Quit: Lost terminal)
[22:16] <wer> sorry if I am being dense. I know I can add each osd 1 by one... but I am trying to find the least path of resistance for turning up new osd's. I can create another script to do all this, but was hoping that is what mkcephfs was for :)
[22:17] <benpol> wer: from what I understand mkcephfs is generally recommended for use during the initial creation of a cluster, and that it's use might become deprecated. All of the docs for adding OSDs tend to specify the usage I just quoted.
[22:18] <wer> ok benpol. Fair enough. The human element of managing ceph has a lot of potential issues. Especially with this human.
[22:18] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[22:18] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has left #ceph
[22:19] <benpol> A lot of SSD users here presumably. Anyone used a Viking SATADIMM? Not sure if they're actually available anywhere, but it seems like a clever idea for use in a 1u server.
[22:21] <nhm> benpol: that's great
[22:21] <benpol> nhm: That's what I thought too, but all the posts about it seem to be a couple years old and I can't find them online for purchase.
[22:22] <absynth> uh, there are lengthy threads about SSD choices on the mailing list
[22:23] <benpol> absynth: fair enough
[22:23] <absynth> i was just saying - maybe it helps
[22:23] <nhm> benpol: now we just need an adapter that lets you put dimms in drive bays...
[22:23] <absynth> dunno if there's a web archive, though (and where)
[22:24] <rweeks> yes there is
[22:24] <benpol> several in fact! :)
[22:24] <rweeks> archives: http://dir.gmane.org/gmane.comp.file-systems.ceph.devel
[22:24] <rweeks> archives: http://marc.info/?l=ceph-devel
[22:24] <absynth> including one that allows re-sending... fucking nabble spammers ;)
[22:24] <rweeks> both are listed on the ceph.com resources page
[22:26] <benpol> And for the record no mention of such a product in the archive. Beginning to think it might just be vapor hardware.
[22:28] <nhm> benpol: bottom of page here: http://www.tweaktown.com/reviews/3941/viking_modular_satadimm_200gb_ssd_review/index2.html
[22:31] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[22:33] <benpol> nhm: The reviewer's puzzled too. Maybe Viking got hit by a patent troll.
[22:34] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:45] <dmick> woot, I'm a Steam for Linux beta tester
[22:47] <ircolle1> dmick: what game you going to play first? :-)
[22:47] <rweeks> Portal!
[22:47] <dmick> oh I'm a wimp. but yeah, probably
[22:47] <dmick> and other simple free games, when Portal chokes :)
[22:49] <Kioob> question, I was trying to start a fresh new RBD cluster, but can't fix some issues. When I try to start OSD, I obtain �cephx server osd.17: unexpected key: req.key=[...]�, followed by �ERROR: osd init failed: (1) Operation not permitted#033[0m�
[22:53] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:53] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[22:53] * Leseb_ is now known as Leseb
[22:56] * Ryan_Lane (~Adium@ has joined #ceph
[22:56] <wer> in http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#adding-an-osd-manual is step 6 needed if the osd-num already exists in the config? Or is the config just assumed to also be consistent with any osd'd that have been created. What keeps the config in sync or what is it's use?
[22:57] <dmick> Kioob: so there's a key problem; have you investigated your keys?
[22:58] <Kioob> but... which keys ?
[22:58] * noob21 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[22:58] <Kioob> I have just done mkcephfs then �/etc/init.d/ceph start osd�
[22:58] <wer> There seems like a lot of ambiguity left up to a human... If I create new osd's do they need to be in the config at all? Or is the config only used by init?
[22:58] <dmick> Kioob: did you start the monitor(s)?
[22:58] <Kioob> yes
[22:59] <Kioob> === mon.a ===
[22:59] <Kioob> mon.a: running 0.48.2argonaut
[22:59] <dmick> so you've done mkcephfs, then ceph start mon, then ceph start osd?
[22:59] <Kioob> yes
[22:59] <dmick> so if mon started, and osd didn't, then it's a key used by the osd, and I'd look first at the monitor key
[23:00] <dmick> the osd talks to the monitor first and must do so with a key
[23:02] * chutzpah (~chutz@ Quit (Remote host closed the connection)
[23:02] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has left #ceph
[23:03] <Kioob> sorry, I didn't understand dmick. I'm looking in ceph.com/docs/, searching about what �keys� are, but...
[23:08] <via> has anyone else been having problems with osd's crashing under basically any load with the new release?
[23:08] <dmick> well you can read at http://ceph.com/docs/master/rados/operations/auth-intro/#ceph-authentication-cephx, but that may be more information than you want
[23:08] <Kioob> thanks dmick, it's a good starting point :)
[23:08] <wer> via I caught mine giving a really strange status.... but then suddenly they were fine a minute later....
[23:08] <via> oh... my ceph-osd processes actually die
[23:09] <via> i guess i'll get dumps from them. the mds's too
[23:09] <dmick> check the logs first via
[23:09] <via> i posted this mds dump last night: https://pastee.org/8sfny
[23:10] <via> it occurs whenever i restart the mds
[23:12] <alan_> Hi, i have this problem too, osd die
[23:12] <wer> so I shouldn't start ceph on a new node if all the OSD's have not been created using the manual method correct? Then make sure the config matches what I created? Then things are safe again?
[23:12] <via> dmick: here's a log https://pastee.org/f4dgd
[23:13] <via> any extra debug options that would help?
[23:15] <alan_> my dump very similar to via
[23:17] <alan_> via, how many servers you update? I 1/3, maybe remove failed osd and then add?
[23:17] <alan_> you try it?
[23:18] <via> alan_: i had so much trouble with auth related crashes, that combined with me getting parts for an extra node, i just started fresh from scratch
[23:18] <via> so all nodes are updated
[23:18] <via> i can add all the osd's back and they run until i start to actually use it for testing things, then they begin to crumble
[23:19] <via> but right now, ceph 0.55 is basically completely unusable
[23:19] <via> mds's crashed after a few minutes of rsyncing things, even took out my 2 standby's within seconds
[23:20] <via> osd's die similarly. with .54 the osd's were rather rock solid, mds's were still a bit iffy
[23:20] <alan_> hmm, my odd now work file, i readd him, i will try readd bad osds. I will post here results.
[23:20] <alan_> oi, my mds
[23:21] <dmick> via: that looks a lot like http://tracker.newdream.net/issues/3459, which was thought to be addressed; I've added a note
[23:22] <via> oh... a bug tracker. i should have looked
[23:22] <via> sorry
[23:22] <via> also, i pasted those with week time limits
[23:24] <alan_> dmick: yes, it is. my friend have this problem too.
[23:24] <dmick> alan_: yeah, sounds like it bears some investigation
[23:25] <dmick> it's definitely not unusable for everyone, via, we're testing it, but, yeah, sorry
[23:25] <via> no apologies necessary, i love the project
[23:25] <dmick> it's possible that turning off cephx would be a workaround for the moment; don't know if that's acceptable for you
[23:25] <alan_> dmick, what you think about read sod's?
[23:25] <via> i've been spending a lot of money on hardware for a personal setup
[23:26] <dmick> alan_: don't understand what you mean
[23:26] <alan_> dmick readd osd's
[23:26] <via> http://imgur.com/NYWvD <- is my bedroom. all for ceph
[23:26] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[23:27] <dmick> that's sick^H^H^H^Hawesome dude
[23:27] <janos> excellent via
[23:27] <janos> sloppy cables though!
[23:27] <janos> ;)
[23:27] <via> yeah, i really need to redo it
[23:27] <via> but i'm still building
[23:27] <via> and i'm pretty messy
[23:28] <rweeks> holy cow
[23:28] <janos> yeah. i need to tear mine down
[23:28] <janos> it's become a mess over time
[23:28] <nhm> what, no rapidraids? ;)
[23:28] <nhm> rails rather
[23:28] <janos> i have a ton here at home if you need!
[23:29] <via> i am using 2post machines so i wouldn't need rails, cause shit they're expensive
[23:29] <janos> i wish i had gotten square-hole posts at teh time
[23:29] <janos> mine are threaded
[23:30] <nhm> via: here's my setup: http://ceph.com/wp-content/uploads/2012/09/SC847a.jpg
[23:30] <janos> easier to go from square-to-thread than the other way around ;(
[23:30] <rweeks> via, what are you going to do with that when it's built?
[23:30] <via> nhm: holy crap
[23:30] <rweeks> aside from pay your power bill
[23:30] <janos> haha man that's a party in a box right there
[23:30] <via> they're amd apu's, 9Watts a piece
[23:30] <via> the goal was cheap pernode cost
[23:31] <rweeks> ahh
[23:31] <janos> via - are they pre-made?
[23:31] <via> janos: nope
[23:31] <via> its a tough fit
[23:31] <janos> i'm digging out my crawlspace into a basement - intend to move equipment there once i have it properly vapor-barrier'd, dust-free etc
[23:31] <via> i'm using one huge loud server with raid right now, and its a pita to expand
[23:31] <via> thats my main reason
[23:32] <janos> i drove myself nuts with rack machines in my home office
[23:32] <via> plus, i work at rackspace and want to eventually push for us to use it
[23:32] <janos> never again
[23:32] <dmick> Here in LA we can get Pernod at relatively low cost, $26/750ml
[23:32] <janos> lol
[23:32] <via> it being ceph
[23:32] <janos> yeah
[23:32] <nhm> janos: nice! how are you laying the slab/block?
[23:32] * jjgalvez1 (~jjgalvez@ has joined #ceph
[23:32] <janos> nhm: i'm not that far yet!
[23:33] <janos> i still want to remove/replace all sorts of supports
[23:33] <rweeks> dmick: there's a lovely absinthe made here in the bay area
[23:33] <janos> it's slow-going. that clay is HEAVY
[23:33] <janos> i'll end up doing a fair traditional slab i imagine - wire mesh, etc
[23:35] <janos> i still have a long way to go - access in/out still completely stinks and i have some structure work to do before i could think about getting something like a dingo in there
[23:35] <janos> so it's shovels, pick-axe, buckets, wheelbarrows and friends right now
[23:35] <janos> then i'll likely rewire and re-plumb most of the house
[23:35] <janos> the work under there looks like a spider web. horrible
[23:36] * jjgalvez (~jjgalvez@ Quit (Ping timeout: 480 seconds)
[23:37] <dmick> yehudasa, maybe you could advise wer?
[23:38] * chutzpah (~chutz@ has joined #ceph
[23:38] <wer> yehuda_hm: I am sort of not getting it :) And actually my attempts to create a new osd on the new node are failing... and I am thinking my osd naming is problematic as well.
[23:39] <wer> yehudasa: I mean sorry.
[23:40] <infernix> oh joy. HP ILO is a trainwreck
[23:40] * infernix starts internet explorer
[23:40] <dmick> infernix: I'm so sorry
[23:41] <dmick> I don't know why *every* LOM device has to suck
[23:41] <infernix> supermicro is fine
[23:42] <infernix> java based, and they have a standalone java app that works
[23:42] <janos> oooh my new workstation parts just came in. supermicro server board ;)
[23:42] <infernix> including virtual media
[23:42] <infernix> i suppose I'll add ubuntu 12.04 to cobbler instead
[23:42] <infernix> fffff internet explorer
[23:42] <janos> haha
[23:44] <alan_> yes! it's word "osdmap e1572: 15 osds: 15 up, 15 in" and start rebuilding "degraded (9.986%)"
[23:46] <dmick> java. /me spits
[23:46] <wer> is osd.010 equivalent to osd.10 ?
[23:46] <alan_> i find where problem, after update ceph lost osd's from crush map . http://pastebin.com/drh3h2Er
[23:48] * chutzpah (~chutz@ Quit (Quit: Leaving)
[23:52] <dmick> alan_: cool
[23:57] <Kioob> dmick : so, by disabling ceph auth it works :D but it's not really what I want :p
[23:57] <dmick> Kioob: yeah, understood
[23:58] <Kioob> but any idea why after �mkcephfs� privileges are not �good enough� to make OSD talk with MON ?
[23:59] <dmick> other than that there's something wrong with the keys, no
[23:59] <alan_> dmick http://pastebin.com/Fi104zPU after update i have this error, it's my bad? where i can setup fs_type?

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.