#ceph IRC Log


IRC Log for 2013-04-12

Timestamps are in GMT/BST.

[0:00] <hox> dmick: logs at paste.ubuntu.com/5699904
[0:03] <dmick> b64 is supposed to literally match auth_sign, so that's the immediate problem
[0:04] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:04] <hox> b64=auth_sign for GET requests but not for PUT
[0:04] <dmick> right.
[0:04] <dmick> so ceph simply disagrees that the PUT is signed correctly
[0:05] <dmick> is the client known good?
[0:05] <noahmehl> i'm having a bit of trouble expanding my cluster
[0:05] <hox> "known good" meaning?
[0:05] <noahmehl> ceph osd crush set 3 1.0 pool=default
[0:06] <noahmehl> is returning: (22) Invalid argument
[0:06] <dmick> hox: does PUT work with this client on other S3 services or other objects
[0:06] <hox> using python client
[0:06] <dmick> "in other situations"
[0:07] <noahmehl> oh
[0:07] <noahmehl> needed root=default
[0:08] <dmick> yep. just saw a bug to fix the docs on that
[0:08] * vanham (~vanham@ Quit (Ping timeout: 480 seconds)
[0:08] <dmick> hox: so the first obvious thing to suspect is that the client is doing the wrong thing on PUT. Do you know that it is not?
[0:12] <dmick> Ah, this is Boto/2.8.0, I see
[0:14] <dmick> yehuda_hm: anything jump out at you from paste.ubuntu.com/5699904?
[0:14] * vata (~vata@2607:fad8:4:6:98b0:7f26:d620:3c4) Quit (Quit: Leaving.)
[0:14] <dmick> I'm a bit surprised to see only "PUT" in the auth_hdr:
[0:17] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[0:20] * sstan (~chatzilla@dmzgw2.cbnco.com) has joined #ceph
[0:20] * tserong (~tserong@124-171-119-73.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[0:27] * BillK (~BillK@58-7-53-224.dyn.iinet.net.au) has joined #ceph
[0:31] * mcclurmc_laptop (~mcclurmc@82-69-96-152.dsl.in-addr.zen.co.uk) Quit (Read error: Operation timed out)
[0:31] * BManojlovic (~steki@fo-d- Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:33] * tnt (~tnt@228.204-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[0:35] * KindTwo (~KindOne@h140.19.131.174.dynamic.ip.windstream.net) has joined #ceph
[0:39] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[0:39] * KindTwo is now known as KindOne
[0:39] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:42] <mrjack> " Running GFS or OCFS on top of RBD will not work with caching enabled."
[0:42] <mrjack> why is that?
[0:43] <gregaf1> to start, that's discussing the userspace rbd caching, not your regular linux page cache etc (just in case that's not clear)
[0:43] <gregaf1> it won't work because systems like that generally issue a flush to get the data out to the drive, where it can be read by another user
[0:44] <gregaf1> notice that "to the drive" means "in the drive's cache", not "safe on disk"
[0:44] <gregaf1> in RBD with caching that flush gets the data into the local RBD cache, so another node who does a read won't see your updates
[0:44] <gregaf1> without caching, a flush gets it to the OSD where it can be read by other nodes
[0:49] <dmick> I *think*, but I'm not sure, that you can flush writes to the cluster, but I think the problem is likely to be that there's no "notify other read caches that they are now stale" callback
[0:52] <mrjack> ah ok and ocfs2 has ondisk heartbeat so this would be a bad idea
[0:52] <mrjack> i have problems with ocfs2 ontop of rbd
[0:53] <gregaf1> anyway, it should work without the cache enabled
[0:53] <mrjack> i have it setup and use it without cache
[0:53] <gregaf1> not that we've tried it ourselves
[0:54] <mrjack> i experience that nodes reboot (ocfs2 fencing) when there is no io from rbd
[0:54] <lurbs> I gave up on GFS/OCFS and ended up running Gluster inside a bunch of RBD backed KVM instances.
[0:54] <gregaf1> …you're kidding
[0:54] <mrjack> i also have gluster running but gluster has locking issues
[0:54] <mrjack> i only use gluster for mail
[0:54] <dmick> yikes, ocfs is a "wear out your disk" filesystem then? :)
[0:55] <gregaf1> mrjack: ah, ocfs2 might be latency sensitive in ways we aren't respecting well enough
[0:55] <mrjack> gregaf1: yeah
[0:55] <mrjack> gregaf1: on high load
[0:55] <dmick> perhaps that can be tuned?
[0:55] <mrjack> i switched to deadline scheduler
[0:55] <dmick> (in ocfs I mean)
[0:55] <mrjack> there is another problem i experienced
[0:55] <mrjack> i had setup about 100 kvms with rbd
[0:55] <mrjack> once debians default cronjobs kicked in all at the same time it killed ocfs2
[0:56] <lurbs> Good ol' 6:25 am. :)
[0:56] <mrjack> that one is lightweight
[0:56] <mrjack> there is the weekly one doing find . or so
[0:56] <mrjack> so i saw ocfs2 nodes reboot every 7 days
[0:56] <lurbs> locate's updatedb, maybe?
[0:57] <mrjack> yeah
[0:57] <lurbs> Tried mlocate instead?
[0:57] <mrjack> but this problem was not with 0.48.3
[0:57] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:57] <mrjack> lurbs: i cannot set nor predict what the user would do so this would be no solution
[0:57] <lurbs> You don't control the OS inside the VM's?
[0:57] <mrjack> i am thinking about getting 10GE
[0:58] <mrjack> lurbs no
[0:58] <lurbs> Ah, we do.
[0:58] <mrjack> i have control on some
[0:58] <mrjack> but most are customers kvms
[0:58] <mrjack> i have a 7 node cluster running...
[0:58] <gregaf1> mrjack: sorry, this worked on .48.3 and doesn't on bobtail?
[0:59] <mrjack> gregaf1: cannot tell for sure but i think yes..
[0:59] <mrjack> http://www.smart-weblications.com/vserver/status/
[1:01] <gregaf1> how odd
[1:01] <mrjack> ?
[1:02] <gregaf1> that it apparently got worse in the upgrade
[1:02] <mrjack> well i did not experience it beforce 0.54
[1:02] * rustam (~rustam@ Quit (Remote host closed the connection)
[1:02] <mrjack> but my load also increased
[1:02] <mrjack> so i cannot tell for sure
[1:02] <lurbs> mrjack: If they're customer VMs, have you tried using the various IO throttling options for qemu-kvm to prevent a single client from DOS'ing the cluster?
[1:03] <lurbs> Not saying the cluster itself shouldn't be tuned to survive as much as possible, of course.
[1:03] <mrjack> lurbs: no i have not yet, i tried with various cache settings and thats where i saw that i could not use cache with ocfs2
[1:03] <mrjack> i'll have a look
[1:04] <mrjack> but i think the problem is not bandwith.. it is the high amount of small io i think
[1:04] <gregaf1> oh, it's probably load then; there's still a pretty steep cliff
[1:05] <lurbs> http://libvirt.org/formatdomain.html <-- iotune, and blkiotune.
[1:05] <mrjack> thanks
[1:06] <mrjack> i find it interesting
[1:06] <mrjack> if i take an osd out
[1:07] <mrjack> and add it again (Reboot a node)
[1:07] <mrjack> i get about 200 - 300mb/sek recovery speed
[1:07] <mrjack> but when i try dd inside kvm with rbd i get max 70 mb/sek
[1:07] * rustam (~rustam@ has joined #ceph
[1:08] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[1:08] * rustam (~rustam@ Quit (Remote host closed the connection)
[1:09] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[1:09] <gregaf1> recovery is much more parallel than a default dd is going to be
[1:10] <mrjack> why? the client could send to multiple osds in parallel?
[1:12] <gregaf1> it can but a dd workload won't, generally speaking
[1:13] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[1:13] <gregaf1> you might try running more than one dd at a time and seeing if you get more than 70MB/s, for instance
[1:15] * tserong (~tserong@124-168-229-104.dyn.iinet.net.au) has joined #ceph
[1:16] <mrjack> i would like to use cephfs directly..
[1:18] <gregaf1> and we'd like to be able to tell you to ;)
[1:18] <mrjack> can i reformat the cephfs somehow if it messes up? i tried it a while ago, extracted kernel source, tried to delete the source and got stuck with files and directory which could not be removed anymore
[1:19] <pioto> iggy: to hijack another channel... context for 9p is... trying to use cephfs, w/o having to leak ceph credentials and such to a virtual machine (so that, if the machine gets rooted, they probably can't mess with any other parts of the filesystem)
[1:20] <iggy> 9p on top of cephfs (also experimental)... yikes
[1:20] <pioto> yeah
[1:20] <pioto> well.
[1:20] <iggy> does rbd not do it for you?
[1:20] <pioto> i don't have a better solution that fits my needs so far
[1:20] <pioto> i am using rbd too
[1:20] <pioto> but i also need a shared fs
[1:21] <pioto> the virtual machine is booting off of rbd, and that works fine
[1:21] * Forced (~Forced@ has joined #ceph
[1:21] <iggy> samba-ceph?
[1:21] <pioto> but if i want 2 virtual machines to be able to appear to have, say, the same home dir...
[1:21] <mrjack> ocfs2 ;)
[1:21] <pioto> hm? wazzat? just samba pointed to a cephfs mount? or something else?
[1:22] <iggy> i'm not sure what level it talks to ceph at
[1:22] <iggy> (objects or fs)
[1:22] <iggy> i don't have the link handy
[1:22] <pioto> do you have a link?
[1:22] <pioto> heh, ok
[1:22] * hox (~hox@ Quit (Ping timeout: 480 seconds)
[1:22] <pioto> i see there's https://github.com/ceph/samba ...
[1:22] <iggy> it's one of the git trees on ceph's github
[1:23] * ivotron (~ivo@dhcp-59-180.cse.ucsc.edu) Quit (Ping timeout: 480 seconds)
[1:23] <iggy> that, yeah
[1:23] <pioto> well. hm. i guess i'll see if there are any docs hidden in there
[1:23] <pioto> or i guess, just read this: https://github.com/ceph/samba/commit/e8e41e716b4dc2499019c1b7a03e43f1fb9c37e8
[1:23] <pioto> seems it's the main change there
[1:24] <iggy> (i'm on my phone, not exactly the best/quickest source of info)
[1:24] <pioto> it seems to use libcephfs
[1:24] <pioto> so i assume it's fs-layer
[1:24] <iggy> i would also
[1:25] * alram (~alram@ Quit (Quit: leaving)
[1:25] <pioto> hm. now... i wonder if i can do something crazy, like tunnel that over virtio :)
[1:26] <pioto> 9p also has the advantage of letting me set up things entirely within the domains' xml file, so i can spin up an image and it doesn't know/care which mountpoint to grab
[1:29] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[1:35] <Elbandi_> dmick: so if i have lots of small random read, it have to set stripe size to small
[1:35] <iggy> cifs+virt-net+vhost is going to be miles ahead of 9p
[1:36] <dmick> Elbandi_: or have some cache in the way, like, say, librbd's cache
[1:36] <iggy> i meant streets ahead
[1:37] <dmick> although it tries to cache only what's read, so if it's truly random, that might not help as much as you might like either
[1:38] <pioto> iggy: do you mean in terms of throughput? stability? or both?
[1:39] <Kioob> Hi
[1:40] <Elbandi_> we use cephfs to serving video file
[1:40] <iggy> pioto: both
[1:40] <Kioob> I'm trying to copy RBD image (format v1) from one pool to another pool, so I use "rbd export | rbd import". But on an image I obtain (after 20 minutes) : rbd: export error: (-2147483648) Unknown error 18446744071562067968
[1:41] <Kioob> is there a way to find what's happening ?
[1:42] * rustam (~rustam@ has joined #ceph
[1:43] * rustam (~rustam@ Quit (Remote host closed the connection)
[1:47] * jakku (~jakku@ad046161.dynamic.ppp.asahi-net.or.jp) has joined #ceph
[1:48] <dmick> Kioob: that's a 64-bit confusion; joshd just put back the fix. I think the image moved successfully IIRC
[1:50] * amatter (~oftc-webi@ Quit (Remote host closed the connection)
[1:51] <Kioob> great !
[1:55] <dmick> sorry, not actually fixed yet, but cause known, fix coming shortly
[1:55] <dmick> I'd message-digest src and dest to be sure, but I think it's probably OK
[1:56] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[2:02] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[2:07] <Kioob> yes, that what I'm doing
[2:07] <Kioob> thanks dmick
[2:07] <Kioob> (not sure how it will works with "holes")
[2:07] <dmick> shouldn't matter to the digest, but the export/import might have lost some sparseness
[2:08] * rustam (~rustam@ has joined #ceph
[2:10] * rustam (~rustam@ Quit (Remote host closed the connection)
[2:12] <mrjack> i had this issue few days ago
[2:12] <mrjack> did diff on export and checksum was ok
[2:13] <Kioob> ok
[2:21] * rustam (~rustam@ has joined #ceph
[2:22] <mrjack> in ceph -w i can see 1173KB/s wr, 95op/s - what op/s is meant?
[2:22] <mrjack> can i see how much was read?
[2:22] <gregaf1> that's ops/s, and you should see read rates as well
[2:22] <mrjack> i only see wr
[2:23] <mrjack> ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
[2:23] * rustam (~rustam@ Quit (Remote host closed the connection)
[2:23] <mrjack> 2013-04-12 02:22:57.443784 mon.0 [INF] pgmap v14506419: 768 pgs: 768 active+clean; 819 GB data, 2458 GB used, 3714 GB / 6359 GB avail; 137KB/s wr, 25op/s
[2:23] <gregaf1> hmm, maybe it was left out of bobtail
[2:23] <gregaf1> it shows up in what-will-be-cuttlefish :)
[2:24] * xiaoxi (~xiaoxi@shzdmzpr01-ext.sh.intel.com) has joined #ceph
[2:24] <gregaf1> or maybe you just don't have any reads; it won't display if the cluster isn't serving any (within the window of time currently being reported; it's only a broad average really)
[2:24] <dmick> looks like op/s is the sum of both, but yeah, should have KB/s (which should be kB/s)
[2:24] <gregaf1> two subsequent lines of mine are:
[2:24] <gregaf1> 2013-04-11 17:13:09.722266 mon.0 [INF] pgmap v8: 24 pgs: 24 active+degraded; 86025 KB data, 93738 MB used, 727 GB / 863 GB avail; 0B/s rd, 8861KB/s wr, 5op/s; 42/84 degraded (50.000%)
[2:24] <gregaf1> 2013-04-11 17:13:17.769846 mon.0 [INF] pgmap v9: 24 pgs: 24 active+degraded; 176 MB data, 93822 MB used, 727 GB / 863 GB avail; 14084KB/s wr, 3op/s; 65/130 degraded (50.000%)
[2:25] <gregaf1> notice that the first one has "0b/s rd" before the wr/s and the second does not
[2:25] <mrjack> i never had r x/bs rd
[2:28] <Kioob> never seen "B/s rd" too
[2:37] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[2:42] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[2:43] * chutzpah (~chutz@ Quit (Quit: Leaving)
[2:44] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[2:45] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[2:45] <tserong> is there an equivalent to ceph-mon-all-starter etc. for non-upstart systems?
[2:49] <tserong> i'm fiddling with https://github.com/ceph/ceph-cookbooks on openSUSE, and AFAICT the config deployed relies on being able to start
[2:49] <tserong> mons etc. based on their existence in /var/lib/ceph/mon
[2:49] <tserong> not on having [mon.a] etc. in ceph.conf.
[2:49] <dmick> I'm not certain but I think /etc/init.d/ceph tries to do similar things to all the upstart jobs
[2:49] <dmick> I'm actually going through all this stuff now
[2:50] <dmick> ceph-deploy, at least, attempts to mark things similarly for sysvinit as it does for upstart, leading me to believe that init.d/ceph can do conditional things based on the state of the disk
[2:50] <tserong> hrm
[2:51] <tserong> when i was looking at this last night, the init script seemed to just bail out if there weren't [mon.a] etc. sections in the conf
[2:51] <gregaf1> iirc the ceph-cookbooks are all upstart-only right now; I think generic init system support is the next thing up on that agenda
[2:51] <tserong> and i don't think it was looking at what's on disk, but might have missed something
[2:52] <gregaf1> tserong: work on being init-system-aware and agnostic is all very new so if you're looking at bobtail you're probably not seeing it
[2:52] <dmick> yeah....ceph deploy is marking the disk but I don't think init.d/ceph is consuming it yet
[2:53] <tserong> gregaf1, ah, ok thanks. i am looking at bobtail
[2:54] <tserong> here's some notes on how far i got yesterday (not very far) if anyone's interested
[2:54] <tserong> http://ourobengr.com/2013/04/the-ceph-chef-experiment/
[2:55] <gregaf1> dmick: sysvinit is aware in (at least) next; see src/ceph_common.sh and grep for sysvinit
[2:57] <dmick> oh! look at that
[2:57] * dpippenger (~riven@ Quit (Quit: Leaving.)
[2:58] <gregaf1> but like I said, very new
[2:58] <tserong> ooh, it's doing get_local_daemon list first in get_local_name_list
[2:59] <gregaf1> looks like it went in for v0.58
[2:59] <dmick> I think this *might* be the fifteenth time I've missed something in ceph_common.sh while looking in init-ceph.in. /me smacks self in temple
[2:59] <gregaf1> at least that piece of it
[2:59] <gregaf1> yeah, I'm not a fan of our sysvinit
[2:59] <gregaf1> Upstart is my friend, though as with Juju it's entirely possible that's because I haven't had to manage it anywhere for real
[3:05] * DarkAceZ (~BillyMays@ Quit (Quit: Run away!)
[3:12] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) has joined #ceph
[3:15] * diegows (~diegows@ has joined #ceph
[3:20] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[3:20] * DarkAceZ (~BillyMays@ has joined #ceph
[3:21] * DarkAceZ (~BillyMays@ Quit (Max SendQ exceeded)
[3:27] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[3:29] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[3:35] * DarkAceZ (~BillyMays@ has joined #ceph
[3:40] * rustam (~rustam@ has joined #ceph
[3:43] * rustam (~rustam@ Quit (Remote host closed the connection)
[3:48] <Psi-jack> Hmmm
[3:48] * nhm__ (~nh@65-128-150-185.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[3:49] <Psi-jack> Now, I'm starting to see some of my flaw in my ceph infrastructure logic, now that I want to actually use it more and more and more.. I kept it all in a closed loop on it's own dedicated network, and now I want to start using it more because of how awesome it is.. LOL
[3:50] <Psi-jack> So, how I need to figure out how to route between to, just so I can get to those 3 mons. heh
[3:51] <Psi-jack> That's all the clients need, yes? The ability to communicate with the mons?
[3:51] <dmick> no. they have to talk to the OSDs as well
[3:52] <Psi-jack> hmmm.
[3:52] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:52] <dmick> that's one of the big advantages, is that clients *can* talk to the OSDs directly, so they avoid the "master server" bottleneck
[3:52] <Psi-jack> Right.
[3:52] <dmick> now, they'll talk on the public net, so if you have a public/cluster setup, only the OSDs need to be on the cluster net
[3:52] <dmick> but they also need to be on the public net
[3:53] <Psi-jack> Hmmm, yeah, I set them up to be totally on the cluster net only. :/
[3:54] <lurbs> It's pretty easy to split out. I have the cluster net as a direct-cabled 10 Gb ring, and the public's 1 Gb switched.
[3:54] <lurbs> Hoping to get 10 Gb for production, eventually.
[3:54] <Psi-jack> Heh, is that possible to change in the configuration out to add in the public network?
[3:55] <dmick> error: syntax error parsing question
[3:55] <Psi-jack> heh, funny thing is, I /did/ set it up in the ceph.conf, to have both cluster network and public network
[3:56] <dmick> those look remarkably similar to me
[3:57] <Psi-jack> dmick: I'm asking if I could change the public network setting in each of the ceph server's configuration files without adverse affects. Like, do it on host ceph1, restart each osd, mon, mds, wait for it to get back healthy, then continue to ceph2's configuration changes and restarting, etc.
[3:58] <dmick> I dunno. makes my head hurt. probably easier if the two nets can route to each other at least while you're transitioning, but you probably guessed that.
[3:58] <Psi-jack> heh
[3:58] <dmick> but: I'm heading out; gl whatever you do
[3:59] <Psi-jack> Hmm yeah, I noticed I also specifically set my mon addr's on all the mon servers, too. :/
[3:59] * portante|lt (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[4:03] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[4:08] * vanham (~vanham@ has joined #ceph
[4:23] <Psi-jack> Hmm, doing mount.ceph /mnt/ceph/ -o name=client.admin,secretfile=/etc/ceph/keyring.admin, I'm getting secret is not valid base64: Invalid argument.
[4:24] <Psi-jack> yet, the keyring.admin is the exact same keyring from the ceph servers themselves. Why would it be failing this?
[4:25] <Psi-jack> If I just directly provide it as secret, instead, I just get "mount error 1 = Operation not permitted"
[4:26] <vanham> Psi, not a ceph developer but
[4:26] <vanham> The only way I know for you to have a invalid base64 string it is have something that is not multiple of 4 bytes
[4:26] <vanham> There is no other formatting on base64 stuff
[4:27] <vanham> Nor on the secret
[4:27] <vanham> For the operation not permitted, do you see ceph listed at /proc/filesystems?
[4:28] <lurbs> Psi-jack: Does it contain the "[client.admin]" and "key = " or just the base64 part?
[4:29] <Psi-jack> Yes. nodev ceph
[4:30] <Psi-jack> lurbs: The [client.admin] and key
[4:31] <lurbs> Have you tried stripping that out, and using a file containing just the base64 part?
[4:32] <Psi-jack> Just did, and that works. hehe
[4:32] <Psi-jack> I'm just still getting Operation not permitted.
[4:32] <vanham> what does dmesg|tail says?
[4:32] <vanham> any tip?
[4:33] <xiaoxi> dmick: are you around?
[4:33] <Psi-jack> hmmm... Ohhh. libceph: auth method 'x' error -1
[4:33] <vanham> rssss
[4:34] <vanham> any logs at the mds daemon?
[4:34] <vanham> it is a problem with your key, the secret is not being accepted and I don't know anything about cephx but I would try there
[4:36] <Psi-jack> The ceph-mds itself? Hmmm. Not seeing it showing anything.
[4:36] <Psi-jack> Ahh, wait a sec. :)
[4:37] <vanham> oh good, ceph.log shows too much stuff
[4:37] <vanham> it would be easier if the problem were shown at ceph-mds.0.log
[4:38] <Psi-jack> yeah., I'm doing a journalctl to filter to just ceph-mds
[4:38] <Psi-jack> So, yeah, nothing.
[4:38] <vanham> oh...
[4:39] <vanham> 1 sec
[4:41] <vanham> what is your ceph auth list showing?
[4:42] <vanham> send on pastebin plz
[4:42] <lurbs> Psi-jack: Tried "name=admin", not "name=client.admin"?
[4:42] <Psi-jack> heh.
[4:42] <Psi-jack> Ahhh.. That made a difference lurbs
[4:43] <Psi-jack> mount: error writing /etc/mtab: Invalid argument
[4:43] <Psi-jack> Cept that. LOL
[4:43] <lurbs> A new and different error!
[4:43] <vanham> ignore that, it is working already
[4:43] <Psi-jack> Yeah.
[4:43] <tserong> Psi-jack, you can also do something like mount -t ceph -o name=admin,secret=$(ceph-authtool --name client.admin /etc/ceph/ceph.keyring --print-key) ......
[4:43] <Psi-jack> mtab's a link to /proc/self/mounts
[4:43] <tserong> which will pull the secret out of the keyring file
[4:43] <tserong> it's a bit verbose though :)
[4:43] <Psi-jack> tserong: Hmmm, interesting approach. :)
[4:44] <tserong> dunno if it's an especially good idea or not :)
[4:44] <vanham> psi, run mount, it is probably mounted already
[4:44] <Psi-jack> haha
[4:44] <Psi-jack> vanham: Yep, it is. :)
[4:44] <vanham> yey! good!
[4:44] <tserong> i had the "mount: error writing /etc/mtab: Invalid argument" error too, JFYI, but the FS did still mount regardless
[4:44] <vanham> I have to learn a little bit more of cephx
[4:44] <Psi-jack> pretty cool. Seems I was able to get my static route to work as expected and very easily.
[4:45] <lurbs> Psi-jack: I recall hitting the same sort of problem when setting up cephx auth in libvirt.
[4:45] <vanham> actually I have to learn cephx
[4:45] <Psi-jack> lurbs: Heh.
[4:45] <lurbs> With the username needing 's/client\.//', that is.
[4:45] <Psi-jack> lurbs: I'm actually using Proxmox VE 2.3, which has some great support for Ceph in it now. :D
[4:46] <Psi-jack> tserong: BTW, Good to see you around these parts. :)
[4:46] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:46] <tserong> thanks Psi-jack :)
[4:47] <Psi-jack> So, at least with just as simple as a route addition, and ip_forward enabled on the endpoint, I can get ceph working from LAN network to Cluster Network no issue.
[4:47] <Psi-jack> That's a bonus. :)
[4:48] <Psi-jack> Heh, I just wonder if I could set it up with three static routes similarly, 172.18.0.{5-7}/32 to each of their 172.17.6.{1,2,4} IP's. :)
[4:50] <Psi-jack> tserong: Yeah, I borrowed yours and Florian's LCA2013 presentation to use in a LUG session to demonstrate Ceph to others. :)
[4:50] <Psi-jack> Mine ended up becoming a 3 hour thing, though! Hahaha
[4:52] <Psi-jack> Sweet. :)
[4:52] <lurbs> Heh, I've been co-opted into giving a talk about Ceph too. Good thing their LCA talk has such a permissive license. :-)
[4:52] <Psi-jack> My 3-way static route approach worked as well. So no 1 SPOF.
[4:52] <Psi-jack> lurbs: Indeed.
[4:52] <Psi-jack> I had to do the presentation 100% myself, which was mostly okay, I had a nice dual-laptop approach and a presenter remote to help out with it all. :)
[4:53] <lurbs> We had Florian over a few months back to do a course on Ceph/Openstack etc. Been looking into it since then.
[4:53] <Psi-jack> People just had so many questions, and thankfully I was able to answer 90% of them. Only failing on the S3/SWIFT/REST area, since I'm not big on that sector, yet.
[4:54] * loicd (~loic@magenta.dachary.org) has joined #ceph
[4:54] <lurbs> Hopefully we'll be able to get him and the Inktank guys more (paid) work by advocating.
[4:55] <Psi-jack> Well, at least, for me, it's great to know I didn't flub up TOO badly on my ceph setup, I /can/ still use single NIC hosts to access the ceph cluster directly via simple routes, since everything has access to the ceph's main LAN IP's, just not all of them have direct access to their cluster IP's.
[4:55] * loicd (~loic@magenta.dachary.org) Quit ()
[4:55] * Psi-jack grins.
[4:56] <tserong> Psi-jack, IIRC florian had done at least one longer version of that tutorial :)
[4:56] <Psi-jack> Heh, wow.
[4:56] * acu (~acu07@24-159-215-150.static.roch.mn.charter.com) has joined #ceph
[4:57] <tserong> anyway i'm sure he's happy you've found it useful/borrow-worth. i know i am :)
[4:57] <tserong> *borrow-worthy
[4:57] <Psi-jack> Yep. :)
[4:58] * vanham (~vanham@ Quit (Read error: Operation timed out)
[4:58] <tserong> hrm .. gotta run, bbl
[4:58] <Psi-jack> I got lots of comments from everyone that it's a beautiful presention.
[4:58] <Psi-jack> Especially the "hotel" scenario, helped iron out to someone not a sys-admin, what's basically happening. :)
[4:59] <acu> elder ping
[5:00] <acu> Hello everyone, just parachuted here by the grace of ...." elder" he was so awesome, in 10 minutes cleared what I was trying to understand since two weeks about ceph
[5:01] <elder> Awww, shucks.
[5:01] <Psi-jack> hehehe
[5:01] <acu> elder: you are here, I could not see the nick
[5:01] <elder> Top of the list :)
[5:02] <acu> oh gosh, to see takes more then visual acuity :)
[5:02] <Psi-jack> heh.
[5:02] <Psi-jack> It's going to be such a pain, but a well-worth pain, converting my starting of using Arch linux over to Gentoo, including the actual servers that run my ceph cluster. :/
[5:05] <acu> OK, so far I got that there are three entities I need to be aware of: monitors, osd and mds servers :)
[5:05] <Psi-jack> My setup today actually involves putting my portage tree on cephfs, so I can use that ever wonderful snapshot ability cephfs has. :D
[5:06] <acu> I need to go ahead and start the most baby brain setup, I run debian wheezy, I hope I can do that, so I need at least three servers with OS installed and each with an empty hard disk
[5:07] <acu> wait, so ceph has snapshot similar to lvm ?
[5:09] <acu> so I wonder if I install a linux kvm host and want to add virtual machines - what kind of storage pool would I be able to do if I have ceph file system ?
[5:10] <acu> can anyone point me to some "real life" simple scenarios on setting up ceph ? written or video tutorials ?
[5:12] <acu> SO it seem that I scared or numbed everybody :) that is OK, I got enough jump start to go on my own :)
[5:22] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[5:28] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[5:30] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit ()
[5:36] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[5:51] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[5:52] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[5:54] * xiaoxi (~xiaoxi@shzdmzpr01-ext.sh.intel.com) Quit (Ping timeout: 480 seconds)
[6:03] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) has joined #ceph
[6:05] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[6:06] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[6:09] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) Quit (Ping timeout: 480 seconds)
[6:09] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) has joined #ceph
[6:20] <Psi-jack> Hmm.
[6:21] <Psi-jack> When I change my crush map, say my goal right now is to change the weights of utilization a few OSD's have because I want them to be used less often, because they're smaller and I don't want them to run out of space before the larger disks do... Will ceph auto-re-allocate data appropriately when the crush map is re-imported?
[6:22] <Psi-jack> Probably lagged but hopefully reliable enough at the moment. ISP's having issue tonight. :/
[6:42] * rustam (~rustam@ has joined #ceph
[6:43] * rustam (~rustam@ Quit (Remote host closed the connection)
[6:50] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) Quit (Remote host closed the connection)
[6:55] * calebamiles1 (~caleb@c-50-138-218-203.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[6:56] * rustam (~rustam@ has joined #ceph
[7:01] * rustam (~rustam@ Quit (Remote host closed the connection)
[7:02] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Ping timeout: 480 seconds)
[7:11] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[7:18] * rustam (~rustam@ has joined #ceph
[7:19] * rustam (~rustam@ Quit (Remote host closed the connection)
[7:23] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[7:34] * rustam (~rustam@ has joined #ceph
[7:35] * rustam (~rustam@ Quit (Remote host closed the connection)
[7:35] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 19.0.2/20130307023931])
[7:41] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[7:48] * norbi (~nonline@buerogw01.ispgateway.de) has joined #ceph
[7:50] * tnt (~tnt@228.204-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[7:54] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Ping timeout: 480 seconds)
[7:59] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:03] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[8:04] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[8:05] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[8:09] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) has joined #ceph
[8:35] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[8:39] * tnt (~tnt@228.204-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[8:39] * rustam (~rustam@ has joined #ceph
[8:40] * rustam (~rustam@ Quit (Remote host closed the connection)
[8:57] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[8:58] * zippity (559eb342@ircip4.mibbit.com) has joined #ceph
[9:03] * verwilst (~verwilst@dD576962F.access.telenet.be) has joined #ceph
[9:11] * BManojlovic (~steki@ has joined #ceph
[9:12] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[9:12] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[9:15] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[9:16] * eschnou (~eschnou@ has joined #ceph
[9:17] * rustam (~rustam@ has joined #ceph
[9:18] * rustam (~rustam@ Quit (Remote host closed the connection)
[9:33] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[9:34] * nigwil (~idontknow@ has joined #ceph
[9:35] * leseb (~Adium@ has joined #ceph
[9:39] <Psi-jack> Hmmm.
[9:39] <Psi-jack> Okay, so I have a lovely failing drive which is one of my OSD's.
[9:41] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:43] * lmh_ (~lmh@ Quit (Read error: Connection reset by peer)
[9:43] <Psi-jack> http://pastebin.ca/2356685
[9:43] <Psi-jack> Err.
[9:44] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[9:45] <nigwil> Psi-jack: are you using this page to handle your dying drive? http://ceph.com/docs/master/rados/operations/troubleshooting-osd/
[9:46] <nigwil> Psi-jack: I've not adminstered a Ceph cluster yet so I am curious as to how this situation is best handled
[9:46] <Psi-jack> nigwil: Well, no, not yet. I'm letting it, for now, remain failed while it re-allocates things to other OSD's as it's been doing.
[9:47] <Psi-jack> I'm about to set that OSD's weight to 0 for the time being.
[9:47] <nigwil> what is the effect of that? it won't be used at all?
[9:47] <Psi-jack> Basically.
[9:48] <nigwil> is it then safe to remove the drive and put a fresh working drive in its place?
[9:48] <Psi-jack> I already had like 2 to 3 replicas of everything, so all the data itself is fine on another disk
[9:48] <Psi-jack> Yep. :)
[9:48] <Psi-jack> new drive will be unfortunately purchased with pennies tomorrow. :)
[9:48] * tnt (~tnt@212-166-48-236.win.be) Quit (Quit: leaving)
[9:49] <Psi-jack> Right now my cluster is 32% degraded, but it's recovering. That OSD is pretty much kicked out of the cluster now, though, still available.
[9:51] * ScOut3R (~ScOut3R@ has joined #ceph
[9:54] <nigwil> do you have a separate cluster LAN for the recovery?
[9:54] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[9:55] <nigwil> and when the new drive is added, does Ceph "automatically" take the drive over and give it to the OSD?
[9:56] <nigwil> I'm comparing with the typical RAID storage array where swapping a drive triggers the automatic rebuild
[9:57] <Psi-jack> nigwil: I have a dedicated physical SAN network, yes.
[9:57] <Psi-jack> Automatically? Well, I still have to partition it, set it up, mount it to place, and start the ceph-osd daemon back up for it, but otherwise, yes, once everything is done, it'll see it and see it has nothing on it and start the rebuild process on it.
[9:57] <nigwil> so each OSD has two network interfaces? one for the Ceph clients and one for the OSD traffic (just confirming my understanding)
[9:58] <Psi-jack> I have OSD+MON+MDS on 3 servers.
[9:58] <Psi-jack> Each server has 3 OSD's each in fact.
[9:58] <nigwil> so 9 OSDs in all
[9:59] <Psi-jack> Yes
[9:59] <Psi-jack> My osd.7 is the one faliing.
[10:00] <Psi-jack> And if you saw the pastebin earlier, not even SMART can report it anymore.
[10:00] <Psi-jack> http://pastebin.ca/2356687
[10:00] <Psi-jack> yet ANOTHER Seagate died...
[10:00] <nigwil> yes so it is quite brain-damaged (as in the drive is)
[10:00] <Psi-jack> Course, I have to hand it to this one. It at least survived more than 1 year. :)
[10:00] <nigwil> ahh
[10:00] <Psi-jack> Going in tomorrow to replace it with a WD Black SATA-3 1TB. :)
[10:01] <nigwil> when you say "partition it" I assume you would assign the entire drive to the OSD, so you're setting up a single partition?
[10:01] <Psi-jack> I was /hoping/ to replace the 3 320GB OSD's with 1TB first, but oh well.
[10:01] <Psi-jack> nigwil: For the OSD's, yes. Full disk, just mkfs it with logdev to the SSD for journaling.
[10:02] <nigwil> so you're using SSDs for the logdev?
[10:02] * l0nk (~alex@ has joined #ceph
[10:02] <Psi-jack> I just need to rumage through my documentation on the mkfs.xfs commandline I did for these to set them up so perfectly with agcounts and all.
[10:02] <Psi-jack> XFS journal (logdev), and Ceph journal, yes.
[10:02] <Psi-jack> 3 OSD's backed by 1 SSD, per server. :)
[10:03] * rustam (~rustam@ has joined #ceph
[10:04] <nigwil> is that the ideal ratio? 3:1 Rust:SSD?
[10:04] <Psi-jack> 3~4:1 Spinner:SSD, yes.
[10:04] <nigwil> how fast are your networks?
[10:04] <Psi-jack> I have two physical networks, each on 1Gbit.
[10:05] <nigwil> above you said you have 2 or 3 copies, is it possible to have different numbers of replicas at some level?
[10:05] <Psi-jack> Not the best in the world, but with Ceph, I can manage to punch out about 80~90 MB/s from Ceph RBD and CephFS storage resources in the SAN network alone. :)
[10:05] <Psi-jack> Yeah, actually. :)
[10:05] <Psi-jack> You can setup rules per pool and everything.
[10:06] <Psi-jack> Per pool, per node, per rack, even.
[10:06] <nigwil> and waht constitutes a pool? (something I've missed understanding it seems)
[10:06] <nigwil> what
[10:07] <Psi-jack> Hmmm, pools in ceph are basically like buckets.. A common "pool" for example, provided during the basic setup, is the rbd pool. the generic RBD disk pool. You can create a new pool and assign that pool different rules, like number of replicas and such.
[10:07] <nigwil> is it worth having dedicated servers for the MON?
[10:07] <Psi-jack> And you have your clients use that pool instead of rbd.
[10:08] <Psi-jack> That depends, I guess.. I have my mon's on the same servers, because this is just my home server.
[10:08] <nigwil> I'm planning a home server setup too, but I am wondering if we can use Ceph at work also (some my questions cross over somewhat)
[10:08] * Psi-jack nods.
[10:09] <Psi-jack> Absolutely! :)
[10:09] <nigwil> brb
[10:09] <Psi-jack> i actually have utilized many infrastructures, and out of ALL of them, Ceph is the only one that matters.
[10:10] <Psi-jack> iSCSI over pacemaker provided HA infrastructure. Not so HA unless you put a mediator box in front of it to LVS direct to the actual active iSCSI device. Same for NFS in fact. Upon failover, if you didn't have a director, all my VM's would loose their disks entirely and have to be forced off and restarted.
[10:10] <Psi-jack> Not very cool. :)
[10:16] <Psi-jack> nigwil: For example. For this 500 GB drive that's failing, the mkfs stuff I did to it was: mkfs.xfs -f -L hdd-osd-7 -d agcount=125 -l logdev=/dev/disk/by-partlabel/osd-log-7,version=2 /dev/sdb
[10:17] <Psi-jack> agcount being the important aspect, and of course the logdev. :)
[10:17] <andreask> Psi-jack: you keep default inode-size?
[10:17] * rustam (~rustam@ Quit (Remote host closed the connection)
[10:18] <Psi-jack> andreask: I do, yes. I just change the agcounts to better suite the specifics of the drive.
[10:18] * rustam (~rustam@ has joined #ceph
[10:18] <nigwil> Psi-jack: that is what I most like about Ceph, the clients see all the storage so they can be moved and get "re-attached" to their storage
[10:19] <Psi-jack> nigwil: No, they never detach. That's the beauty of it.
[10:19] * rustam (~rustam@ Quit (Remote host closed the connection)
[10:19] <nigwil> how does that work?
[10:20] <Psi-jack> Did my infrastructure take a hit when osd.7 went down? yes, because it's faulting and moving data around to account for the fault. My VM's actually got ATA timeout hits, and moved the disks to PIO3 mode. but still are up and running.
[10:20] * LeaChim (~LeaChim@ has joined #ceph
[10:21] <nigwil> so jsut slowed a little
[10:21] <nigwil> just
[10:21] <Psi-jack> nigwil: In my infrastructure, I have 3 ceph servers as mentioned, and 4 hypervisor servers. All 7 connect to the SAN network directly, and all 7 of them also connect to the LAN network as well, but utilize the SAN network for data.
[10:21] <Psi-jack> Correct. :)
[10:21] <andreask> Psi-jack: hmm ... what was the default agcount choosen for 500GB? ... and an inode-size of 1k or 2k is typically recommended when using ceph
[10:21] <Psi-jack> Because the SAN network's being saturated a bit from data moving around.
[10:21] * Vjarjadian (~IceChat77@ Quit (Quit: He who laughs last, thinks slowest)
[10:22] <Psi-jack> andreask: agcount is usually way low. Like 4.
[10:23] <nigwil> what is agcount controlling?
[10:23] <Psi-jack> Allocation Groups
[10:23] <Psi-jack> Allows for better parallelism.
[10:23] <nigwil> so a CPU assignment?
[10:23] <Psi-jack> Not necessarily.
[10:24] <andreask> Psi-jack: never saw that ... I have a 35GB xfs filesystem with agcount 16 ... no special tuning
[10:24] <Psi-jack> Some people associate it with that though, and they don't get as good of performance with it doing so. :)
[10:24] <nigwil> I see google-hits with agcount=4 (for example)
[10:25] <Psi-jack> The agcount settings I use are based on total allocation size of the disk/parition.
[10:26] <nigwil> further up the page you used agcount=125 for a 500GB drive, why the odd number?
[10:27] <Psi-jack> Based entirely on the size of the actual drive, having X number of allocation groups for the total size of the drive (in my case).
[10:27] <Psi-jack> Trying to find the references I got this all from. :/
[10:28] <nigwil> it (agcount) is not googling for me :-)
[10:31] <Psi-jack> yeah, often times you would actually use agcount for RAID, but with Ceph, there's no underlying infrastructure itself, so accounting for RAID grouping for stripes becomes a bit silly inthat regards.
[10:31] <Psi-jack> I wish i could find the reference page...
[10:32] <nigwil> no worries, we can experiment with some values
[10:32] <Psi-jack> I really need to make my own dociumentation site for myself, since my home infrastructure is so detailed and precise now. haha
[10:32] <nigwil> :-)
[10:33] <nigwil> which version are you running? do you track latest?
[10:35] <Psi-jack> of Ceph?
[10:35] <nigwil> yes
[10:35] <Psi-jack> Then generally, yes, I do. Currently I'm still on 0.56.3 though. I currently maintain Ceph in Arch Linux's AUR, but dropping that as soon as I finish dropping Arch.
[10:35] <nigwil> leaving Arch for ?
[10:36] <Psi-jack> Until then I'll maintain it. :)
[10:36] <Psi-jack> Gentoo.
[10:36] <nigwil> ahh ok
[10:37] <Psi-jack> Main issue is Arch's main repos and AUR don't have /anything/ in common, except that AUR stuff will depend on main repo stuff, and when Arch continuously upgrades those, things in AUR will continuously break. Like Ceph, for example, due to libboost being constantly updates.
[10:37] <Psi-jack> s/updates/updated/
[10:37] * nigwil nods
[10:37] <Psi-jack> This... To me.. Is unacceptable, Gentoo solves this problem with 100% all source building, and you can easily ensure full integrity. :)
[10:38] <nigwil> understandable
[10:38] <joao> then again, Gentoo builds everything from source :p
[10:38] <Psi-jack> They also have Ceph 0.56.3 in stable.
[10:38] <Psi-jack> joao: So does Arch.. Just... ABS and AUR have nothing connectnig them, or maintaining stuff in AUR. :/
[10:39] <joao> ah, had no idea
[10:39] <Psi-jack> Neither did I... Until I maintained for a few months. :)
[10:39] <absynth> gentoo still exists?
[10:39] <joao> thought the guys from Gentoo were the only ones crazy enough to do it
[10:39] <Psi-jack> absynth: Of course. :)
[10:39] <joao> I moved on to binary repos a long time ago and have no regrets
[10:40] <Psi-jack> joao: Arch's main repos you can get everything in binary bundles. Optionally you could use ABS yourself and build everything that way, too.
[10:40] <joao> then again, the only thing I really need to build is ceph, so...
[10:40] <Psi-jack> My gentoo setup has a build server I use to build server and desktop binpkgs. :)
[10:41] <Psi-jack> I just have that gbs VM build, and dump the bin packages to a storage node which other clients can access appropriately. :)
[10:41] <Psi-jack> I've started to put the portage tree itself, in a cephfs filestore, so i can use it to do dated snapshots of it. :)
[10:41] <joao> I actually used to love gentoo, until ebuilds started to be unmaintained and things started to go weird; and I really just wanted a functional workstation instead of spending every other week fixing ebuilds
[10:42] <joao> but right now that feels like it was a lifetime ago
[10:42] <Psi-jack> heh
[10:42] <Psi-jack> Most of those issues are resolved now, I'd say.
[10:43] <Psi-jack> I know.. I had somewhat the same issue, except the fact, I enjoyed fixing ebuilds, and was at that time, a Gentoo contributer/developer. :)
[10:43] <joao> eh, that's nice :)
[10:43] <Psi-jack> I'm about to be a full fledged developer, part of the team. :)
[10:44] <joao> to me, moving to gentoo was the obvious choice for someone coming from slackware: no more building stuff manually
[10:44] <Psi-jack> So, I'll also be involved with maintaining ceph in Gentoo soon. :D
[10:44] <joao> but then I started having even more trouble than I used to have with slackware, and let's face it: that makes no sense
[10:44] <Psi-jack> hehe
[10:45] <Psi-jack> Yeaaah. Put like that. I agree..
[10:45] <joao> nowadays, my only concern is how much RAM ubuntu really needs, but that's easily fixable :p
[10:45] <Psi-jack> I stopped working with Gentoo because of resolver issues, the way they basically ousted Daniel Robbins, and a few other reasons.
[10:46] <Psi-jack> heh
[10:46] <Psi-jack> I just moved a lot of stuff, at home, away from Ubuntu. :)
[10:46] <Psi-jack> My reason for THAT is pretty simple. They slapped Wayland in the face with Mir, and I will never use a Canonical product again until they get their head out of their arse.
[10:47] * rahmu (~rahmu@ has joined #ceph
[10:48] * v0id (~v0@212-183-96-57.adsl.highway.telekom.at) has joined #ceph
[10:50] <absynth> having a politicial/religious opinion about distros is kinda 90s, innit? ;)
[10:50] <Psi-jack> As for other distros.. Well, Debian's okay, I would use it for servers, depending on need.. But I like the newer stuff that Debian doesn't tend to have a lot. :)
[10:50] <joao> absynth, that's how we keep things interesting!
[10:50] <Psi-jack> absynth: Hmmm. 21 years. All I can say about that. :)
[10:50] <nigwil> could I re-ask an earlier question about MON, I think I missed the reply, is it best to have a dedicated server for the MONs? or can they coexist with an OSD?
[10:50] <Psi-jack> I started Linux with SLS. heh
[10:50] <joao> nigwil, they can coexist
[10:50] <Psi-jack> nigwil: Mine co-exist.
[10:51] <Psi-jack> They have their own dedicated storage on the SSD. :)
[10:51] <joao> just make sure you spread your monitors across servers
[10:51] <joao> don't bundle them all in the same server, for obvious reasons
[10:51] <nigwil> if they coexist should the host server have a little more memory than the usual OSD?
[10:51] <nigwil> or more CPU
[10:52] <joao> well, I would certainly advise more mem, but that's because we've seen some monitors OOM-killed when using large PG numbers
[10:52] <nigwil> ok
[10:52] <Psi-jack> hehe
[10:53] <Psi-jack> hence why I run ceph-osd/mon/mds under systemd, currently.
[10:53] <joao> I don't really know what are the official specs for servers, nor what Inktank usually recommends
[10:53] <Psi-jack> in my gentoo replacement methods, supervisord will manage each daemon process.
[10:54] <Psi-jack> but, supervisord will run as root, ceph daemons will be users, so they will get OOM'd before supervisord does.
[10:55] <Psi-jack> Alternative to that, of course, is running ... Oh.. What's that other one.. DJB made...
[10:55] <Psi-jack> hehe
[10:55] * vo1d (~v0@91-115-225-50.adsl.highway.telekom.at) Quit (Ping timeout: 480 seconds)
[10:56] <Psi-jack> daemontools. heh
[10:59] <alexxy> gregaf1: joao: i find another problem
[10:59] <alexxy> gregaf1: joao: kernel client shows bogus df values
[10:59] <alexxy> gregaf1: joao: 99G 42G 57G 43% /home
[10:59] * rickstok (d4b24efa@ircip1.mibbit.com) has joined #ceph
[11:00] <alexxy> gregaf1: joao: while ceph report 2013-04-12 12:59:41.809720 mon.0 [INF] pgmap v1922510: 3648 pgs: 3645 active+clean, 3 active+clean+scrubbing+deep; 5309 GB data, 10669 GB used, 14464 GB / 25
[11:00] <alexxy> 149 GB avail
[11:01] * rickstok (d4b24efa@ircip1.mibbit.com) Quit ()
[11:01] * rickstok (d4b24efa@ircip2.mibbit.com) has joined #ceph
[11:02] <rickstok> http://mibpaste.com/R2HDMo
[11:02] <rickstok> What could be the problem of this?
[11:03] <joao> alexxy, the value shown in the pgmap info is the aggregated value across all osds
[11:03] <rickstok> i just deployed it with the standard cookbooks of ceph but having always this issue
[11:03] <alexxy> joao: yep
[11:03] <alexxy> but 10T and 10G is a huge difference
[11:03] <alexxy> fuse client shows 'real' values
[11:04] <Psi-jack> Hmmm
[11:04] <joao> alexxy, what's the actual size of it?
[11:04] <alexxy> 25T
[11:04] <Psi-jack> Interesting..
[11:04] <joao> alexxy, if you're running a recent version, could you try 'ceph df' ?
[11:04] <alexxy> 18 osd x 1.5T
[11:05] <alexxy> http://bpaste.net/show/90901/
[11:05] <Psi-jack> ceph df?
[11:06] <joao> Psi-jack, I think it went in for 0.60 iirc
[11:06] <Psi-jack> Ahhh.
[11:06] <Psi-jack> Well, dernit! Backport that into 0.56.5! :D
[11:06] <joao> should be right; been about a month I've worked on it
[11:07] * ScOut3R_ (~ScOut3R@ has joined #ceph
[11:07] <soren> I'm havving trouble wrapping my head around placement groups. I'm hoping you guys can clear it up for me:
[11:07] <Psi-jack> it is kind of wierd, actually. df reports my cephfs mount as being 17GB total. Kind of small for a 4TB ceph cluster. :)
[11:08] <soren> First of all, as I understand it, PG's aren't partitions. They don't split up my storage into chunks in any way. Is that correct?
[11:08] <joao> oh, look, there's a unit screw up on 'ceph df'
[11:08] <joao> showing total global size at 25GB
[11:08] <alexxy> joao: ^^^
[11:09] <joao> alexxy, 'ceph df' should be different from the system's df
[11:09] <joao> I'm not sure where cephfs gets its stats, but I doubt it's on the same place ceph df does
[11:10] <joao> the problem on 'ceph df' looks to me like a conversion issue
[11:11] <joao> yeah
[11:11] <alexxy> joao: http://bpaste.net/show/90901/
[11:12] <joao> 'ceph df' issue is that I'm passing a kb count to a formatting class that is expecting a byte count
[11:13] <Psi-jack> Hmm
[11:13] <nigwil> X *= 1024; // :-)
[11:13] <Psi-jack> Are the cephfs xattr's documented somewhere? Like what you can getfattr?
[11:14] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[11:14] <joao> alexxy, I don't know what the system's df is supposed to report, or whether it is precise; maybe someone else knows :\
[11:14] <soren> joao: Looking at the kernel code for ceph it seems to sed a CEPH_MSG_STATFS request to a monitor.
[11:14] <nigwil> what does Ceph use xattr's for?
[11:14] <soren> joao: I don't know how that compares to what "ceph df" does.
[11:14] <joao> nigwil, I *think* there's a field on that same struct with the byte count :)
[11:15] <Psi-jack> nigwil: For example, on cephfs itself, you can getfattr something, and get total usage stats of that directory tree. Much faster than using du on the filesystem. :)
[11:15] <joao> soren, it basically reads infos from the osdmap
[11:15] <joao> not sure what is sent to cephfs in that message, but I'll check it in a sec
[11:16] <nigwil> Psi-jack: right, I understand CephFS tracks almost real-time space usage
[11:16] <Psi-jack> nigwil: Imagine doing a du on a 100-osd cluster, or even a 100-node cluster. :)
[11:16] <Psi-jack> Where-as with xattr's you can get that information basically instantly, and then some.
[11:17] <nigwil> ha :-) yes it won't scale without some mechanism
[11:17] <Psi-jack> hehe
[11:17] <Psi-jack> I just forget what those xattrs are you can query against. :/
[11:18] <Psi-jack> And it doesn't /appear/ to be documented, either...
[11:18] <nigwil> "use the source" :-)
[11:18] <joao> soren, well, fwiw, the MStatfs reply contains exactly the same info 'ceph df' uses -- only 'ceph df' outputs the values with the wrong units, but that's a presentation bug, not an issue with the info itself
[11:18] <Psi-jack> Would you like the appropriate death threats that come with that retort? ;)
[11:18] <joao> alexxy, ^
[11:18] <nigwil> haha :-)
[11:21] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Read error: No route to host)
[11:21] <Psi-jack> And getfattr -d doesn't show /anything/. :/
[11:22] <Psi-jack> It's like that mysterious .snap directory. invisible, till you actually specifically look for it. :)
[11:22] <nigwil> IBM GPFS has a very similar mechanism, except true to style it is called .snapshot
[11:22] <Psi-jack> heh
[11:23] <Psi-jack> ZFS has similarish as well.
[11:24] <Psi-jack> Ahhh nice. :0
[11:24] <Psi-jack> CephFS itself is now considered production ready, with a single MDS?
[11:24] <nigwil> is that officially official?
[11:25] <Psi-jack> Well, the FAQ's say they provide commercial support for it as well. ;)
[11:25] <Psi-jack> On a single MDS.
[11:25] <nigwil> so they have some sync/race-conditions to eliminate to get it to multiple MDS
[11:26] <Psi-jack> Not sure. I've been runing it with 3 MDS's without issue since December, before 0.56's release. :)
[11:27] <nigwil> might have to try some edge-cases, like fill it to 100% and then remove a couple of OSDs, then start a big transfer
[11:27] <Psi-jack> heh
[11:28] <Psi-jack> I'm still odded out by the 17G total size.. Out of 4TB. :)
[11:39] <Psi-jack> There we go!
[11:39] <Psi-jack> getfattr -d -m ceph.* <dir>
[11:47] <nigwil> what does that show?
[11:52] * zippity (559eb342@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[12:02] <dosaboy> hello cepehrs!
[12:03] <dosaboy> cephers even
[12:03] <dosaboy> I am having an issue with assigning fsid in ceph.conf
[12:03] <dosaboy> if I set fsid = <uuid> in [global]
[12:04] <dosaboy> my osds are not inheriting the fsid
[12:04] <dosaboy> they seem to still get assigned a random one
[12:04] <dosaboy> I'm sure this is a stupid mistake on my part
[12:04] <dosaboy> can anyone help?
[12:13] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) Quit (Quit: Leaving.)
[12:14] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[12:23] * diegows (~diegows@ has joined #ceph
[12:29] * ljonsson1 (~ljonsson@ext.cscinfo.com) Quit (Quit: Leaving.)
[12:41] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:44] * rustam (~rustam@ has joined #ceph
[12:44] <mattch> dosaboy: It might be that fsid isn't a [global] variable, but rather an [osd] one?
[12:44] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:45] * rustam (~rustam@ Quit (Remote host closed the connection)
[12:48] <dosaboy> mattch: the docs appear to indicate otherwise
[12:48] <mattch> dosaboy: Ahh, ok, just a guess. Will see if someone more knowledgeable can answer
[12:50] <dosaboy> i.e. http://ceph.com/docs/master/rados/configuration/mon-config-ref/
[12:54] * sleinen (~Adium@2001:620:0:25:9094:35ee:61ab:a0b3) has joined #ceph
[12:54] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:56] * rahmu (~rahmu@ Quit (Remote host closed the connection)
[13:01] * ljonsson1 (~ljonsson@ext.cscinfo.com) has joined #ceph
[13:17] * ScOut3R_ (~ScOut3R@ Quit (Remote host closed the connection)
[13:18] * ScOut3R (~ScOut3R@ has joined #ceph
[13:29] * rustam (~rustam@ has joined #ceph
[13:30] * rustam (~rustam@ Quit (Remote host closed the connection)
[13:37] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[13:38] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[13:41] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[13:43] <nigwil> can an OSD share its partition with the OS?
[13:47] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[13:56] <Gugge-47527> nigwil: yes
[13:56] <nigwil> thanks Gugge-47527
[14:02] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:06] * mrjack_ (mrjack@office.smart-weblications.net) has joined #ceph
[14:12] * sstan (~chatzilla@dmzgw2.cbnco.com) Quit (Read error: Operation timed out)
[14:22] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) Quit (Ping timeout: 480 seconds)
[14:25] * jgallard_ (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[14:25] * vanham (~vanham@ has joined #ceph
[14:30] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Ping timeout: 480 seconds)
[14:31] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Read error: Operation timed out)
[14:35] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[14:35] * leseb (~Adium@ Quit (Quit: Leaving.)
[14:37] * NightDog (~Karl@ has joined #ceph
[14:37] * noahmehl (~noahmehl@cpe-75-186-45-161.cinci.res.rr.com) Quit (Quit: noahmehl)
[14:42] <ljonsson1> Hello everyone, I'm trying to trouble shoot a HEALTH_WARN state.
[14:42] <ljonsson1> HEALTH_WARN 15 pgs peering; 15 pgs stuck inactive; 54 pgs stuck unclean
[14:42] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[14:42] <ljonsson1> All my osd are up and in
[14:43] * mtk0 (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[14:43] <ljonsson1> mons look good as far as I can tell
[14:43] <ljonsson1> It's been stuck in this state for a day now
[14:43] <vanham> how long has it been like that?
[14:43] <vanham> ok
[14:44] * mtk0 (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[14:44] <joao> osds may be unable to talk to each other, but in that case they would be marking each other down; or it can be a crush map issue; or maybe you have a small amount of osds per host and that kind of issue may happen then
[14:44] <ljonsson1> I was performing a rolling restart, rebooting one server at a time, waiting for ceph to come back healthy before moving on to the next node
[14:44] <ljonsson1> the last node I restarted also had a monitor on it
[14:44] <ljonsson1> after restarting it got stuck with this HEALH_WARN
[14:45] <ljonsson1> I was planning to upgrade from 56.2 to 56.4 but wanted to make sure all was clean and healthy first
[14:46] <ljonsson1> I have 11 osd's per host
[14:47] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[14:47] * portante|lt (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Read error: Operation timed out)
[14:49] <ljonsson1> full connectivity between all nodes on both interfaces
[14:50] <joao> yeah, the connectivity issues would be noticeable in other forms
[14:51] <joao> my best advice: stick around, and maybe sjust can help you with that
[14:51] <joao> I for one always have to resort to him whenever that happens to me on my test clusters
[14:51] <absynth> you, and everyone else :P
[14:52] <absynth> Sam, Reviver of OSDs and Reanimator of broken Clusters.
[14:52] <joao> that is incredibly comforting :p
[14:54] <nigwil> oops doc has an error, this is throwing a 404: https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc%27
[14:54] <ljonsson1> sjust, ok I'll keep an eye out for him
[14:55] <joao> nigwil, where in the docs is that link?
[14:55] <joao> btw, fyi, drop the %27 and it should work
[14:56] <nigwil> http://ceph.com/docs/master/install/debian/#install-release-key
[14:56] <joao> thanks
[14:56] <joao> uh, the link is correct on the docs
[14:57] <nigwil> oops sorry about that, my mistake
[14:57] <nigwil> my URL detector is over zealous
[14:57] <nigwil> it works now...
[14:58] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) Quit (Remote host closed the connection)
[14:59] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) has joined #ceph
[15:09] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[15:11] * leseb (~Adium@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[15:11] * sleinen (~Adium@2001:620:0:25:9094:35ee:61ab:a0b3) Quit (Quit: Leaving.)
[15:11] * sleinen (~Adium@ has joined #ceph
[15:13] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[15:15] * dwm37 (~dwm@northrend.tastycake.net) Quit (Quit: *poof*)
[15:19] * jskinner (~jskinner@ has joined #ceph
[15:21] * sleinen1 (~Adium@2001:620:0:25:91ae:bc35:394d:1d3a) has joined #ceph
[15:21] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[15:24] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[15:27] * sleinen (~Adium@ Quit (Ping timeout: 480 seconds)
[15:28] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[15:42] * drokita (~drokita@ has joined #ceph
[15:49] * janos (janos@static-71-176-211-4.rcmdva.fios.verizon.net) Quit (Read error: Operation timed out)
[15:50] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:51] <nigwil> root@ceph0:/home/nw# ceph health
[15:51] <nigwil> HEALTH_WARN mds a is laggy
[15:51] <nigwil> is "laggy" a problem?
[15:52] * drokita1 (~drokita@ has joined #ceph
[15:52] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[15:58] * PerlStalker (~PerlStalk@ has joined #ceph
[16:00] * sileht (~sileht@sileht.net) Quit (Remote host closed the connection)
[16:05] <vanham> nigwil, are your clocks sincronized?
[16:05] <vanham> and how is the load on the mds?
[16:06] <nigwil> vanham: about 4 seconds different
[16:06] <nigwil> mds is very quiet
[16:06] * rustam (~rustam@ has joined #ceph
[16:06] <vanham> ok them
[16:07] * rustam (~rustam@ Quit (Remote host closed the connection)
[16:10] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[16:18] <vanham> nigwil, it seems that your mds is not communicating well with mon
[16:18] <vanham> (or at all)
[16:19] <vanham> after 15 seconds without communication between them the mds becomes lagy
[16:19] <vanham> laggy
[16:19] <nigwil> ok, they are on the same host though
[16:20] <drokita1> Anyone using a live boot environment in their Ceph cluster?
[16:21] <vanham> damn, I'm so out of it
[16:24] <nigwil> vanham: I rebooted and now it is happy
[16:24] <vanham> can you pastebin your "ceph status detail" and your "ceph md5 dump"
[16:24] <vanham> nvm
[16:24] <vanham> hum
[16:25] <vanham> nigwil, I'm trying to learn the next problems I'll have with managing ceph here first... Shame we couldn't find out what was wrong
[16:25] <vanham> good it is fixed
[16:26] <vanham> what version are you using?
[16:26] <nigwil> 0.60
[16:26] <nigwil> maybe a bit bleeding edge :-)
[16:26] <vanham> cool!
[16:27] <Psi-jack> heh
[16:27] <nigwil> this is my first attempt
[16:27] <Psi-jack> Replacement HDD is in hand. :)
[16:27] <Psi-jack> makes it hard to type. :/
[16:27] <vanham> Hi Psi :)
[16:36] * leseb (~Adium@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[16:40] <Psi-jack> Moin
[16:47] * t0rn (~ssullivan@2607:fad0:32:a02:d227:88ff:fe02:9896) has joined #ceph
[16:51] * tjikkun_ (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Quit: Ex-Chat)
[16:52] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[16:52] <Psi-jack> Hmm, alright.
[16:52] * sileht (~sileht@sileht.net) has joined #ceph
[16:53] <Psi-jack> So, I am about to replace this HDD. Basically my plan is to pull out the old one, put in the new one, format full disk (no partitions), updating the UUID into my fstab appropriately since I use timestamp based UUIDs's to mark age and such..
[16:53] <Psi-jack> And once it's in place and mounted and ready, albeit blank, I should be able to start up the OSD again, and it'll "repair" it as expected..
[16:54] <Psi-jack> Am I wrong at all? :)
[16:55] * mrjack (mrjack@office.smart-weblications.net) Quit (Ping timeout: 480 seconds)
[17:09] * diegows (~diegows@ has joined #ceph
[17:10] * norbi (~nonline@buerogw01.ispgateway.de) Quit (Quit: Miranda IM! Smaller, Faster, Easier. http://miranda-im.org)
[17:12] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[17:14] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[17:14] <mrjack_> hm
[17:14] <mrjack_> what to do when ceph-osd is in D state?
[17:14] <Psi-jack> Heh. Not much you can do.
[17:15] <mrjack_> load average: 116.14, 65.19, 30.53
[17:15] <mrjack_> i can do something now.. but maybe not in a few minutes
[17:15] <slang1> mrjack_: that might be an indication of a drive going bad. Does strace show which system call its blocked on?
[17:15] <slang1> mrjack_: anything in dmesg?
[17:16] <mrjack_> strace -f -p 4271
[17:16] <mrjack_> Process 4271 attached with 113 threads - interrupt to quit
[17:16] <Psi-jack> Yep
[17:16] * leseb (~Adium@ has joined #ceph
[17:17] <mrjack_> INFO: task ceph-osd:4570 blocked for more than 300 seconds.
[17:17] <Psi-jack> I JUST replaced a drive in the same scenario-ish. osd was Dsl and SMART showed it was definitely failing.
[17:17] <mrjack_> hm
[17:17] <mrjack_> smart looks good
[17:18] <Psi-jack> Hmmm.
[17:18] <Psi-jack> Okay, so yeah, mounting a new replacement HDD in place of the old failed one that's now gone and replaced, didn't work as I expected.
[17:18] <Psi-jack> it's unabe to open the OSD superblock on the mount point of the OSD, (it's empty!)
[17:22] <Psi-jack> There we go.
[17:22] <Psi-jack> ceph-osd -i 7 -f --mkfs
[17:22] <Psi-jack> Now it's in and up, and balancing into it. :D
[17:22] <mrjack_> hm
[17:22] <mrjack_> i see a simple "sync" does not sync
[17:22] <mrjack_> maybe time for reboot the node?
[17:23] <Psi-jack> mrjack_: In the D state, yeah, if you can't stop it, reboot's the only way.
[17:23] <Psi-jack> Anything else would leave the system less stable than it already is now.
[17:24] <Psi-jack> D is pretty much as bad as Z. :)
[17:25] <mrjack_> now it's gone Z
[17:25] <Psi-jack> Yeah. Now definitely reboot time. :)
[17:26] <mrjack_> can't reboot now
[17:26] <Psi-jack> What do you mean you can't reboot now? It's ceph!
[17:26] <Psi-jack> It's highly available, if you did it correctly.
[17:26] <mrjack_> there are other jobs on that node
[17:26] <mrjack_> which i don't want to kill now
[17:26] <Psi-jack> Well, you shouldn't be running stuff on your storage servers besides your storage daemons.
[17:27] <mrjack_> yes you shouldn't
[17:27] <Psi-jack> Sounds like why you're CPU load is off the charts at 116.14.
[17:27] <mrjack_> that is because of the ceph-osd hanging
[17:27] <mrjack_> fsync() does not work anymore somehow
[17:27] <absynth> colocating OSDs and qemu-kvm processes works just fine
[17:27] <absynth> there is no "you shouldn't", actually.
[17:28] <mrjack_> :)
[17:28] <mrjack_> hi absynth!
[17:28] <Psi-jack> My ceph-osd has been hanging since Apr 8th, and my CPU load's not done that. And it was because the HDD itself was failing, hard.
[17:28] <absynth> hey there
[17:28] <Psi-jack> absynth: Hmm? Since when? Using ceph RBD on the same servers as the ceph servers has been considered bad since.. Since I remember.
[17:29] <Psi-jack> Has this changed?
[17:29] * Vjarjadian (~IceChat77@ has joined #ceph
[17:29] <mrjack_> i also read this a while ago
[17:29] <absynth> i think this is about cephfs or the kernel client
[17:29] <mrjack_> but this is for the cehp-fs client
[17:29] <Psi-jack> It's about /everything/
[17:29] <absynth> no
[17:29] <mrjack_> no
[17:29] <absynth> it`s not
[17:29] <absynth> you are very sadly mistaken
[17:30] <Psi-jack> or at least the rbd kernel client and cephfs kernel client.. Yeah, maybe.
[17:30] <absynth> i am personally running several hundred vms on a colocated setup and it works _just fine_
[17:31] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) has joined #ceph
[17:31] <absynth> cephfs is not considered stable enough for anything useful, anyway
[17:31] <Psi-jack> I guess I never thought about the client side clients, like qemu-rbd
[17:31] <imjustmatthew> absynth: are you mounting the RBDs/cephfs inside the VMs or in the dom0 where the OSD is running?
[17:31] <Psi-jack> absynth: Eh? FAQ says it is now..
[17:31] <absynth> the latter
[17:31] <Psi-jack> For a single MDS, anyway.
[17:32] <imjustmatthew> Psi-jack: CephFS is "stable" but I'm not sure I'd use it in production yet, I deifnitely still encounter bugs with it
[17:32] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:33] <Psi-jack> I've seen minor bugs at best, so far. and I'm running it with 3 MDS's.. The bugs I've encountered haven't yet been an issue. Mostly just reporting bugs, like having 17GB total space on the cephfs volume, while there's actually technically 4TB.
[17:33] <imjustmatthew> absynth: Huh, I might have to try that, I was under the impresison it was still going to deadlock under memory pressure. It would certinaly be nice to use those spare CPU cycles for something productive.
[17:33] <Psi-jack> That's, however, df reporting, not ceph tools reporting that. :)
[17:34] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[17:35] <Psi-jack> So, with ceph-fuse, you could locally use the ceph-osd on the same system as the client?
[17:37] * sleinen1 (~Adium@2001:620:0:25:91ae:bc35:394d:1d3a) Quit (Quit: Leaving.)
[17:37] * sleinen (~Adium@ has joined #ceph
[17:38] <Psi-jack> I ask, because someone wants to use cephfs, but doesn't need rbd, because they really like cephfs's snapshot capability.
[17:39] <mrjack_> could it be possible that the osd died because of full / and no space for writing logs?
[17:40] <Psi-jack> Hmmm... Possability... Maybe...
[17:40] <absynth> not sure about that fuse thing
[17:40] <mrjack_> i do not use fuse
[17:40] <Psi-jack> Hmmm
[17:40] <absynth> that was directed at Psi-jack
[17:40] <mrjack_> ah sry! ok
[17:40] <Psi-jack> I know. :)
[17:41] <absynth> mrjack_: that might well happen
[17:41] <absynth> and of course you will NEVER want the osd partitions to fill up
[17:41] <Psi-jack> heh. So, unsure...
[17:41] <dosaboy> hi guys can someone tell me how I can manually stripe data to an object in ceph?
[17:41] <dosaboy> is it possible using the reados command?
[17:41] <mrjack_> absynth: that's why i expanded our cluster from monday 18:00 pm - tuesday 11:30am
[17:42] <dosaboy> reados/rados/
[17:43] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Read error: Connection reset by peer)
[17:45] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[17:45] * sleinen (~Adium@ Quit (Ping timeout: 480 seconds)
[17:51] <mrjack_> there we go
[17:51] <mrjack_> 2013-04-12 17:51:18.204870 mon.0 [INF] pgmap v14522185: 768 pgs: 666 active+clean, 48 active+degraded+wait_backfill, 5 active+recovery_wait, 45 active+degraded+backfilling, 4 active+recovering; 808 GB data, 2158 GB used, 3144 GB / 5442 GB avail; 11925KB/s wr, 1560op/s; 70596/641631 degraded (11.003%); recovering 623 o/s, 2422MB/s
[17:57] <absynth> don't make any weekend plans ;)
[17:58] <mrjack_> why? ;)
[17:58] <absynth> anything can happen during a recovery - just as with alcoholics
[17:59] * eschnou (~eschnou@ Quit (Ping timeout: 480 seconds)
[17:59] <mrjack_> 2013-04-12 17:59:26.803126 mon.0 [INF] pgmap v14522241: 768 pgs: 768 active+clean; 808 GB data, 2424 GB used, 3748 GB / 6359 GB avail; 269KB/s wr, 42op/s
[17:59] <mrjack_> it is sync
[17:59] <mrjack_> so no worry
[18:08] * BillK (~BillK@58-7-53-224.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:08] * sleinen (~Adium@2001:620:0:26:c8ab:ae00:25a:afbd) has joined #ceph
[18:11] * smeven (~diffuse@ Quit (Ping timeout: 480 seconds)
[18:19] * mattch (~mattch@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[18:20] <absynth> awww, there goes your weekend entertainment!
[18:20] * rustam (~rustam@ has joined #ceph
[18:20] <absynth> now you'll have to watch a movie or go skydiving or whatever you do on weekends
[18:21] * rustam (~rustam@ Quit (Remote host closed the connection)
[18:24] * BillK (~BillK@124-169-69-25.dyn.iinet.net.au) has joined #ceph
[18:25] <mrjack_> :)
[18:25] <mrjack_> absynth: i'll go for some karate training
[18:28] * dignus (~dignus@bastion.jkit.nl) Quit (Ping timeout: 480 seconds)
[18:29] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[18:29] * dignus (~dignus@bastion.jkit.nl) has joined #ceph
[18:35] <absynth> that sounds reasonable ;)
[18:36] * leseb (~Adium@ Quit (Quit: Leaving.)
[18:36] * xmltok_ (~xmltok@pool101.bizrate.com) Quit (Ping timeout: 480 seconds)
[18:46] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[18:46] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[18:47] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[18:48] <Psi-jack> heh, wow..
[18:49] <Psi-jack> Looking at why one of my 3 ceph servers is chewing up memory, and I find the reason it is so, is the ceph-mds daemon is consuming 2.8GB RAm
[18:51] * l0nk (~alex@ Quit (Quit: Leaving.)
[18:52] <gregaf1> grumble grumble
[18:52] <gregaf1> can you get it to dump its cache and see what the size is, Psi-jack?
[18:52] <Psi-jack> gregaf1: Sure. How do I get it to dump its cache? ;)
[18:53] <gregaf1> "ceph mds tell 0 dumpcache /tmp/dump.txt"
[18:53] <gregaf1> where 0 is the id (it should be 0) and "/tmp/dump.txt" is the location you want it to dump to
[18:54] <Psi-jack> Hmmm. Well, I do run 3 MDS's, and they're named a, b, and c. This is a.
[18:54] <gregaf1> just one MDS active, and "a" is the active one?
[18:55] <Psi-jack> e180: 1/1/1 up {0=a=up:active}, 2 up:standby
[18:55] <gregaf1> yeah, that "0" right there :)
[18:55] <Psi-jack> Yep. :0
[18:55] <Psi-jack> So, yeah. dumped the cache file. And... What am I doing?
[18:56] <Psi-jack> Definitely has a lot of stuff in t he file. hehe, a lot of information about portage stuff, since I recently put my portage tree for Gentoo on cephfs. :)
[18:58] <gregaf1> I actually don't remember the format; let me see if I have one
[18:58] <gregaf1> does it start out the beginning by saying how many items are in-cache or something?
[18:59] <Psi-jack> nope...
[18:59] <Psi-jack> it starts off with inode details.
[18:59] <Psi-jack> [inode 100000817ac [2,2] /portage/desktop/x11-themes/vanilla-dmz-aa-xcursors/Manifest auth v53 s=1677 n(v0 b1677 1=1+0) (iversion lock) cr={8031155=0-4194304@1} | dirty 0x9b65c6a0]
[18:59] <gregaf1> okay, well wc -l on it will start out fine
[19:00] <Psi-jack> dump is 212,292 lines.
[19:00] <gregaf1> oh, that's about the right size I think
[19:00] <gregaf1> huh, I was expecting it to be much larger
[19:00] <gregaf1> oh, you build your own packages; is it using tcmalloc?
[19:01] <Psi-jack> Hmmm, one sec. I will verify.
[19:01] <Psi-jack> Yes
[19:02] <gregaf1> well now I'm out of ideas
[19:02] <Psi-jack> Hmm, tcmalloc I take it is the recommended? :)
[19:02] <gregaf1> *ponders*
[19:02] <gregaf1> very much so
[19:02] <gregaf1> the default memory allocator does horrible things under our workload
[19:03] <Psi-jack> Gotcha. *adjusts gentoo builds accordingly*
[19:03] <Psi-jack> Heh, what about libatomic? :D
[19:03] <gregaf1> like use up 3GB of heap space even though the daemon has all of 200MB allocated, due to cycling
[19:03] <Psi-jack> Wow..
[19:03] <gregaf1> which induces memory fragmentation, which is the real problem of course
[19:04] <Psi-jack> yeah, I noticed this because 100% of vmem is used.. 100%! That's all physical memory, and all swap. Thankfully the swap is on a high speed SSD.
[19:04] <gregaf1> libatomic too, although that's less crucial as best I know
[19:04] <gregaf1> but you are using tcmalloc, so that shouldn't be the problem
[19:04] <gregaf1> and your cache is in the realm of correct
[19:04] <gregaf1> and I don't know where else the memory could be going :/
[19:04] <Psi-jack> This is ceph 0.56.3 BTW
[19:05] <gregaf1> well, I guess open up two terminals
[19:05] <Psi-jack> I know .4 is out., but I haven't updated my packages yet.
[19:05] <gregaf1> in one run ceph -w
[19:05] <Psi-jack> Oi.. that'll be rather loud at the moment.. :)
[19:05] <Psi-jack> My OSD's are still rebalancing from the replacement HDD.
[19:06] <Psi-jack> Currently at 11.3% degraded and slowly declining.
[19:06] <gregaf1> in the other run "ceph mds tell 0 heap stats"
[19:06] <Psi-jack> Got it.
[19:06] * mrP (~pablomelo@static-71-187-25-130.nwrknj.fios.verizon.net) has joined #ceph
[19:07] <gregaf1> well, you care about what the mds is going to dump to central log so I suppose it end up in the MDS log too if you'd rather go look at the file, but ceph -w seems simplest ;)
[19:07] <Psi-jack> http://pastebin.ca/2357216
[19:07] <gregaf1> ….well, I've sure never tcmalloc return results like that
[19:07] * noob2 (~cjh@ has joined #ceph
[19:07] <Psi-jack> heh
[19:08] <gregaf1> indeed it says the MDS is only using 800MB, and it has 1700MB in the free list
[19:08] <Psi-jack> Correct.
[19:08] <gregaf1> try "ceph mds tell 0 heap release" and see if it goes away
[19:09] <Psi-jack> It sure did.
[19:09] <gregaf1> well, I dunno how that happened
[19:10] <gregaf1> I've never seen numbers like that before
[19:10] <Psi-jack> heh, well, ps doesn't say it did. but..
[19:10] <gregaf1> perhaps there's some tuning of tcmalloc options that is done for us in the binary distros but which gentoo doesn't have that
[19:10] <Psi-jack> 2,861,296
[19:10] <gregaf1> oh, ps is still angry at it?
[19:10] <Psi-jack> But, swap went down to 930 MB, of 2047 MB. :)
[19:11] <gregaf1> ah
[19:11] <gregaf1> well, you can dump the stats again and see what they say now
[19:12] <gregaf1> but what I can tell you based on those is that something underneath us is misbehaving
[19:12] <gregaf1> welcome to Gentoo, I guess? :p
[19:12] <joao> I just hope that's what's happening with the monitor (wishful thinking)
[19:12] <gregaf1> joao: hmmm?
[19:12] <Psi-jack> MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
[19:13] <Psi-jack> gregaf1: Actually, this is Arch Linux running my Ceph servers, currently.
[19:13] <gregaf1> (there's also the actual memory used, virtual address space used, and bytes to released to os lines that are interesting)
[19:13] <joao> gregaf1, not so long ago, there was some folks on ceph-devel showing us huge mem consumption on the monitors
[19:13] <gregaf1> is arch a binary or a source distro?
[19:13] <joao> we've never been able to find out what or why
[19:13] <Psi-jack> mix of both. :)
[19:13] <gregaf1> joao: ah, right; we never tracked that down or heard much more about it though
[19:14] <Psi-jack> joao: Well, no issue on the mon.. this was with the mds..
[19:14] <gregaf1> joao: I don't recall, was it before or after the great leveldb merge?
[19:14] <joao> before
[19:14] * mrP (~pablomelo@static-71-187-25-130.nwrknj.fios.verizon.net) has left #ceph
[19:14] <joao> but I bet the problem hasn't gone away
[19:14] <gregaf1> yeah, I dunno
[19:15] * vata (~vata@2607:fad8:4:6:ece8:540e:4482:b1e2) has joined #ceph
[19:15] <Psi-jack> Hmm. :)
[19:15] <gregaf1> our monitor memory use has actually gone up a good bit I bet
[19:15] <gregaf1> or at least creating a 50k PG pool now takes up a ton of RAM
[19:15] <gregaf1> (must….shard…pgmap…)
[19:15] <Psi-jack> Well, while I have you both here.. hehe. is ceph-fuse "safe" to use on the same systems running osd's, unlike ceph's kernel client layer?
[19:15] <gregaf1> yes
[19:16] <Psi-jack> Cool. A local LUG member will be happy to hear that. :)
[19:16] <joao> gregaf1, I'm just glad to see the heap dump is working out-of-the-box again
[19:16] <gregaf1> forgot it was ever broken — but now that you mention it, wasn't that just on one version of Ubuntu or something?
[19:16] * BillK (~BillK@124-169-69-25.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[19:16] <Psi-jack> hehe
[19:17] <joao> gregaf1, it was on one version of gperftools iirc
[19:17] * WithoutSleep (~pablomelo@static-71-187-25-130.nwrknj.fios.verizon.net) has joined #ceph
[19:17] <joao> which ubuntu hadn't updated for god knows how long
[19:17] <Psi-jack> Now, if only the rebalance would hurry the frack up. LOL
[19:17] <joao> installing a trunk version fixed the issues for me back then
[19:17] <Psi-jack> Now at 10.9%. :)
[19:18] <gregaf1> you could turn up the concurrency if you can afford it
[19:19] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:20] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[19:22] * jgallard_ (~jgallard@gw-aql-129.aql.fr) Quit (Quit: Leaving)
[19:23] <Psi-jack> Well, probably couldn't.. I've had my front-facing firewalls stall a bit during this because it was starving for disk access.
[19:23] <Elbandi_> is somethink is mds monitor code broken, because i have 4 oneshot-replay laggy mds?
[19:23] <Psi-jack> That, however.. Was mostly when the other HDD was in failed state and it was using slower disks.
[19:23] <Elbandi_> mds.-1.0 up:oneshot-replay seq 19 laggy since 2013-04-08 18:01:42.236257 (standby for rank 0)
[19:26] * rustam (~rustam@ has joined #ceph
[19:26] <Psi-jack> heh, at one point, my firewalls (which are VM's attached to ceph by qemu-rbd for disk), even struggled so much the kernel changed the IO mode to PIO3. :)
[19:27] * vanham (~vanham@ has left #ceph
[19:27] * vanham (~vanham@ has joined #ceph
[19:27] * rustam (~rustam@ Quit (Remote host closed the connection)
[19:28] <Psi-jack> hmm
[19:28] <Psi-jack> okay, so how do I check what the concurrency is, and change it?
[19:29] <gregaf1> I think there's a tuning doc somewhere; look into that because if I wing it I'll forget something
[19:29] <gregaf1> Elbandi_: how many MDS daemons do you actually have?
[19:31] <Psi-jack> gregaf1: Does it involve CRUSH at all?
[19:31] <gregaf1> Psi-jack: the important one though is probably osd_max_backfills
[19:31] <gregaf1> no, not crush at all
[19:31] <Psi-jack> Gotcha. Just looking for keywords to help find the specific documentation.
[19:32] <sjust> ljonsson1: what's up?
[19:33] <Elbandi_> gregaf1: still 2, but this are journal-check mds
[19:33] * rturk-away is now known as rturk
[19:33] <Elbandi_> i mean, there are 2 active
[19:34] <Elbandi_> and there are 5 veryold oneshot-replay
[19:34] <ljonsson1> my cluster has been stuck in HEALTH_WARN for a day now even thoug all osd's are up and in
[19:35] <ljonsson1> network is up on all nodes
[19:35] <ljonsson1> HEALTH_WARN 15 pgs peering; 15 pgs stuck inactive; 54 pgs stuck unclean
[19:35] <gregaf1> Elbandi_: oh, oneshot-replay; right
[19:35] <gregaf1> there was a bug, I believe resolved now, where they didn't get cleaned out
[19:36] <Elbandi_> hmm
[19:36] <gregaf1> that'll be in cuttlefish
[19:36] <gregaf1> you can clean them out manually if you dump the mdsmap and then do rmfailed on them
[19:37] <ljonsson1> This happened after restarting one node that happend top also have a monitor (cluster was in HEALTH_OK at this point)
[19:40] <Psi-jack> heh, hmmm.
[19:40] <Elbandi_> do remember maybe ticketid, or something? :)
[19:41] <Psi-jack> Network related loads are at least reasonable. heck, I'm not even CLOSE to maxing it out, at peak moments (brief moments), of 80 Mbps.
[19:41] <Psi-jack> On a 1Gbps dedicated network..
[19:42] <Psi-jack> Actually wondering why that's not higher. heh
[19:43] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) Quit (Quit: Leaving.)
[19:43] <mikedawson> joshd: are you around?
[19:49] * mcclurmc_laptop (~mcclurmc@ has joined #ceph
[19:51] * alram (~alram@ has joined #ceph
[19:59] * chutzpah (~chutz@ has joined #ceph
[20:00] <ljonsson1> sjust: problem solved after restarting ceph on one of the nodes that had some network problems before I fixed it
[20:02] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) has joined #ceph
[20:06] * Cube (~Cube@ has joined #ceph
[20:08] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[20:09] * verwilst (~verwilst@dD576962F.access.telenet.be) Quit (Ping timeout: 480 seconds)
[20:10] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[20:11] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit ()
[20:11] * rustam (~rustam@ has joined #ceph
[20:11] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[20:12] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[20:13] * BManojlovic (~steki@fo-d- has joined #ceph
[20:13] * rustam (~rustam@ Quit (Remote host closed the connection)
[20:20] * verwilst (~verwilst@dD576962F.access.telenet.be) has joined #ceph
[20:21] * LeaChim (~LeaChim@ has joined #ceph
[20:30] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[20:42] * rustam (~rustam@ has joined #ceph
[20:43] * rustam (~rustam@ Quit (Remote host closed the connection)
[20:43] * WithoutSleep (~pablomelo@static-71-187-25-130.nwrknj.fios.verizon.net) has left #ceph
[20:45] <drokita1> IS anyone out there using a live boot environment with their ceph cluster?
[20:49] <Psi-jack> Define "live boot environment"
[20:50] <Psi-jack> I mean, my ceph servers are online, booted up live, directly off internal SSD drives.
[20:50] <drokita1> I mean via PXE
[20:50] <drokita1> Sorry for not being clear
[20:51] <Psi-jack> Oh.. Hmmm. Not me. :)
[20:51] <drokita1> I am mulling over creating an environment with none persistent OSD servers. If I have chassis failure, I just jack in new server, swap disks and fire it up
[20:52] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[20:52] * rahmu_ (~rahmu@ip-147.net-81-220-131.standre.rev.numericable.fr) has joined #ceph
[20:52] <drokita1> Probably with a smattering of post-boot Puppet magic
[20:53] <Psi-jack> Hmmm
[20:53] <Psi-jack> Seems a bit.. Odd..
[20:54] <drokita1> Why is that... beats a os reinstall and reconfigure when a chasis fails
[20:54] <Psi-jack> Define "Chasis fails." A chasis is a case surrounding something. :)
[20:54] <drokita1> disk chassis.... so say a 16 disk supermicro server using the disks as 16 OSDs
[20:55] <Psi-jack> So, do you mean the equivalent of a PowerVault, for example?
[20:56] <drokita1> Yeah, essentially, but generic. I wouldn't use a power vault. My existing ceph machines are exactly what I said. 16 disk chassis 3U SuperMicro boxes.
[20:56] * ljonsson1 (~ljonsson@ext.cscinfo.com) Quit (Quit: Leaving.)
[20:57] <Psi-jack> drokita1: But a PowerVault, all it does is power and provide the disk array to the attached computer it's plugged into. It doesn't really contain an O/S. this is the main area I'm focusing on.
[20:57] <drokita1> I see... sorry. No this is an actual server with an OS and running OSD processes
[20:58] <Psi-jack> Okay.
[20:58] <drokita1> Trying to get to a place where when I add another of these nodes, I add the osd config to ceph.conf and plug the new server in. No additional steps
[20:58] <Psi-jack> Well, IMHO. Reinstalling an OS after repairing the basic issue, if it's just an HDD failure that is,is nothing. Especially if you pre-image your servers so all you do is put the image back on and boot.
[20:59] <drokita1> How about a motherboard issue?
[20:59] <Psi-jack> That's a simple issue. Replace motherboard. Boot as normal.
[21:00] <Psi-jack> if you're running Ceph as you should be, in a cluster of multiples of these units, everything is safe and will rebalance back and fourth. That's part of the beauty of ceph. :)
[21:01] <drokita1> What I want is if a chasis fails, I call a rack monkey to install new server... he boots it, the cluster fixes itself
[21:01] <Psi-jack> If there's not a TOTAL FAILURE, as in, all OSD disks die at the same time, you'd be wasting more time than not, because the new system would have to rebuild 100% of the time. Where-as a single HDD failure you could pull, replace, rebuild.
[21:02] <drokita1> I would pull the disks for sure
[21:02] <Psi-jack> if a chasis fails, get a new chasis. Keep in mind, I'm seeing "chasis" as a case, not the guts within it.
[21:02] <Psi-jack> businesses don't survive on mere monkeys. :)
[21:04] <drokita1> What about monkey business ;)
[21:05] <Psi-jack> drokita1: Well, then the only time they get restless is around something new... Like a new server unit. :)
[21:05] <Psi-jack> taken, in part, from The Monkeys theme song. :)
[21:06] * rahmu_ (~rahmu@ip-147.net-81-220-131.standre.rev.numericable.fr) Quit (Remote host closed the connection)
[21:06] * yasu` (~yasu`@dhcp-59-149.cse.ucsc.edu) has joined #ceph
[21:11] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[21:11] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:15] <imjustmatthew> joao: Around?
[21:21] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[21:21] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:21] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit ()
[21:22] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:28] * verwilst (~verwilst@dD576962F.access.telenet.be) Quit (Quit: Ex-Chat)
[21:29] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[21:29] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit ()
[21:30] * dosaboy (~dosaboy@host86-161-164-218.range86-161.btcentralplus.com) Quit (Quit: leaving)
[21:31] * diegows (~diegows@ Quit (Read error: Operation timed out)
[21:34] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[21:42] * ljonsson (~ljonsson@pool-74-103-170-242.phlapa.fios.verizon.net) has joined #ceph
[21:43] * rustam (~rustam@ has joined #ceph
[21:44] * rustam (~rustam@ Quit (Remote host closed the connection)
[21:46] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[21:46] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:50] * ljonsson (~ljonsson@pool-74-103-170-242.phlapa.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[21:50] * john_barbee_ (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[21:50] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 20.0.1/20130409194949])
[21:58] * calebamiles (~caleb@c-50-138-218-203.hsd1.vt.comcast.net) has joined #ceph
[22:01] * imjustmatthew (~imjustmat@c-24-127-107-51.hsd1.va.comcast.net) Quit (Remote host closed the connection)
[22:02] * vanham (~vanham@ Quit (Quit: Leaving)
[22:09] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) has joined #ceph
[22:18] * KevinPerks1 (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[22:23] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[22:30] * mrjack (mrjack@office.smart-weblications.net) has joined #ceph
[22:35] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:40] * jakku (~jakku@ad046161.dynamic.ppp.asahi-net.or.jp) Quit (Remote host closed the connection)
[22:40] <mrjack> which scheduler do you recomend for ceph? deadline?
[22:40] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[22:40] * jakku (~jakku@ad046161.dynamic.ppp.asahi-net.or.jp) has joined #ceph
[22:46] <davidz> mrjack: I haven't heard discussion of schedulers, but you might want to send to ceph-users mailing list.
[22:48] * jakku (~jakku@ad046161.dynamic.ppp.asahi-net.or.jp) Quit (Ping timeout: 480 seconds)
[22:49] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[22:50] <gregaf1> nhm studied this a bit in one of his performance blog posts
[23:04] * noob21 (~cjh@ has joined #ceph
[23:05] * noob21 (~cjh@ Quit ()
[23:05] * noob21 (~cjh@ has joined #ceph
[23:06] * noob21 (~cjh@ Quit ()
[23:06] <mrjack> i placed journal a tooo big partition on software raid1... now the raid1 resync is killing my IO :(
[23:08] * noob2 (~cjh@ Quit (Read error: Operation timed out)
[23:08] <darkfaded> mrjack: you can ionice the md_resync
[23:09] <darkfaded> or throttle the rate, but that often doesnt work
[23:09] <darkfaded> ionice -c3 -p <thread pid>
[23:11] * noob2 (~cjh@ has joined #ceph
[23:15] * snnw (snnw@0001b418.user.oftc.net) has joined #ceph
[23:15] * snnw (snnw@0001b418.user.oftc.net) has left #ceph
[23:17] * rustam (~rustam@ has joined #ceph
[23:18] * rustam (~rustam@ Quit (Remote host closed the connection)
[23:19] * john_barbee_ (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:24] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[23:27] * humbolt (~elias@91-113-98-240.adsl.highway.telekom.at) has joined #ceph
[23:27] <humbolt> I have a cluster of 3 hosts with 1 osd each. Each OSD consists of one 3TB drive. The hosts have 32GB RAM each and are connected via Gbit switch. But I am still only getting 10MB/s write speed. That seems way to less.
[23:33] * nhm (~nh@65-128-150-185.mpls.qwest.net) has joined #ceph
[23:33] <davidz> mrjack: check this out: http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/
[23:34] * LeaChim (~LeaChim@ has joined #ceph
[23:36] <humbolt> with which mount options should I mount ifs partitions for my OSDs?
[23:36] <davidz> humbolt: Someone is going to see if they can help you.
[23:37] * rustam (~rustam@ has joined #ceph
[23:39] * rustam (~rustam@ Quit (Remote host closed the connection)
[23:40] <davidz> humbolt: What filesystem type are you using?
[23:42] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) has joined #ceph
[23:43] <davidz> humbolt: According to the documentation, had you deployed with mkcephfs the default would be xfs filesystem with options rw,noatime. see http://ceph.com/docs/master/rados/deployment/mkcephfs/
[23:43] <mrjack> thanks davidz
[23:43] <dspano> Sorry if this is a stupid question. Can you change schedulers on the fly?
[23:43] <humbolt> xfs
[23:45] <davidz> dspano: yes, you do something like "echo noop > /sys/block/sda/queue/scheduler" Current, I/Os need to finish before it will switch.
[23:46] <dspano> davidz: Thanks.
[23:47] <humbolt> interestingly radios bench write test shows 35 MB/s on average
[23:49] * BManojlovic (~steki@fo-d- Quit (Quit: Ja odoh a vi sta 'ocete...)
[23:49] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Read error: Connection reset by peer)
[23:50] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[23:50] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[23:50] * vata (~vata@2607:fad8:4:6:ece8:540e:4482:b1e2) Quit (Quit: Leaving.)
[23:51] <davidz> humbolt: Are you using a 2 replica pool size?
[23:51] <humbolt> 3
[23:52] * noob2 (~cjh@ Quit (Quit: Leaving.)
[23:53] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:53] <nhm> humbolt: hello, sorry, I'm late to this conversation.
[23:55] * noob2 (~cjh@ has joined #ceph
[23:57] <dspano> Is 30MB/s with rados bench normal with 1gbe and 7200 sata drives?
[23:58] <nhm> dspano: how many servers, how many clients, how much replication, and how many disks per server? (and are journals on the same disks as the OSDs?)
[23:58] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.