#ceph IRC Log

Index

IRC Log for 2013-04-24

Timestamps are in GMT/BST.

[0:09] * aliguori (~anthony@20616e33.test.dnsbl.oftc.net) Quit (Remote host closed the connection)
[0:10] * danieagle (~Daniel@177.133.173.20) Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[0:10] <Kdecherf> gregaf: I will export new logs tomorrow
[0:13] <pioto> hi, i know i may sound like a broken record but... i'm trying to use cephfs, and here are my demands, err, uh, requirements:
[0:13] <pioto> 1) i need to limit each qemu guest to only access certain parts of the filesystem
[0:14] <pioto> 2) i need full posix semantics (chown, chmod). i think this rules out the samba patches
[0:14] <pioto> 3) i want it to not suck, perfomance-wise. this seems to rule out 9p (virtfs), which is otherwise my top choice, because of how it lets me restrict access
[0:14] * BillK (~BillK@58-7-124-118.dyn.iinet.net.au) has joined #ceph
[0:15] <pioto> does anyone have any other suggestions for things that'll scale reasonably well, and still allow live migrations with libvirt, and not have a single point of failure, and ...
[0:15] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[0:16] <PerlStalker> pioto: You could use rbd rather than cephfs.
[0:16] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[0:16] <pioto> well, those serve different purposes
[0:16] <pioto> i still need a shared filesystem, shared between 2 or more guests potentially
[0:16] <PerlStalker> pioto: True. It depends on what you need.
[0:16] <pioto> rbd is very good for the "boot drive", and i'm using it there with success
[0:16] <PerlStalker> pioto: IIRC, cephfs will not meet requirement 1.
[0:17] <pioto> yesh. so i've been looking at "cephfs plus virtfs, or samba, or ..."
[0:17] <PerlStalker> pioto: You can use rbd+ocsf2 or gfs, etc.
[0:18] <pioto> hm
[0:18] <PerlStalker> I'm no expert but from what I've seen during my own research, you can use rbd as a shared block device.
[0:18] <pioto> hm?
[0:18] <PerlStalker> You just need a cluster aware file system to sit on top of it.
[0:18] <pioto> hm
[0:19] <paravoid> that would suck a lot
[0:19] <paravoid> it also wouldn't help with limiting parts of the fs
[0:19] <pioto> i think gfs needed special hardware, when i looked at it before?
[0:19] <pioto> hm
[0:19] <pioto> well... another angle...
[0:20] <pioto> how much effort would be involved in adding #1 to cephfs in a complete way?
[0:20] <PerlStalker> paravoid: Sure it would. You just share the block with the file system you want to share.
[0:20] <pioto> basically, building on the current "give some part of the fs a pool", but, limiting metadata access too
[0:20] <pioto> so that a given client key can only access a given area
[0:20] <pioto> is this something that the architecture could support? or would it require a major redesign?
[0:21] <PerlStalker> pioto: At the moment, cephfs only support a single file system with all that implies.
[0:21] <pioto> yes, i know about now
[0:21] <pioto> i'm wondering about now+$TIME
[0:21] <PerlStalker> I don't know if there's anything in the works to change it.
[0:21] <pioto> ok, now+$MY_TIME or something
[0:22] <dmick> not sure about rbd with a shared fs; it might need to support persistent reservations or something to really pull that off
[0:22] <pioto> as in, is this something patches could be made for, or is it fundamentally not gonna work?
[0:22] <PerlStalker> There are devs that hang out here from time to time who might be able to answer that question.
[0:22] <pioto> yeah. lemme check the "office hours..."
[0:22] <paravoid> I think you could have a separate pool per guest
[0:22] <PerlStalker> AFAIK, it would take a significant change to do it.
[0:22] <paravoid> but that would be a maintenance nightmare
[0:23] <PerlStalker> paravoid: You would need a completely separate cluster.
[0:23] <paravoid> why?
[0:23] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[0:23] <PerlStalker> All of the cephfs meta data is kept in the same place.
[0:23] <pioto> paravoid: yes, it'd be a nightmare
[0:23] <pioto> plus the metadata, yes
[0:23] <gregaf> the way to go about it would probably be to add client security caps to the MDS, which specify what part of the tree they're allowed to access
[0:23] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[0:23] <pioto> you can't read any other pool's data
[0:23] <pioto> but you can delete their files, etc
[0:23] <gregaf> and then on incoming client requests compare those allowances to the dentries that the request is touching
[0:23] <pioto> gregaf: yes, that!
[0:24] <PerlStalker> See? Go to the experts. :-)
[0:24] <pioto> so... is that even a plausible project, or, something that'd require a fundamental architecture change?
[0:24] <pioto> or, do you not know yet for sure?
[0:24] <gregaf> I don't think it would be impossible, and it might even be feasible for somebody to do without a lot of prior experience, but I haven't scoped it at all
[0:24] <pioto> but yes, that's the kinda thing i am looking for
[0:24] <gregaf> not a fundamental architecture change, no
[0:24] <pioto> excellent
[0:24] <gregaf> I'd just expect there to be a lot of edge cases to deal with (though perhaps not?)
[0:25] <paravoid> I smell a bug report
[0:25] <paravoid> fwiw we'd very interested in that too
[0:25] <pioto> yep, i'm gonna try to draft one up
[0:25] <pioto> [feature request]
[0:25] <sage> gregaf: probably yes if you consider the DoS stuff, but we could probably add a pretty simple generic check right after the locking step that verifies inodes are in the right subtrees
[0:26] <dmick> pioto: this might even be worth a blueprint?
[0:26] <paravoid> would nfs/ganesha integration help here?
[0:26] <sage> great idea for a blueprint, tho, yeah!
[0:26] <paravoid> implement this at that level perhaps?
[0:26] <paravoid> and then mount via nfs
[0:26] <pioto> dmick: hm. i saw that "somewhere"
[0:26] <gregaf> sage: I was actually thinking it should be able to do a dentry comparison before it even locks, right?
[0:26] <pioto> which would be preferred? feature request, or blueprint? or what?
[0:27] <pioto> i can only provide a very high level sketch of what i'd like to see
[0:27] <sage> that too. may want both to avoid issues with hard links across subtree boundaries and such
[0:27] <pioto> i'm not at all familiar with internals
[0:27] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[0:27] <sage> but even a rough check will cover most use cases (no hard links etc)
[0:27] <gregaf> well, I don't know that we can budget time for implementing it, so you can do a feature request but if you can devote dev time to it then a blueprint to be discussed at the design summit would be the nicest ;)
[0:27] <dmick> http://wiki.ceph.com/01Planning/02Blueprints
[0:28] <pioto> gregaf: i can't say for sure how much time i'd be able to devote to implementing it
[0:28] <pioto> i'll have to ask
[0:28] <pioto> i guess i'll feature request first, so the idea isn't totally lost
[0:28] <pioto> and then try to work out a blueprint
[0:36] * gmason (~gmason@hpcc-fw.net.msu.edu) Quit (Ping timeout: 480 seconds)
[0:49] * PerlStalker (~PerlStalk@72.166.192.70) Quit (Quit: ...)
[0:54] <pioto> k, I opened http://tracker.ceph.com/issues/4799... any feedback would be welcome. but, i'm leaving right now. thanks for the help!
[0:54] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:55] * loicd (~loic@magenta.dachary.org) has joined #ceph
[0:58] * tnt (~tnt@91.176.19.114) Quit (Ping timeout: 480 seconds)
[1:01] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[1:09] * rustam (~rustam@94.15.91.30) has joined #ceph
[1:19] * The_Bishop (~bishop@2001:470:50b6:0:f01a:7b78:ec10:b15a) has joined #ceph
[1:22] * LeaChim (~LeaChim@176.250.159.86) Quit (Ping timeout: 480 seconds)
[1:25] * coyo|2 (~unf@71.21.193.106) has joined #ceph
[1:25] * coyo (~unf@00017955.user.oftc.net) Quit (Ping timeout: 480 seconds)
[1:25] * coyo|2 is now known as coyo
[1:36] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:37] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:48] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:50] * TMM (~hp@535240C7.cm-6-3b.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[1:51] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[1:53] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:55] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[1:58] * TMM (~hp@535240C7.cm-6-3b.dynamic.ziggo.nl) has joined #ceph
[2:04] * noob2 (~cjh@66.220.144.81) has joined #ceph
[2:05] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) has joined #ceph
[2:14] * diegows (~diegows@190.190.2.126) Quit (Ping timeout: 480 seconds)
[2:16] * alram (~alram@267a14e2.test.dnsbl.oftc.net) Quit (Read error: Connection reset by peer)
[2:16] * alram (~alram@38.122.20.226) has joined #ceph
[2:23] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[2:45] * portante|ltp (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[3:16] * JohansGlock (~quassel@kantoor.transip.nl) Quit (Read error: Connection reset by peer)
[3:20] * alram (~alram@38.122.20.226) Quit (Ping timeout: 480 seconds)
[3:43] * noob2 (~cjh@66.220.144.81) Quit (Quit: Leaving.)
[3:50] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) has joined #ceph
[3:54] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[3:55] * Cube (~Cube@12.248.40.138) Quit (Quit: Leaving.)
[4:02] * treaki_ (0ad1dc3bc4@p4FDF7BEF.dip0.t-ipconnect.de) has joined #ceph
[4:02] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[4:06] * treaki__ (afe4ff0994@p4FDF78D1.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[4:08] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) Quit (Remote host closed the connection)
[4:17] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[4:19] * mega_au_ (~chatzilla@94.137.213.1) has joined #ceph
[4:23] * mega_au (~chatzilla@94.137.213.1) Quit (Ping timeout: 480 seconds)
[4:23] * mega_au_ is now known as mega_au
[4:45] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[4:47] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[5:00] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:08] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) Quit (Quit: ZNC - http://znc.in)
[5:09] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) has joined #ceph
[5:14] * shardul_man (~shardul@174-17-80-182.phnx.qwest.net) has joined #ceph
[5:19] * lx0 is now known as lxo
[6:07] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[7:08] * shardul_man (~shardul@174-17-80-182.phnx.qwest.net) Quit (Remote host closed the connection)
[7:20] * ScOut3R (~ScOut3R@51B7B054.dsl.pool.telekom.hu) has joined #ceph
[7:29] * ScOut3R (~ScOut3R@51B7B054.dsl.pool.telekom.hu) Quit (Ping timeout: 480 seconds)
[7:29] * bstansell (~bryan@c-98-248-230-102.hsd1.ca.comcast.net) has joined #ceph
[7:38] * trond (~trond@trh.betradar.com) Quit (Remote host closed the connection)
[7:38] * trond (~trond@trh.betradar.com) has joined #ceph
[7:40] * BillK (~BillK@58-7-124-118.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[7:42] * bstansell (~bryan@c-98-248-230-102.hsd1.ca.comcast.net) Quit (Quit: bstansell)
[7:45] * tnt (~tnt@91.176.19.114) has joined #ceph
[7:49] * BillK (~BillK@124-169-227-185.dyn.iinet.net.au) has joined #ceph
[7:57] * The_Bishop (~bishop@2001:470:50b6:0:f01a:7b78:ec10:b15a) Quit (Ping timeout: 480 seconds)
[8:25] * Vjarjadian (~IceChat77@90.214.208.5) Quit (Quit: Never put off till tomorrow, what you can do the day after tomorrow)
[8:55] * shardul_man (~shardul@174-17-80-182.phnx.qwest.net) has joined #ceph
[9:03] * verwilst (~verwilst@109.130.227.28) has joined #ceph
[9:04] * ScOut3R (~ScOut3R@212.96.47.215) has joined #ceph
[9:07] * LeaChim (~LeaChim@176.250.159.86) has joined #ceph
[9:12] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:18] * mib_2or3qc (559eb342@ircip2.mibbit.com) has joined #ceph
[9:20] <mib_2or3qc> Hi! Does anybody have an idea how to solve Disk I/O problems while recovering? All my I/O gets stalled. Even i set osd recovery max active = 1 and osd max backfills = 1
[9:21] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:23] <Zethrok> mib_2or3qc: How long has it stalled? I've found that, depending on hw, it can stall for anything from a few sec to several mins or hours even
[9:24] * tnt (~tnt@91.176.19.114) Quit (Read error: Operation timed out)
[9:24] <mib_2or3qc> yes for around 15 min
[9:24] <mib_2or3qc> but i thought it shouldn't with bobtail
[9:25] <Kioob`Taff> (I still see that behaviour with bobtail)
[9:26] <Zethrok> Same - even just 1 disk going down gives a small window of stalled IO - usually only a few sec. A complete node is usually a few min
[9:27] <mib_2or3qc> Yes i in my case the node came up after a hw failure...
[9:27] <mib_2or3qc> i'm not talking about a stall when a node fails
[9:28] <Zethrok> ahh, so doing the redistributing phase to the newly started node?
[9:28] <mib_2or3qc> yes
[9:36] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:36] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) has joined #ceph
[9:37] * leseb (~Adium@83.167.43.235) has joined #ceph
[9:41] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Read error: Connection reset by peer)
[9:41] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:42] * JohansGlock (~quassel@kantoor.transip.nl) has joined #ceph
[9:42] * DarkAceZ (~BillyMays@50.107.54.92) Quit (Read error: Operation timed out)
[9:44] * DarkAceZ (~BillyMays@50.107.54.92) has joined #ceph
[9:47] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[9:50] * lofejndif (~lsqavnbok@93.114.43.156) has joined #ceph
[9:50] * l0nk (~alex@83.167.43.235) has joined #ceph
[9:52] * lofejndif (~lsqavnbok@9YYAACGNE.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[10:00] * thorus (~jonas@pf01.intranet.centron.de) has left #ceph
[10:07] * SubOracle (~quassel@00019f1e.user.oftc.net) Quit (Remote host closed the connection)
[10:07] * SubOracle (~quassel@coda-6.gbr.ln.cloud.data-mesh.net) has joined #ceph
[10:23] * mib_2or3qc (559eb342@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[10:24] * Sargun (~sargun@208-106-98-2.static.sonic.net) has joined #ceph
[10:24] <Sargun> Hey
[10:25] <Sargun> How does Ceph choose where to fetch data from
[10:27] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[10:28] <fghaas> Sargun: I hope http://bit.ly/17R5zSq is helpful :)
[10:30] <Sargun> So, I'm kinda familiar with how CRUSH decides to place data
[10:31] <Sargun> Can you have a different read path than write path
[10:32] <fghaas> the write path is always to the Primary OSD, from whence data replicates out to the other replicas
[10:33] * loicd (~loic@magenta.dachary.org) has joined #ceph
[10:33] <fghaas> what OSD is being hit for reads is, I think, random, but it might also go through the primary OSD -- at any rate, there is definitely no notion of locality, i.e. Ceph won't pick the "closest" OSD to read from
[10:37] <Sargun> oh
[10:39] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) Quit (Remote host closed the connection)
[10:46] * bergerx_ (~bekir@78.188.204.182) has joined #ceph
[10:49] <topro> fghaas: interesting, i was just about to ask concerning that topic, having CRUSH... wouldn't that be relatively straight-forward to implement (binding clients to osds within closest data center as preferred IO-source)?
[10:50] * v0id (~v0@212-183-101-130.adsl.highway.telekom.at) has joined #ceph
[10:50] <fghaas> topro: if it's straightforward, why don't you send a patch? :)
[10:51] <topro> good question, I have to admit ;)
[10:53] <topro> anyway I didn't want to bother anyone to actually do it but primarily start a discussion about how hard it would be to do so
[10:57] * vo1d (~v0@194-118-211-45.adsl.highway.telekom.at) Quit (Ping timeout: 480 seconds)
[10:57] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[11:00] * jtangwk (~Adium@2001:770:10:500:8418:aa9e:6507:57cd) has joined #ceph
[11:02] <Anticimex> i'm making excel sheets on ceph again..
[11:02] <Anticimex> what is the recommended approach to add caching (RAM or SSD) onto ceph?
[11:04] <Anticimex> im considering ceph in openstack, and it may be useful to have a dm-cache with ssd's somewhere..
[11:05] <Anticimex> but i can only see that a box with dm-cache and ssds would be a middleman between the ceph cluster and the VMs mounting, using something other than ceph, i suppose
[11:06] <Anticimex> perhaps it's ok to have dm-cache colocated with the individual physical vm hosts (compute nodes)
[11:07] <Anticimex> no clue how that can tie into openstack tho :)
[11:09] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[11:11] <wido> Anticimex: Are you looking into read or writes?
[11:11] <wido> Probably reads? I'd say, put more RAM into the OSDs
[11:11] <wido> Since the RBD images will be striped over 4MB objects on the OSDs they will land in the page cache of the kernel
[11:12] <Anticimex> will osds manage cache themselves
[11:12] <wido> So more RAM == more caching
[11:12] <Anticimex> or are we simply talking using file-level caches here?
[11:12] <wido> Anticimex: No, not the OSDs, the kernel wil do that for them
[11:12] <Anticimex> ie ext4, xfs, btrfs
[11:12] <Anticimex> check
[11:12] <wido> simple file level caches, since objects are regular files
[11:12] <wido> at the OSD level
[11:12] <Anticimex> i recall one early argument for userspace osds ~2006 being "then we can handle caching ourselves"
[11:12] <Anticimex> wido: right, KISS. i like
[11:13] <Anticimex> good, for reads
[11:13] <Anticimex> what about writes?
[11:13] <wido> Anticimex: They will go into the journal prior to being committed to the disk
[11:14] <wido> Use a fast SSD for the OSD journal
[11:14] <wido> And underprovision them. So buy a 180GB SSD and only use 16GB on that SSD
[11:14] <Anticimex> is there good info on the ceph webpage for journaling vs storage vs io benching?
[11:14] <Anticimex> aha
[11:14] <Anticimex> how big does a journal for 24x4TB have to be? is it primarily IO-load-dependant?
[11:14] <wido> yes, mainly I/O
[11:15] <Anticimex> and mainly writes?
[11:15] <wido> rough estimate is that the journal should hold about 30 seconds of IO
[11:15] <Anticimex> or only?
[11:15] <wido> no, only writes
[11:15] <Anticimex> ok
[11:15] <wido> so it you write with 100MB/sec, the journal should be about 3GB
[11:15] <wido> since it has to hit the disk at some point
[11:15] <Anticimex> right
[11:16] <Anticimex> 100% of all written data (payload in my world) goes through the journal?
[11:16] <wido> Yes, it all goes through the journa;
[11:16] <jerker> fghaas: as far as I understood from the mailinglist the primary OSD is used for reads to to get better caching. Read more here: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/14370/
[11:16] <Anticimex> wido: check, thanks for datapoints
[11:17] <wido> But, please don't build OSD systems with 24 disks
[11:17] <wido> rather go for 6 systems with 4 disks
[11:17] <wido> The impact of loosing a machine will be a lot smaller than loosing ~100TB at once
[11:17] <wido> It all depends on your cluster size, but don't go for 3 big machines, rather go for 9 smaller machines
[11:18] <Anticimex> well, in my example i have 4x 24-OSD systems for OSD
[11:18] <Anticimex> check
[11:18] <Anticimex> so 8x12 is better then
[11:19] <wido> Anticimex: Yes, but 16x6 would be even better
[11:19] <wido> the smaller the nodes, the less impact you will have when you loose a node
[11:20] <jerker> Anticimex: theres a swedish supplier for the Asus 12-disk 1U box, I asked them, http://www.tritech.se .. They havn't returned any mails yet though.
[11:20] <Anticimex> 12-disk in 1U seems neat
[11:20] <Anticimex> wido: well, 16x12?
[11:21] <wido> Anticimex: sure, when you have 16 nodes, you loose 6.25% when a node fails
[11:21] <wido> So that's not a big problem
[11:21] <Anticimex> wido: i had exemplified 2x10Gbps for resync-network and 2x10Gbps for front-end access, per osd-box
[11:21] <wido> That's more then sufficient I think
[11:21] * hybrid5121 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[11:21] <wido> but still, recovery is heavy on the cluster, so you should make the impact as small as possible
[11:21] <Anticimex> 2x per link for redundancy
[11:22] <wido> And loosing a node could also be maintenance, upgrades, etc
[11:22] <Anticimex> yeah, i understand your point :)
[11:22] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Read error: Operation timed out)
[11:22] * shardul_man (~shardul@174-17-80-182.phnx.qwest.net) Quit (Remote host closed the connection)
[11:22] <Anticimex> so having each node carry <10% of the data is a good metric?
[11:22] <Anticimex> bigger is better, i guess
[11:23] <Anticimex> is there any real-world experience in terms of recovery-time on a single node-loss?
[11:24] <Anticimex> i see recovery time relating to #volume / (node & it's throughput) and #nodes somehow
[11:26] <Anticimex> by "bigger is better" above i meant of course more nodes is better.. :)
[11:28] * andreask (~andreas@212.101.205.2) has joined #ceph
[11:28] * andreask (~andreas@212.101.205.2) has left #ceph
[11:30] <Anticimex> wido: MDS scaling? RAM, SSD, lots of multi-core CPU? what's important? :)
[11:30] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[11:30] <wido> Anticimex: I haven't worked with the MDS that much, but
[11:31] <wido> the MDS does everything in memory, so LOT's of memory
[11:31] <wido> But CephFS isn't stable yet
[11:31] <wido> MDS is mainly CPU and memory bound. You want low latency memory and fast CPU's
[11:31] <Anticimex> check, i thought MDS was necessary for managing the cluster including RADOS?
[11:31] <Anticimex> the whole crush and stuff.
[11:32] * Anticimex has to revisit the basic architecture presentation :)
[11:32] <wido> Anticimex: no, the monitors are
[11:32] <Anticimex> check. what about their requirements then? :)
[11:35] <Zethrok> Anticimex: http://ceph.com/docs/master/install/hardware-recommendations/ is a good place to look also :)
[11:35] <Anticimex> yeah that's what i started reading today :)
[11:35] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[11:38] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[11:40] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[11:41] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[11:42] * lofejndif (~lsqavnbok@82VAABUTI.tor-irc.dnsbl.oftc.net) has joined #ceph
[11:44] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[11:49] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[11:50] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[11:55] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[12:08] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[12:11] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:16] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[12:16] * jlogan2 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Quit: jlogan2)
[12:18] * ShaunR (~ShaunR@staff.ndchost.com) Quit ()
[12:23] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:25] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[12:25] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:31] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[12:32] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:32] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[12:33] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[12:35] * Havre (~Havre@2a01:e35:8a2c:b230:e8a8:e15:1197:808c) Quit (Ping timeout: 480 seconds)
[12:35] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[12:35] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:42] * Havre (~Havre@2a01:e35:8a2c:b230:e8a8:e15:1197:808c) has joined #ceph
[12:55] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[12:57] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:59] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[12:59] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[13:04] * rawsik (~kvirc@31.7.230.12) has joined #ceph
[13:05] * Rocky_ (~r.nap@188.205.52.204) has left #ceph
[13:19] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[13:23] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[13:30] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[13:32] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[13:34] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[13:37] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[13:49] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[13:51] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[13:51] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[14:08] * noahmehl_ (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[14:13] * noahmehl (~noahmehl@47437310.test.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[14:13] * noahmehl_ is now known as noahmehl
[14:17] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Quit: noahmehl)
[14:17] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[14:32] * diegows (~diegows@190.190.2.126) has joined #ceph
[14:41] * rahmu (~rahmu@83.167.43.235) has joined #ceph
[14:51] * lofejndif (~lsqavnbok@82VAABUTI.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[15:04] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:08] * BillK (~BillK@124-169-227-185.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[15:14] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[15:16] * gmason (~gmason@hpcc-fw.net.msu.edu) has joined #ceph
[15:17] * BillK (~BillK@124-148-103-23.dyn.iinet.net.au) has joined #ceph
[15:19] * juuva (~juuva@dsl-hkibrasgw5-58c05e-231.dhcp.inet.fi) Quit (Remote host closed the connection)
[15:19] * juuva (~juuva@dsl-hkibrasgw5-58c05e-231.dhcp.inet.fi) has joined #ceph
[15:23] * hybrid5121 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[15:23] <jerker> Zethrok: hmm the hardware recommendation should replace "1 GB" with "1 Gbit/" if that is what is meant to be readable by boring nitpicking technicians like me
[15:23] <jerker> "1 Gbit/s"
[15:24] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[15:27] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit ()
[15:28] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[15:31] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit ()
[15:32] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[15:40] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[15:45] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[15:46] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[15:51] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:51] * tnt (~tnt@91.176.19.114) has joined #ceph
[15:58] * PerlStalker (~PerlStalk@72.166.192.70) has joined #ceph
[16:01] <infernix> if I would put the ceph journal on a battery-backed pcie DRAM storage card, and use pcie SSD OSDs, and measure performance in the box (so no network involved)
[16:01] <infernix> where would the performance bottleneck most likely occur?
[16:02] <nhm> infernix: What kind of test?
[16:03] <nhm> infernix: I've hit 2.7GB/s with rados bench on localhost with 24 spinning disks and 8 SSD journals for large objects.
[16:03] <infernix> i'm seriously considering this
[16:03] <nhm> That was using 4 LSI SAS9207-8i controllers with ext4.
[16:04] <infernix> 4 or 5 PCIe SSDs, 1 DRAM battery backed journal device
[16:04] <infernix> but that's just bandwidth tests, I'm also wondering what kind of IO i would get through rbd
[16:04] <nhm> infernix: how fast is the battery back DRAM device?
[16:05] <infernix> well maybe forego that and just put the journal in actual RAM
[16:05] <Vjarjadian> sounds horrible
[16:05] <nhm> yes, IO is the bigger issue. For reads directly from pagecache, I've been able to do about 20,000 IOPs in 1 box.
[16:05] <Vjarjadian> and risks data loss iirc
[16:05] <Anticimex> journal in ram could work for test, i guess
[16:05] <nhm> IOPS rather
[16:05] <Anticimex> but wont have same latency as pcie-dram
[16:05] <Anticimex> (ram is better)
[16:05] <Anticimex> and obviously ram is volatile
[16:06] <infernix> or I could have a DRAM based storage device outside the ceph nodes, but that adds to the network bus bandwidth
[16:06] <infernix> in any case, 20k isn't nearly enough
[16:06] <infernix> where's the bottleneck?
[16:07] <infernix> i'm talking about pcie cards that do 80k a piece
[16:07] <Anticimex> how large writes in rados bench?
[16:07] <infernix> let's say 4k
[16:07] <infernix> common block device write size
[16:07] <Anticimex> are you doing 2.7GB/s with 4k writes?
[16:07] <Anticimex> yeah well, 4k random writes is quite different from sequential writing, right
[16:07] <infernix> hence ssd
[16:08] <Anticimex> sure
[16:08] <Anticimex> oh, it was nhm that hit 2.7Gbps with 24 spinning + 8 ssd journals
[16:08] <Anticimex> with large objects
[16:08] <nhm> infernix: I didn't map out the op latencies in that test, but I suspect that the combination of TCP/IP overhead (it's stll doing socket communication via localhost), crush, crc32c, and various other bits of overhead are restricting how many IOPs can be pushed.
[16:09] <infernix> ah
[16:09] <Anticimex> 112MB / drive is quite ok throughput i guess
[16:09] <infernix> tcp/ip could be eliminated with rsockets
[16:09] <Anticimex> ~20Gbps. neat
[16:10] <infernix> i don't know how rsockets would work on localhost though
[16:10] <nhm> Anticimex: yes, I can max out at bonded 10GbE link with RADOS bench which is fun. :)
[16:11] <nhm> infernix: No idea. We are looking at rsockets closely, but there is no in-kernel implementation yet. We would probably have to write one.
[16:11] <infernix> you mean for rbd, right?
[16:12] <infernix> hm, and cephfs
[16:12] <Anticimex> osd's are userspace though
[16:12] <Anticimex> for specifically localhost-implementation, that should work fine :)
[16:12] <Anticimex> but i don't think that's what ceph is about
[16:12] <nhm> infernix: yes, though cephfs is probably the bigger question.
[16:12] <matt_> I've had a lot of success using the PCIe SSD's as a journal for 24 disks
[16:12] <nhm> infernix: a lot of the people that want infiniband are the same people that want working cephfs. :)
[16:13] <Anticimex> nhm: does the journals resolve the replica ACK'ing ?
[16:13] <Anticimex> *do
[16:13] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) has joined #ceph
[16:13] <nhm> matt_: nice. I'm using a pile of intel 520s.
[16:13] <matt_> Anticimex, the faster it can get the data to a journal the faster it ACK's a replica copy
[16:13] <infernix> yeah i guess those targets pair up
[16:13] <todin> Hi, is in the openstack grizzly release anything diffrent for ceph/cinder than in the folsom release?
[16:13] <Anticimex> matt_: check. neat neat
[16:13] <nhm> matt_: only thing I don't like about the PCIE card solution is that if you lose a card you potentially lose a bunch of OSDs at the same time.
[16:13] <infernix> HPC doesn't have a lot of use for rbd
[16:14] <Anticimex> matt_: then drives can be left to simply write final data down
[16:14] <Anticimex> yet, 112 MB/s per drive doesn't leave that much room for writing 2 copies?
[16:14] <Anticimex> nhm: is 2.7GB/s counting all replicas too? or was it with replication=1 ?
[16:14] <nhm> Anticimex: replication=1. Just trying to see how fast I could push data at the OSDs.
[16:14] <Anticimex> gotcha
[16:15] * mnash (~chatzilla@vpn.expressionanalysis.com) has joined #ceph
[16:15] <matt_> nhm, I run all of my sata drives off an expander to I'm limited to around 1 GB/s through because of that. The PCIe ssd performs around the same and I have a single slot free for a dual IB card
[16:15] <nhm> matt_: I've been suspecting for a while that Ceph seems to hate expanders. Some do better than others.
[16:17] <matt_> nhm, I think it's just a problem with expanders in general. Things get a bit problems when you have 24 drives trying to cram as much data as they can down a few 6GB/s SAS lanes
[16:17] <matt_> problematic*
[16:17] <nhm> matt_: The test node I've got is a SC847A, so no expanders in the backplanes.
[16:17] <Vjarjadian> you running those expanders with high end boards or consumer level ones?
[16:17] <matt_> nhm, I think I have the same Supermicro system but with a 6GB/s single port expander
[16:18] <nhm> matt_: Yeah. I've seen them push lots of large sequential IO down them, but I wonder if with Ceph since it's doing things like dentry lookups and metadata writes sporadically mixed in if the expanders are causing high latencies or other wierd issues mixed in.
[16:18] <matt_> Vjarjadian, I'm using a Supermicro expander/backplane which uses an LSI chipset. The LSI expanders are pretty common
[16:19] <nhm> matt_: with 4 LSI9207-8i cards and SSD journals that thing jsut flies.
[16:19] <matt_> nhm, I use the same cards, they're brilliant
[16:20] <nhm> matt_: amazingly the highpoint rocket 2720SGL performs about the same, at least in a 1-card setup.
[16:20] <nhm> matt_: no idea if it will survive under sustained load, but those things are dirt cheap.
[16:20] <matt_> nhm, what was your 4kb write performance like in rados bench?
[16:20] <nhm> The old SAS2008 based cards do pretty well too.
[16:23] <nhm> matt_: depends a lot on what FS is used for the OSDs, and what version of ceph is used. For 0.58, I could do about 40MB/s writes and around 80MB/s reads with enough concurrency.
[16:23] <matt_> That's impressive, I was only able to get up to around 25MB/s writes
[16:23] <nhm> matt_: what version of ceph were you testing?
[16:24] <matt_> 0.60
[16:24] <nhm> what FS?
[16:26] * lofejndif (~lsqavnbok@95.170.88.81) has joined #ceph
[16:26] * dosaboy (~dosaboy@host86-161-164-218.range86-161.btcentralplus.com) has joined #ceph
[16:26] <matt_> XFS
[16:26] <nhm> Ok. with 0.60 XFS performance should be pretty good for small writes.
[16:27] <nhm> how many drives?
[16:27] <matt_> for testing it was just 24
[16:27] <nhm> ok. And just 1 controller?
[16:27] <Kioob`Taff> and for production, should be upgrade from 0.56.4 to 0.60 right now ?
[16:27] <matt_> I've tested 2 servers, 48 drives and 3 replica's and it gets down to 5 MB/s
[16:27] <nhm> Kioob`Taff: cuttlefish should be out soonish
[16:28] <Kioob`Taff> thanks nhm, I will wait cuttlefish then ;)
[16:28] <matt_> Kioob`Taff, don't upgrade just yet... I made this mistake and it could be quite painful
[16:28] <nhm> matt_: ouch. :/ Latency kills
[16:28] <nhm> matt_: how many concurrent ops with that test?
[16:28] <janos> how is the 0.56.3 --> 0.56.4 upgrade? any issues?
[16:28] * portante|ltp (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Ping timeout: 480 seconds)
[16:29] <infernix> nhm: are there any ways or available tests to figure out exactly where the performance bottlenecks are?
[16:29] <nhm> infernix: you can get a ton of information from the OSD admin sockets
[16:29] <nhm> infernix: And the logs too with debug = 20 if you are willing to figure out how to parse them. :)
[16:30] <matt_> nhm, 64 I think. I was trying to max it out
[16:30] <Zethrok> janos: I didn't have any issues going from .3 -> .4 on a few clusters; YMMV
[16:30] <nhm> matt_: might be worth trying even more, and possibly from multiple clients.
[16:30] <Kioob`Taff> and generally speaking, with RBD, for a lot of random small writes, which FS do you recommand ?
[16:30] <janos> Zethrok: sounds good. i may wait until AFTER my mini-vacation coming up to do it ;)
[16:31] <nhm> matt_: one thing you'll want to watch out for is whether or not operations are backing up on any specific OSD.
[16:31] <matt_> nhm, it should increase once I get my new infiniband switch and I can go back to connected mode which reduces latency a fair but
[16:31] <nhm> Kioob`Taff: it keeps changing. With 0.58+ XFS small write performance improved dramatically.
[16:31] <matt_> nhm, the cluster was in use at the time also just not heavily loaded since it was off peak times
[16:31] <Zethrok> janos: Hehe, I would prob. as well ^^ - just in case
[16:32] <Kioob`Taff> ok nhm, very good news !
[16:32] <nhm> Kioob`Taff: you might want to look at some of the old tuning articles and io scheduler articles I put up on the blog over the winter.
[16:32] <Kioob`Taff> infernix: I started a (really) small script to have a sort of "top" tool thought the admin socket of OSD
[16:33] <Kioob`Taff> nhm: I think that I read all of them, but I will check again
[16:33] <infernix> Kioob`Taff: nice
[16:33] <nhm> Kioob`Taff: ok, no worries. There's just so many things to tune with small IO it's hard to some times keep track of it all.
[16:34] <Kioob`Taff> I confirm that !
[16:34] <Kioob`Taff> and I'm looking to replace my old "postville" SSD with new "Intel S3700"
[16:34] <nhm> The S3700s look fantastic.
[16:34] <Kioob`Taff> yep
[16:35] <nhm> Almost as fast as the 520 with a ton more durability.
[16:35] <nhm> I've been hounding Dell to start selling them for Ceph.
[16:35] <nhm> We'll see if I can convince them. :)
[16:37] * drokita (~drokita@199.255.228.128) has joined #ceph
[16:39] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[16:39] <Kioob`Taff> infernix: http://pastebin.com/jLynvU59 it's just a base, there is a lot more to do
[16:39] <Kioob`Taff> (and tested only on Debian)
[16:40] <nhm> Kioob`Taff: that looks handy, got it up on github or anything?
[16:40] <jmlowe> wasn't there somebody using bcache on their ods's?
[16:40] <nhm> jmlowe: a couple of people have tried bcache, flashcache, etc.
[16:41] <Kioob`Taff> nhm: if it where in Python, I could be usefull, but in PHP I'm really not sure
[16:41] <jmlowe> nhm: know how well it worked for them?
[16:41] <Kioob`Taff> s/where/were/
[16:41] <Kioob`Taff> I tried bcache and stop that
[16:41] <Kioob`Taff> but in my case I have very slow SSD
[16:41] <nhm> jmlowe: Supposedly with good results, though I heard from Xiaoxi at Intel that with 0.58 the changes to pg_info made bcache/flashcache have less of an impact.
[16:42] <jmlowe> Kioob`Taff: PHP is really only useful to the people compromising your machine
[16:42] <Kioob`Taff> It's really useful for me ;)
[16:44] <darkfaded> flashcache is also quite latency-sensitive since if something is viable for caching it'll be read in-band through the ssd (so read from disk, copy to flashcache ssd, return to application from flashcache)
[16:44] <darkfaded> basically it's no problem for facebook because they got fancy fusionio everywhere
[16:48] * Volture (~Volture@office.meganet.ru) has joined #ceph
[16:48] <nhm> heh, must be nice. :)
[16:49] <Anticimex> i wonder if dm-cache from a SSD partitioned to handle all $X drives in a box will help with writing, in addition to journals, for small writes
[16:50] <Anticimex> depends on where the perf dies i guess
[16:50] <nhm> Anticimex: 1 SSD probably isn't enough for more than about 5 journals unless it's extremely fast.
[16:50] <Anticimex> and i guess it's not between journal <-> magnetic drive
[16:50] <darkfaded> heh i totally missed dm-cache so far
[16:50] <darkfaded> Anticimex: thanks
[16:50] <Anticimex> i mean, small io perf of magnetic disk setup can be benched with bonnie++ etc, right
[16:50] <Volture> hi
[16:50] <Anticimex> to get baseline
[16:51] <Volture> здесь есть люди говорящие по русский ?
[16:51] <jmlowe> I'm getting a couple of more machines for osd's with these in them http://h30094.www3.hp.com/product.aspx?sku=10389149&mfg_part=631671-B21&pagemode=ca
[16:51] <Kioob`Taff> "bcache" is good (in theory) since it detects sequential and random writes, to use SSD only for random writes.
[16:51] <darkfaded> yeah theory and bcache go hand in hand ;p
[16:52] <jmlowe> hp dl 380 g8's with 12x3TB drives
[16:52] <Kioob`Taff> Volture: no, sorry
[16:52] <Volture> Ok
[16:53] <Volture> Kioob`Taff: Can I ask you a few questions on the configuration ceph?
[16:53] <jmlowe> should bring me up to 24 osd's with about 137TB of raw storage
[16:54] <Volture> Kioob`Taff: Maybe I did not read what is in the documentation
[16:57] <Volture> please tell me what does this error http://pastebin.com/xaYzW3ks
[16:57] <Volture> cluster configure with 3 mon daemon & 7 osd daemon
[16:58] <Volture> cluster consists of 5 servers
[17:00] * ShaunR (~ShaunR@staff.ndchost.com) has joined #ceph
[17:01] * noahmehl_ (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[17:04] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Ping timeout: 480 seconds)
[17:04] * noahmehl_ is now known as noahmehl
[17:05] * tserong (~tserong@124-171-116-238.dyn.iinet.net.au) Quit (Quit: Leaving)
[17:27] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[17:33] * ScOut3R (~ScOut3R@212.96.47.215) Quit (Ping timeout: 480 seconds)
[17:35] * tserong (~tserong@124-171-116-238.dyn.iinet.net.au) has joined #ceph
[17:39] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[17:45] * calebamiles (~caleb@c-50-138-218-203.hsd1.vt.comcast.net) has joined #ceph
[17:46] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[17:46] * verwilst (~verwilst@109.130.227.28) Quit (Quit: Ex-Chat)
[17:48] * darkfaded (~floh@88.79.251.60) Quit (Read error: Connection reset by peer)
[17:49] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[17:50] * darkfader (~floh@88.79.251.60) has joined #ceph
[17:51] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:54] * bmjason (~bmjason@74.121.199.170) has joined #ceph
[17:55] * stacker666 (~stacker66@33.pool85-58-181.dynamic.orange.es) has joined #ceph
[17:56] <stacker666> hi all
[17:56] <bmjason> heya
[17:57] <stacker666> somebody can answer me a easy question?
[17:57] <bmjason> not until you ask it :)
[17:57] <stacker666> hehe
[17:58] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) Quit (Quit: Leaving)
[17:58] <stacker666> what pool i use it when i mount with mount -ceph [ip]:/ ?
[17:59] * alram (~alram@38.122.20.226) has joined #ceph
[18:00] * Rocky (~r.nap@188.205.52.204) has joined #ceph
[18:00] <fghaas> data and metadata
[18:00] <stacker666> thanks a lot
[18:00] <fghaas> I mean, the cephfs data lives in data, and the MDS metadata in metadata, in case that's not obvious :)
[18:01] <stacker666> haha
[18:01] <stacker666> i want to make sure of this
[18:01] <stacker666> thanks again fghaas
[18:01] <sagewk> joao: morning!
[18:02] <bmjason> when using ceph to boot from volume for openstack instances.. if a compute node fails, is there a way to make the instance on the failed compute node boot on another compute node? a simple restart or migration doesnt' seem to work
[18:03] <bmjason> i know that is more of an openstack question than ceph.. but i dont' know if i have to tell ceph that the volume is now on another compute node or not
[18:03] <joao> hey sagewk
[18:03] <joao> mornign
[18:04] <sagewk> where are we at with mon bugs?
[18:04] <gregaf> bmjason: you don't need to tell Ceph anything about that; it's all OpenStack stuff; you'll have to check with them
[18:04] <joao> sagewk, see 4521
[18:04] <bmjason> gregaf: thanks just trying to eliminate pieces from the puzzle
[18:04] <joao> damned bug took me all morning to figure out
[18:05] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[18:06] <sagewk> joao: buggy leveldb?
[18:06] * tnt (~tnt@91.176.19.114) Quit (Ping timeout: 480 seconds)
[18:06] <matt_> stacker666, I think it default to the data pool
[18:07] <joao> sagewk, seems like something is wrong with it; status always report an error
[18:07] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[18:07] <joao> can't say if it's our end that is buggy, if leveldb itself, or if the store
[18:08] * BillK (~BillK@124-148-103-23.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:08] <joao> tried googling for similar issues but no joy
[18:08] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[18:08] <sagewk> joao: you have a copy of the store that has this behavior?
[18:08] <gregaf> joao: did you check with him to see if he'd had any monitor issues prior to this?
[18:08] <joao> sagewk, attached to the bug
[18:08] <gregaf> corrupted disk state makes me think previous hardware failures...
[18:08] <joao> gregaf, he suffered from 4521
[18:09] <sagewk> or maybe we're not syncing properly
[18:09] <gregaf> although that shouldn't have gotten through leveldb's shield, hrm
[18:09] <joao> gregaf, sagewk, the version is there and, afaict, it's okay
[18:09] <gregaf> sagewk: nah, the only sync options are "make sure this write goes to disk" or "put it to disk whenever we next ask for a consistency point"
[18:09] <gregaf> it'll fall back to the last consistency point though on a failure, unless I'm very confused
[18:09] <joao> repairing leveldb makes everything okay aside from a missing pgmap version
[18:09] <Volture> please tell me what does this error http://pastebin.com/xaYzW3ks
[18:10] <gregaf> joao: have you correlated that missing version with his history to figure out what happened to the cluster around then?
[18:10] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[18:10] <sagewk> volture: proably means you're running 0.56.3 instead of .4 ?
[18:11] <Volture> sagewk: downgrade ceph ?
[18:11] <matt_> joao, have you had a chance to check if my store uploaded correctly?
[18:11] <sagewk> volture: no, i mean it looks like a bug that was fixed in 0.56.4
[18:11] <sagewk> and 0.58 or thereabouts
[18:12] <Volture> sagewk: 0.56.3
[18:12] <joao> matt_, no
[18:12] <sagewk> volture: upgrading to .4 should fix it then
[18:12] <matt_> joao, I think it uploaded correctly but just wanted to make sure
[18:12] <Volture> sagewk: ok try now
[18:13] <Volture> sagewk: thank you
[18:13] <sagewk> volture: np
[18:17] <sagewk> joao: hrm. well, get() will certainly work around the issue, but then i worry the same underlying issue will crop up with other iterator users (of which i assume there are many)
[18:18] <sagewk> does leveldb say anything interesting when it repairs the store?
[18:18] <Volture> sagewk: the ceph cluster must be reconfigured ?
[18:19] <sagewk> volture: just apt-get install the updated packages
[18:19] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[18:19] <sagewk> volture: and restart daemons
[18:19] <Volture> sagewk: i use gentoo
[18:19] <sagewk> heh then whatever the equivalent of make install and service ceph restart is :)
[18:20] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[18:21] <joao> sagewk, yes, there are
[18:22] <joao> sagewk, maybe sjust have some other idea on why this could be happening?
[18:22] <joao> I have to run for a bit to finish packing some stuff but will be around for standup
[18:31] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[18:31] * rawsik (~kvirc@31.7.230.12) Quit (Read error: Connection reset by peer)
[18:32] * bergerx_ (~bekir@78.188.204.182) Quit (Quit: Leaving.)
[18:32] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[18:32] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[18:32] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) has joined #ceph
[18:34] * l0nk (~alex@83.167.43.235) Quit (Quit: Leaving.)
[18:34] <Gugge-47527> should http://tracker.ceph.com/issues/4282 be fixed in 0.56.4, or only in newer releases?
[18:36] * rahmu (~rahmu@83.167.43.235) Quit (Remote host closed the connection)
[18:37] <Volture> sagewk: 2013-04-24 20:36:28.484301 7fd4a7719700 0 mon.0@1(peon) e1 handle_command mon_command(status v 0) v1 And this is what?
[18:37] <sagewk> 'ceph status' or 'ceph -s'
[18:38] <Volture> sagewk: mon-log
[18:38] <sagewk> it's just logging that someone did 'ceph -s'
[18:38] * leseb (~Adium@83.167.43.235) Quit (Quit: Leaving.)
[18:40] <Volture> sagewk: 7fe503fff700 0 -- 172.16.0.3:0/24903 >> 172.16.0.1:6802/19503 pipe(0x7fe440004520 sd=43 :0 s=1 pgs=0 cs=0 l=1).fault Sometime there is such a mistake "ceph -s"
[18:40] <sagewk> its a warning, you cna ignore
[18:41] <Volture> sagewk: what warning?
[18:42] <sagewk> the fault message is a warning and can be ignored
[18:42] <gregaf> Gugge-47527: that's a question for sagewk
[18:43] <Volture> sagewk: is there a command to check the consistency of the cluster ceph ?
[18:43] <gregaf> (who has been getting much better about mentioning where it's resolve lately…*glares*) ;)
[18:43] <Gugge-47527> sagewk: do you know? :)
[18:44] <sagewk> gugge-47527: it was mostly a kenrel side issue, will go upstream in teh next window. it's a non-critical warning (nothing bad happens, just annoying)
[18:44] <sagewk> there were some userspace fixes too, but nothing critical.
[18:44] <Gugge-47527> ahh, super then :)
[18:46] <pioto> hi, i got around to writing up a blueprint on the 'client security' stuff discussed here briefly last night. any feedback would be welcome: http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Client_Security_for_CephFS
[18:46] <pioto> (also made this issue last night: http://tracker.ceph.com/issues/4799 )
[18:46] <Kdecherf> There is a dead link here: http://ceph.com/docs/master/rados/configuration/ceph-conf/#ceph-logging-and-debugging
[18:46] <Kdecherf> (I all)
[18:46] <Kdecherf> Hi*
[18:48] <pioto> though i guess maybe i'd be better off bringing this up on the ceph-devel mailing list?
[18:48] <mattch> pioto: interesting... is this a first step on the way to a full user-authenticated fs, or would that be as far as it goes?
[18:50] <Anticimex> matt_, nhm, when you do perf tests
[18:50] <Anticimex> have you also done oprofile or some other profiling?
[18:50] <nhm> Anticimex: yes, usually with perf or sysprof
[18:51] <Anticimex> what does it tell?
[18:51] <pioto> mattch: well. i think that a good useful point to get to would be "each client can only touch the files it's supposed to"
[18:51] <Anticimex> where are cycles spent?
[18:51] <nhm> Anticimex: sadly I keep having problems getting some symbols to resolve.
[18:51] <pioto> this lets you not care too much about uid clashes, etc
[18:51] <pioto> basically, i see this as being like nfs, only better
[18:51] <nhm> Anticimex: If we are pushing a lot of throughput a non-trivial amount of time is spent in crc32c calculations.
[18:51] <pioto> in terms of the use cases
[18:51] <mattch> pioto: Indeed. I guess I was thinking how we currently use samba/cifs and wondering if it was ever going to be on the ceph roadmap to make it handle user-authentication
[18:51] * Anticimex has a pretty good understanding of the gap between linux's network stacks' performance and what optimal code in a cpu will do
[18:52] <nhm> Anticimex: MD5 is even worse if you are pushing a lot of data through RGW.
[18:52] <Anticimex> crc32, okay. are they optimized with data pipelining etc? :)
[18:52] <mattch> pioto: but yes, something other than host-based nfs auth would be nice too!
[18:52] <Anticimex> ah, that crc32 needs to verify the entire data being transferred?
[18:52] <pioto> mattch: yes.
[18:52] <Anticimex> (i guess)
[18:52] <nhm> Anticimex: In the past we had some thread contention issues but Sam fixed the worst of those up 5-6 months ago.
[18:53] <pioto> ideally, i'd like to have something like the 'krb5p' level of security, too (encrypted messages, instead of just signed messages)
[18:53] <pioto> but that's for another time
[18:53] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[18:53] * vata (~vata@2607:fad8:4:6:2936:71bf:9c47:80e8) has joined #ceph
[18:53] <Anticimex> nhm: accelerated wide-register crc32 instructinos exist in AVX instrution set. are those used when available?
[18:54] <Anticimex> ugh, md5 is slow, yes
[18:54] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit ()
[18:54] <nhm> Anticimex: not yet. Right now we use a slice-by-8 algorithm, but I'm hoping we can switch to crcutil at some point.
[18:55] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[18:55] <nhm> Anticimex: the CRC stuff though is really only an issue if pushing a lot of data or running on low powered CPUs.
[18:56] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[18:56] <Anticimex> okay
[18:56] <Anticimex> what about small-IO perf?
[18:56] <nhm> probably more interesting is doing an evaluation of latencies and what things are slowing ops down. I suspect that things like how the VM layer handles dentry and metadata caching, leveldb, and how data gets stored in the OSDs will be interesting.
[18:57] <Anticimex> i mean, put differently, ceph OSD performance should match bonnie++ perf on the magnetic drives
[18:57] <Anticimex> if it doesn't, something's "wrong".. my 2 cents :)
[18:57] <nhm> Anticimex: like I said before, I've seen us push reads at about 20,000 IOPs for 1 node. Need to do some investigation to see if there are any obvious bottlenecks there.
[18:58] <Anticimex> that was a 24-disk node?
[18:58] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[18:58] <nhm> Anticimex: that's probably not realistic given the huge pile of software and network you are adding on top of the disks with ceph.
[18:58] <nhm> Anticimex: that was just reading from pagecache
[18:59] <Anticimex> nhm: well, magnetic drives are very slow, compared to a CPU :)
[18:59] <Anticimex> for my own education i should start putting some of these things together
[19:05] * davidzlap (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:05] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:06] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[19:06] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[19:07] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) Quit (Quit: Leaving.)
[19:07] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) has joined #ceph
[19:10] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[19:16] <Anticimex> nhm: i'm not talking about doing benchmarks in a KVM VM somewhere remotely over the network, and that performance matching bonnie++. but localhost-test with just 1 copy stored
[19:17] <joao> Karcaw, around?
[19:18] <darkfader> Anticimex: it still *goes* through the network stack etc., just keep it in mind.
[19:19] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[19:19] <Anticimex> darkfader: yeah, i'm aware
[19:19] <Anticimex> the network stack is my domain a little ;)
[19:19] <darkfader> hehe
[19:19] <Anticimex> and more than that, it goes through the kernel, too
[19:20] * Vjarjadian (~IceChat77@90.214.208.5) Quit (Quit: Hard work pays off in the future, laziness pays off now)
[19:20] <Anticimex> so write path is $client -> network -> OSD -> journal -> magnetic drive?
[19:20] <Anticimex> and well, modulo replication
[19:21] <joao> Karcaw, around?
[19:21] <joao> oops
[19:21] <joao> had asked already
[19:21] <joao> totally slipped my mind
[19:22] * sagewk (~sage@2607:f298:a:607:3044:ec9d:bf7c:3ca1) Quit (Ping timeout: 480 seconds)
[19:25] <Kdecherf> gregaf: well, I ran another test for file access latency and I have a 16G logfile :)
[19:26] <gregaf> well, upload it and let me know when it's there
[19:27] <gregaf> dunno how soon I can get to it — we're pushing for Cuttlefish right now — but I'll try and take a look
[19:28] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[19:28] * noob2 (~cjh@66.220.144.81) has joined #ceph
[19:29] <imjustmatthew> If I have a mon with an resident memory size of 16G and nothing jumping out as bad in the logs is there anything you need to debug the memory leak before I restart the process?
[19:29] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[19:29] <gregaf> imjustmatthew: run "ceph -w" in one window and "ceph tell mon x heap stats" in another
[19:30] <gregaf> if it says there's a lot of free space, try running "ceph tell mon x heap release" and see if that frees it up
[19:30] <imjustmatthew> gregaf: when you say free space, you mean in memory or on disk?
[19:30] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[19:30] <gregaf> memory
[19:31] <gregaf> you'll get a dump of how much memory is allocated for the process, how much of that it's actually using, etc
[19:31] <gregaf> lately we've started seeing the processes have a bunch of free-but-reserved-from-the-OS memory for some reason
[19:31] <Kdecherf> gregaf: yep np, thanks
[19:32] <imjustmatthew> k, I'll try that
[19:32] <Kdecherf> gregaf: current status: grep [...] log | gzip > mds.gz :)
[19:33] <gregaf> why are you grepping it?
[19:33] * sagewk (~sage@2607:f298:a:607:7c26:18f5:2afb:3067) has joined #ceph
[19:33] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[19:33] <Kdecherf> gregaf: grep -e "^2013-04-24 1\(8\|9\)"
[19:33] <gregaf> ah
[19:33] <Kdecherf> just to save some gigs
[19:34] * gaveen (~gaveen@175.157.231.40) has joined #ceph
[19:37] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[19:38] <Kdecherf> well, the compression is done, I will upload it at home
[19:38] <imjustmatthew> gregaf: "ceph tell mon.a heap stats" doesn't generate any output?
[19:38] <gregaf> I think you might need to use the id — 0, 1, 2
[19:39] <imjustmatthew> it still doesn't produce output in ceph -w
[19:39] * sagewk (~sage@2607:f298:a:607:7c26:18f5:2afb:3067) Quit (Remote host closed the connection)
[19:40] <dmick> may only be at a particular debug level..
[19:40] <gregaf> oh, sorry, "ceph mon tell 0 heap stats"
[19:41] <gregaf> unless we haven't rigged that up on the monitors, but I'm sure joao used this last week or something
[19:41] <Kdecherf> hm, does anyone found any memory-related (OOM) issue with monitor on 0.60?
[19:41] <joao> imjustmatthew, you need to either start the tracer or specify CEPH_HEAP_PROFILER_INIT=1 as an env var to ceph-mon
[19:42] <joao> I think that's the correct env var name
[19:42] <gregaf> joao: no, this is just that stats I'm after, don't need to turn anything on for that
[19:42] <gregaf> I just can't remember what the stupid command format is
[19:42] <joao> ah right
[19:42] <joao> we need to dump
[19:42] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[19:42] <joao> imjustmatthew, try targeting with '-m ip:port' instead of 'tell mon.foo'
[19:42] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[19:43] <joao> as such 'ceph -m IP:port heap stats'
[19:43] <imjustmatthew> joao: k, one sec
[19:43] <joao> fwiw, the monitor must be in the quorum
[19:44] <imjustmatthew> it is in the quorum; response is "tcmalloc not enabled, can't use heap profiler commands"
[19:44] <joao> have you compiled it manually?
[19:44] <joao> I'm not sure if we have tcmalloc enabled by default on our packages
[19:44] <gregaf> we definitely do
[19:44] <imjustmatthew> No, this is "version":"0.60-472-g327002e"
[19:44] <imjustmatthew> from one of the wip branches
[19:45] <gregaf> anyway, not being built with tcmalloc would do it for the memory use
[19:45] <gregaf> but I'm surprised it was built without, we only have one gitbuilder that does that I think
[19:45] <joao> well, I really don't know if default config builds with tcmalloc or if it only tries to discover it; I always compile it with --with-tcmalloc just in case
[19:46] <imjustmatthew> it's from wip-3495 on gitbuilder, but isn't the tip of that branch, it's a few days old
[19:46] <gregaf> it defaults to on unless you pass the without-tcmalloc flag to configure
[19:46] <gregaf> or at least it used to and it damn well better still be doing so
[19:46] <imjustmatthew> :)
[19:47] <gregaf> which OS?
[19:47] <imjustmatthew> anyways, it sounds like there's nothing more to learn from the process and it can be restarted?
[19:47] <imjustmatthew> Ubuntu 12.04
[19:48] <gregaf> joao: sagewk: well there was some trouble with using the profiler on Precise, right? did we do something foolish like disable tcmalloc entirely in our builds?
[19:49] <joao> there was something wrong with a specific google-perf lib version on either precise or oneiric, can't recall which
[19:49] <joao> that would hang the monitor
[19:50] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[19:50] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[19:52] <dmick> gregaf: this should be easy to discover
[19:53] <dmick> ldd <binary> should show it if it's linked in
[19:53] <gregaf> yeah, but I've never been into a gitbuilder and I don't know how to reach them :p
[19:53] <gregaf> well it clearly isn't for that particular build; what I want to know is *why*
[19:55] * loicd (~loic@magenta.dachary.org) has joined #ceph
[19:59] <pioto> gregaf: if you havea free moment, feedback on http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Client_Security_for_CephFS would be apprecaited. but, if you think it'd be better to take that to, say, a mailing list, let me know
[20:01] <gregaf> it's fine as far as it goes; I'd want more detail on the coding tasks in there
[20:01] <gregaf> and I would personally push strongly for a first implementation just having separate pools rather than the object prefix caps (for the OSDs)
[20:04] * noob2 (~cjh@66.220.144.81) Quit (Quit: Leaving.)
[20:04] <pioto> gregaf: ok, i figured the prefix stuff was "bonus"
[20:05] <pioto> i'm not sure about enough detail on the coding tasks, as i don't know the code base at all
[20:05] <dmick> yeah, fleshing this out will be a group effort, likely
[20:06] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[20:06] <pioto> ok. well, however i can force, i mean, "encourage" people to start that group effort, let me know :)
[20:06] <gregaf> hehe
[20:06] <gregaf> well if you want to transcribe what I said yesterday and ask clarifying questions ;)
[20:06] <pioto> otherwise, i guess i can start digging, uh... here? https://github.com/ceph/ceph/tree/master/src/mds
[20:06] <dmick> pushing for a release this week, so we're all a little more distracted than usual, but, yeah.
[20:06] <pioto> ah, yes. cuttlefish.
[20:07] <pioto> k, i'll dig up your stuff from yesterday. thanks
[20:07] <gregaf> I think basically you'll want to add a "can_access_path(dentry_path, client_caps)" function to the MDS somewhere
[20:07] <gregaf> and then all the handle_client_* methods will run that for each path the client message is dealing with
[20:07] <gregaf> and yes, you probably want to look in that folder, starting from...
[20:08] <gregaf> MDS::_dispatch in MDS.cc
[20:09] <gregaf> or perhaps handle_deferrable_message would be easier; which passes stuff on mostly to Server::dispatch, which splits them out into
[20:09] <gregaf> handle_client_lookup, handle_client_getattr, etc etc etc
[20:09] <pioto> ok
[20:09] <pioto> those sound familiar (lookup, getattr)
[20:10] <pioto> from when i hacked sshfs a while ago.
[20:10] <pioto> so hopefully i can work it out from there. thanks.
[20:10] <mikedawson> gregaf: I like your recent note on #4793. I have felt like 0.59/0.60 have at least major paxos/quorum issue. Sounds like you may be on the right track.
[20:12] <gregaf> heh
[20:12] <gregaf> *sigh*
[20:12] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[20:12] <gregaf> mikedawson: is it feasible for you to upgrade your monitors to next branch?
[20:13] <gregaf> I haven't checked back into the logs from #4784 again to make sure I still liked my diagnosis if the other two weren't forming a quorum
[20:13] <gregaf> but that is my best guess still
[20:14] <mikedawson> gregaf: sure thing. will upgrade in a few
[20:14] <gregaf> the next branch has resolved a lot of those causes; in particular the leveldb tuning stuff that Jim Schutt brought up
[20:15] <gregaf> not the monitor syncing one, though
[20:15] <mikedawson> gregaf: I've frustratingly had 3 in quorum for the past 36 hours though...
[20:16] <gregaf> mikedawson: oh, if it's been working then don't bother for now
[20:17] <mikedawson> gregaf: yeah, when it goes bad, it really goes bad (total outage, tough to get back in sync), but somehow I managed to wrestle it into compliance Monday night and it's been kept together
[20:18] <mikedawson> gregaf: that being said, this isn't yet production and I'd rather see Cuttlefish succeed, so I'm going to upgrade to see if I can break next
[20:27] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[20:28] <mikedawson> gregaf: do you expect to start having gitbuilder and cuttlefish packages for Raring in near term? Raring's release date should be tomorrow
[20:30] * sagewk (~sage@2607:f298:a:607:7c26:18f5:2afb:3067) has joined #ceph
[20:32] <sagewk> joao: still there?
[20:33] <joao> yeah, was about to leave when my ride told me it's late (yet again)
[20:33] <joao> 15 more minutes or so
[20:33] <sagewk> joao: http://tracker.ceph.com/issues/4748
[20:33] <sagewk> shouldn't we be ignoring everyone when we're syncing? as when we're out of quorum?
[20:34] <joao> uh, btw, it looks as if matt_'s store has only some 858MB worth of versions, and it's mostly pgmaps
[20:34] <sagewk> so nothing crazy?
[20:34] <joao> sagewk, the store still has 8GB
[20:36] <gregaf> mikedawson: not sure
[20:36] <gregaf> sagewk, do you know when we'll be building for raring?
[20:36] <joao> sagewk, we shouldn't ignore probes nor election messages while we're out of quorum
[20:37] <joao> but we should ignore pretty much everything besides sync messages if we're sync'ing
[20:37] <sagewk> but we should ignore everything else..
[20:37] <sagewk> in this case, a subscribe
[20:37] <joao> sagewk, I submitted a patch to ignore everything during a store sync, but I'm not sure if that's the right approach
[20:38] <sagewk> where is that patch?
[20:38] <joao> well, it should be, but given the monitor has been running just fine without these kinds of hard constraints, I thought that we should be doing something wrong now
[20:38] <joao> wip-4748 I suppose
[20:39] <joao> sagewk, https://github.com/ceph/ceph/commit/6607ae9462b800417d6b0e5e5d9f75e71b6a4f08
[20:39] <sagewk> got it thanks
[20:40] * yehuda_hm (~yehuda@2602:306:330b:1410:9dbe:9b5c:6236:13d2) Quit (Ping timeout: 480 seconds)
[20:45] * tnt (~tnt@109.130.96.140) has joined #ceph
[20:46] <joao> sagewk, okay, now I'm really off
[20:46] <joao> later
[20:46] <sagewk> joao: have fun!
[20:46] <sagewk> thanks
[20:46] <joao> :qa
[20:46] <joao> oops
[20:46] * joao (~JL@89.181.147.69) Quit (Quit: Leaving)
[20:49] * Cube (~Cube@12.248.40.138) has joined #ceph
[20:52] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Quit: noahmehl)
[20:52] <mikedawson> gregaf: Moving to next. I just did apt-get update, then the upgrade returned 404s... new versions of packages were pushed in that window. how often goes gitbuilder update next?
[20:59] * jskinner (~jskinner@69.170.148.179) has joined #ceph
[20:59] <pioto> gregaf: i think i picked out all the meaty bits from yesterday and today onto that blueprint. i'll start digging through the code now to find all the things you referred to, and come back later with questions. thanks.
[21:03] <wido> pioto: I checked on t he libvirt vol clone
[21:03] <wido> that won't work with libvirt and RBD
[21:04] <wido> some stupid assumption libvirt makes. It wants to do a fopen() and doesn't let the storage driver handle it
[21:04] <pioto> hm. that's lame.
[21:04] <wido> since RBD images via librbd aren't files nor block devies
[21:04] <wido> devices*
[21:04] <pioto> yeah
[21:04] <wido> you get it
[21:04] <pioto> so it's basically always doing a 'deep' clone anyways
[21:04] * stacker666 (~stacker66@33.pool85-58-181.dynamic.orange.es) Quit (Ping timeout: 480 seconds)
[21:05] <pioto> so it probably doesn't even handle qemu2 clones well. hm
[21:05] <pioto> err, qcow2
[21:06] <wido> Well, it does something with qcow2, but I haven't checked
[21:06] <wido> it calls qemu-img at some point
[21:06] <pioto> ok
[21:06] <wido> anyway, for librbd it won't work right now
[21:06] <pioto> ok, thnaks for confirming
[21:06] <wido> I started a ml discussion regarding this some time ago
[21:06] <wido> The driver should just give a input or output stream to libvirt
[21:07] * bmjason (~bmjason@74.121.199.170) has left #ceph
[21:07] <pioto> oh, so it may be doing something like... passing it a stream from 'qemu-img clone oldimg -'?
[21:08] <wido> can't tell for sure. It does something magic with qcow2, but it was some time ago when I checked
[21:09] <pioto> ok. "oh well"
[21:09] <pioto> there are plenty of other ways to do a clone instead
[21:12] * dwt (~dwt@128-107-239-234.cisco.com) has joined #ceph
[21:14] * amichel (~amichel@saint.uits.arizona.edu) has joined #ceph
[21:15] * eschnou (~eschnou@131.165-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:16] * t0rn (~ssullivan@2607:fad0:32:a02:d227:88ff:fe02:9896) Quit (Remote host closed the connection)
[21:24] * jmlowe1 (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[21:27] * paravoid_ (~paravoid@scrooge.tty.gr) has joined #ceph
[21:27] * thelan_ (~thelan@paris.servme.fr) has joined #ceph
[21:27] * __jt___ (~james@rhyolite.bx.mathcs.emory.edu) has joined #ceph
[21:28] * cclien_ (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) has joined #ceph
[21:28] * Hau_MI (~HauM1@login.univie.ac.at) has joined #ceph
[21:28] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (synthon.oftc.net graviton.oftc.net)
[21:28] * paravoid (~paravoid@scrooge.tty.gr) Quit (synthon.oftc.net graviton.oftc.net)
[21:28] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) Quit (synthon.oftc.net graviton.oftc.net)
[21:28] * HauM1 (~HauM1@login.univie.ac.at) Quit (synthon.oftc.net graviton.oftc.net)
[21:28] * __jt__ (~james@rhyolite.bx.mathcs.emory.edu) Quit (synthon.oftc.net graviton.oftc.net)
[21:28] * chutz (~chutz@2600:3c01::f03c:91ff:feae:3253) Quit (synthon.oftc.net graviton.oftc.net)
[21:28] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (synthon.oftc.net graviton.oftc.net)
[21:28] * thelan (~thelan@paris.servme.fr) Quit (synthon.oftc.net graviton.oftc.net)
[21:28] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) Quit (synthon.oftc.net graviton.oftc.net)
[21:28] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (synthon.oftc.net graviton.oftc.net)
[21:34] * chutz (~chutz@li567-214.members.linode.com) has joined #ceph
[21:35] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[21:35] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[21:36] * cashmont (~cashmont@c-76-18-76-30.hsd1.nm.comcast.net) has joined #ceph
[21:37] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) has joined #ceph
[21:38] * ctrl (~ctrl@83.149.8.246) has joined #ceph
[21:38] <sagewk> sjust: need me to look at 4805?
[21:39] <ctrl> hi everyone!
[21:42] <sjusthm> sagewk: wip_4805
[21:42] <sjusthm> I think it'll be fine
[21:42] <ctrl> is anyone use ceph for sql servers?
[21:42] <sjusthm> want to do a quick run to confirm that check_..._sources is safe to call on a non-primary
[21:43] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has left #ceph
[21:45] <sagewk> sjusthm: remove_down_peer_info() is alos called from PG::RecoveryState::Started::react(const AdvMap& advmap)
[21:45] <sagewk> with your patch the other 2 call sites are the same
[21:45] <sagewk> maybe they all should be?
[21:46] <sjusthm> if it gets to Started, we know it's not primary
[21:46] <sjusthm> on the other hand, couldn't hurt
[21:46] <sjusthm> one sec
[21:47] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[21:48] <sjusthm> sagewk: how about that?
[21:48] <sagewk> sjusthm: that makes me sleep easier :)
[21:48] <sjusthm> indeed
[21:49] <sjusthm> ok, I'll let it run a bit and then merge it
[21:51] <ctrl> anybody use ceph as storage for sql servers?
[21:51] <sjusthm> ctrl: you mean cephfs, or rbd?
[21:52] <ctrl> i mean rbd, sorry)
[21:52] * imjustmatthew (~imjustmat@c-24-127-107-51.hsd1.va.comcast.net) Quit (Remote host closed the connection)
[21:52] <sjusthm> I think someone did a small amount of work on that, nhm?
[21:58] <ctrl> i try to find some information about sql server in ceph cluster, any information, configuration etc.
[22:00] * imjustmatthew (~imjustmat@c-24-127-107-51.hsd1.va.comcast.net) has joined #ceph
[22:07] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[22:10] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:10] * cashmont (~cashmont@c-76-18-76-30.hsd1.nm.comcast.net) has left #ceph
[22:14] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[22:14] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:15] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[22:19] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[22:20] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[22:21] * ctrl (~ctrl@83.149.8.246) has left #ceph
[22:24] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[22:30] <benner> how to debug this issue: "2013-04-24 23:26:17.612157 mon.0 [INF] pgmap v11952: 584 pgs: 544 active+clean, 31 active+remapped, 9 active+degraded; 75290 MB data, 224 GB used, 10947 GB / 11172 GB avail; 509/57318 degraded (0.888%)"? i'm getting 0.888% and nothing changing in time.
[22:31] <dmick> benner: have you read http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ ?
[22:31] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[22:34] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[22:36] <benner> in the past and forgot it exists. rereading
[22:42] <sagewk> sjusthm: can you take a quick look at wip-4785-b?
[22:42] <sjusthm> yep
[22:43] <sagewk> problematic object was
[22:43] <sagewk> cloneid snaps size overlap
[22:43] <sagewk> 1137 983 4194304 [0~1200128,1232896~5632,1239040~2955264]
[22:43] <sagewk> head - 4194304
[22:46] <sjusthm> sagewk: looks right
[22:51] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[22:57] <Kdecherf> gregaf: email sent
[23:04] <gregaf> cool
[23:06] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[23:07] <Kdecherf> gregaf: let me know if you need other logs
[23:07] <gregaf> what's up with the mount on the ftp server and the backup server?
[23:07] <gregaf> are you sure you don't have processes going through and looking at the tree (du, find, etc)?
[23:08] <gregaf> maybe inotify too, not sure about that kind of thing, though, hrm
[23:08] <Kdecherf> the backup server only opens files once a week (and not during the test)
[23:10] * tnt (~tnt@109.130.96.140) Quit (Ping timeout: 480 seconds)
[23:10] <Kdecherf> and nobody was accessing these files using ftp (I am the only to have credentials)
[23:11] <Kdecherf> no processes like du or find are used on these servers
[23:11] <Kdecherf> And as far as I know, inotify is not used on this storage
[23:11] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[23:12] <gregaf> that's probably enough, then
[23:14] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[23:14] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[23:15] * dontalton (~dwt@128-107-239-234.cisco.com) has joined #ceph
[23:17] * eschnou (~eschnou@131.165-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:19] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[23:20] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[23:21] * dwt (~dwt@128-107-239-234.cisco.com) Quit (Ping timeout: 480 seconds)
[23:22] * vata (~vata@2607:fad8:4:6:2936:71bf:9c47:80e8) Quit (Quit: Leaving.)
[23:27] * rustam (~rustam@94.15.91.30) has joined #ceph
[23:27] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[23:28] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[23:35] * sjusthm1 (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[23:35] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Read error: Connection reset by peer)
[23:36] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)
[23:40] * gaveen (~gaveen@175.157.231.40) Quit (Ping timeout: 480 seconds)
[23:42] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:42] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[23:42] * gmason (~gmason@hpcc-fw.net.msu.edu) Quit (Ping timeout: 480 seconds)
[23:43] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[23:44] * PerlStalker (~PerlStalk@72.166.192.70) Quit (Quit: ...)
[23:48] * rustam (~rustam@94.15.91.30) has joined #ceph
[23:50] * jskinner (~jskinner@69.170.148.179) Quit (Remote host closed the connection)
[23:52] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Connection reset by peer)
[23:52] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[23:57] * sstan (~chatzilla@dmzgw2.cbnco.com) Quit (Remote host closed the connection)
[23:58] * rustam (~rustam@94.15.91.30) Quit (Remote host closed the connection)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.