#ceph IRC Log


IRC Log for 2013-07-18

Timestamps are in GMT/BST.

[0:00] <sjusthm> sagewk: yeah, we can
[0:00] <sjusthm> one sec
[0:00] <sjusthm> the inconvenient bit is canceling the first callback
[0:00] <sjusthm> or adding a second one
[0:01] <sagewk> are you sure the pull request can get in front of the activate?
[0:01] <sjusthm> yes, you can pull from a stray
[0:01] <sjusthm> which is not in the acting set
[0:01] <sagewk> oh right
[0:02] <sjusthm> on the other hand, we will activate() iff we are in the acting set
[0:02] <sagewk> i see.
[0:02] <sagewk> yeah, i'd make it conditional on the acting set
[0:02] <sjusthm> so we can just start_flush() in one place or the other conditionally
[0:02] <sjusthm> yeah
[0:02] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[0:03] * dosaboy_ (~dosaboy@host86-164-80-171.range86-164.btcentralplus.com) Quit (Quit: leaving)
[0:04] * dosaboy (~dosaboy@host86-164-80-171.range86-164.btcentralplus.com) has joined #ceph
[0:07] * Blackknight (~abcde@ has joined #ceph
[0:08] * drokita1 (~drokita@ Quit (Ping timeout: 480 seconds)
[0:11] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[0:11] <gregaf> nwat: have you made any arrangements with anybody (including yourself :p) about the os x patches stuff?
[0:12] * indeed_ (~indeed@ has joined #ceph
[0:12] * indeed (~indeed@ Quit (Read error: Connection reset by peer)
[0:13] * indeed_ (~indeed@ Quit (Remote host closed the connection)
[0:14] <nwat> gregag: i have not made arrangements with others, but do have a plan in mind :) -- the total change set is large, but a majority are relatively simple, stand alone changes. My plan was to try and push those upstream, and formulate a blueprint for the invasive changes
[0:14] * indeed (~indeed@ has joined #ceph
[0:14] <Blackknight> is it possible to use rbd on centos 6 to connect to a remote ceph cluster?
[0:15] <Blackknight> or do I need to use iscsi?
[0:15] * BillK (~BillK-OFT@203-59-173-44.dyn.iinet.net.au) has joined #ceph
[0:15] <nwat> invasive might be a little harsh tone :) more difficult changes. on the bright side, a large number of the tests are passing...
[0:16] <gregaf> nwat: I have several things I'm interested in doing that I haven't been, but this is one of them ;)
[0:16] <gregaf> so nobody else has promised reviews or anything?
[0:16] <gregaf> and I'm curious how much is involved in merging
[0:17] <Blackknight> I cannot find anything on google regarding this
[0:17] <gregaf> (the full set, not just these #defines)
[0:17] <gregaf> Blackknight: CentOS 6 does not have kernel support for RBD, which means you can't mount it natively
[0:17] <dmick> Blackknight: there are lots of access options. What do you intend to use the block device for?
[0:17] <gregaf> there are packages available…somewhere that let you install new enough QEMU stuff to use RBD with QEMU (all in userspace)
[0:18] <gregaf> and some people have installed newer kernels under CentOS 6
[0:18] <Blackknight> just as a backend for a lustre OST
[0:18] <gregaf> that's been discussed on the mailing list very recently
[0:18] <Blackknight> the problem is I don't know how to get my server attached to the volume
[0:18] * zhyan_ (~zhyan@ has joined #ceph
[0:18] <Blackknight> it's not using qemu
[0:18] <Blackknight> well, it is but I don't want to hack on the parent level
[0:19] <Blackknight> rbd-fuse might work
[0:19] <nwat> gregaf: no, i haven't asked for reviews beyond just putting up a pull request for some trivial stuff. 90% of the total changes I have should be able to go forward as simple #defines for OSX alternatives.
[0:20] <nwat> gregaf: wanna shephard those upstream????
[0:20] <jakes> nwat: can you help me on this
[0:20] <dmick> rbd-fuse or stgt are options
[0:20] <gregaf> nwat: that's why I'm asking about them :)
[0:20] <gregaf> not sure how much time I can devote, but I'd love to see it happen
[0:20] <Blackknight> /dev/rbd doesn't seem to exist on the mon node
[0:20] <gregaf> how far along is your "unclean" stuff? building, or does it pass runtime tests too?
[0:21] <dmick> Blackknight: no, it wouldn't
[0:22] <dmick> /dev/rbd devices show up where you run the kernel client to connect to the cluster (that's how the client machine sees the cluster). Ceph isn't built on top of rbd; rbd is one of the things built on top of Ceph
[0:23] <Blackknight> ah, ok
[0:23] * infinitytrapdoor (~infinityt@ip-109-41-61-88.web.vodafone.de) Quit (Ping timeout: 480 seconds)
[0:23] <nwat> gregaf: i've ran libcephfs tests, and some rados tests. those work. all of core builds. i've been running libcephfs and rados tests, and they are all passing so far
[0:23] <Blackknight> I guess I'm hosed without using libvirt/kvm
[0:23] <gregaf> nwat: sweet
[0:24] <dmick> Blackknight: (03:20:18 PM) dmick: rbd-fuse or stgt are options
[0:24] <gregaf> don't suppose you kept good track of the dependencies that need to be installed? That was one of the stumbling blocks I ran into when I was doing this — spent a weekend or two of idle time installing packages and then realized I hadn't kept track of them and how they interacted with the compile-time patches I'd already generated
[0:24] <dmick> or a kernel from sometime in the last, oh, year
[0:24] <Blackknight> lol
[0:24] <Blackknight> it's a VM, not sure it will even boot a new kernel
[0:24] <Blackknight> worth a shot I guess
[0:25] <dmick> that's the beauty of opensource, that we can adopt new technology more rapidly :)
[0:26] <Blackknight> is the kernel module part of the ceph package?
[0:26] <nwat> gregaf: it's pretty straightfoward. a few things can come from osx system libraries, some from homebrew (or macports or whatever), and a few things that just don't work at all like tcmalloc. things like pthread_spin_lock and libatomic-ops don't exist, but i have fallback implementations until I have time to setup OSX optimized versions (e.g. OSX has atomic ops)
[0:26] <dmick> nope, in the kernel packages
[0:26] * zhyan_ (~zhyan@ Quit (Ping timeout: 480 seconds)
[0:26] <gregaf> cool
[0:27] <Blackknight> I wish I wasn't stuck with centos
[0:27] <Blackknight> you can't do ANYTHING cool with it
[0:27] <gregaf> nwat: fallbacks should be pretty simple as long as we define a replacement for pthread_spin_lock as I believe we still have a non-libatomic-ops based atomic_t implementation
[0:27] <nwat> gregaf: the fallback for non-libatomic-ops is pthread_spin_lock :)
[0:27] <dmick> I've been experimenting with stgt lately with good success, and that will certainly work and be fairly simple. None of the paths are obviously right for that use case IMO
[0:28] <gregaf> nwat: yeah, but we need that elsewhere, don't we? (maybe not)
[0:28] <Blackknight> meh, if I change kernels lustre won't work :(
[0:29] <nwat> gregaf: the spin_lock version is, so it all shakes out with some macro magic to turn spinlocks into mutexes.
[0:29] <gregaf> yeah, that's what I was expecting
[0:29] <nwat> ahh cool
[0:29] <Blackknight> I don't want to have a MITM stgt server
[0:29] <gregaf> pretty sure os x mutexes are a futex-alike anyway so it shouldn't be too expensive
[0:30] <nwat> gregaf: osx has spinlock primitives, too
[0:30] * jakes (~oftc-webi@128-107-239-233.cisco.com) Quit (Remote host closed the connection)
[0:30] <gregaf> it does? I thought I'd looked and couldn't find them
[0:31] <gregaf> in that case it shouldn't be too hard to map them then, right?
[0:31] <nwat> https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/spinlock.3.html
[0:31] <nwat> haven't tried, i've been using the mutex fallback. mapping them will work, but might not be an easy macro fix.
[0:31] <gregaf> maybe replace the direct pthread_spin_lock call with a macro-ized function to make it easier
[0:32] <gregaf> hmm, do linux userspace spinlocks allow pre-emption?
[0:32] <gregaf> I guess they must, but the kernel ones disable it...?
[0:32] <nwat> kernel spinlocks disable preemption for sure, unless RT_PREEMPT
[0:33] <dmick> Blackknight: tgtd can run on one of the Ceph machines, but, yeah
[0:33] <nwat> i dunno how pthread_spinlock works. probably has some sort of futex fall back or something.. never looked into it
[0:33] <dmick> you could make a Lustre backend that uses librados. :) or librbd+librados
[0:34] <Blackknight> lol
[0:34] <Blackknight> I have one of my OSTs running on a 250G loop device
[0:41] * Vjarjadian (~IceChat77@ has joined #ceph
[0:42] * yanzheng (~zhyan@ has joined #ceph
[0:42] <nwat> gregaf: lemme know if anything in particular would make reviews easier. i was planning on continuing to push simple stuff to wip-osx-upstream for a while
[0:43] <gregaf> nwat: I'll have to see what the pull request actually looks like, then maybe I'll have some ideas
[0:44] <nwat> nwat: you want everything all at once?
[0:44] <nwat> haha
[0:44] <nwat> gregaf
[0:44] <gregaf> if you've got it cleaned up then making it available wouldn't hurt
[0:45] * jakes (~oftc-webi@128-107-239-233.cisco.com) has joined #ceph
[0:46] <jakes> is there any way that cinder volume that is been created with rbd, be used by cephfs client?
[0:48] <dmick> jakes: no, and why would you even want to do that?
[0:49] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Remote host closed the connection)
[0:50] * janisg (~troll@ Quit (Ping timeout: 480 seconds)
[0:50] <dmick> why does OSD::heartbeat() ask the question "refresh stats?" but then go ahead and do just that? Was it once conditional on something?
[0:51] * janisg (~troll@ has joined #ceph
[0:53] <jakes> dmick, as I once discussed, I was planning to use cephfs client in each of the guest Vm's of the openstack which connect to the common storage cluster in the host. This way, single storage cluster alone can be maintained for data(say hadoop ) as well as VM's . But, the problem is that, data storage used by hadoop will be not encountered in the volumes which are already created using rbd
[0:53] * indeed (~indeed@ Quit (Remote host closed the connection)
[0:53] <gregaf> dmick: hmm, I suspect this once included the logic for sending the pg_stat_t to the monitor, but now it's done async in another thread
[0:53] * zhyan_ (~zhyan@ has joined #ceph
[0:53] <gregaf> though I could be making that up
[0:54] <gregaf> wait, no, that comment was introduced with that block
[0:55] * yanzheng (~zhyan@ Quit (Ping timeout: 480 seconds)
[1:01] * BManojlovic (~steki@237-231.197-178.cust.bluewin.ch) Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:02] <jakes> dmick: is my understanding right?
[1:04] <dmick> jakes: I'm sorry; I just don't understand how rbd is involved
[1:08] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[1:08] * mozg (~andrei@host217-44-214-64.range217-44.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[1:09] <jakes> dmick: i am planning to use rbd to create/manage volumes in openstack as described in http://ceph.com/docs/master/rbd/rbd-openstack/ , . Also, I was planning to use cephfs client in each of the guest VM's for applications like hadoop for getting file-like-access. This also connects to same object store in the host. But, if use in this way, my storage needed for applications in VM's like hadoop, will be unaccounted as they will directly write to t
[1:10] * loicd (~loicd@bouncer.dachary.org) Quit (Ping timeout: 480 seconds)
[1:10] <jakes> So, Is there any way that volumes which are already created for a VM instance ,also be used for the applications in VM like hadoop.
[1:11] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) Quit (Ping timeout: 480 seconds)
[1:11] <dmick> I mean, you can run hdfs on the VMs, right?....if you choose two different abstractions into the Ceph cluster, you're choosing two different storage systems; like choosing a disk and a NAS box; of course one doesn't see the other
[1:12] * zynzel (zynzel@spof.pl) Quit (Ping timeout: 480 seconds)
[1:12] <dmick> OTOH, why does one *need* to see the other, since presumably the VMs have access to both? Why would you want the rbd image holding the VM boot image to have any crosstalk with the Hadoop filesystem holding the data (other than being accessed from the same machine)?
[1:12] * raso (~raso@deb-multimedia.org) Quit (Ping timeout: 480 seconds)
[1:12] * coredumb (~coredumb@xxx.coredumb.net) Quit (Ping timeout: 480 seconds)
[1:13] * coredumb (~coredumb@xxx.coredumb.net) has joined #ceph
[1:14] * loicd (~loicd@bouncer.dachary.org) has joined #ceph
[1:14] <iii8> Guys, do you plan to implement keystone authentication for S3 API also?
[1:15] <iii8> I'm asking only because currenty you're having one only for swift api.
[1:15] * zynzel (zynzel@spof.pl) has joined #ceph
[1:16] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) has joined #ceph
[1:16] <jakes> dmick: I was planning to get rid of hdfs somehow as it is not needed if we have a ceph cluster. So, I thought of removing hdfs and use cephfs clients which can talk directly to object store. Is there any way which can help, to keep only one cluster in whole setup which can hold data from applications inside VM's ?
[1:17] <gregaf> iii8: you can follow the ticket http://tracker.ceph.com/issues/5506
[1:17] * markit (~marco@88-149-177-66.v4.ngi.it) Quit ()
[1:17] * DarkAce-Z (~BillyMays@ has joined #ceph
[1:18] <gregaf> yehudasa tells me that ccourtaut might be working on it
[1:18] * mozg (~andrei@host109-151-35-94.range109-151.btcentralplus.com) has joined #ceph
[1:19] * paravoid is back
[1:20] * Blackknight (~abcde@ has left #ceph
[1:21] <dmick> jakes: it's clear that one Ceph cluster can hold *both* VM rbds *and* cephfs/hadoop filesystems, right? They're disjoint data, but they're in the same cluster?
[1:21] * DarkAceZ (~BillyMays@ Quit (Ping timeout: 480 seconds)
[1:26] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[1:28] <jakes> yup.I understand that one ceph cluster can hold both VM rbd's and cephfs data. But, my point is , since In hadoop , I need distributed view of data, I am forced to use the cephfs client in each VM. In that case, hadoop writes data to object store independently which will not be accounted in rbd volume. eg. If I create a volume of 10 gb out of 1TB and attaches to an instance, hadoop which is running in that instance can still use the entire sto
[1:28] <dmick> *accounting*.
[1:28] <dmick> your question is about *accounting*. Now I think I get it.
[1:29] <jakes> yes
[1:29] <jakes> :)
[1:29] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) Quit (Ping timeout: 480 seconds)
[1:29] <dmick> yes, cephfs cluster usage is not handled by OpenStack accounting mechanisms.
[1:30] <jakes> Is there any way that I can still write to same rbd volume from guest VM?
[1:31] <dmick> the rbd volume *is* the VM's disk
[1:31] <dmick> so however you write to the disk is however you write to it.
[1:31] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[1:31] <jakes> but, its not accessible inside VM right?
[1:32] <dmick> but nothing useful I can think of to leverage the local disk for Hadoop other than the obvious "partition and use something like hdfs". I mean, it's a simple equation here.
[1:32] <dmick> jakes: no, of course it's accessible inside the VM, because it is the VM's disk
[1:32] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) has joined #ceph
[1:32] <dmick> but that's not what you're asking
[1:33] * zhyan_ (~zhyan@ Quit (Ping timeout: 480 seconds)
[1:34] <jakes> dmick, it is accessible, but I won't get distrbuted view of files in the cluster which is needed for hadoop
[1:36] <jakes> if i am writing it locally*
[1:36] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[1:37] <dmick> there aren't any distributed files in the cluster involved. there is your disk. your disk is stored in the cluster, but that doesn't change the fact that it's a disk.
[1:37] <dmick> I'm gonna have to let someone else explain this to you, I guess; I'm not seeing something, or not saying it right.
[1:42] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[1:44] <jakes> Dmick: wht I meant to say is, Say I have two Vms's in openstack . file1 and file2 in VM A and file3 and file4 in VM B. If I have a cephfs client in both VM's, all files are accessible in both Vm's. The first requirement is satisfied but, storage is not accounted as explained earlier. Second choice, is store files locally in each VM but problem is ,VM's will not get to see files of other VM's which is needed for Hadoop.
[1:47] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Read error: Operation timed out)
[1:48] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[1:50] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[1:54] <sagewk> gregaf, dmick, sjusthm: thoughts on wip-osd-latency? it's late to be adding but i'm thinking the it's low-risk and worth it
[1:54] <sagewk> there is a second patch that reports slow requests in 'health [detail]'
[1:57] <dmick> I was puzzling over why we expand/contract the vector when we're only tracking up to 1 << 30 anyway; is it worth the extra operations? does it do any more than save a few 32-bit words?
[2:02] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[2:11] * jeff-YF (~jeffyf@pool-173-66-21-43.washdc.fios.verizon.net) has joined #ceph
[2:15] * jakes (~oftc-webi@128-107-239-233.cisco.com) Quit (Remote host closed the connection)
[2:17] * mozg (~andrei@host109-151-35-94.range109-151.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[2:20] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[2:22] * yehudasa__ (~yehudasa@2607:f298:a:607:ea03:9aff:fe98:e8ff) Quit (Ping timeout: 480 seconds)
[2:31] * rturk is now known as rturk-away
[2:33] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[2:37] * huangjun (~huangjun@ has joined #ceph
[2:38] * The_Bishop (~bishop@2001:470:50b6:0:497e:a554:edd:9a9f) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[2:40] * sagelap (~sage@2600:1012:b02f:7ac7:6c8c:28d1:a574:a5cc) has joined #ceph
[2:41] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) has joined #ceph
[2:41] * stacker666 (~stacker66@ Quit (Ping timeout: 480 seconds)
[2:45] * MK_FG (~MK_FG@00018720.user.oftc.net) Quit (Ping timeout: 480 seconds)
[2:47] * MK_FG (~MK_FG@00018720.user.oftc.net) has joined #ceph
[2:48] * [1]huangjun (~huangjun@ has joined #ceph
[2:52] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[2:54] * Tamil (~tamil@ Quit (Quit: Leaving.)
[2:54] * jeff-YF (~jeffyf@pool-173-66-21-43.washdc.fios.verizon.net) Quit (Quit: jeff-YF)
[2:54] * huangjun (~huangjun@ Quit (Ping timeout: 480 seconds)
[2:54] * [1]huangjun is now known as huangjun
[2:55] * yy-nm (~chatzilla@ has joined #ceph
[2:58] * yy (~michealyx@ has joined #ceph
[2:59] * yy (~michealyx@ has left #ceph
[3:00] * yy (~michealyx@ has joined #ceph
[3:01] * xmltok (~xmltok@pool101.bizrate.com) Quit (Quit: Bye!)
[3:02] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[3:05] <sjustlaptop> sagelap: wip-osd-latency seems like a good thing to pull into dumpling
[3:06] <sagelap> k, i'll retest with the encoder cleanups first
[3:06] <sagelap> thanks!
[3:06] <sjustlaptop> sagelap: wip-cleanup-local has the start_flushes() in the new places if you'd like to take a look
[3:09] <sagelap> sjustlaptop: looks good
[3:09] <sjustlaptop> k, mergin
[3:09] <sjustlaptop> teuthology git is where I add the config for nightly runs?
[3:09] <sagelap> dmick: ping
[3:09] <dmick> pong
[3:10] <sagelap> sjustlaptop: yeah, ceph.conf.template
[3:10] <sjustlaptop> k
[3:10] <sagelap> wip-cli?
[3:10] <sagelap> trivial but i tend to get my bash wrong on these things
[3:10] * yy (~michealyx@ has left #ceph
[3:11] <dmick> last 1, or last 2?
[3:11] <sagelap> last 1
[3:12] <dmick> [ ] is enough; || false doesn't add anything
[3:12] <dmick> and I would use $() over ``, but that's style
[3:12] <sagelap> i wasn't sure given teh ! thing we saw before :)
[3:13] <dmick> oh well "enough for -e"...hm. hang on
[3:13] * raso (~raso@deb-multimedia.org) has joined #ceph
[3:14] <dmick> yeah, no, that works
[3:14] <sagelap> k cool
[3:15] <dmick> did you wanna do ceph osd rm $id after?
[3:17] <sagelap> oh, good call
[3:17] <sjustlaptop> paravoid: I've added something to next which may or may not help dramatically
[3:17] <sjustlaptop> once next builds
[3:19] <sagelap> dmick: thanks
[3:19] <dmick> totally
[3:30] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[3:31] * nwat (~oftc-webi@eduroam-251-132.ucsc.edu) Quit (Remote host closed the connection)
[3:41] * jeff-YF (~jeffyf@pool-173-66-21-43.washdc.fios.verizon.net) has joined #ceph
[3:46] * yehuda_hm (~yehuda@2602:306:330b:1410:baac:6fff:fec5:2aad) has joined #ceph
[3:49] * drokita (~drokita@97-92-254-72.dhcp.stls.mo.charter.com) has joined #ceph
[4:01] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[4:07] * jeff-YF (~jeffyf@pool-173-66-21-43.washdc.fios.verizon.net) Quit (Quit: jeff-YF)
[4:09] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[4:09] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[4:10] * xmltok (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[4:10] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Remote host closed the connection)
[4:12] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[4:31] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[4:44] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:00] * fireD (~fireD@93-142-204-59.adsl.net.t-com.hr) has joined #ceph
[5:07] * fireD1 (~fireD@93-139-147-104.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:11] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[5:16] * dosaboy_ (~dosaboy@host86-161-162-69.range86-161.btcentralplus.com) has joined #ceph
[5:22] * dosaboy (~dosaboy@host86-164-80-171.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[5:24] * drokita1 (~drokita@97-92-254-72.dhcp.stls.mo.charter.com) has joined #ceph
[5:25] <iii8> gregaf: thanks ;) Hope it will be impletented !
[5:29] * drokita (~drokita@97-92-254-72.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[5:29] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[5:38] * sagelap1 (~sage@ has joined #ceph
[5:42] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:45] * sagelap (~sage@2600:1012:b02f:7ac7:6c8c:28d1:a574:a5cc) Quit (Ping timeout: 480 seconds)
[6:02] <huangjun> the slow request of osd tips occurs when client write big file(10GB) to the cluster, and then the client wrtie rate goes down
[6:03] <huangjun> so is that a normal or no-worry output?
[6:13] * AfC (~andrew@jim1020952.lnk.telstra.net) has joined #ceph
[6:16] * yy (~michealyx@ has joined #ceph
[6:16] * yy (~michealyx@ has left #ceph
[6:18] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[6:21] * dosaboy (~dosaboy@host86-145-219-250.range86-145.btcentralplus.com) has joined #ceph
[6:27] * dosaboy_ (~dosaboy@host86-161-162-69.range86-161.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[6:28] * xmltok (~xmltok@relay.els4.ticketmaster.com) Quit (Ping timeout: 480 seconds)
[6:33] * jeandanielbussy (~jeandanie@124x35x46x15.ap124.ftth.ucom.ne.jp) has joined #ceph
[6:40] * silversurfer (~jeandanie@124x35x46x12.ap124.ftth.ucom.ne.jp) Quit (Ping timeout: 480 seconds)
[6:43] * tserong_ is now known as tserong
[6:45] * dosaboy_ (~dosaboy@host86-164-85-49.range86-164.btcentralplus.com) has joined #ceph
[6:48] * yehudasa__ (~yehudasa@2602:306:330b:1410:ed9b:1527:92bf:3254) has joined #ceph
[6:50] * dosaboy (~dosaboy@host86-145-219-250.range86-145.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[6:57] * AfC (~andrew@jim1020952.lnk.telstra.net) Quit (Quit: Leaving.)
[7:03] * AfC (~andrew@jim1020952.lnk.telstra.net) has joined #ceph
[7:12] * sig_wall (~adjkru@ Quit (Ping timeout: 480 seconds)
[7:13] * sig_wall (~adjkru@ has joined #ceph
[7:15] * Macheske (~Bram@d5152D87C.static.telenet.be) has joined #ceph
[7:19] * Machske (~Bram@d5152D87C.static.telenet.be) Quit (Ping timeout: 480 seconds)
[7:25] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[7:27] * AfC (~andrew@jim1020952.lnk.telstra.net) Quit (Quit: Leaving.)
[7:39] * gotenk (~gotenk@ has joined #ceph
[7:41] * jeandanielbussy_ (~jeandanie@124x35x46x12.ap124.ftth.ucom.ne.jp) has joined #ceph
[7:45] * jeandanielbussy (~jeandanie@124x35x46x15.ap124.ftth.ucom.ne.jp) Quit (Ping timeout: 480 seconds)
[7:56] * yy-nm (~chatzilla@ Quit (Read error: Connection reset by peer)
[7:57] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[7:57] * yy-nm (~chatzilla@ has joined #ceph
[8:16] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[8:17] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[8:18] * Vjarjadian (~IceChat77@ Quit (Quit: Friends help you move. Real friends help you move bodies.)
[8:21] <huangjun> hope someone can give me some tips of the slow request
[8:22] <yy-nm> keep showing the slow request?
[8:26] * odyssey4me (~odyssey4m@ has joined #ceph
[8:28] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[8:31] * alexxy[home] (~alexxy@ Quit (Ping timeout: 480 seconds)
[8:37] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) Quit (Remote host closed the connection)
[8:48] * AfC (~andrew@2001:44b8:31cb:d400:4c2f:6d75:850c:c196) has joined #ceph
[8:51] * yanzheng (~zhyan@ has joined #ceph
[8:59] * _Tass4dar (~tassadar@tassadar.xs4all.nl) has joined #ceph
[9:01] * _Tassadar (~tassadar@tassadar.xs4all.nl) Quit (Ping timeout: 480 seconds)
[9:01] * yanzheng (~zhyan@ Quit (Remote host closed the connection)
[9:13] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:15] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[9:20] * jcfischer_ (~fischer@user-23-14.vpn.switch.ch) has joined #ceph
[9:21] * AfC (~andrew@2001:44b8:31cb:d400:4c2f:6d75:850c:c196) Quit (Ping timeout: 480 seconds)
[9:22] * jcfischer (~fischer@peta-dhcp-13.switch.ch) Quit (Read error: Operation timed out)
[9:22] * jcfischer_ is now known as jcfischer
[9:24] * infinitytrapdoor (~infinityt@ has joined #ceph
[9:25] * yehudasa (~yehudasa@2607:f298:a:607:74e5:5f1:db38:4beb) Quit (Ping timeout: 480 seconds)
[9:29] <Kdecherf> Hi guys
[9:29] <Kdecherf> I have a lot of 'handle_client_session client_session(request_renewcaps seq 2617) from client.56352' on the mds with the last release
[9:29] <Kdecherf> any idea?
[9:30] * haomaiwang (~haomaiwan@ Quit (Remote host closed the connection)
[9:30] * haomaiwang (~haomaiwan@ has joined #ceph
[9:31] * AfC (~andrew@2001:44b8:31cb:d400:104f:1ffe:4dcf:cf5f) has joined #ceph
[9:34] <Kdecherf> (I believe it causes hangs on clients)
[9:34] * yehudasa (~yehudasa@2607:f298:a:607:e1d9:eaeb:4726:2fb6) has joined #ceph
[9:37] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[9:37] * jjgalvez (~jjgalvez@ip72-193-215-88.lv.lv.cox.net) Quit (Quit: Leaving.)
[9:37] * jjgalvez (~jjgalvez@ip72-193-215-88.lv.lv.cox.net) has joined #ceph
[9:38] * AfC (~andrew@2001:44b8:31cb:d400:104f:1ffe:4dcf:cf5f) Quit (Quit: Leaving.)
[9:41] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:41] * ChanServ sets mode +v andreask
[9:43] * s2r2 (~s2r2@ has joined #ceph
[9:45] * jjgalvez (~jjgalvez@ip72-193-215-88.lv.lv.cox.net) Quit (Ping timeout: 480 seconds)
[9:50] * infinitytrapdoor (~infinityt@ has joined #ceph
[9:50] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[9:55] * mschiff (~mschiff@port-57062.pppoe.wtnet.de) has joined #ceph
[9:58] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[9:58] * allsystemsarego (~allsystem@ has joined #ceph
[10:05] * LeaChim (~LeaChim@ has joined #ceph
[10:15] * gotenk (~gotenk@ has left #ceph
[10:31] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[10:32] * X3NQ (~X3NQ@ has joined #ceph
[10:33] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[10:33] <odyssey4me> I find this very odd. in my 'ceph -s' output I see the cluster processing a large number of operations/second. My VM (which has a block device on the RBD cluster) is doing a read test with iozone. However, none of my cluster disks are actually doing anything. What gives?
[10:34] <ofu_> all reads are served from RAM?
[10:38] <odyssey4me> I'm using iozone which is meant to focus on disk IO performance testing, so doing reads from RAM would appear to defeat the purpose.
[10:38] <odyssey4me> Also, reading from RAM should be a lot quicker than I'm seeing. My write speed is faster than my read speed.
[10:39] <ofu_> whtat is the size of the image of your VM? How many nodes and how much RAM do they have?
[10:39] <ofu_> well, strange
[10:39] <odyssey4me> Although it may be reading from the Ceph Cache.
[10:40] <odyssey4me> The VM's image size is 50GB. There are 3 nodes with 128GB RAM each. Each has 6 disks used as OSD's.
[10:42] <ofu_> 128GB >> 50GB... well
[10:43] <odyssey4me> So if it's reading from Ceph's cache, held in RAM, why does the performance suck so badly?
[10:44] <odyssey4me> The cluster's read of the performance is something like this: 3575KB/s rd, 893op/s
[10:44] <odyssey4me> which is pretty good
[10:44] <ofu_> ouch
[10:44] <odyssey4me> however iozone's report from inside the vm is more along the lines of 87 MB/s
[10:46] <odyssey4me> hmm, iozone's report is 87MB/s read, 284MB/s write
[10:47] <odyssey4me> I've seen 18658KB/s wr, 4050op/s reported by ceph -s
[10:49] <jcfischer> odyssey4me: I am re-running my iozone tests on a new VM (with 512 MB RAM instead of 8GB) and unmounting the volume between writes & reads and the random read/write performance has gone totally down the drain
[10:50] <jcfischer> the drama unfolds here: https://docs.google.com/spreadsheet/ccc?key=0AsjockBApInDdC10SWw4Y09HbUpsQWdZQ2RlTlhibEE#gid=1
[10:50] <jcfischer> (on sheet 2)
[10:50] <odyssey4me> jcfischer - my vm's have been with 2 vcpu's and 2gb ram... I chose that low spec to make sure that it wasn't the vm's cache that was doing a great job
[10:50] * bergerx_ (~bekir@ has joined #ceph
[10:50] <jcfischer> I am down to 1VCPU and 512 MB RAM
[10:50] <jcfischer> maybe that's too little?
[10:51] <odyssey4me> that said, I gave a go at a vm with 12cpu's and 64gb ram and it performed very well... then I added the -I flag to iozone to ensure that directio was happening and performance is down the toilet
[10:51] <jcfischer> I had a run with -I during the night and it didn't even finish the benchmark in 12 hours - I did a mercy killing
[10:52] <odyssey4me> I can comfortably say that if you're doing directio, the amount of ram doesn't matter - it ignores RAM.
[10:52] <odyssey4me> I was previously catering for RAM not being a factor by not using the -I flag and ensuring that the test file size (5GB) was larger than the amount of RAM (2GB)
[10:53] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[10:53] <odyssey4me> right now, with the -I flag on a 12cpu 64gb ram machine, I'm getting ~20MB/s write and ~2.5MB/s read
[10:54] <odyssey4me> this is shocking performance for read - I'm desperately trying to figure out where the problem could be
[10:59] <odyssey4me> hmm, I see that quantal and raring have later versions of kvm... and that there's a known async issue which was fixed in v1.4
[10:59] <jcfischer> We are running on a mix of quantal / raring hosts
[11:03] <odyssey4me> what's your kvm version?
[11:03] <odyssey4me> for your vm host
[11:08] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Read error: Operation timed out)
[11:09] <jcfischer> not sure this is the right one:
[11:09] <jcfischer> root@h0:~# kvm --version
[11:09] <jcfischer> QEMU emulator version 1.2.0 (qemu-kvm-1.2.0+noroms-0ubuntu2.12.10.3, Debian), Copyright (c) 2003-2008 Fabrice Bellard
[11:09] <jcfischer> root@h0:~# virsh --version
[11:09] <jcfischer> 0.9.13
[11:10] <jcfischer> (I have inherited the OpenStack/Ceph Cluster and am slowly getting up to speed with it)
[11:14] <odyssey4me> so you're running your vm on quantal
[11:14] <odyssey4me> if you can run some tests on raring (fully patched up) that'd be interesting
[11:15] <odyssey4me> have you tried cache=none (with no ceph client caching) and cache=writeback (with ceph client caching) ?
[11:15] <jcfischer> I'll let the current run complete and then move the VM to a raring host
[11:15] <jcfischer> no - not at all
[11:15] <jcfischer> where is that configured?
[11:16] <odyssey4me> cache=none or cache=writeback is in the virsh config for the vm: virsh edit <vmname>
[11:16] <odyssey4me> rbd client caching is set in the ceph.conf
[11:16] <jcfischer> root@h3:~# kvm --version
[11:16] <jcfischer> W: kvm binary is deprecated, please use qemu-system-x86_64 instead
[11:16] <jcfischer> QEMU emulator version 1.4.0 (Debian 1.4.0+dfsg-1expubuntu4), Copyright (c) 2003-2008 Fabrice Bellard
[11:16] <jcfischer> root@h3:~# virsh --version
[11:16] <jcfischer> 1.0.2
[11:17] <jcfischer> I guess I then have to undefined/define the VMs and reboot them?
[11:18] <odyssey4me> http://ceph.com/docs/next/rbd/libvirt/
[11:19] <odyssey4me> nope, if you virsh edit, change the settings, then you just need to shut the vm down and start it up afterwards
[11:19] <odyssey4me> easiest is to shut it down, virsh edit it, then start it up
[11:20] <odyssey4me> sorry, the right doc that references caching is http://ceph.com/docs/next/rbd/qemu-rbd/
[11:20] <jcfischer> ok - we have cache=none on that vm
[11:22] <odyssey4me> another option is 'io=native' - http://serverfault.com/questions/425607/kvm-guest-io-is-much-slower-than-host-io-is-that-normal
[11:22] <jcfischer> what do you want first? tests with cache on quantal? tests without cache on raring? with cache on raring?
[11:23] <odyssey4me> so my thinking is that without cache is how openstack deploys the vm anyway, so doing the test with cache is academic
[11:24] <odyssey4me> so do a test on raring without cache please
[11:24] <jcfischer> k
[11:25] * stacker666 (~stacker66@104.pool85-58-195.dynamic.orange.es) has joined #ceph
[11:25] <odyssey4me> I see that you also have paltry read speed vs write speed
[11:26] <jcfischer> indeed
[11:27] <yy-nm> you have too much local osd in the server
[11:27] <odyssey4me> yy-nm - do you think so? I have 6 OSD's on the client server
[11:28] <jcfischer> we have 64 OSD (split across 10 servers)
[11:28] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[11:28] * ChanServ sets mode +v andreask
[11:28] <yy-nm> rbd read rate is easy limit
[11:29] <odyssey4me> ok, I had planned on removing the osd's on my client server and only leaving the mon on it as one of my next tests
[11:31] <odyssey4me> so the thinking is therefore that the contention causing the slow-down is on the vm host? is that contention specifically cpu or ram?
[11:32] <odyssey4me> looks like cpu, if anything
[11:32] <odyssey4me> although my cpu isn't exactly hurting
[11:34] <odyssey4me> yeah, in fact this host is a bit bored
[11:34] <huangjun> cyberduck can't access the radosgw
[11:34] <huangjun> ?
[11:35] <jcfischer> lunchtime - will do the raring tests laster
[11:35] <jcfischer> s/laster/later/
[11:36] <huangjun> give me some tips?
[11:36] <Kdecherf> does anyone already have a ceph cluster with mds frozen in an infinite replay state?
[11:37] <odyssey4me> sorry huangjun, I don't have experience with either radosgw or cyberduck
[11:40] <Kdecherf> or does anyone already played with ceph-mds --reset-journal?
[11:40] <huangjun> oh, i think the ceph doc about radosgw are messy, and it difficult to deploy after that doc
[11:49] * BillK (~BillK-OFT@203-59-173-44.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[11:54] * BillK (~BillK-OFT@203-59-173-44.dyn.iinet.net.au) has joined #ceph
[12:03] * Meyer is now known as Meyer^
[12:03] * Meyer^ is now known as Meyer
[12:03] * Meyer is now known as Meyer^
[12:09] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[12:11] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[12:16] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[12:26] * yy-nm (~chatzilla@ Quit (Quit: ChatZilla [Firefox 22.0/20130618035212])
[12:29] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:31] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:37] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[12:41] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:48] <Kdecherf> hm, interesting, I can't rescue a ceph cluster with 2 monitors down (3 total)
[12:53] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[12:59] * infinitytrapdoor (~infinityt@ has joined #ceph
[13:09] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[13:12] <odyssey4me> very interesting - increasing /sys/block/sd?/queue/nr_requests from the default of 128 to 1024 or above is doubling my read speed
[13:12] <odyssey4me> (random read and random write, actually)
[13:12] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[13:15] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Read error: No route to host)
[13:15] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[13:22] * infinitytrapdoor (~infinityt@ has joined #ceph
[13:25] <odyssey4me> scratch that - 4x improvement!
[13:26] <janos> odyssey4me: what is nr_requests ?
[13:26] <janos> my morning brain isn't decoding "nr"
[13:27] * s2r2 (~s2r2@ has joined #ceph
[13:27] <jcfischer> Kdecherf: we had that two days ago and we turned up logging to insane values and found some OSDs that were laggy. Restarting the OSDs solved the mds up:replay state
[13:28] <jcfischer> we had to set debug objecter = 20 and debug filer = 20 to see the offending OSD
[13:28] <Kdecherf> jcfischer: hm thx, I will try
[13:32] <jcfischer> odyssey4me: you did this on all OSD drives?
[13:33] <odyssey4me> jcfischer - yes, all osd's
[13:34] <jcfischer> ahrg - will have to think about how to automate that… did you have to restart the OSD processes?
[13:34] <Kdecherf> jcfischer: your tip was for my failed cluster with down mon?
[13:34] * huangjun (~huangjun@ Quit (Quit: I love my HydraIRC -> http://www.hydrairc.com <-)
[13:34] <jcfischer> for mds stuck in up:replay
[13:35] <odyssey4me> janos - nr_requests is the amount of reads/writes that is allowed to sit in the disks queue... the more there are, the more the scheduler can reorder them to be more efficient... however the bigger the number, the larger the likelihood that your iops will drop
[13:35] <Kdecherf> jcfischer: ok, cool
[13:35] <janos> ah ok. good explanation
[13:35] <odyssey4me> jcfischer - nope, no osd restart... this is a kernel value and takes immediate effect
[13:35] <janos> thanks, odyssey4me
[13:35] <Kdecherf> jcfischer: I see that one osd (up for mon) doesn't open session with the mds
[13:35] <Kdecherf> jcfischer: but the restart of the osd doesn't help :(
[13:36] <odyssey4me> queue_depth=1024; for i in {b..g}; do echo ${queue_depth} > /sys/block/sd${i}/queue/nr_requests; done
[13:36] <odyssey4me> np janos :)
[13:36] <jcfischer> nice, thanks!
[13:36] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[13:36] <odyssey4me> I'm trying with the value of 4096 now too :)
[13:37] <jcfischer> I will try that before migrating
[13:37] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[13:38] <mgalkiewicz> can one help me with kvm and rbd cache?
[13:38] <odyssey4me> mgalkiewicz - yeah, several of us are busy performance testing exactly that :)
[13:38] <niklas> How does ceph do dns-resolving?
[13:39] <odyssey4me> nr_requests, by the way, seems to help the random reads/writes... the sequentials aren't really affected
[13:40] <niklas> Or in other words: why does rados put connect via 10.* ip, instead of 192.* ip which is how ping resolves my hostname, and how the hostname is put into /etc/hosts ?
[13:42] <jcfischer> re-running tests on same vm, same physical chost
[13:42] <jcfischer> s/chost/host/
[13:43] <niklas> I have a ceph-independet host, from where I put a file into ceph using "rados put". My OSDs are on a host called "store", which has two NICs each in a different net (10.x.x.x and 192.x.x.x), from my test-host, "store" resolves to 192.x.x.x (hard coded in /etc/hosts), but rados still connects via 10.x.x.x.
[13:43] <niklas> Why?
[13:44] <odyssey4me> niklas - I think it determines the ip when you add it to the cluster and saves that address into the config/map somewhere
[13:45] <niklas> ceph osd dump lists two ips for each osd
[13:45] <niklas> one for the public network(10.x.x.x), and one for the cluster network(192.x.x.x)
[13:45] * _Tass4dar (~tassadar@tassadar.xs4all.nl) Quit (Read error: Operation timed out)
[13:45] <niklas> I want to connect viar the cluster network, but it won't let me
[13:46] <niklas> the crushmap does not contain any ips, only hostnames
[13:46] <odyssey4me> niklas - where is your mon running?
[13:46] <niklas> on 10.x.x.x
[13:46] <odyssey4me> and have you set the 'public network' and 'cluster network' ?
[13:47] <niklas> yep, put that into ceph.conf
[13:47] <odyssey4me> ok, so my understanding is that clients will always connect on the public network
[13:47] <niklas> I'm still testing, and normal usecase would be, to connect via public network
[13:47] <niklas> but for special access I need to connect via cluster network
[13:47] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:47] * ChanServ sets mode +v andreask
[13:48] <niklas> Because I have 10G on the cluster network, and only 1G on the public network
[13:48] <odyssey4me> niklas - a way to do that might be to have a separate mon with the 'public network' set to your 192.x.x.x address, then to connect the client via that specific mon
[13:49] <odyssey4me> I don't know if that'd work
[13:49] <niklas> Can I have multiple mons on one host?
[13:49] <niklas> and: how would those mons communicate?
[13:51] <odyssey4me> niklas - no idea... as I mentioned, haven't done it - just thought I'd share an idea
[13:51] <mgalkiewicz> odyssey4me: joshd already helped me a little bit but I need more info https://gist.github.com/maciejgalkiewicz/5ee4f695cb5494a1e7c4
[13:52] <Kdecherf> jcfischer: well, it does not work for me :(
[13:52] <mgalkiewicz> I am trying to verify rbd cache settings but I am a bit confused with the results on kvm 1.1.2
[13:53] <jcfischer> did you edit the ceph.conf and restart the mds?
[13:53] <niklas> odyssey4me: thank you, I'll try to figure out a way
[13:54] <odyssey4me> mgalkiewicz - what's confusing you?
[13:54] <Kdecherf> jcfischer: for changing what?
[13:54] <jcfischer> the log levels - to see if you also suffer from a laggy ods
[13:55] <Kdecherf> jcfischer: yep, I found that one osd did not respond to the mds but was healthy for mons
[13:55] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[13:55] <odyssey4me> rbd cache is enabled, but libvirt's config has set cache=none so it won't use rbd cache
[13:55] <mgalkiewicz> odyssey4me: exactly
[13:55] <jcfischer> Kdecherf: no other ideas from me, unfortunately
[13:55] <Kdecherf> jcfischer: I out-ed this osd but the mds is still in a infinite replay state
[13:55] <jcfischer> try to restart the osd
[13:56] <odyssey4me> cache=none is set by nova's code as I recall...
[13:56] <mgalkiewicz> odyssey4me: but config from admin socket shows that rbd is enabled
[13:56] <Kdecherf> jcfischer: already done
[13:56] <mgalkiewicz> it is
[13:56] <mgalkiewicz> what is more I have found that since kvm 1.2 it is possible to provide cache settings like nova does
[13:56] <odyssey4me> mgalkiewicz - sure, so my understanding would be that the rbd library is set to be able to use the cache, but kvm is set not to use it
[13:57] <mgalkiewicz> hmm not sure:)
[13:58] <mgalkiewicz> I dont now how kvm decides whether to use cache or not. Is ceph.conf sufficient?
[13:58] <mgalkiewicz> I recall that docs state it is not
[13:58] <odyssey4me> mgalkiewicz - kvm doesn't decide, the libvirt config tells it whether to or not
[13:58] <odyssey4me> and your libvirt config is being set by openstack nova
[14:00] <mgalkiewicz> yes so libvirt tells not to use but this cache=none option is used since kvm 1.2
[14:00] <odyssey4me> cache=none has been around for ages
[14:01] <odyssey4me> well, at least since kvm v1
[14:01] <odyssey4me> v.0
[14:01] <odyssey4me> v1.0
[14:01] <mgalkiewicz> http://wiki.qemu.org/ChangeLog/1.2#rbd
[14:02] <odyssey4me> ah - but that's just for rbd...
[14:02] <odyssey4me> so perhaps I misunderstood your original question
[14:02] <odyssey4me> are you wondering why cache is not enabled? or are you wondering why cache is enabled in ceph when it's not enabled in libvirt?
[14:02] <Kdecherf> hm, ceph-mds [...] --reset-journal seems to not work here
[14:03] * markbby (~Adium@ has joined #ceph
[14:03] <mgalkiewicz> I am wondering why cache is enabled on 1.1.2 when all docs claims it should not with my settings
[14:04] <mgalkiewicz> I dont know whether to upgrade to 1.5.0 where I am quite sure that cache works or to stay with 1.1.2 because it works too (probably, maybe info from admin socket is wrong)
[14:05] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[14:05] <mgalkiewicz> http://ceph.com/docs/master/rbd/qemu-rbd/
[14:07] <mgalkiewicz> my original question is: do I have to upgrade kvm in order to have cache set by cache=writeback option which I can easily provide through openstack
[14:08] <odyssey4me> ah
[14:09] <odyssey4me> I don't think so. That said, I'm not certain and have no experience in that regard.
[14:09] <mgalkiewicz> so maybe the other way
[14:10] <matt__> mgalkiewicz, you will want to upgrade to 1.4.2 or later anyway because it includes async IO support
[14:10] * infinitytrapdoor (~infinityt@ has joined #ceph
[14:10] <matt__> versions before 1.4.2 blocked on IO calls and it makes everything run slow/inconsistent
[14:11] <odyssey4me> matt__ - there I would agree... in fact, jcfischer will be trying a performance test on v1.4 later in the day
[14:11] <mgalkiewicz> should I believe in config options received from admin socket?
[14:11] <mgalkiewicz> if cache is set to true does it really mean it is working?
[14:12] <odyssey4me> jcfischer - note that it needs to be 1.4.2, so just check whether the version on raring is up to that version?
[14:12] <jcfischer> the nr_requests change hasn't so far made a big difference - the random rw numbers for 4k art still missing, which means they are really bad
[14:13] <odyssey4me> mgalkiewicz - if it's received from the admin socket I would believe that rbd is ready for it, but since cache=none in the libvirt config I would think that the vm isn't taking advantage of rbd's cache on purpose
[14:13] * fridudad (~oftc-webi@fw-office.allied-internet.ag) has joined #ceph
[14:14] <odyssey4me> jcfischer - 4k record lengths continue to suck
[14:14] <odyssey4me> I did see improvements, but in bigger record lengths
[14:15] <odyssey4me> I'm looking into partition alignment, raid optimisation, etc at the moment.
[14:15] <mgalkiewicz> matt__: does it require additional config or sth?
[14:16] * yanzheng (~zhyan@ has joined #ceph
[14:16] <matt__> mgalkiewicz, it 'just works'. As long as your version of ceph supports it so will qemu
[14:16] <mgalkiewicz> ok I will upgrade anyway
[14:17] <jcfischer> odyssey4me: are we talking about the qemu-kvm package? I only see 1.4.0 in the raring repos
[14:17] <mgalkiewicz> ok thx for help
[14:18] <odyssey4me> yeah, in that case they haven't produced the update yet... that said, it'd be interesting to see if there are performance differences between 1.2 and 1.4
[14:18] <jcfischer> k
[14:23] * zhyan_ (~zhyan@ has joined #ceph
[14:25] <jcfischer> still waiting for the 4k test run to finish
[14:26] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) Quit (Ping timeout: 480 seconds)
[14:27] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[14:28] <jcfischer> ah -they finally finished - about as bad as the 128 nr_requests
[14:29] * yanzheng (~zhyan@ Quit (Ping timeout: 480 seconds)
[14:38] <nhm> jcfischer: what kind of tests are you running?
[14:39] <jcfischer> iozone - see https://docs.google.com/spreadsheet/ccc?key=0AsjockBApInDdC10SWw4Y09HbUpsQWdZQ2RlTlhibEE#gid=2 for details
[14:40] <nhm> oh, that's right. Did you figure out your crazy numbers from yesterday?
[14:40] * joshd (~joshd@2607:f298:a:607:44a7:1bbc:ef54:56cd) Quit (Ping timeout: 480 seconds)
[14:48] <Kdecherf> what does 'rank' mean with --reset-journal and other commands of ceph-mds?
[14:50] * joshd (~joshd@2607:f298:a:607:e4ca:af6f:1bf7:6830) has joined #ceph
[14:55] <jcfischer> nhm: My VM had 8 GB of RAM - I guess the 5G test file was completely in memory… The current VM only has 512 MB RAM, and iozone unmounts the volume after each pass of the test which kills the cache
[14:55] <nhm> jcfischer: good deal
[14:55] <jcfischer> odyssey4me: 8k block size - same (bad) performance
[14:56] <nhm> jcfischer: what kind of tests are giving you bad performance?
[14:56] <jcfischer> random read/writes
[14:56] <nhm> jcfischer: how many iops for how many disks?
[14:57] <jcfischer> mon.0 [INF] pgmap v5835138: 20856 pgs: 20856 active+clean; 2957 GB data, 6448 GB used, 161 TB / 174 TB avail; 14505KB/s rd, 560op/s
[14:57] <jcfischer> 64 disks
[14:57] <nhm> wow
[14:57] <jcfischer> good? bad?
[14:57] <mgalkiewicz> odyssey4me: I have done some tests and it looks that setting rbd_cache in ceph.conf is sufficient to enable it no matter how libvirt is configured and what kvm version is used (I have kvm with rbd module compiled with latest cuttlefish librbd version)
[14:58] <nhm> jcfischer: how much concurrency?
[14:58] <jcfischer> concurrency where?
[14:58] <nhm> jcfischer: like, how many IOs is iozone keeping in flight?
[14:59] <jcfischer> good question - here's the command I'm using: sudo ./iozone -a -y 16k -q 512k -s 5g -i 0 -i 1 -i 2 -R -U /mnt/vol -f /mnt/vol/iozone.dat
[14:59] * jcfischer has seen iozone for the first time only yesterday
[14:59] * waxzce (~waxzce@2a01:e34:ee97:c5c0:ccdb:7898:3dc8:8bd0) has joined #ceph
[15:00] <nhm> ok, I don't know IOzone very well. Let me see if I can find any options for it.
[15:01] <nhm> you will almost certainly see better performance with more operations in flight at once due to the way ceph spreads load around.
[15:01] <jcfischer> http://www.iozone.org/docs/IOzone_msword_98.pdf
[15:02] <odyssey4me> jcfischer - ok, so that didn't help much... I saw a note on a website where the person created their xfs filesystem for the osd's with 64k blocks instead of the standard 4k blocks... I wonder if that will make a diff
[15:02] <odyssey4me> mgalkiewicz - aha, good to know - thanks
[15:02] <jcfischer> we are using btrfs (but will move to xfs in our next (production) cluster)
[15:02] <nhm> odyssey4me: what are you seeing?
[15:02] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[15:02] <jcfischer> we've had to may server die due to btrfs
[15:03] <odyssey4me> nhm - my understanding with the '-a' flag we're using is that iozone pushes as many threads as it can until the fs starts blocking it... so it adapts a little
[15:03] * jeff-YF (~jeffyf@ has joined #ceph
[15:05] <nhm> brb
[15:05] <odyssey4me> jcfischer - yeah, I got excited for no reason... the performance is pretty much the same... I've done all sorts of stuff and nothing's making a difference
[15:06] <odyssey4me> my next step is to remove the client server from the cluster and test with only the mon service running on it
[15:06] <jcfischer> I will wait for the 16k numbers to come in and then try to move the vm to a raring host
[15:08] <odyssey4me> here's something interesting to try out: http://www.spinics.net/lists/ceph-devel/msg07128.html
[15:09] * oliver1 (~oliver@p4FD06BAF.dip0.t-ipconnect.de) has joined #ceph
[15:09] * infinitytrapdoor (~infinityt@ has joined #ceph
[15:14] <nhm> you guys might want to consider also trying fio
[15:20] <odyssey4me> nhm - fair enough, but I've done the same test in a variety of configurations, with and without ceph
[15:21] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Read error: Operation timed out)
[15:21] <odyssey4me> in every setup without ceph, read performance is better than write... this is to be expected... with ceph write performance is better than read
[15:21] <odyssey4me> this is what I'm trying to identify the root cause of
[15:25] <nhm> jcfischer: odyssey4me: if you want, try this: fio --rw=randread --ioengine=sync --direct=1 --numjobs=1 --bs=4k --runtime=60 --size=5G --runtime=60 --name=<path_to_file>
[15:25] <nhm> jcfischer: you can also specify multiple --name values to write to multiple files/volumes at once.
[15:27] <odyssey4me> trying it now
[15:28] <nhm> oops, put runtime in there twice
[15:29] <nhm> jcfischer: that's with iodepth=1. after that, you can try adding iodepth=16 or something and see how if affects throughput.
[15:29] <odyssey4me> result: http://pastebin.com/MaX8c4Lf
[15:32] <nhm> sorry, I'm slightly distracted. Trying to manage a 4 day old puppy too. :)
[15:32] <nhm> well, not 4 day old, 4 days for us
[15:35] <nhm> ok, looking at that first test, with just 1 IO in flight using direct IO you are getting about 561 IOPs.
[15:35] <nhm> I'm curious now if you include --iodepth=16 how much difference you'll see.
[15:35] <odyssey4me> surely fallocate should be used too?
[15:36] * zhyan_ (~zhyan@ Quit (Ping timeout: 480 seconds)
[15:37] <odyssey4me> it appears to be the same performance
[15:37] <odyssey4me> http://pastebin.com/jbKhd9GL
[15:38] <nhm> doh, I had you use the sync engine by mistake
[15:38] <nhm> copy-paste-puppy error
[15:38] <nhm> please switch to --ioegine=libaio
[15:39] <Kdecherf> Is osd/client 0.65 compatible with mon/mds 0.66?
[15:40] <nhm> odyssey4me: btw, regarding fallocate, see: http://manpages.ubuntu.com/manpages/natty/man1/fio.1.html
[15:41] <odyssey4me> http://pastebin.com/Yn7iBAfY
[15:44] <odyssey4me> trying out the four threads test from here: https://www.linux.com/learn/tutorials/442451-inspecting-disk-io-performance-with-fio/
[15:45] <odyssey4me> I quite like the stats coming out of fio
[15:45] <odyssey4me> showing iops, latency, throughput, etc
[15:46] <nhm> odyssey4me: btw, with 5GB we very likely could be seeing some caching effects (especially dentry cache), so take these results with at least a little grain of salt.
[15:46] * amatter (~oftc-webi@ has joined #ceph
[15:47] <nhm> odyssey4me: yes, fio is quite nice.
[15:47] <nhm> very configurable too, though you can shoot yourself in the foot with it if you aren't careful. :)
[15:47] <nhm> (I've done so plenty of times)
[15:47] <odyssey4me> http://pastebin.com/cnmif6py
[15:48] <odyssey4me> my vm only has 2GB RAM
[15:48] <odyssey4me> so generally the OS caching doesn't have a big effect
[15:48] <nhm> odyssey4me: be very careful doing writes before reads if you specify a timelimit on the write.
[15:49] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:49] <odyssey4me> nhm - why?
[15:50] <nhm> odyssey4me: If the file doesn't get written out completely, then reads may start reading unwritten blocks with will happen nearly instantaneously and make the reuslts look hopelessly amazing. ;)
[15:51] <odyssey4me> lol, fair enough
[15:51] <odyssey4me> so best is to specify a file size with no time limit
[15:51] <nhm> odyssey4me: what I do is pre-write the file with no timelimit, then do the actual tests with a timelimit.
[15:52] <jcfischer> starting rio tests now
[15:54] <jcfischer> iodepth=1: http://pastebin.com/7k2a2sFv
[15:55] * BillK (~BillK-OFT@203-59-173-44.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[15:56] <odyssey4me> it's still clear that writing is faster than reading
[15:57] <nhm> odyssey4me: For small random IO or in general?
[15:57] <jcfischer> iodepth=16 http://pastebin.com/yXcHPUVx
[15:57] <odyssey4me> in general
[15:57] <odyssey4me> for sequential and random
[15:59] <nhm> odyssey4me: I'm not sure I'd go so far as that, but definitely for QEMU/KVM reads have been less consistently fast.
[15:59] <odyssey4me> nhm - kvm reads for me have been consistently slower than writes, over many, many tests
[15:59] <nhm> odyssey4me: jcfischer in your case, it looks like iops increased pretty dramatically with a higher io depth.
[15:59] <Kdecherf> does "200 object 200.002a300e should be 4194304, actual is 0" can cause a mds to hang in replay state?
[16:00] <odyssey4me> it hasn't mattered if I enable/disable caching, or whether I tweak various things on the host OSes of the OSD's
[16:00] <nhm> odyssey4me: With Kernel RBD I see reads scaling higher in some cases.
[16:00] <odyssey4me> so something is screwy in my system and I'd love to know
[16:01] <odyssey4me> nhm - yes, with kernel RBD I get the expected high performance reads and decent performance writes
[16:01] <odyssey4me> (on bare metal)
[16:01] <nhm> odyssey4me: one question, have you tried concurrent client guests and even client hosts at the same time?
[16:01] <odyssey4me> no I haven't
[16:01] <nhm> odyssey4me: that would be an interesting test
[16:02] <odyssey4me> sure, but I'd like to see an uncontended test working well before going there
[16:02] * yanzheng (~zhyan@jfdmzpr05-ext.jf.intel.com) has joined #ceph
[16:02] <nhm> odyssey4me: I'm wondering if there could be a client side limitation etiher in the RBD code or in the QEMU/KVM stuff.
[16:03] <nhm> jcfischer: how many disks do you have?
[16:03] <jcfischer> 64
[16:03] <jcfischer> in 10 servers
[16:03] <nhm> jcfischer: oh, I asked that already. :) Ok, so 4900 IOPS over 64 disks
[16:04] <jcfischer> I have no feeling if that is a good or bad number
[16:04] <nhm> With a 5GB volume you are probably hitting the same block fairly often so it may not be quite as random as over a very large file.
[16:04] <jcfischer> it's a 30gb volume
[16:05] <jcfischer> but I can make a bigger one :)
[16:05] <odyssey4me> mine's 50GB ;)
[16:06] <nhm> jcfischer: figure each disk can do like 150 IOPs. You are getting something like 76 IOPs per disk, but you are also doing journal writes to the disk. If you have a smart controller, those are being aggregated so it will be some, but not 2x the IOPs overhead.
[16:06] * huangjun (~huangjun@ has joined #ceph
[16:06] <jcfischer> the journals are on a separate ssd
[16:06] <nhm> jcfischer: And you are probably not 100% totally random due to the small file size. In essence: It doesn't look totally wrong. ;)
[16:06] <odyssey4me> jcfischer - how many disks per server? and how many ssd's?
[16:07] <jcfischer> between 4 and 9 disks, 1 ssd with a partition for each journal
[16:07] <jcfischer> I'll see your 50GB volume and raise to 100
[16:07] <odyssey4me> hmm, you could be limiting performance by putting the journal for too many osd's on a single ssd
[16:08] <nhm> jcfischer: ok, that makes it easier. So basically you are seeing ok, but not amazing IOPS which is right about what I'd expect given the overhead we have to deal with object metadata and such.
[16:09] <jcfischer> and we have el-cheapo hardware
[16:09] <nhm> jcfischer: that will definitely hurt large sequential write performance (unless the SSD is super fast), but may not hurt small random write performance.
[16:09] <nhm> btw, I was silly before. journals will only matter for writes and you were doing reads.
[16:09] <jcfischer> running fio with 4 jobs in 4 datafiles of 5GB each on a 100GB volume now
[16:09] <nhm> I blame the puppy again. :P
[16:09] <jcfischer> 'ts ok
[16:10] <huangjun> i want to build the radosgw, but it failed, anyone build it on centos?
[16:10] <jcfischer> I'm learning a bunch
[16:10] * KindTwo (KindOne@h189.48.186.173.dynamic.ip.windstream.net) has joined #ceph
[16:10] <odyssey4me> jcfischer - share your command please?
[16:10] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[16:10] <jcfischer> fio --rw=randread --ioengine=libaio --direct=1 --numjobs=4 --bs=4k --runtime=60 --size=5G --runtime=60 --iodepth=16 --name=/mnt/vol100/fio1.dat --name=/mnt/vol100/fio2.dat --name=/mnt/vol100/fio3.dat --name=/mnt/vol100/fio4.dat
[16:11] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:11] <odyssey4me> how do I check the active settings for an osd, including the defaults that aren't set in the conf file?
[16:11] * KindTwo is now known as KindOne
[16:11] <jcfischer> oh damn - forgot to actually mount the 100gb volume :)
[16:11] <nhm> huangjun: we've got packages if you are ok with doing pre-built: http://ceph.com/docs/master/install/rpm/
[16:11] <nhm> odyssey4me: you can get that via the admin socket
[16:12] <nhm> odyssey4me: ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show
[16:12] <nhm> odyssey4me: for osd.0 for example
[16:12] <huangjun> thanks,nhm
[16:13] <huangjun> another question, if i'm not use the auth module, should i set it in radosgw ?
[16:14] <odyssey4me> nhm, thanks :)
[16:16] * yanzheng (~zhyan@jfdmzpr05-ext.jf.intel.com) Quit (Remote host closed the connection)
[16:16] <nhm> huangjun: don't remmeber. I think if you set it in global you shouldn't need to do anything special.
[16:19] <huangjun> ok
[16:20] * infinitytrapdoor (~infinityt@ has joined #ceph
[16:20] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[16:23] <Kdecherf> How can I tell to the mds to skip a metadata object on startup?
[16:26] * deadsimple (~infinityt@ has joined #ceph
[16:26] <jcfischer> nhm: now I'm seeing this on ceph -w: 2013-07-18 16:25:52.413084 mon.0 [INF] pgmap v5837470: 20856 pgs: 20856 active+clean; 3041 GB data, 6622 GB used, 161 TB / 174 TB avail; 55423KB/s rd, 13855op/s
[16:27] <jcfischer> http://pastebin.com/bkTSYzc7
[16:27] <nhm> jcfischer: that's more than your drives can do. Looks like some of the data must be coming from cache.
[16:28] <jcfischer> let me unmount
[16:28] <nhm> jcfischer: It may be due to using numjobs=4 across the 4 files.
[16:28] <jcfischer> damn - too good to be true?
[16:28] <nhm> jcfischer: probably. :)
[16:29] <jcfischer> retrying with numjobs=1
[16:29] <nhm> jcfischer: though not necessarily totally unreasonable. Not all random workloads are totally 100% random.
[16:30] <jcfischer> what about the 13855 op/s from ceph? due to the fact that the volume is bigger and split across more osds?
[16:30] <nhm> jcfischer: IE if it's only random within a small set of blocks, that will behave differently than random over blocks spread across all portions of all disks in the cluster.
[16:31] <jcfischer> so I probably should try to build a really huge volume (20 TB) and have huge files in them and then run the benchmark - but - that's not a day to day scenario...
[16:32] <nhm> jcfischer: yeah, depends how much you care too.
[16:32] <jcfischer> or rather - have many volumes on many vas doing IO and have ceph handle the load
[16:32] <jcfischer> s/vas/vms/
[16:32] <jcfischer> damn autocorrect
[16:32] * Volture (~quassel@office.meganet.ru) Quit (Ping timeout: 480 seconds)
[16:32] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[16:34] <nhm> jcfischer: so it looks like fio is reporting roughly 11.2K IOPS while ceph is reporting 13.9ish, but that might just be due to fluctuations.
[16:34] <jcfischer> I just picked on line from the ceph -w output
[16:35] <Kdecherf> jcfischer: did you already reset a mds journal?
[16:35] <jcfischer> Kdecherf: no
[16:35] <jcfischer> our WD 3TB RED disks have around 45 IOPS wr / 110 IOPS rd
[16:35] <Kdecherf> hm
[16:37] <jcfischer> and one more: http://pastebin.com/cjBZ5pZa
[16:37] <jcfischer> and where do you read the 11.2K IOPS in the fio output?
[16:37] <huangjun> can i use the ceph-deploy tool to create the osd which use raid0 as the osd data dir?
[16:38] <nhm> jcfischer: I just computed it from the aggregate throughput.
[16:38] <nhm> jcfischer: you could probably add up the iops from each of the jobs too.
[16:39] <Kdecherf> does anyone already reset a mds journal directly by removing objects with rados?
[16:39] <jcfischer> ah ok - so with numjobs=1 on 4 files, I still get around 12k IOPS reading
[16:41] <nhm> jcfischer: interesting. That'd be like 175 IOPs per disk which for very fast disks could possibly be in the realm of possibiltiy, but more likely there is some kind of caching or slightly sequential behavior going on.
[16:41] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[16:43] <nhm> jcfischer: I'd love that to be the real throughput though. ;)
[16:43] <jcfischer> just did a random write test: ca 1600 IOPS if I interpret the results correctly: http://pastebin.com/cUGuLtBX
[16:45] * Pauline (~middelink@2001:838:3c1:1:be5f:f4ff:fe58:e04) has joined #ceph
[16:45] <nhm> jcfischer: looks like a little over 1800 IOPS.
[16:46] <nhm> jcfischer: journals on the same disks as data?
[16:46] <jcfischer> no - on SSD
[16:46] <nhm> jcfischer: well that's annoying.
[16:46] <jcfischer> indeed
[16:47] <nhm> jcfischer: replication?
[16:47] <jcfischer> 2
[16:48] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[16:48] <nhm> ok, so every single write has to hit the primary OSD journal, then hit the secondary OSD journal before it can be acknowledged.
[16:48] <odyssey4me> so if I read that correctly, you're getting an aggregate of 46MB/s on random read and 7MB/s on random write - is that correct?
[16:49] <jcfischer> I'm still trying to make heads or tails from the fio output (I feel like a monkey poking at things atm)
[16:49] <nhm> In reality you are getting a little under 60 IOPS per disk for writes. From what I've seen around 75-90 is more typical.
[16:50] <jcfischer> that sounds in line with what the disks actually are capable of
[16:50] <nhm> You may want to see if any specific OSD has operations backing up on it.
[16:50] <jcfischer> how so?
[16:50] <nhm> jcfischer: have you used the admin socket at all?
[16:50] <jcfischer> 2 days ago :)
[16:51] <jcfischer> (for the first time)
[16:51] <nhm> jcfischer: I do: sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$i.asok dump_ops_in_flight | grep num_ops
[16:52] <nhm> in a loop on every OSD server for all of the osd admin sockets.
[16:52] <jcfischer> oh my
[16:53] <nhm> There's a tool called pdsh that will let you run the same command on a bunch of servers at once. I then watch the output to see if any particular OSD is backing up.
[16:53] <jcfischer> I have a command as well to do that, but it seems I need to provide osd numbers for each command, no?
[16:53] <nhm> jcfischer: oh, do you have RBD cache enabled?
[16:53] <jcfischer> where?
[16:54] <jcfischer> ceph.conf?
[16:54] <nhm> jcfischer: ceph.conf and in the guest XML?
[16:54] <odyssey4me> for osdsock in `ls -1 /run/ceph/ceph-osd*`; do ceph --admin-daemon ${osdsock} dump_ops_in_flight | grep num_ops; done
[16:54] <odyssey4me> or perhaps watch 'for osdsock in `ls -1 /run/ceph/ceph-osd*`; do ceph --admin-daemon ${osdsock} dump_ops_in_flight | grep num_ops; done'
[16:54] <nhm> odyssey4me: yeah, something like that. find works well too.
[16:55] <nhm> odyssey4me: with the exec flag
[16:55] <odyssey4me> nhm - I had caching enabled, made no difference for me
[16:56] <nhm> "find /var/run/ceph/ceph-osd* -exec sudo ceph --admin-daemon {} dump_ops_in_flight \; | grep num_ops" might do it.
[16:56] <nhm> odyssey4me: for writes?
[16:56] <jcfischer> no in fliight ops atm
[16:56] <odyssey4me> nhm - yeah, read & write stats are much the same
[16:56] <nhm> jcfischer: test going? :)
[16:57] <jcfischer> no - still trying to get this work across all servers atm
[16:57] <jcfischer> too many shells
[16:57] <nhm> odyssey4me: hrm, got it set in the XML too?
[16:57] <jcfischer> anyway - cache: I guess this needs to be in the ceph.conf of every server
[16:58] <nhm> jcfischer: yes
[16:58] <jcfischer> and of course it isn't set
[16:58] <odyssey4me> nhm - yeah, had it set in ceph.conf in the [client] section, and in the vm's xml... also made sure to use writeback in the vm's xml file and shut it down and started it again to ensure that it stuck
[16:58] <odyssey4me> cache only affects writes, doesn't it?
[16:58] <jcfischer> rbd_cache?
[16:59] <nhm> odyssey4me: hrm, ok. I'm lazy and stick everything in [global] so I enver remember where stuff is supposed to go. ;)
[16:59] <loicd> leseb and I are looking for ways to experiment with "m->get_flags() & CEPH_OSD_FLAG_LOCALIZE_READS" but not sure if it's actually possible ;-)
[17:00] <nhm> odyssey4me: rbd cache shouldn't change read performance...
[17:00] <odyssey4me> jcfischer - yes rbd_cache
[17:00] <loicd> leseb: get_flags tries a bitfield and there probably are other bits set by options found in src/common/config_opts.h ... although I did not look ;-)
[17:00] <joelio> interesting, I only seem to have one .asok on a host with 6 OSD's - all active
[17:00] <jcfischer> odyssey4me: any other params to set?
[17:00] <joelio> identical nodes have all OSDs
[17:01] <odyssey4me> I've done this as a try too, but it doesn't seem to be making a difference: http://pastebin.com/R4QHxxms
[17:01] <nhm> joelio: that's strange. :)
[17:01] <jcfischer> and then? Restart the osd or mons?
[17:01] <odyssey4me> restart everything
[17:02] * deadsimple (~infinityt@ Quit (Ping timeout: 480 seconds)
[17:02] <joelio> nhm: yea, I saw your command and thought it looked interesting.. ran in parallel-ssh across hosts and a couple of them don't have full .asok lists?!??
[17:02] <odyssey4me> joelio - you may have them in /var/run/ceph or in /run/ceph
[17:02] <joelio> odyssey4me: but surely they should be in one consistent place
[17:02] <nhm> odyssey4me: Ah, that looks like Jim's settings. I haven't seen dramatic differences with his settings.
[17:02] <odyssey4me> nhm - yeah
[17:02] <odyssey4me> joelio - yes, they should be consistent
[17:03] <nhm> odyssey4me: here's the testing I did: http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/
[17:03] <nhm> odyssey4me: just rados bench
[17:04] * jeff-YF (~jeffyf@ Quit (Quit: jeff-YF)
[17:04] <odyssey4me> yup, I didn't find that rados bench gives believable result
[17:04] <odyssey4me> yup, I didn't find that rados bench gives believable results
[17:05] <odyssey4me> besides, I need to compare results to non ceph filesystems
[17:05] <nhm> odyssey4me: You have to take it for what it is. How fast objects of various sizes can be read or written.
[17:05] * joelio finds issues in ceph log - eve though 'HEALTH OK' 2013-07-18 15:59:37.819051 7f40085de7c0 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-34: (16) Device or resource busy
[17:05] <nhm> odyssey4me: IE it's comparable to itself, but not to RBD or CephFS directly.
[17:05] * waxzce_ (~waxzce@office.clever-cloud.com) has joined #ceph
[17:06] * diegows (~diegows@ has joined #ceph
[17:07] * alfredod_ (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[17:08] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[17:09] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[17:09] <odyssey4me> why is it that cephfs is not considered ok for production? is it only due to the lack of redundancy for the metadata server?
[17:09] * jksM (~jks@3e6b5724.rev.stofanet.dk) has joined #ceph
[17:09] * jks (~jks@3e6b5724.rev.stofanet.dk) Quit (Read error: Connection reset by peer)
[17:10] <nhm> odyssey4me: There are some known bugs and afaik we aren't doing any kind of real QA on it.
[17:11] <odyssey4me> ok, that crosses it off my list
[17:12] <nhm> odyssey4me: Some brave folks are using it anyway, but we can't really guarantee that it will work well and afaik we aren't supporting it as an Inktank product yet.
[17:12] <nhm> odyssey4me: Having said that, bug our business folks if you want us to work on it. ;)
[17:12] * waxzce (~waxzce@2a01:e34:ee97:c5c0:ccdb:7898:3dc8:8bd0) Quit (Ping timeout: 480 seconds)
[17:13] <odyssey4me> I'm going to try using the kernel module to mount an image, then test using a standard preallocated raw disk via qemu to see what happens. Before this I need to remove the OSD's from the client machine though as it croaks when I do that.
[17:13] <odyssey4me> The kernel module gave me awesome performance when I tested on the host.
[17:13] <jcfischer> we've had some *interesting* issues with cephfs as the shared storage for our VM images
[17:13] * joelio find stop ceph-all not working
[17:14] <jcfischer> I have setup a watch on all odds on the servers, ran the test and see 0 ops in flight
[17:14] <odyssey4me> ultimately that's what I'm wanting - shared storage for qemu vm images.
[17:14] <jcfischer> our images got corrupted… we rolled back rather fast
[17:14] <odyssey4me> ouch
[17:14] <odyssey4me> jcfischer - what're you using currently?
[17:15] <jcfischer> a big ssd in each server
[17:15] <jcfischer> and moving towards grizzly where we plan to use "boot from volume"
[17:15] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[17:16] <odyssey4me> jcfischer - ok, so you're wanting to use rbd as the instance storage directly?
[17:16] <odyssey4me> (same as I was thinking)
[17:16] <jcfischer> yes
[17:16] <jcfischer> now running the randwrite test and seeing some iops in flight (1-5)
[17:17] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Read error: Operation timed out)
[17:17] <Kdecherf> gregaf: are you there?
[17:17] <nhm> jcfischer: One thing you might want to do as well, is try the test on two VMs concurrently.
[17:18] <jcfischer> let me spin one up
[17:18] <jcfischer> why stop with one? I'll have 4 running :)
[17:18] <nhm> jcfischer: sure. :)
[17:20] <jcfischer> plugging things together
[17:20] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[17:22] <odyssey4me> ok, it would appear that having the cache enabled improved the random write somewhat
[17:22] <odyssey4me> it's about half the write speed without it
[17:22] <nhm> odyssey4me: I've seen that too.
[17:22] <nhm> odyssey4me: It's a bit strange that it helps so much.
[17:23] <jcfischer> which cache is that?
[17:24] <odyssey4me> my random write for 4k records is 13MB/s with the cache on and 6MB/s with it off... neither are great speeds in the first place
[17:26] * nwat (~oftc-webi@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[17:27] <jcfischer> oh fantastic - none of my new instances actually boot…
[17:29] <jcfischer> I guess it is time to call it a day
[17:29] <nhm> odyssey4me: how many disks?
[17:29] <nhm> odyssey4me: and how much replication?
[17:30] <nhm> oh, and journal config?
[17:32] * jjgalvez (~jjgalvez@ip72-193-215-88.lv.lv.cox.net) has joined #ceph
[17:32] <odyssey4me> 6 disks per host, seperate journal partition in the first few sectors of each disk, 3 hosts, 2 x bonded 10gb networking with seperate vlans on the nics for the public and cluster networks
[17:33] <nhm> ok, so 18 disks total with journals on the same disks? what replication?
[17:33] <odyssey4me> 2 copies
[17:33] * xmltok (~xmltok@pool101.bizrate.com) Quit (Quit: Bye!)
[17:33] * mynameisbruce (~mynameisb@tjure.netzquadrat.de) Quit (Quit: Bye)
[17:34] <odyssey4me> I'm removing the client server from the cluster (but will leave the mon server running on it) as we speak. I want to try the kernel module as an option.
[17:35] <nhm> odyssey4me: Ok, so in reality you are getting ~13MB/s with 2x replication = 26MB/s. That's 6,500 IOPS, divided by 18 drives is 361 IOPS per disk. That doesn't even count journal writes...
[17:35] <odyssey4me> Then I have two ssd's (they arrived today) to plug into the servers - one in each of the two cluster servers.
[17:36] <odyssey4me> hmm, that sounds high
[17:37] <nhm> odyssey4me: yes, though I suspect if RBD cache is helping that much it means some of the IOs are being aggregated.
[17:37] * oliver1 (~oliver@p4FD06BAF.dip0.t-ipconnect.de) has left #ceph
[17:38] * sagelap (~sage@2600:1012:b026:e6bd:6c8c:28d1:a574:a5cc) has joined #ceph
[17:38] <jcfischer> thanks for the interesting time guys - more fun tomorrow
[17:38] * jeff-YF (~jeffyf@ has joined #ceph
[17:38] <odyssey4me> cheers jcfischer
[17:38] <nhm> odyssey4me: If we look at the results without RBD cache it's more like 166 IOPS
[17:38] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[17:38] <nhm> jcfischer: good night!
[17:38] <jcfischer> evening: I see a glass of red wine in my near future - night comes later
[17:39] * nwat (~oftc-webi@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[17:39] <nhm> jcfischer: Too bad it's morning for me. Maybe a bit of spiked coffee. ;)
[17:39] <jcfischer> your evening will come too...
[17:40] * sagelap1 (~sage@ Quit (Read error: Operation timed out)
[17:40] <Pauline> 6MB (without cache) * 2x = 12MB/s => /4k (iops) = 3072 iops per 18 drives => 170 iops per drive....
[17:40] <Pauline> oh. i see you did the math already ^^
[17:41] * sagelap1 (~sage@2600:1012:b007:abc2:6474:c6a5:6ffa:636d) has joined #ceph
[17:43] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[17:43] <odyssey4me> hmm, so what's the best way to utilise the ssd's?
[17:44] <odyssey4me> if I put one in each of two nodes then map all the journals to partitions on them won't it restrict performance?
[17:44] <odyssey4me> ie 5 spinning disks per ssd
[17:44] <Pauline> 5? not 6?
[17:44] * mynameisbruce (~mynameisb@tjure.netzquadrat.de) has joined #ceph
[17:45] <odyssey4me> I'll have to remove one of the SAS disks to put the SSD into the chassis.
[17:45] <Pauline> ah. pity.
[17:45] <nhm> odyssey4me: how fast are the SSDs?
[17:45] <odyssey4me> although, I suppose I could remove one of the OS disks... that's not a bad idea :)
[17:45] <odyssey4me> it'll degrade the OS mirror, but that's hardly a problem
[17:46] <nhm> And do your controllers do 6Gb/s lanes?
[17:46] <odyssey4me> 256GB 6Gb/s SATA SSD
[17:46] * sagelap (~sage@2600:1012:b026:e6bd:6c8c:28d1:a574:a5cc) Quit (Ping timeout: 480 seconds)
[17:46] <odyssey4me> yup
[17:46] <nhm> odyssey4me: what model?
[17:46] <odyssey4me> model of?
[17:46] <nhm> SSD that is
[17:47] <odyssey4me> RealSSD P400e 2.5
[17:47] <joelio> odyssey4me: 6Gbit is just the SATA b/w - doubt the drive is that fast
[17:48] <odyssey4me> looks like this is the spec: Sequential 64K Read: 350 MB/s
[17:48] <odyssey4me> Sequential 64K Write: 140 MB/s
[17:48] <odyssey4me> Random 4K Read: 50K IOPs
[17:48] <odyssey4me> Random 4K Write: 7.5K IOPs
[17:49] <nhm> odyssey4me: write speed is pretty slow. :/
[17:49] <Kdecherf> does anyone already reset a mds journal?
[17:49] <nhm> odyssey4me: so if you put all 5 journals on that disk you'd be hampering your sequential write throughput.
[17:50] <nhm> odyssey4me: though probably helping small random write throughput.
[17:50] <odyssey4me> well, I thought that perhaps a workaround for that could be to configure the 6 disks into two stripes
[17:50] <huangjun> hello, i build the radosgw on centos6.4 and find something interesting, almost 50 radosgw process in one radosgw
[17:50] <odyssey4me> (or 3 for that matter)
[17:50] <nhm> odyssey4me: one thing you could try is using the SSD for flashcache or bcache.
[17:51] <nhm> odyssey4me: doesn't really matter, if you are using a single SSD for journaling you'll never get faster than 140MB/s writes no matter how you split things up.
[17:52] <nhm> odyssey4me: hrm, looking at some IO meter results. It looks like for large block sequential writes it can do more like 260MB/s, so at least a bit better.
[17:53] <Pauline> and its prob worse than 140MB/s, as I never saw a prod sheet without some lies on it. Though in this case it might be the read speed :P
[17:53] <nhm> http://www.storagereview.com/micron_realssd_p400e_enterprise_ssd_review
[17:53] <Pauline> but won't journal writes be mostly small?
[17:54] <nhm> Pauline: journal writes will be the same size as the IO coming in.
[17:54] <nhm> hrm, probably plus a bit of header data.
[17:55] <odyssey4me> lol, this ssd doesn't look like it's worth much
[17:55] <odyssey4me> haha
[17:55] <nhm> low write endurane on that drive is a bit of a concern too.
[17:56] <nhm> yeah, looks like it's pretty read oriented.
[17:56] <nhm> Not a particularly great drive for journal use.
[17:57] * sagelap1 (~sage@2600:1012:b007:abc2:6474:c6a5:6ffa:636d) Quit (Read error: Connection reset by peer)
[17:59] <huangjun> FastCGI: failed to connect to server "/var/www/s3gw.fcgi": connect() failed ? anyone see this before?
[18:01] * gregphone (~gregphone@66-87-131-133.pools.spcsdns.net) has joined #ceph
[18:03] <gregphone> joao: pretty sure that skew interval you're fretting over was changed for teuthology by sage a few weeks ago
[18:03] <gregphone> we're you checking the assembled run yaml?
[18:04] <joao> gregphone, would that be the 'config.yaml' on the run's archive dir?
[18:04] <joao> if so, yes
[18:05] <joao> both *.yaml files on the dir actually
[18:05] <joao> none of them set that option
[18:05] <joao> maybe somewhere else?
[18:05] <gregphone> hmm
[18:05] * wijet (~wijet@toya.hederanetworks.net) has joined #ceph
[18:06] <gregphone> that's certainly where it should be set, but maybe he used a runtime switch in the task code, I maybe I'm making it up
[18:06] <gregphone> grep through teuthology for it? ;)
[18:06] <gregphone> *or maybe I'm
[18:07] <wijet> I have an issue with one of the mons after server restart, it complains about the keyring, despite the fact that the keyring is there and is correct
[18:07] <wijet> 2013-07-18 18:06:57.378673 7f5ee1106700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption
[18:08] <wijet> 2013-07-18 18:06:57.378681 7f5ee1106700 0 -- >> pipe(0x16c0a00 sd=22 :58878 s=1 pgs=13116 cs=1 l=0).failed verifying authorize reply
[18:08] <joao> gregphone, doh
[18:08] <wijet> I tried to remove the mon and re-add it, but same error message
[18:08] <joao> joao@tardis:~/inktank/src/teuthology$ git grep -nFI 'mon clock'
[18:08] <joao> teuthology/ceph.conf.template:8: mon clock drift allowed = .250
[18:08] <joao> and now I recall seeing this patch going in
[18:08] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[18:08] <joao> gg
[18:09] <gregphone> ah, so part of the non-configurable base template
[18:09] <gregphone> :)
[18:09] <joao> yeah
[18:10] * smiley (~smiley@c-71-200-71-128.hsd1.md.comcast.net) has joined #ceph
[18:10] * joao goes shaving as a punishment for overlooking this
[18:11] * sagelap (~sage@2600:1012:b007:abc2:215:ffff:fe36:60) has joined #ceph
[18:12] <gregphone> wijet: I don't remember exactly what causes that, but I'd check the keyring again, plus all the other auth settings
[18:13] <wijet> the keyring in mon directory is exactly the same, other two mons don't have issues with it
[18:13] <gregphone> same config file on each of them?
[18:14] * sagelap (~sage@2600:1012:b007:abc2:215:ffff:fe36:60) Quit ()
[18:15] * aliguori (~anthony@ has joined #ceph
[18:15] <gregphone> presumably that incoming connection is another monitor, best I can recall (not that strong right now, tbh) the only way that'll happen here is if the keys are wrong or the authentication mechanism they're configured for is different
[18:15] <wijet> yes
[18:16] <wijet> I'm running ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
[18:16] <wijet> When i removed broken mon, and tried to re-add it, it said that mon already exists, what is weird
[18:16] * huangjun (~huangjun@ Quit (Quit: HydraIRC -> http://www.hydrairc.com <- It'll be on slashdot one day...)
[18:17] <gregphone> sounds like you find actually remove it, then ;)
[18:17] <gregphone> *didn't
[18:19] <wijet> that's weird i got
[18:19] <wijet> ceph mon remove c
[18:19] <wijet> 2013-07-18 17:32:03.271166 7f4593a4e700 0 -- :/17351 >> pipe(0x14983e0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
[18:19] <wijet> removed mon.c at, there are now 2 monitors
[18:19] <wijet> is the part about removing mon from ceph.conf http://ceph.com/docs/next/rados/operations/add-or-rm-mons/
[18:19] <wijet> still valid?
[18:20] * sagelap (~sage@2607:f298:a:607:6c8c:28d1:a574:a5cc) has joined #ceph
[18:20] <wijet> I don't have there separate sections for each mon as it used to be
[18:22] <gregphone> hmm, should be good
[18:22] <gregphone> gotta run now though, maybe joao or somebody can help you out
[18:22] * gregphone (~gregphone@66-87-131-133.pools.spcsdns.net) Quit (Quit: Rooms • iPhone IRC Client • http://www.roomsapp.mobi)
[18:23] <sagewk> yehudasa: on 5663: you need to update ceph-common
[18:26] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[18:26] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[18:26] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[18:27] * hybrid5121 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[18:31] <joao> sagelap, wip-5652 on teuthology?
[18:32] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[18:34] <Kdecherf> hm, I have a segfault on ceph-mds 0.66
[18:34] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Ping timeout: 480 seconds)
[18:43] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[18:44] * stacker666 (~stacker66@104.pool85-58-195.dynamic.orange.es) Quit (Ping timeout: 480 seconds)
[18:51] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[18:53] * _Tassadar (~tassadar@tassadar.xs4all.nl) has joined #ceph
[19:00] * bergerx_ (~bekir@ Quit (Quit: Leaving.)
[19:07] * jeff-YF (~jeffyf@ Quit (Quit: jeff-YF)
[19:08] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[19:08] <infernix> so i have some python scripts talking rbd
[19:08] <infernix> whenever i have an issue with osds going down, that stuff just stalls indefinitely
[19:09] <infernix> they remain stuck even after the ceph cluster state goes back to HEALTH_OK
[19:09] <infernix> is that correct behaviour for bobtail?
[19:09] <infernix> or am i just a bad programmer :>
[19:10] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[19:13] <gregaf> it ought to be handled transparently by the rbd client without your interface needing to worry about it, maybe your scripts are getting stuck if rbd takes too long to get back to them or something?
[19:17] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[19:17] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit ()
[19:22] * jmlowe (~Adium@2601:d:a800:97:3c26:9e9d:695a:cea0) has joined #ceph
[19:22] <infernix> perhaps
[19:22] <infernix> i need to look into it
[19:22] <jmlowe> everything going ok with 0.61.5 upgrades?
[19:24] * markbby1 (~Adium@ has joined #ceph
[19:24] * markbby (~Adium@ Quit (Remote host closed the connection)
[19:25] <paravoid> sjusthm: hey
[19:25] * jeff-YF (~jeffyf@ has joined #ceph
[19:25] <sjusthm> hi
[19:25] <sjusthm> next may or may not be an improvement
[19:25] <paravoid> so, I haven't tested extensively due to #5655
[19:26] <paravoid> but the indications so far is that the peering bug is gone
[19:26] <sjusthm> as in, peering is uniformly fast now?
[19:26] <paravoid> 5 seconds
[19:26] <sjusthm> we may be able to improve that further, but you have a lot of pgs
[19:26] <paravoid> but I have tested a tiny amount of OSDs under no i/o load
[19:27] <paravoid> as apparently the new ones can't recover from the old ones :)
[19:28] <sjusthm> the fix for 5655 should be in next as well, if it's what I think it is
[19:28] <sjusthm> sagewk: the stuck peering bug fix went into next, right?
[19:28] <paravoid> what is it that you think it is?
[19:28] <sjusthm> dropped messages
[19:28] <sagewk> yeah
[19:29] <sjusthm> paravoid: but you are still seeing it on current next?
[19:30] <sjusthm> ok, caught up on the bug, wasn't that then
[19:30] * rturk-away is now known as rturk
[19:30] <paravoid> nod
[19:30] <paravoid> I got to go now, sorry
[19:30] <paravoid> see you folks later
[19:31] <paravoid> I just wanted to report some good news first :)
[19:31] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[19:31] * ChanServ sets mode +v andreask
[19:31] <sjusthm> k
[19:31] <paravoid> thanks!
[19:32] * fridudad (~oftc-webi@fw-office.allied-internet.ag) Quit (Remote host closed the connection)
[19:33] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Read error: Connection reset by peer)
[19:34] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[19:38] * xmltok (~xmltok@pool101.bizrate.com) Quit (Remote host closed the connection)
[19:38] * xmltok (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[19:38] * s2r2 (~s2r2@g227158028.adsl.alicedsl.de) has joined #ceph
[19:38] * infinitytrapdoor (~infinityt@ip-109-41-169-89.web.vodafone.de) has joined #ceph
[19:39] * sstan (~chatzilla@dmzgw2.cbnco.com) has joined #ceph
[19:39] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[19:39] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[19:40] * oddomatik is now known as bandrus
[19:40] <sstan> has anyone tried to make a machine boot from cephfs?
[19:41] <sjusthm> sagewk: 4255b5c2fb54ae40c53284b3ab700fdfc7e61748 needs to get backported to cuttlefish
[19:41] <sjusthm> cuttlefish I think is reporting ~0 as its feature set
[19:41] * xmltok_ (~xmltok@pool101.bizrate.com) has joined #ceph
[19:41] <sagewk> hmm i thought i did, checking
[19:41] * xmltok_ (~xmltok@pool101.bizrate.com) Quit (Remote host closed the connection)
[19:41] * xmltok_ (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[19:42] * xmltok (~xmltok@relay.els4.ticketmaster.com) Quit (Read error: Connection reset by peer)
[19:42] <sagewk> yeah c2b38291e706c9d1d4d337cee3a944f34bf66525
[19:42] <sjusthm> ah, forgot to reset
[19:43] <sjusthm> ok, that explains 5655 though, 0.65 had that bug
[19:50] * infinitytrapdoor (~infinityt@ip-109-41-169-89.web.vodafone.de) Quit (Ping timeout: 480 seconds)
[19:51] * sjusthm1 (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[19:51] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Read error: Connection reset by peer)
[19:52] <grepory> it would be nice if the startup script allowed you to set options via /etc/sysconfig/ceph or something like that.
[19:52] <grepory> or /etc/default/ceph, one of those directories.
[19:55] <grepory> because otherwise, i'm not entirely sure how i would past —hostname to the init script
[19:55] <grepory> s/past/pass
[19:55] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[19:56] * bandrus1 (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[19:56] <sagewk> sjusthm: one new failure: http://tracker.ceph.com/issues/5667
[19:57] <sjusthm1> ok
[19:57] * bandrus1 (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[19:57] * bandrus1 (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[20:02] * xmltok_ (~xmltok@relay.els4.ticketmaster.com) Quit (Ping timeout: 480 seconds)
[20:05] * stacker666 (~stacker66@ has joined #ceph
[20:05] * bandrus1 (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[20:13] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[20:17] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[20:21] * s2r2 (~s2r2@g227158028.adsl.alicedsl.de) Quit (Quit: s2r2)
[20:24] <grepory> are there ntp recommendations documented somewhere? i tried searching, but wasn't able to find anything particularly useful.
[20:25] <gregaf> keep it tight on your monitors, and adjust their warning and (if necessary, but it really shouldn't be) lease intervals accordingly
[20:26] <grepory> sweet. thanks. i didn't even realize those were tunable.
[20:26] <grepory> i'll re-read the config reference
[20:31] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[20:32] * Svedrin (svedrin@ketos.funzt-halt.net) Quit (Ping timeout: 480 seconds)
[20:35] * stacker666 (~stacker66@ Quit (Ping timeout: 480 seconds)
[20:40] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) Quit (Ping timeout: 480 seconds)
[20:41] * Svedrin (svedrin@ketos.funzt-halt.net) has joined #ceph
[20:41] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) has joined #ceph
[20:45] * jmlowe (~Adium@2601:d:a800:97:3c26:9e9d:695a:cea0) Quit (Quit: Leaving.)
[20:46] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[20:50] * Svedrin (svedrin@ketos.funzt-halt.net) Quit (Quit: Starved on the internet)
[20:53] * markbby1 (~Adium@ Quit (Quit: Leaving.)
[20:57] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[20:57] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[20:59] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[21:03] * markbby (~Adium@ has joined #ceph
[21:04] * jmlowe (~Adium@ has joined #ceph
[21:08] * mozg (~andrei@host109-151-35-94.range109-151.btcentralplus.com) has joined #ceph
[21:08] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[21:11] * sh_t (~sht@ has joined #ceph
[21:12] * s2r2 (~s2r2@g227158028.adsl.alicedsl.de) has joined #ceph
[21:13] <sh_t> hi guys. i'm new to ceph and was wondering if someone could point me to documentation describing what the maximum usable disk storage is on a ceph cluster. to me it seems like ceph treads all the osd's as one big raid0 and i'm wondering how this works with regards to redundancy if all of that space is made available for data only and not redundancy. for example I have 3 nodes (3 osds) at 20GB
[21:13] <sh_t> each and ceph reports a total of 60GB storage when I run ceph -s. thanks for the help!
[21:13] <sh_t> treats*
[21:14] <sjusthm1> sh_t: different pools can have different replication levels, so we just report raw space
[21:14] <sjusthm1> you'd want to divide that by your replication level to get actual usable space
[21:15] * odyssey4me (~odyssey4m@ Quit (Ping timeout: 480 seconds)
[21:19] <sh_t> sjusthm1: when you say replication level do you mean the "osd pool default size" setting? with a default of 2, I'd take my total raw space and divide by 2, yes?
[21:20] <sjusthm1> yes
[21:20] * alfredod_ is now known as alfredodeza
[21:24] <sh_t> thanks :)
[21:28] <mozg> hello guys
[21:29] <mozg> does anyone have suggestions on how to improve performance of small block reads and writes?
[21:29] * waxzce_ (~waxzce@office.clever-cloud.com) Quit (Remote host closed the connection)
[21:29] <mozg> i am using kvm + rbd and seeing very poor performance with small block sizes
[21:30] <mozg> like 4k reads/writes
[21:32] <sstan> replica size influences the speed ..
[21:32] * Svedrin (svedrin@ketos.funzt-halt.net) has joined #ceph
[21:32] <sstan> mozg: the speed is a function of the quality of the journal and the number of OSDs
[21:33] <mozg> sstan: i do have an ssd device for journals
[21:33] <sjustlaptop> sagewk, davidz: wip-5154 review?
[21:33] <joshd> if the writes are sequential, rbd caching will help significantly
[21:33] <mozg> but i get very poor performance
[21:33] <sstan> mozg: do you use rbd caching?
[21:34] <gregaf> what kind of performance on what kind of cluster under what kind of test?
[21:34] <mozg> sstan: yes I do
[21:34] <mozg> 256mb size
[21:34] <mozg> gregaf: i've got 16 osds in total
[21:34] <mozg> 2 osd servers
[21:34] <mozg> 3 mons
[21:34] <sstan> how many OSD daemons ?
[21:35] <sagewk> sjustlaptop: looking
[21:35] <mozg> when I run a 4k read test I am seeing around 1-1.5k iops with ceph
[21:36] <mozg> whereas I was getting around 10 times as much using nfs
[21:36] <mozg> 16 osds in total
[21:36] * scuttlemonkey changes topic to 'Latest stable (v0.61.5 "Cuttlefish) -- http://ceph.com/get || Ceph Developer Summit: Emperor - http://goo.gl/yy2Jh || Ceph Day NYC 01AUG2013 - http://goo.gl/TMIrZ'
[21:36] * fireD (~fireD@93-142-204-59.adsl.net.t-com.hr) Quit (Quit: Leaving.)
[21:37] <gregaf> what's generating the read test?
[21:37] * scuttlemonkey changes topic to 'Latest stable (v0.61.5 "Cuttlefish") -- http://ceph.com/get || Ceph Developer Summit: Emperor - http://goo.gl/yy2Jh || Ceph Day NYC 01AUG2013 - http://goo.gl/TMIrZ'
[21:38] <sstan> what do you mean? you were running kvm on nfs for that 10x comparaison ?
[21:38] <gregaf> (also: you're seeing between 62 and 90 random IOPS per drive, sounds like, which is a bit low but not outrageous. If you were getting 625-900 via NFS it was cheating one way or another)
[21:39] <sagewk> sjustlaptop: looks right to me.
[21:39] * wijet (~wijet@toya.hederanetworks.net) Quit (Quit: wijet)
[21:39] <sstan> if your data is re-read often , you could consider using some ssd cache solution
[21:41] <mozg> i've used many benchmarks
[21:41] <mozg> like iozone, fio and simple dd
[21:41] <mozg> get pretty similar results
[21:42] <mozg> gregaf: the trouble is i get similar results (i think and i need to verify this) even when the ceph osd servers are sending data from cache and not from disks
[21:43] * markbby (~Adium@ Quit (Quit: Leaving.)
[21:43] <mozg> so if i run the same benchmark several times
[21:43] <mozg> i get pretty similar results
[21:43] <mozg> and i can see from the osd server side there is no activity on the disks
[21:43] <mozg> server side, there is not much load from what i could see
[21:43] <mozg> i need to verify this with more testing
[21:44] <mozg> but it was my first impression
[21:44] * fireD (~fireD@93-142-204-59.adsl.net.t-com.hr) has joined #ceph
[21:44] <mozg> so, i thought that there must be some performance tunning for small block size reads/writes
[21:44] <mozg> coz performance is really good for large blocks
[21:44] <mozg> like 1M+
[21:44] <mozg> it outperforms nfs with concurrentcy
[21:44] <gregaf> mozg: hrm, that is odd, what versions of everything involved?
[21:44] <mozg> however, small block sizes suffer a lot with my tests
[21:45] <mozg> ubuntu 12.04
[21:45] <mozg> 0.61.4 ceph
[21:45] <mozg> kernel from backports
[21:45] <mozg> 3.8 i believe
[21:45] <gregaf> I seem to recall some op throughput issues under older versions that joshd or nhm would be able to discuss more coherently
[21:45] <sstan> maybe it can be improved by some % by tuning the IP packets caching, stacks, etc.
[21:46] <sstan> I think there have been some improvements in that regards lately with newer kernels ?
[21:46] * SvenPHX (~scarter@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[21:46] <joshd> which version of qemu? 1.4.2 and 1.5 have an async flush that will help
[21:47] <sjustlaptop> rbd caching doesn't help reads much, right?
[21:49] <sstan> mozg : if you can, maybe build a pool over some ramdisks, and assign the journal to the ramdisk too. Then you could find the bottlenecks easily
[21:49] <mozg> i've got 1.5.0
[21:50] <mozg> sstan: how would the ramdisks help me ?
[21:51] <sstan> by dedicating a section of your ram for CEPH, one is sure that the bottleneck isn't related to the disk/journal. If you get similar results, you'd at least know that it isn't because of the hard drives
[21:54] <sstan> since small blocks travel through the network, I think that every amount of overhead has a huge impact on the rate of transfer. ex: 1ms delay for each 4k block is very bad compared to 1ms for 4M.
[21:56] <sstan> idk if it's possible, I'm not an expert, but it would be great to make sure that ceph's network transactions are imediately processed (not buffered, or in a wait state)
[21:56] <nhm> sjustlaptop: it shouldn't...
[21:56] <sjustlaptop> k
[21:56] <sjustlaptop> if nfs was doing read ahead, that would explain a lot
[21:56] <nhm> sjustlaptop: at last I don't think.
[21:57] <nhm> sjustlaptop: better ask josh, I'm starting to second guess myself.
[21:58] <nhm> sjustlaptop: tell me what he says when you find out. ;)
[21:58] <sjustlaptop> I'm just saying that the reads might be sequential without read ahead
[22:00] <joshd> the cache won't help reads much unless they're reads of data recently read or written
[22:01] * markbby (~Adium@ has joined #ceph
[22:03] <nhm> joshd: it's only like 32MB by default right?
[22:03] <sstan> I wonder if there are performance gains to be found in the way the ceph client's process is scheduled
[22:03] <sstan> (for small blocks )
[22:03] <joshd> nhm: yeah
[22:03] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:04] <nhm> joshd: is there any way it could hurt in a case where you have lots of VMs doing lots of concurrent reads?
[22:04] * markbby (~Adium@ Quit ()
[22:06] <joshd> nhm: not any obvious way - it's not doing anything much different for reads compared to not using the cache
[22:07] * markbby (~Adium@ has joined #ceph
[22:07] * markbby1 (~Adium@ has joined #ceph
[22:07] * markbby (~Adium@ Quit (Remote host closed the connection)
[22:08] * jmlowe (~Adium@ Quit (Quit: Leaving.)
[22:19] <SvenPHX> Have a working cluster, now trying to get a rados gateway working but I'm having some problems. Anyone got a few mins to help me?
[22:25] * grepory1 (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[22:25] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Read error: Connection reset by peer)
[22:27] * grepory1 (~Adium@50-115-70-146.static-ip.telepacific.net) has left #ceph
[22:30] <sstan> what's the best value for net.ipv4.tcp_congestion_control ?
[22:30] <sstan> does it affect Ceph's performance for transfers of small blocks?
[22:32] <SvenPHX> I've found some performance improvements by enabling jumbo frames for the cluster network and the members of the cluster
[22:32] <sstan> SevenPHX: did you find something that improves small writes?
[22:39] * markbby (~Adium@ has joined #ceph
[22:40] * markbby1 (~Adium@ Quit (Remote host closed the connection)
[22:41] * markbby1 (~Adium@ has joined #ceph
[22:41] * fireD (~fireD@93-142-204-59.adsl.net.t-com.hr) Quit (Quit: leaving)
[22:41] * fireD (~fireD@93-142-204-59.adsl.net.t-com.hr) has joined #ceph
[22:43] * markbby (~Adium@ Quit (Remote host closed the connection)
[22:44] <sstan> I'm curious, what do you get with cat /proc/sys/net/ipv4/tcp_low_latency
[22:45] * fireD (~fireD@93-142-204-59.adsl.net.t-com.hr) Quit ()
[22:47] * Tamil (~Adium@cpe-108-184-66-69.socal.res.rr.com) has joined #ceph
[22:48] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has left #ceph
[22:50] * sagelap (~sage@2607:f298:a:607:6c8c:28d1:a574:a5cc) Quit (Quit: Leaving.)
[22:53] * fireD (~fireD@93-142-204-59.adsl.net.t-com.hr) has joined #ceph
[22:55] <sstan> mozg are you there?
[22:55] <mozg> yeah, i am here
[22:55] <sstan> could you please check what's in /proc/sys/net/ipv4/tcp_low_latency
[22:56] <sstan> and I might have a suggestion to try: decreasing the niceness of the processes
[22:56] <sstan> if any of the two suggestions make a difference, we might be on something : )
[22:57] <sstan> I'd try that right now if my ceph cluster was operational
[22:59] * aliguori (~anthony@ Quit (Remote host closed the connection)
[23:05] <mozg> one sec
[23:06] <mozg> cat /proc/sys/net/ipv4/tcp_low_latency
[23:06] <mozg> 1
[23:06] <mozg> that's for both of my osd servers
[23:07] <sstan> Ceph disables TCP buffering by default.. good stuff. http://ceph.com/docs/master/rados/configuration/network-config-ref/
[23:07] <sstan> I wasn't sure.
[23:08] <mozg> sstan: regarding the niceness
[23:08] <mozg> what value should i set
[23:08] <mozg> like -19?
[23:08] * jmlowe (~Adium@c-50-165-167-63.hsd1.in.comcast.net) has joined #ceph
[23:08] <sstan> I don't know really, the developpers have probably thought of that already, but maybe it's worth trying to see ...
[23:08] <mozg> i have nice of 0
[23:10] <sstan> oh it's possible to set priority on existing processes with renice
[23:11] <sstan> maybe decrease by 1 a couple of times to see if it makes a difference : /
[23:11] * markbby1 (~Adium@ Quit (Quit: Leaving.)
[23:11] * amatter (~oftc-webi@ Quit (Remote host closed the connection)
[23:13] * s2r2 (~s2r2@g227158028.adsl.alicedsl.de) Quit (Quit: s2r2)
[23:16] * jmlowe (~Adium@c-50-165-167-63.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[23:17] <sagewk> gregaf: want to take another look at wip-osd-leaks?
[23:18] * jmlowe (~Adium@c-50-165-167-63.hsd1.in.comcast.net) has joined #ceph
[23:18] <SvenPHX> sstan: I found it improved the backfill and whatnot
[23:22] <SvenPHX> between that and making the 'max open files' to 65536 significantly improved the backfill and convergance
[23:25] <gregaf> sagewk: did you see my jabber about the MDS use of mark_down?
[23:25] <gregaf> and what auditing did you do on them?
[23:26] * jmlowe (~Adium@c-50-165-167-63.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[23:26] <gregaf> also, the docs should mention that the reset notification is conditional on there being a connection attached
[23:26] <gregaf> and I'm kind of uncomfortable with the parallel addr and connection-based interfaces having different callbacks like that :/
[23:29] * PerlStalker (~PerlStalk@ Quit (Read error: Operation timed out)
[23:30] * DarkAce-Z is now known as DarkAceZ
[23:33] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[23:33] <sjusthm1> sagewk: wip-peering-perf-counter
[23:34] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[23:35] <sagewk> sjusthm1: couple trivila osd patches in wip-osd-leaks, want to look?
[23:35] <sjusthm1> yup
[23:35] <sjusthm1> looking
[23:35] <mozg> sstan: I will give it a go and get back to you guys. Will try to find time this weekend to work on it
[23:36] <sstan> cool good luck!
[23:36] <sstan> I hope small writes/reads will improve with newer versions of ceph
[23:37] <sagewk> gregaf: yeah, the audited the mds and osd ones
[23:37] <sagewk> gregaf: i think addr vs con is an unfortunately consequence of the addr interface not yet being deprecated
[23:37] <sagewk> getting there, but not done yet
[23:37] <gregaf> hmm, maybe
[23:38] <sagewk> i prefer not donig it in mark_down.. i don't really want to do it there *just* to be consistent wrt the interface we're trying to phase out
[23:38] <gregaf> do we have any reasonable expectation that the functions the mds calls are going to remain idempotent, then? we should certainly make a note of it
[23:38] <sagewk> i can add a comment
[23:39] * sstan (~chatzilla@dmzgw2.cbnco.com) Quit (Remote host closed the connection)
[23:39] <sjusthm1> sagewk: looks reasonable
[23:40] * lyncos (~chatzilla@ has joined #ceph
[23:40] <sagewk> i think we're actually a small patch away from eliminating it entirely from the osd, but i that needs to wait until after dumpling.
[23:42] <gregaf> oh, exciting
[23:42] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[23:43] * LeaChim (~LeaChim@97e00a48.skybroadband.com) has joined #ceph
[23:43] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[23:45] <paravoid> hey folks
[23:45] <paravoid> so, 0.66 -> 0.67 isn't going to be possible?
[23:45] <lyncos> pgmap v81287: 2000 pgs: 1999 active+clean, 1 active+clean+inconsistent; what to do in that case ? and what caused this ?
[23:45] <paravoid> I mean, I can upgrade 0.66 -> git 4255b5 -> git next or 0.67
[23:45] <paravoid> but I'm guessing others will hit this too
[23:45] <paravoid> (re: #5655)
[23:47] * waxzce (~waxzce@glo44-2-82-225-224-38.fbx.proxad.net) has joined #ceph
[23:47] <sjusthm1> paravoid: one sec
[23:48] <sjusthm1> looks like 66 is fine
[23:48] <sjusthm1> 0.65 is the problem
[23:49] <paravoid> I have 0.66 everywhere though
[23:49] <paravoid> (and that isn't what greg said on the bug report)
[23:49] <sjusthm1> and it is still causing a problem?
[23:49] <paravoid> yes
[23:49] <paravoid> 0.65 is long gone
[23:49] <paravoid> well, let me recheck that via asok to be sure
[23:50] <gregaf> sjusthm1: that patch isn't in v0.66 that I could tell (just ran git log v0.66 and looked for the commit msg)
[23:50] <paravoid> {"version":"0.66"}
[23:50] <sjusthm1> oh, so it isn't
[23:51] <sjusthm1> ok, what greg said then, you just need to upgrade past 4255b5
[23:51] <sjusthm1> or just to next
[23:51] <sagewk> lyncos: look in your /varl/og/ceph/cpeh.log on the mons to see what the scrub error was
[23:51] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[23:51] <sagewk> gregaf: mark_down audit turned up a couple cleanups, see wip-refs
[23:51] <paravoid> just to next is impossible, they can't recover
[23:52] <paravoid> but yeah, my point is that if you have a user base tracking 0.66 they'll have a hard time upgrading to 0.67
[23:52] <sjusthm1> k, it'll work if you stop at 4255b5 first
[23:52] <paravoid> I can do git, I'm worrying about the rest of your users :)
[23:53] <gregaf> I'm confused, how is upgrading impossible?
[23:53] <sjusthm1> the problem has actually existed since prior to cuttlefish
[23:53] <sjusthm1> gregaf: it's possible, it will just be disruptive
[23:53] <sjusthm1> you won't be able to do a rolling upgrade
[23:53] <gregaf> it's not *clean* in that the rolling upgrades are likely to not communicate if they hit this
[23:53] <gregaf> yeah
[23:54] <paravoid> how else would you the upgrade?
[23:54] <paravoid> just restart everything at the same time?
[23:54] <lyncos> sagewk thanks im looking
[23:54] <sjusthm1> 61.5 fixes this for cuttlefish, so 61.5-> 67 will work
[23:55] <sjusthm1> we could do a 66.1 I suppose if there is a lot of demand
[23:55] <sagewk> sjusthm1, gregaf: is it 32 or 64 bits of ~0?
[23:55] <sjusthm1> 64, annoyingly
[23:56] <gregaf> sagewk: can you clean it up? there's lots of debug patches without signoffs and wip-refs and wip-osd-leaks have diverged on their commits
[23:56] <sagewk> just did, wip-osd-leaks
[23:56] <gregaf> paravoid: there aren't any users tracking the dev releases for whom disruptive upgrades are a problem, at least that we're aware of
[23:57] <sagewk> sjusthm1: can we modify new code so that if it sees features ~0 it knows that *really* means (1<<32)-1 or whatever?
[23:57] <gregaf> which is good given the upgrade guarantees we adhere to ;)
[23:57] <lyncos> sagewk : http://pastebin.com/x9aM2USr humm I see nothing wrong...
[23:57] <sagewk> or even reserve 1<<63 to mean exactly that
[23:57] <sjusthm1> sagewk: I suppose, but that seems cumbersome
[23:57] <gregaf> sagewk: would need to be more careful than that since old enough code wouldn't have all lower-32-bit features
[23:57] <sjusthm1> all future checks will need to do that
[23:58] <sagewk> yeah. in teh msgr we would silently translate ~0 to whatever it is.. whatever the real feature set is for the version that started doing this (when teh 31st feature was introduced, right?)
[23:58] <sagewk> and never use the final bit so we don't confuse ourselves 2 years down the road
[23:59] <sjusthm1> sagewk: ok, one sec
[23:59] <gregaf> oh, was it only the last feature bit that did that, because of the sign extension?
[23:59] <lyncos> sagewk: Something special I should look for ?
[23:59] <sjusthm1> gregaf: correct

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.