#ceph IRC Log


IRC Log for 2012-12-05

Timestamps are in GMT/BST.

[0:04] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[0:06] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[0:10] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[0:12] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:12] * loicd (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) has joined #ceph
[0:16] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[0:16] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[0:17] * mooperd (~andrew@dslb-178-012-145-248.pools.arcor-ip.net) has joined #ceph
[0:23] * mooperd (~andrew@dslb-178-012-145-248.pools.arcor-ip.net) Quit (Quit: mooperd)
[0:24] * tnt (~tnt@207.171-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[0:33] * Psi-Jack_ (~psi-jack@psi-jack.user.oftc.net) has joined #ceph
[0:41] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) Quit (Ping timeout: 480 seconds)
[0:41] * Psi-Jack_ is now known as Psi-jack
[0:51] * jtang1 (~jtang@ Quit (Quit: Leaving.)
[0:54] * maxiz (~pfliu@ has joined #ceph
[0:55] <glowell> v0.55 update. I've rebuilt the repo and pushed it out to the debian-testing directory. The problem was that we built a package with the same name on amd64 and i386. This caused the reprepro command that we use to build repository to terminate before it had completed the index.
[0:56] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:57] * cdblack (c0373727@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[0:57] * loicd (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) Quit (Quit: Leaving.)
[0:57] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:02] <via> yehudasa: ok
[1:05] <benpol> glowell: thanks, I just finished updating my little test cluster from 54 to 55, went relatively well (although it seemed that perhaps the two versions perhaps speak incompatible versions of cephx)
[1:06] <benpol> none of my cluster nodes wanted to talk to each other until they were both at 0.55.
[1:07] <glowell> I know there were some changes in cephx, but I don't know the details. I thought it was just a change to the default settings.
[1:08] <via> benpol: fwiw, i have the same problem
[1:08] <benpol> fortunately my test cluster is tiny (three nodes)
[1:11] <benpol> via: what's up with the init script error? I saw you were talking about that earlier.
[1:11] <via> a few things
[1:11] <via> the first is just a faulty assignment in bash
[1:12] <via> fs_type = something -- there shouldn't be a spce after fs_type for proper variable assignment
[1:12] <via> but also the btrfs logic calls btrfsctl with, at least for me, invalid options
[1:13] * benpol nods
[1:13] <via> which right now means whenever i start ceph it displays a the btrfsctl help listing
[1:13] * vata (~vata@ Quit (Ping timeout: 480 seconds)
[1:13] <benpol> via: what version of btrfs-tools do you have?
[1:13] <via> probably ancient
[1:13] <via> Btrfs Btrfs v0.19
[1:14] <via> btrfs-progs-0.19-12.el6.x86_64
[1:16] <via> on that note, if anyone cares/is lazy, i am maintaining ceph rpms as well as rpms for libvirt/qemu with rbd support here: http://mirror.ece.vt.edu/pub/ceph-dev/
[1:17] <via> for el6
[1:20] * densone (~densone@74-92-51-22-NewEngland.hfc.comcastbusiness.net) Quit (Quit: densone)
[1:21] <wer> shouldn't mkcephfs have created a keyring file in /etc/ceph/keyring ? Because I don't have one, so I can't add my radosgw key... and it is driving me batty. All my OSd's are up but this isn't right and I don't know how to fix it.
[1:23] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: ircolle)
[1:24] * Cube (~Cube@64-60-46-82.static-ip.telepacific.net) has joined #ceph
[1:25] <wer> Or perhaps I didn't create the client.admin correctly.
[1:26] <wer> sudo ceph auth get-or-create client.admin mds 'allow' osd 'allow *' mon 'allow *' > /etc/ceph/keyring never exits. It just sits there forever......
[1:27] * vata (~vata@ has joined #ceph
[1:31] * Cube (~Cube@64-60-46-82.static-ip.telepacific.net) Quit (Quit: Leaving.)
[1:32] * jlogan (~Thunderbi@2600:c00:3010:1:7db5:bf2b:27d1:c794) Quit (Ping timeout: 480 seconds)
[1:36] <yehudasa> wer: is your mon up?
[1:37] * nwatkins (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[1:37] <wer> yehudasa: yeah it was..... I started blowing everything away trying to figure out what step I missed.
[1:37] <yehudasa> well, ceph utility hangs means it can't connect to the mon
[1:39] <wer> well, hmm. I didn't have an admin key from I could tell. I thought that mkcephfs made all that but maybe it is only if I give it the -k flag. The order of events for the key stuff is confounding me this time round. And I thought I understood what I needed. :(
[1:40] <via> actually... how *do* you get unscrewed if you lose the client.admin key? rather how do you add new keys to the keyring if you can't run ceph auth?
[1:41] <sjust> gregaf1, slang, nwatkins: greg indicated that you might be interested in the first patch of wip_recovery
[1:42] <sjust> it adds support in the admin socket for creating a stream for consumption by the ceph tool
[1:42] <sjust> useful for streaming internal events for testing
[1:43] * densone (~densone@74-92-51-22-NewEngland.hfc.comcastbusiness.net) has joined #ceph
[1:45] <wer> right via. That is a concern of mine too. I need a better understanding of the keys in ceph and docs seem logical but then I don't work for me. I feel the need to have established along each step if things were done correctly, and I am not confident where step one begins. I thought when mkcephfs was run but in my case It generated a 0 length file.... but built the osds. So I am trying again cause I am going crazy realizing I missed a step after I am alread
[1:45] <wer> y screwed.
[1:47] <wer> and since they are all mount points you can really make a mess :P
[1:49] <wer> I have written scripts to create all my filesystems... and translate them into the osd numbers... which feels not good, and the wrote scripts to mount everything... no clean an remount an remove whatever is in mon because bootstrapping again from a mistake is a pain. But I don't know what my mistake was :)
[1:56] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:56] * plut0 (~cory@pool-96-236-43-69.albyny.fios.verizon.net) has joined #ceph
[1:56] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:56] <plut0> hi
[1:57] * maxiz (~pfliu@ Quit (Quit: Ex-Chat)
[1:59] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[2:01] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[2:08] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:12] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[2:12] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[2:15] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[2:15] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[2:15] * Leseb_ is now known as Leseb
[2:17] * densone (~densone@74-92-51-22-NewEngland.hfc.comcastbusiness.net) Quit (Quit: densone)
[2:18] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[2:26] * calebamiles1 (~caleb@c-98-197-128-251.hsd1.tx.comcast.net) Quit (Read error: Connection reset by peer)
[2:27] * calebamiles (~caleb@c-98-197-128-251.hsd1.tx.comcast.net) has joined #ceph
[2:29] <via> yehudasa: http://pastebin.com/xtpFvQNt
[2:36] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[2:41] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Quit: Ex-Chat)
[3:00] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) has joined #ceph
[3:01] <infernix> hello gentlemen
[3:01] <infernix> what is the highest read throughput people have successfully achieved with ceph?
[3:02] <infernix> I am looking at getting around 40 to 100TB of spindle storage to achieve sustained sequential read speeds north of 2.5GByte/s
[3:03] <infernix> we have a 40gbit QDR infiniband network in place and I'm wondering if I can achieve this by building a ceph cluster instead of building a traditional raid system
[3:04] <infernix> it'll serve as a backup store for block devices in the range of 20GB to 10TB. write speeds are fine at 400MByte/s but more is always better obviously
[3:04] <nhm> infernix: I don't have aggregate numbers, but I'm actually working on a argonaut vs bobtail comparison right now. For large sequential reads I can do about 800GB/s reads with 8 disks in 1 server with the client on localhost and no replication.
[3:05] <infernix> you mean MB/sec, right?
[3:05] <nhm> infernix: doh, yes. It's been a long day. :)
[3:06] <nhm> infernix: that's with btrfs which tends to perform well on fresh hard drives but may degrade over time (it used to) pretty quickly.
[3:06] <infernix> I will need every object replicated twice, so I suppose i need at least 4 OSD servers, each about 10TB i suppos
[3:06] <nhm> infernix: ext4 and xfs performance was more in the 600-700MB/s range.
[3:07] <infernix> what i'm not entirely clear about is how data security is guaranteed through replication. with 4 boxes of 10TB, do I get an effective 20TB sized cluster?
[3:08] <nhm> infernix: yes, with 2x replication you'll store twice as much data. You'll also carve off a relatively small portion for journals. Probably ~1-10GB per OSD.
[3:10] * maxiz (~pfliu@ has joined #ceph
[3:11] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[3:13] <nhm> infernix: I should add that this is a supermicro server with drives directly connected to the cards. Systems that use expanders/exapnder backplanes may not perform as well.
[3:13] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[3:14] <infernix> hm. so the highest density is like 24x3.5" in 3U. 2 disks in a raid 1 for OS, then 22 3TB disks for OSD = 66TB per 3U
[3:14] <nhm> infernix: hrm, I'm not familiar with any 3U chassis that can do 24x3.5" drives. Which one are you looking at?
[3:14] <infernix> 24 port backplane on x4 SAS is about 1gbit per disk, e.g. ~110mbyte/s per disk
[3:15] <nhm> infernix: yes, assuming you get the full 6gbit through the expander.
[3:16] <infernix> no you're right, its 4U
[3:16] <nhm> infernix: it's entirely possible that some configurations are just fine. I've just seen a couple of setups that seem to have strange performance impacts so thought I'd mention it.
[3:16] <nhm> Ah, the SC846?
[3:17] <infernix> anda t 4U you can also get the 36 disk box
[3:17] <nhm> The test chassis I use is the SC847a. 36 drives in 4U.
[3:17] <infernix> yeah that's the one i'm eyeing for the raid setup
[3:17] * dmick (~dmick@2607:f298:a:607:88ef:4fcc:fc5a:8b23) Quit (Quit: Leaving.)
[3:18] <infernix> but don't I need another box for redundancy?
[3:18] <nhm> infernix: my personally recommendation is to stick with smaller chassis unless you want to do a really big deployment.
[3:18] <infernix> i'm expecting that i need 200TB by the end of next year
[3:19] <infernix> but it's essential i find a way to read sequentially with 2.5GB/s, preferably directio
[3:19] <infernix> e.g. dd if=/rados/block/device of=/superfast/ssd/device iflag=direct oflag=direct
[3:19] <infernix> need at least the oflag
[3:20] <nhm> is this cephfs or block (or S3 object?)
[3:20] <infernix> looking to build a scalable backup method to spindles
[3:20] <infernix> i actually don't really know what the best way is to handle it with ceph, but i'm reading from lvm snapshots and will be writing to them in cases of a disaster recovery
[3:21] <infernix> i can write to the ssd with 4.5gbyte/s, 2 threads; 1 thread does about 2.5gbyte/s
[3:21] <nhm> what kind of SSDs?
[3:21] <infernix> now i've been doing some math with the linux md guys and I can make it work with raid 10 and 33GB
[3:21] <infernix> *33 disks
[3:22] <infernix> because 33 disks at ~80mb/sec = 2.5gbyte/s
[3:22] <infernix> but that will never scale beyond one box
[3:22] <infernix> it's a storage array
[3:22] <infernix> crazy expensive
[3:22] <infernix> hence the need for a fast backup solution
[3:22] <infernix> fast and affordable
[3:22] <nhm> yeah, being able to grow your cluster is a nice benefit ceph provides.
[3:24] <nhm> One thing to keep in mind too is that once you get up to ~2-3GB/s in one box you can start hitting issues with processor/irq affinity, memory affinity, strange PCIE annoyances, etc.
[3:24] <nhm> you may be fine, but it's getting close to the kind of performance where those things crop up.
[3:25] <infernix> i have no problem buying multiple boxes for this
[3:25] <infernix> have a budget of around $10k per 10TB
[3:26] <infernix> and i need to accomodate 20TB initially
[3:26] <nhm> One thing I've really wanted us to do but no one has had time to think about how to do it yet is make a test cluster for potential customers to test this kind of thing out.
[3:27] <infernix> well i'll be buying the 36 disk box regardless probably
[3:28] <infernix> because it's either going to be linux md raid 10, or something that scales beyond a single box
[3:28] <nhm> infernix: given everything you've said, I think I'd go with 2U 12 drive boxes. You can start out with 1 box and just test it out and play with it. If ceph looks promising you can buy another 1 or 2. If it doesn't look good for you, can always take the drives out and repurpose it as something else.
[3:29] <infernix> so no hardware raid underneath eh?
[3:29] <nhm> nope, just straight up jbod.
[3:29] <infernix> so how does it handle disk failure, do i need to monitor this myself?
[3:29] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[3:31] <nhm> infernix: that's what ceph is good at. A drive fails and you've got another copy (or two) of the data on other nodes and healing happens across the whole cluster.
[3:32] <tore_> yeah if you lose an OSD Ceph automatically starts creating additional replicas on other OSDs to replace the ones that were lost
[3:32] <infernix> so the replication is not parity based
[3:32] <jmlowe> ideally the osd will choke or die with the disk and the cluster will mark it as down after it times out, eventually it will remap the data to the remaining osd's
[3:32] <infernix> can I configure objects themselves for 0, 1 or 2 replicas?
[3:32] <nhm> infernix: nope. it's just straight up replication. Lots of people have been asking about erasure coding, but we don't have it yet.
[3:33] <jmlowe> better yet you set the replica policy with a hierarchy
[3:33] <tore_> the most simplistic explanation is that ceph chops files up into 4mb chunks. Then it maintains a preconfigured number of duplicate chunks on different ISDs in the cluster
[3:34] <jmlowe> I want one copy in different racks, or two copies on different servers
[3:34] <tore_> if in your cluster every disk is an OSD, and you loose a disk -> then ceph just stores a replica somewhere else. if the disk comes back or is replaced the pool size adjusts. There is no need for "rebuild" or "resilvering" type operation in ceph
[3:35] <infernix> but this is per cluster? or per object?
[3:35] <jmlowe> per pool
[3:35] <tore_> you can use a custom crush map to influence where replicas are stored
[3:35] <jmlowe> pools have osd's assigned to them
[3:36] <tore_> like if you want replicas balanced between racks or datacenters or nodes on different power rails
[3:36] <nhm> infernix: I gotta run to put my kids to bed, but it looks like jmlowe and tore_ are doing a good job of explaining things. :)
[3:36] <jmlowe> you may also weight the osd's arbitrarily, let's say 3:1 new faster 6G disks to older slower 3G disks
[3:36] <tore_> yep
[3:37] <nhm> infernix: good luck, definitely let the channel know what you decide to do. :)
[3:37] <infernix> i need to study this in the morning
[3:38] <infernix> the good thing is that i have the stupid fast interconnect already
[3:38] <infernix> 40gbit IB
[3:39] <tore_> i'm trying to catch up, but will DAS work for you?
[3:39] <tore_> it sounds like you need some serious IO there
[3:39] <infernix> i have stupid fast but stupid expensive SSD storage that i need to back up with ~400mb, but restore in emergencies with 2.5gbyte/s
[3:39] <infernix> right now only 20TB but soon probably 100TB
[3:40] <infernix> faster backups would be nice but not essential
[3:40] <infernix> most likely only ever going to restore 10TB units at any given time
[3:41] <infernix> but if i'm taking 12x3TB ceph OSDs and use 2 for OS in raid1 and 10 for ceph, that's 30TB per 2U. I can probably buy 5 right off the bat
[3:42] <tore_> We've done some stuff here for virtualization with lsi sas switches where we assign disks balanced accross multiple JBOD to host servers. The server is then configured for mirroring etc. Int he event of a failure, we just modify the isolation zone and reboot the server
[3:42] <jmlowe> Using older sas 7.2k 3Gbs drives I could do around 800MB/s with one machine (hp p800 controller 512M cache)
[3:42] <infernix> 50 disks at 80mb/sec = 4gbyte/s in theory
[3:42] <jmlowe> oh, that was against a 12 disk MSA60
[3:43] <tore_> restoration time is basiclaly a few minutes in our setup, but we don't have a requirement for an offsite backup
[3:43] <infernix> so ceph runs entirely over tcp?
[3:44] <infernix> i mean i have infiniband. rdma is amazing
[3:44] <tore_> an LSI warpdrive and ZFS is also a relaly great way to achieve some rediculous IOPS
[3:44] <jmlowe> I've had to go jbod with btrfs raid 10, slot 6 keeps returning different data than what I write to it, I suspect I have a faulty backplane but can't prove it to the vendor
[3:44] <jmlowe> no rdma, … yet
[3:45] * deepsa (~deepsa@ has joined #ceph
[3:45] <jmlowe> ip over ib is what you would have to use
[3:45] <infernix> hrm. might run into issues there
[3:45] <jmlowe> relative to the disk latency you probably wouldn't see any benefit to the ib goodies other than high bandwidth
[3:46] <jmlowe> I use 10GigE
[3:46] <tore_> by any chance, are you using supermicro blad chassis, etc?
[3:47] <jmlowe> tore_: who are you asking?
[3:47] <infernix> we have a twinx
[3:47] <infernix> 4 in 2u
[3:47] <infernix> might get more, might not; depends on RAM pricing
[3:47] <tore_> is this your first venture with them?
[3:47] <infernix> supermicro? no
[3:47] <tore_> I have a few here, and the supermicro switch modules are complete shit
[3:48] <infernix> switch modules?
[3:48] <tore_> been workign with their taiwan support for over a year now to fix a switch lockup issue
[3:48] <jmlowe> I swore off dell a few years ago, now my recent experiences with hp have me thinking I'll buy IBM next time
[3:48] <infernix> i've had all of them
[3:48] <infernix> ibm in 2000-2003
[3:48] <infernix> dell since forever
[3:49] <infernix> hp the past 2 years
[3:49] <infernix> i'm in the supermicro camp
[3:49] <infernix> especially now that they have builtin mellanox infiniband boards
[3:49] * glowell (~glowell@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:49] <infernix> incidentally, mellanox uses supermicro for everything
[3:49] <jmlowe> I'm pushing a 75% replacement rate for my hp drives after 3.5 years
[3:49] <infernix> but i haven't used their superblades
[3:50] <infernix> just the twinx chassis. i can't use blades, need 2 IB ports per server
[3:50] <tore_> what we found was that if you connetc to ssh on the switch module enough times, the switch module's ssh service and web gui both crash. Then approximately, 60 - 90 min later, the switch completely dies
[3:51] <tore_> unfortunately it doesn't completely die
[3:51] <infernix> is there a limit to object sizes?
[3:52] <tore_> the software on supermicro's switch modules is from 2003. They don't maintain it and they have memory leak issues that they do not seem capable of resolving
[3:53] <infernix> and can I selectively replicate to another datacenter? can i for example create two pools using the same disks and have one pool only on servers in datacenter A, and another pool in datacenter A and B?
[3:53] <tore_> we even had our dev group supply them with an exploit, so they could replictae it in their lab. after a year they still haven't been able to fix it
[3:55] <tore_> I think pools span the entire cluster
[3:55] <tore_> however you can create a crush map for each pool that determines where how the data is placed
[3:57] <tore_> as for object size, ceph uses a 4mb object size by default. out of the box files regardless of size will be chopped into 4mb pieces
[3:57] <jmlowe> yes, you can do that, I have two datacenters and one cluster, just be sure your latency is low enough
[3:58] <tore_> I'm not sure what the maximum object size would be. Maybe someone with better knowledge of the internals could answer that for you
[3:59] <infernix> jmlowe: 70ms?
[3:59] <infernix> i can easily saturate the gbit link with multiple tcp streams, a single stream not quite though
[4:00] <infernix> i need to store at least 4TB but eventually probably 10TB objects
[4:02] * nwatkins (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Ex-Chat)
[4:10] <infernix> so if I have, say, three 2U boxes, 10 3TB disks for OSD e.g. 90GB raw, 45GB with 2 replicas, and i stripe across them
[4:10] <infernix> if i add another 2U box in a month, and another in 2 months
[4:11] <jmlowe> don't know what the timeout period, my latency is <2ms
[4:11] <infernix> are those going to be used for striping when creating new objects?
[4:11] <infernix> in other words, does that increase the performance?
[4:11] <infernix> or should I buy another set of 3?
[4:13] <jmlowe> ceph -s
[4:13] <jmlowe> health HEALTH_OK
[4:13] <jmlowe> monmap e2: 3 mons at {alpha=,beta=,gamma=}, election epoch 6, quorum 0,1,2 alpha,beta,gamma
[4:13] <jmlowe> osdmap e1266: 9 osds: 9 up, 9 in
[4:13] <jmlowe> pgmap v86670: 1360 pgs: 1360 active+clean; 87403 MB data, 374 GB used, 53270 GB / 55889 GB avail
[4:13] <jmlowe> mdsmap e265: 1/1/1 up {0=alpha=up:active}
[4:14] <jmlowe> right now I have 2 nodes with 3 and 6 osd's
[4:15] <jmlowe> with 9/9 "in" a copy is being placed on each node and assigned an osd on that node
[4:15] <jmlowe> my limiting factor is the aggregate write speed of the machine with 3 osd's
[4:16] <jmlowe> I use rbd to back vm's and inside of one of my vm's I unpack the 2.6.39 kernel from kernel.org in 33 to 35 seconds
[4:17] <jmlowe> make that 2.6.32
[4:17] <jmlowe> linux- to be specific
[4:19] <jmlowe> if you are doing rbd and dd as you indicated then the mds won't factor in, but if you are using cephfs then the size and number of files will be significant
[4:20] <infernix> echo 3 > /proc/sys/vm/drop_caches; time tar xjf linux- - real 0m18.171s
[4:20] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[4:21] <infernix> but you're doing this on cephfs?
[4:21] <jmlowe> inside a vm backed by rbd
[4:21] <jmlowe> vm running ubuntu 12.10
[4:21] <infernix> ah yes; this is inside a ubuntu 12.04 vm backed by ssd
[4:22] <nhm> infernix: for infiniband, IPoIB isn't too bad if you use the irq affiniity stuff.
[4:22] <infernix> nhm: irq affinity stuff?
[4:22] <nhm> infernix: otherwise rsocket might be worth looking into.
[4:22] <jmlowe> I figure best I can do is about 500MB/s with my hardware
[4:23] <nhm> infernix: http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10368
[4:23] <infernix> we're planning to do 10GB ethernet over IB soon in any case
[4:25] <nhm> heh, 3 year old calling out. bbiab
[4:30] <infernix> wait, cephfs is shared?
[4:30] <infernix> o_O
[4:32] <jmlowe> shared?
[4:32] <infernix> i can mount it on multiple servers concurrently?
[4:33] <infernix> for some reason i thought it wasn't
[4:33] <infernix> i thought, ok, cephfs is a filesystem only one server can mount at a given time
[4:34] * infernix needs sleep
[4:38] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: ircolle)
[4:38] <nhm> infernix: that's the rbd (ie block device) layer.
[4:38] <nhm> infernix: cephfs is the distributed filesystem layer.
[4:44] <infernix> i'm going to break my dev setup tomorrow and install it
[4:44] <infernix> but first some shuteye. thanks so far :)
[4:46] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[4:46] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[4:47] * Cube (~Cube@64-60-46-82.static-ip.telepacific.net) has joined #ceph
[4:47] * Cube (~Cube@64-60-46-82.static-ip.telepacific.net) Quit ()
[4:48] <jmlowe> infernix: multiple clients can mount and read/write concurrently a la gpfs, lustre, pnfs, gfs, ocfs
[4:48] <jmlowe> nhm: did I miss any?
[4:48] <jmlowe> gluster
[4:50] <nhm> orangefs/pvfs, tahoe-LAFS, moosefs, FhGFS
[4:50] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[4:50] <nhm> might be a couple of others I'm not remembering. :)
[4:50] * loicd (~loic@magenta.dachary.org) has joined #ceph
[4:50] <jmlowe> afs
[4:51] <nhm> Ah yes, afs. Never played with it, but a project at MSI used it.
[4:52] <jmlowe> http://kb.iu.edu/data/aroz.html
[4:53] <nhm> ah, interesting
[4:53] <nhm> jmlowe: btw, were you at SC12?
[4:53] <jmlowe> I passed this year, new baby and all
[4:54] <jmlowe> wasn't an easy decision
[4:54] <nhm> jmlowe: ah, that's totally understandable.
[4:54] <nhm> jmlowe: I think I talked to someone from Indiana University who had some FS corruption with Ceph. :(
[4:55] <jmlowe> that would be my boss
[4:55] <nhm> jmlowe: Ah, too bad. Sorry you guys got hit so hard. :(
[4:55] <jmlowe> turns out on one of my arrays it doesn't matter what disk you shove into slot 6, the bits you put in aren't the ones you get out
[4:56] <jmlowe> only sometimes
[4:56] * plut0 (~cory@pool-96-236-43-69.albyny.fios.verizon.net) has left #ceph
[4:56] <nhm> jmlowe: I don't know much about the backstory, was the corruption due to a bug in ceph?
[4:56] <jmlowe> finally figured it out after I spent 3 weeks unsuccessfully trying to scrub zfs clean
[4:57] <jmlowe> I'm sure we were hit by the fiemap bug
[4:57] <jmlowe> but we also have hardware problems
[4:58] <jmlowe> only determined it in the past month that we have a possible bad backplane in the chasssis
[4:58] <nhm> jmlowe: ok, well, hopefully you guys can test it again some day and the experience will be better next time!
[4:59] <jmlowe> I'm hoping we can go back to ceph with paid support
[4:59] <jmlowe> really is the best solution for what we want to do
[5:00] <nhm> jmlowe: That would definitely make our business guys happy. ;)
[5:02] <jmlowe> you guys going to be able to make a go of things? I've had my eye on the system engineer posting
[5:04] <nhm> jmlowe: hrm, I'm probably not supposed to talk about those kinds of things. I'll say I'm not particularly worried. :)
[5:06] <nhm> jmlowe: We definitely could use more people helping out on the professional services side.
[5:07] <jmlowe> I can't really leave Indiana, but I think Sage said inktank was a "post geographic company"
[5:08] <nhm> jmlowe: we have people all over the country and one of the developers is in Portugal.
[5:08] <jmlowe> maybe I'll have another bad day and finish my cover letter
[5:08] <jmlowe> :_
[5:08] <jmlowe> :)
[5:08] <nhm> jmlowe: that was the same story for me. I couldn't leave MN. It took me a long time to find a place like Inktank. :)
[5:09] <nhm> jmlowe: there are things I miss about academia. It's hard being so much farther away from the science. There are a lot of benefits too though.
[5:10] <nhm> jmlowe: It's both amazing and slightly alarming being surrounded by so many incredibly smart people.
[5:11] <jmlowe> I've been at the university for 7 years, straight out of Purdue with my computer engineering degree
[5:12] <jmlowe> it would be a major shock I think
[5:12] <nhm> jmlowe: I graduated and went off the general dynamics AIS for a little over a year (do not recommend) and then came back and worked for the Supercomputing Institute for 6 years.
[5:13] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[5:13] * loicd (~loic@magenta.dachary.org) has joined #ceph
[5:14] <nhm> jmlowe: it is a bit of a shock. If you are one of the guys at the University that has a preverse sense of needing to do things you can work yourself to death at a place like inktank. There's so many responsibilities and things you *could* be doing that you have to eventually learn to pace yourself.
[5:15] <nhm> on the plus side, inktank is about as technical and researchy of a place as any I've found.
[5:15] <nhm> People read and discuss computing research papers for fun. :)
[5:16] <jmlowe> We are getting a petaflop cray xk5 in the new year so things may change quite a bit, but with a 6 year old supercomputer and no real national commitments other than the vm's I provide it's not exactly what I signed up for
[5:17] <nhm> jmlowe: I pushed for us to go after national funding a number of times but there was never the will at the top.
[5:18] <jmlowe> did I say xk5 I meant xk6
[5:18] <nhm> jmlowe: we finally got some NIH money for shared memory machine, but some how it worked out that we ended up with an Altix UV1000 with a very oversubscribed 2 torus network.
[5:19] <jmlowe> there was will at the top here but we couldn't seem to win anything
[5:19] <nhm> s/2/2D
[5:19] <nhm> jmlowe: Did you guys partner with anyone?
[5:19] <jmlowe> I got 2 fte's to provide vm's for xsede
[5:19] <nhm> jmlowe: well, that's better than nothing I suppose.
[5:20] <jmlowe> uptime
[5:20] <jmlowe> 23:20:24 up 322 days, 13:11, 0 users, load average: 0.08, 0.02, 0.01
[5:21] <jmlowe> that's how I got it, nothing else at IU other than portal development was funded
[5:22] <nhm> the pot is smaller for xcede. It'll be tougher to get the big procurements, even for the guys that got them last time.
[5:22] <jmlowe> We won this http://ncgas.org/
[5:22] <jmlowe> I don't see another blue waters anytime soon
[5:23] <nhm> jmlowe: yeah, we got somethign kind of like that for proteomics through a mayo clinic / umn grant.
[5:23] <jmlowe> maybe doe will keep up but I don't see nsf funding any big machines in the foreseeable future
[5:25] <jmlowe> we also won futuregrid but the pi is a maniac and fired us after we built it for him
[5:26] <nhm> jmlowe: sounds about right. Guess we'll see how badly the DOE needs to get the simulations finished before all of the old guys who worked on the real weapons retire/die off.
[5:26] <nhm> lol, maniac pis! That is one thing I won't miss. ;)
[5:27] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[5:28] * chutzpah (~chutz@ Quit (Quit: Leaving)
[5:28] <jmlowe> ok, I'm turning in milk powered alarm clock doesn't have a snooze button
[5:31] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[5:32] * yasu` (~yasu`@dhcp-59-168.cse.ucsc.edu) has joined #ceph
[5:34] <nhm> lol, have a good night. :)
[5:40] * deepsa_ (~deepsa@ has joined #ceph
[5:42] * AaronSchulz_ (~chatzilla@ has joined #ceph
[5:43] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[5:43] * deepsa_ is now known as deepsa
[5:47] * AaronSchulz (~chatzilla@ Quit (Ping timeout: 480 seconds)
[5:47] * AaronSchulz_ is now known as AaronSchulz
[5:47] * tore_ (~tore@ Quit (Remote host closed the connection)
[5:52] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[6:10] * deepsa_ (~deepsa@ has joined #ceph
[6:15] * Cube (~Cube@ has joined #ceph
[6:15] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[6:15] * deepsa_ is now known as deepsa
[6:19] * Cube1 (~Cube@ has joined #ceph
[6:23] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[6:32] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[6:38] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[6:49] * The_Bishop (~bishop@2001:470:50b6:0:714b:9a8d:ae17:c726) Quit (Read error: No route to host)
[6:49] * MK_FG (~MK_FG@00018720.user.oftc.net) Quit (Quit: o//)
[6:50] * MK_FG (~MK_FG@00018720.user.oftc.net) has joined #ceph
[6:51] * The_Bishop (~bishop@2001:470:50b6:0:714b:9a8d:ae17:c726) has joined #ceph
[6:53] * yasu` (~yasu`@dhcp-59-168.cse.ucsc.edu) Quit (Remote host closed the connection)
[6:58] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[6:58] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[7:06] * maxiz (~pfliu@ Quit (Quit: Ex-Chat)
[7:19] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[7:19] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[7:36] * Cube1 (~Cube@ Quit (Ping timeout: 480 seconds)
[7:42] * joey__ (~terje@75-166-98-10.hlrn.qwest.net) has joined #ceph
[7:42] * terje_ (~joey@75-166-98-10.hlrn.qwest.net) has joined #ceph
[7:44] * boll (~boll@00012a62.user.oftc.net) Quit (Quit: boll)
[7:44] * joey_ (~terje@71-218-31-90.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[7:44] * terje (~joey@71-218-31-90.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[8:00] * tnt (~tnt@207.171-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:03] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[8:06] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[8:13] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[8:21] * wer (~wer@wer.youfarted.net) Quit (Remote host closed the connection)
[8:21] * wer (~wer@wer.youfarted.net) has joined #ceph
[8:29] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[8:29] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[8:35] * xiaoxi (~xiaoxiche@jfdmzpr06-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[8:35] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[8:36] * xiaoxi (~xiaoxiche@ has joined #ceph
[8:59] * nosebleedkt (~kostas@ has joined #ceph
[8:59] <nosebleedkt> hi joao and everybody
[9:00] <nosebleedkt> joao, I keep getting at debug console this thing: osd.2 [INF] 1.bf scrub ok
[9:00] <nosebleedkt> multiple same messages
[9:00] <nosebleedkt> what is this about ?
[9:00] * boll (~boll@00012a62.user.oftc.net) has joined #ceph
[9:00] * low (~low@ has joined #ceph
[9:14] <jtang> interesting
[9:14] <jtang> just saw the chat between nhm and jmlowe about corrupt systems
[9:14] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[9:21] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[9:23] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[9:24] * tnt (~tnt@207.171-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:24] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[9:24] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[9:25] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:27] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:33] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:35] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[9:35] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[9:38] <nosebleedkt> I upgraded to ceph 0.55 from 0.48. I deleted all previous ceph stuff so I run a clean setup.
[9:38] <nosebleedkt> I have all daemons working normally and ceph reporting healthy
[9:39] <nosebleedkt> Now I want to add at realtime a new osd
[9:39] <nosebleedkt> I do : root@masterceph:~# ceph osd create 3
[9:39] <nosebleedkt> (22) Invalid argument
[9:39] <nosebleedkt> wtf is 22 ? :(
[9:41] <tnt> the error code value for EINVAL = "Invalid argument" ...
[9:44] * pmjdebruijn (~pmjdebrui@overlord.pcode.nl) has joined #ceph
[9:44] <pmjdebruijn> lo
[9:45] <pmjdebruijn> I'm looking around in the ceph-client tree, I'm wondering what the current recommended kernel branch is?
[9:47] <nosebleedkt> tnt, yes but does this mean?
[9:47] <nosebleedkt> ceph osd create [<uuid>]
[9:47] <nosebleedkt> uuid?
[9:48] * sileht (~sileht@sileht.net) Quit (Quit: WeeChat
[9:49] <pmjdebruijn> I noticed there was a big merge 3.4.20
[9:49] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[9:49] <tnt> http://en.wikipedia.org/wiki/Universally_unique_identifier
[9:50] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[9:51] * sileht (~sileht@sileht.net) has joined #ceph
[9:52] * Leseb (~Leseb@ has joined #ceph
[9:53] * ScOut3R (~ScOut3R@ has joined #ceph
[9:53] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[9:53] * ScOut3R (~ScOut3R@ has joined #ceph
[9:54] * ScOut3R_ (~ScOut3R@ has joined #ceph
[9:55] <pmjdebruijn> and 3.6.6 apparently
[9:57] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[9:58] <nosebleedkt> tnt, how do i create uuids for new osds ?
[9:59] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[10:01] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[10:16] * loicd (~loic@ has joined #ceph
[10:22] <pmjdebruijn> any advice on using 3.6.9 vs 3.4.21 ?
[10:33] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[10:41] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[10:41] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[11:07] <joao> nosebleedkt, it's the uuid of the osd that you want to add
[11:07] <joao> I think the docs got that wrong
[11:07] <nosebleedkt> joao, yeah i figure it out
[11:07] <joao> it's not the number
[11:08] <nosebleedkt> i had to do: xfs_admin -U generate /dev/sdb
[11:08] <joao> the number of the osd is returned upon a successful create
[11:08] <nosebleedkt> to get the UUID
[11:08] <joao> not that uuid
[11:08] <joao> it's the uuid generated during ceph-osd --mkfs, if I'm not mistaken
[11:09] <nosebleedkt> ceph-osd -i ID --mkfs --mkkey runs after you create
[11:09] <joao> I was going to point you to the docs, but then I saw that the docs state 'ceph osd create {osd-num}', which should not be right
[11:09] <joao> nosebleedkt, that's actually not relevant
[11:09] <joao> you can ceph-osd --mkfs *before*
[11:09] <nosebleedkt> joao, i updated ceph from 0.48 to 0.55
[11:10] <nosebleedkt> so i start to have some compatibilty issues
[11:10] <joao> you just shouldn't bring it up before it's configured on the cluster
[11:10] <nosebleedkt> ok let me try do it again
[11:10] <joao> nosebleedkt, you can ignore that parameter
[11:10] <joao> just 'ceph osd create', obtain the value that is returned on success
[11:10] <nosebleedkt> which one
[11:11] <joao> and that will be your osd number
[11:11] <joao> s/number/id
[11:12] <nosebleedkt> ok wait
[11:13] <nosebleedkt> root@osdprovider:~# ceph-osd --mkfs --mkkey /dev/sdb
[11:13] <nosebleedkt> thats wrong
[11:13] <nosebleedkt> i need to provide an ID
[11:14] <joao> does it ask for an --uuid ?
[11:14] <nosebleedkt> root@osdprovider:~# ceph-osd -i 3 --mkfs --mkkey
[11:14] <nosebleedkt> works ok
[11:14] <nosebleedkt> then i
[11:14] <nosebleedkt> root@osdprovider:~# ceph osd create 3
[11:14] <nosebleedkt> (22) Invalid argument
[11:14] <joao> oh, yeah, that
[11:14] <joao> no
[11:14] <joao> the '3' is not the osd id
[11:14] <nosebleedkt> create 3 is wrong. should be create UUID
[11:14] <joao> yes
[11:15] <joao> I mean, '3' is your osd id, not your uuid
[11:15] <joao> sorry, late night, slept in, kinda slow
[11:15] <nosebleedkt> created object store /var/lib/ceph/osd/ceph-3 journal /var/lib/ceph/osd/ceph-3/journal for osd.3 fsid f613334f-8341-43f3-ae86-c4c220525c0b
[11:15] <joao> that's you uuid
[11:15] <joao> :)
[11:15] <nosebleedkt> AAAAAAAAAAAAAA
[11:15] <joao> *your
[11:15] <nosebleedkt> KK
[11:15] <nosebleedkt> let me try again then
[11:16] <nosebleedkt> ok worked
[11:19] <nosebleedkt> now i do
[11:19] <nosebleedkt> root@osdprovider:~# ceph osd crush set 3 osd.3 2.0 pool=default
[11:19] <nosebleedkt> (22) Invalid argument
[11:19] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[11:20] <nosebleedkt> joao,
[11:20] <nosebleedkt> help shows: ceph osd crush set <osd-id> <weight>
[11:20] <joao> I'm not sure how your crushmap is setup, but try it with 'root=' instead of pool
[11:21] <nosebleedkt> root@osdprovider:~# ceph osd crush set 3 2.0 root=default
[11:21] <nosebleedkt> updated item id 3 name 'osd.3' weight 2 at location {root=default} to crush map
[11:21] <joao> and you can also define a pool, sure, but a known point of reference should be present anyway
[11:21] <agh> hello to all.
[11:22] * AaronSchulz_ (~chatzilla@ has joined #ceph
[11:22] <agh> one question: how does a ceph updates goes in a production env ?
[11:22] <agh> i mean, when a new version of Ceph is coming, what we have to do on a prod env ?
[11:22] <agh> is it a dangerous task for datas ?
[11:22] <joao> as far as I know, we're trying to keep it a rolling upgrade
[11:23] <joao> in production you might want to take your time with osds as you see fit though
[11:23] <joao> but maybe someone with production clusters around here could shed better light?
[11:23] <nosebleedkt> joao, ok done. I added an OSD at runtime with 0.55 version
[11:24] <joao> great
[11:24] <agh> mm… i will have to test that in mu lab so
[11:24] <agh> thanks
[11:24] <joao> going to make some coffee; bbiab
[11:25] <nosebleedkt> :)
[11:27] * AaronSchulz (~chatzilla@ Quit (Ping timeout: 480 seconds)
[11:27] * AaronSchulz_ is now known as AaronSchulz
[11:29] <ScOut3R_> is there any best practice to create the xfs filesystem under the OSDs? i'm not very familiar with XFS and i wonder wether is there any performance tricks to apply
[11:29] * ScOut3R_ is now known as ScOut3R
[11:34] <match> Does anyone have any experience of handling network partitioning with ceph on multi-site clusters? Doy uo just rely on an odd number of mons for quorum, or do you do something else fancier?
[11:35] * The_Bishop (~bishop@2001:470:50b6:0:714b:9a8d:ae17:c726) Quit (Ping timeout: 480 seconds)
[11:37] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[11:38] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[11:42] * mooperd (~andrew@ has joined #ceph
[11:43] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[11:44] * The_Bishop (~bishop@2001:470:50b6:0:8d5d:14da:c16f:c382) has joined #ceph
[11:44] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[11:51] * ScOut3R (~ScOut3R@ has joined #ceph
[11:53] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[11:59] <nosebleedkt> im getting some bugs i think
[11:59] <nosebleedkt> Removing image: 99% complete...2012-12-05 12:58:43.232507 b2154b70 0 -- >> pipe(0x96ba8e8 sd =6 :0 pgs=0 cs=0 l=1).fault
[11:59] <nosebleedkt> 2012-12-05 12:59:07.005334 b2154b70 0 -- >> pipe(0x96dbe90 sd=4 :0 pgs=0 cs=0 l=1).fault
[12:03] * asdfasfas (asdfasfas@ has joined #ceph
[12:03] <asdfasfas> hello
[12:09] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[12:10] * BManojlovic (~steki@ has joined #ceph
[12:17] * loicd (~loic@ has joined #ceph
[12:21] * fc (~fc@ has joined #ceph
[12:22] * fc is now known as fc___
[12:23] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[12:24] * ScOut3R (~ScOut3R@ has joined #ceph
[12:25] * loicd (~loic@ Quit (Quit: Leaving.)
[12:30] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[12:32] * gaveen (~gaveen@ has joined #ceph
[12:33] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:39] * jks (~jks@3e6b7199.rev.stofanet.dk) Quit (Read error: Connection reset by peer)
[12:41] * guigouz (~guigouz@201-87-100-166.static-corp.ajato.com.br) has joined #ceph
[12:43] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[12:51] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[13:10] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Remote host closed the connection)
[13:11] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:26] * guigouz1 (~guigouz@201-87-100-166.static-corp.ajato.com.br) has joined #ceph
[13:29] * guigouz (~guigouz@201-87-100-166.static-corp.ajato.com.br) Quit (Ping timeout: 480 seconds)
[13:29] * guigouz1 is now known as guigouz
[13:39] * asdfasfas (asdfasfas@ Quit ()
[13:41] * deepsa (~deepsa@ has joined #ceph
[13:42] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[13:46] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[13:47] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[13:51] * ScOut3R_ (~ScOut3R@ has joined #ceph
[13:53] * tryggvil_ (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[13:53] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Read error: Connection reset by peer)
[13:53] * tryggvil_ is now known as tryggvil
[13:54] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[13:55] * loicd (~loic@ has joined #ceph
[13:58] * firaxis (~vortex@unnum-91-196-193-107.domashka.kiev.ua) Quit (Remote host closed the connection)
[14:11] * guigouz1 (~guigouz@ has joined #ceph
[14:14] * guigouz (~guigouz@201-87-100-166.static-corp.ajato.com.br) Quit (Ping timeout: 480 seconds)
[14:14] * guigouz1 is now known as guigouz
[14:15] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[14:24] * jrisch (~jrisch@4505ds2-hi.0.fullrate.dk) has joined #ceph
[14:26] * mooperd (~andrew@ Quit (Remote host closed the connection)
[14:48] * SkyEye (~gaveen@ has joined #ceph
[14:53] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[14:57] * synfin (~rahl@cv-nat-A-128.cv.nrao.edu) has joined #ceph
[14:58] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) has joined #ceph
[14:59] <synfin> I just started up ceph with fuse and tried exporting the filesystem over NFS. On an NFS client I created a new file with vi, but vi says there is an existing lock file. Sure enough, there was. It appears there may be a delay on file creation? Has anyone else experienced this before?
[14:59] <synfin> This problem does not manifest itself on the fuse mounted ceph client, only nfs
[15:05] * boll is now known as Guest560
[15:05] * boll (~boll@00012a62.user.oftc.net) has joined #ceph
[15:05] * boll (~boll@00012a62.user.oftc.net) Quit ()
[15:12] * Guest560 (~boll@00012a62.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:13] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:24] * fmarchand (~fmarchand@ has joined #ceph
[15:24] <fmarchand> Hi !
[15:25] <fmarchand> I updated ceph to the 0.55 version and ... It looks than you don't start the cluster as before ? sudo service ceph start ?
[15:26] <low> fmarchand: see http://ceph.com/releases/v0-55-released/
[15:28] <fmarchand> I read it ..
[15:29] <fmarchand> but it says nothing about a new way to start the cluster ... does it ?
[15:29] <low> thought you hit the authx thing, misread your sentence sorry
[15:30] <fmarchand> my authx is disabled
[15:31] <fmarchand> but when I do sudo service ceph start ... there is a ceph process running and taking a lot of cpu but it doesn't look like before
[15:33] <fmarchand> ceph command does not work anymore in fact
[15:33] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) Quit (Quit: Ex-Chat)
[15:36] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[15:37] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[15:38] <fmarchand> someone had the same weird behavior ?
[15:40] <mikedawson> fmarchand: I'm seeing the same behavior
[15:41] <fmarchand> oh ..
[15:41] <fmarchand> if you do a ceph -w it never gives you any result ?
[15:42] <mikedawson> fmarchand: ceph -w works for me on the one node I have upgraded so far
[15:43] <mikedawson> authx is enabled for me (and had been prior to the upgrade)
[15:45] <fmarchand> ceph -s or ceph -w on my cluster doesn't give me back hand
[15:48] <mikedawson> init script changed quite a bit with the upgrade to 0.55 http://pastebin.com/kCUFCjfG
[15:50] <via> line 52 of that diff is a syntax error i ran into
[15:50] <via> and fwiw, my auth is stlil broken from yesterday
[15:52] * synfin (~rahl@cv-nat-A-128.cv.nrao.edu) Quit (Ping timeout: 480 seconds)
[15:52] * drokita (~drokita@ has joined #ceph
[15:55] * boll (~boll@00012a62.user.oftc.net) has joined #ceph
[15:56] <mikedawson> fmarchand: /etc/init.d/ceph start works for me on Ubuntu 12.10
[15:57] <fmarchand> I rebooted the machine ... ceph restarted but I have only 1 osd's
[15:57] <fmarchand> and when I do "ceph -w" => unable to authenticate as client.admin
[15:58] * aliguori (~anthony@ has joined #ceph
[15:59] * PerlStalker (~PerlStalk@ has joined #ceph
[16:03] * drokita1 (~drokita@ has joined #ceph
[16:06] * nosebleedkt (~kostas@ Quit (Quit: Leaving)
[16:07] * drokita2 (~drokita@ has joined #ceph
[16:08] * drokita2 (~drokita@ Quit ()
[16:08] * drokita2 (~drokita@ has joined #ceph
[16:10] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[16:10] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[16:14] * drokita1 (~drokita@ Quit (Ping timeout: 480 seconds)
[16:15] * BManojlovic (~steki@ has joined #ceph
[16:16] * drokita2 (~drokita@ Quit (Ping timeout: 480 seconds)
[16:17] * drokita (~drokita@ has joined #ceph
[16:17] <fmarchand> Oki 1 osd does'n want to start : ERROR: osd init failed: (1) Operation not permitted
[16:18] <infernix> when is it recommended to use an OSD on top of a hardware raid?
[16:19] <infernix> I watched a presentation on youtube at inktank and it was mentioned
[16:19] <mikedawson> I have completed the upgrade to 0.55 on Ubuntu 12.10 and am back to HEALTH_OK. cephx is enabled (and was before the upgrade).
[16:19] <fmarchand> mikedawson: lucky you !
[16:19] <mikedawson> "service ceph start" is broken, but "/etc/init.d/ceph start" works as expected.
[16:21] <fmarchand> mikedawson: yes I tried and it worked thx ... it's just that I can't use "ceph -w" and my osd.0 does not want to start ...
[16:21] <mikedawson> infernix: I'm not using any RAID. I use one OSD process per SATA disk, plus a single SSD with 10GB partitions (with no formatted filesystem) for the OSD journals
[16:22] <jmlowe> I use btrfs software raid, but that's because I'm stuck with a $15k array that silently corrupts data but I can't prove it to the vendor
[16:23] <mikedawson> fmarchand: sounds like an auth issue
[16:25] <fmarchand> oki ...
[16:26] <infernix> ok, so next question. I can either go with a 24 disk or 36 disk box; one OSD per disk. assuming there will be no issue with network bandwidth (infiniband, e.g. ~3gbyte/s with IPoIB), what kind of cpu do the OSDs use?
[16:26] <fmarchand> "auth supported = none " in the conf file is deprecated
[16:26] <infernix> say about 22 OSD daemons vs 34 OSD daemons
[16:26] <fmarchand> must use :
[16:26] <fmarchand> auth cluster required = none
[16:26] <fmarchand> auth service required = none
[16:26] <fmarchand> auth client required = none
[16:26] <infernix> does that saturate 4 cores? 6? 8? 12?
[16:26] <infernix> or 16 even?
[16:26] * ScOut3R (~ScOut3R@ has joined #ceph
[16:29] <via> so, to attempt to recover from my failed ceph cluster due to all the auth problems, i've tried recreating my monitor
[16:29] <via> and i did so as per instructions, providing the old monmap, etc
[16:29] <mikedawson> infernix: If you are using JBOD and SSDs for journals, divide the continuous write throughput of the SSDs by the continuous write throughput of your spinning disks to get a ratio. In my case I use one SSD to every three SATA drives.
[16:29] <via> but now it thinks it wants 192 pgs vs previously 960 pgs, and it can't find any of my OSD's even after i add the keys for them
[16:30] <mikedawson> infernix: at that point your network throughput limits the number of OSDs per server. In general, it is preferred to use more servers with less OSDs/server over less servers with more OSDs/server.
[16:31] * ScOut3R_ (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[16:31] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[16:32] <infernix> mikedawson: i'm on infiniband
[16:33] <mikedawson> infernix: if you have dual 1GB NICs for instance, use one for public and the other for cluster. You get 125MB/s of throughput to replicate. That seriously bottlenecks the throughput capabilities of your SSDs (maybe ~500MB/s) / OSDs (maybe ~200MB/s).
[16:33] <infernix> they are 40gbit, but effectively 3gbyte/s with ipoib
[16:33] * fmarchand (~fmarchand@ Quit (Remote host closed the connection)
[16:34] <infernix> i hadn't thought about the journal yet though.
[16:34] <mikedawson> infernix: do the math, and I'd say 3GB/s might support 6 SSDs and 18 OSDs nicely
[16:35] <infernix> 18 osds? what 7200rpm drive does 200mb/s :)
[16:35] <infernix> i'm assuming 80mb/sec for the sata stuff
[16:36] <jtang> you reckon the sata controller has enough bandwidth to handle all the drives?
[16:36] <infernix> 2 backplanes, 1 24port and 1 12 port; 2 pcie 3.0 LSI SAS HBAs
[16:36] <infernix> 2gbit per port on the 12-port backplane, 1gbit on the 24port
[16:37] <infernix> so yeah
[16:37] <infernix> i'd think so
[16:37] <jtang> 200mbyte/s second isn't unreasonable if you are only getting about 3gbit/s from the ipoib connection
[16:38] <jtang> you are limited by the network if that is the case
[16:38] <infernix> gByte
[16:38] <infernix> not bit
[16:38] <jtang> ah okay, i didnt know that ipoib has gotten that good these days
[16:38] <via> can someone explain the procedure for replacing a monitor when its the only monitor?
[16:38] <jtang> the last time i tried, we only got about 5-6gbit/s from a ipoib connection
[16:39] <via> running ceph-mon with the monmap from the previous one does not appear to work
[16:39] * jtang shrugs, we only have a 2node ceph cluster
[16:39] <jtang> 45disks in each pod
[16:39] <jtang> and 2x1gbit/s connections on each node
[16:39] <jtang> we get only about 90mbyte/s tops from the system if i remember right
[16:40] <jtang> for write performance that is
[16:40] * jtang has cheap hardware
[16:40] <infernix> i'm trying to get the quote together for 3 boxes at 36x3TB each
[16:40] <infernix> but i hadn't thought about using ssd for journal at all
[16:40] <jtang> 36disks is kinda pushing it
[16:41] <infernix> they are about $10k per box
[16:41] <jtang> i think in hindsight 45disks in a machine was over kill
[16:41] <mikedawson> infernix: Dreamhost supposedly uses a separate partition on their spinning disks for the journal without any SSDs
[16:42] <drokita> Can somebody explain the relevance of the journal to overall performance of the Ceph cluster both in a healthy state and a rebuild state?
[16:42] <jtang> infernix: just remember if you have one failed box and your cluster is full you will need to shunt ~100tb between machines
[16:42] <jtang> when you recover or fail a machine it could be painful
[16:43] <jtang> from our expereinces, at our site we'd prefer to have more than 2-3 machines with probably about 30-50tbs per node
[16:43] <infernix> [ 3] 0.0-10.0 sec 16.7 GBytes 14.3 Gbits/sec
[16:43] <infernix> IPoIB
[16:43] <mikedawson> jtang: that's why I'm using 1SSD and 3 SATA per server, with lots of dual 1Gbps NIC servers
[16:43] <infernix> so that's on centos 5, out of the box
[16:43] <jtang> we're at ~65tbs per node right now, and its just plain slow to recover from a broken osd
[16:44] <infernix> at 1gbit, i'd assume so. but my network is a lot faster
[16:44] <jtang> mikedawson: heh, we're probably gonna go down the route of ~12-24disks per node depending on what we can get from the likes of dell
[16:44] <infernix> so i don't know how that changes things
[16:44] <jmlowe> drokita: a journal is a blocker, nothing can happen before it synchronously hits the journal, I don't think healthy vs recovery has much to do with it
[16:45] <jtang> infernix: yea i suppose that might help, have you benchmarked the aggregate read and write performance of each node with 36disks ?
[16:46] <drokita> jmlowe: So then what I am hearing, journals implemented on SSD drives help make that blocking activity faster and therefore, the overall performance could improve?
[16:47] <jmlowe> drokita: as I understand it, yes
[16:47] <drokita> Are the journals for the monitors or OSDs
[16:47] <infernix> jtang: i'm putting together a quote
[16:47] <infernix> still in the process of figuring out what to buy
[16:48] * boll (~boll@00012a62.user.oftc.net) Quit (Quit: boll)
[16:48] <infernix> my goal is 2.5gbyte/s sequntial reads
[16:48] <via> does anyone know how to recreate a monitor when you only have one? i'd like to think i didn't just wipe out the whole cluster
[16:48] <jmlowe> drokita: only osd's, to be confusing the underlying filesystems have journals which you could also sepparate and ceph has a journal
[16:49] <infernix> i need to basically do a dd if=ceph of=/stupid/fast/ssd/storage bs=4M
[16:49] <jtang> infernix: ah okay, well i can tell you now not to buy backblaze pods for ceph :)
[16:49] <jtang> they just suck
[16:49] <infernix> and that needs to write with 2.5gb/s
[16:49] * boll (~boll@00012a62.user.oftc.net) has joined #ceph
[16:50] <drokita> So, our OSDs are implemented on XFS. I was aware that XFS had a journal, but I thought it was inherent in the FS itself and not a separate entity.
[16:50] <jmlowe> drokita: mon's keep a small amount of data on disk and don't do much io, osd's do most of the io and could really benefit from separate journals, mds keeps data in osd's and will benefit from faster osd's
[16:50] <drokita> interesting
[16:50] <jtang> right i have a meeting
[16:50] <jtang> bye all
[16:51] <jmlowe> drokita: I believe there is a way with some filesystems to keep the journal on a separate device
[16:51] <drokita> Yeah, I have done that before with the likes of ZFS.
[16:52] <drokita> Not sure if it could be done on say XFS or EXT4. Either way, I should look into it.
[16:52] * synfin (~rahl@cv-nat-A-128.cv.nrao.edu) has joined #ceph
[16:53] <infernix> oh, look what i found. http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[16:54] * low (~low@ Quit (Quit: bbl)
[16:55] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[16:56] <jmlowe> drokita: as I understand it with ceph performance, the separate journal referred to in the docs is just pointing the journal directory to a faster mounted filesystem
[16:57] <mikedawson> infernix: Second article focuses on write throughput without SSDs http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
[16:57] <jmlowe> drokita: nhm is the resident benchmarker, you should verify with him
[16:57] * joshd1 (~jdurgin@2602:306:c5db:310:44b1:30ad:f0a1:af97) has joined #ceph
[16:58] <mikedawson> joshd1: after upgrading to 0.55, "service ceph start" fails "/etc/init.d/ceph start" works. Is this a known issue?
[17:00] <joshd1> not that I'm aware of
[17:01] <mikedawson> joshd1: thanks. fmarchand and I have the same issue.
[17:02] * nhorman_ (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[17:02] * nhorman_ (~nhorman@hmsreliant.think-freely.org) Quit ()
[17:03] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) has joined #ceph
[17:04] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[17:13] <drokita> jmlowe: THanks for the input!
[17:14] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:16] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:20] * jlogan1 (~Thunderbi@2600:c00:3010:1:7db5:bf2b:27d1:c794) has joined #ceph
[17:21] * boll (~boll@00012a62.user.oftc.net) Quit (Quit: boll)
[17:29] * roald (~Roald@ has joined #ceph
[17:40] * LeaChim (~LeaChim@b0fa9cd5.bb.sky.com) has joined #ceph
[17:42] <LeaChim> cephfs error: Linux 3.2.0-27-generic #43-Ubuntu SMP, http://pastebin.com/zy7CWMb4
[17:46] * absynth (~absynth@irc.absynth.de) has joined #ceph
[17:46] <absynth> hey
[17:46] <absynth> any inktank guy online?
[17:46] <absynth> we have "-6000 degraded" on our cluster
[17:46] <absynth> what the hell is that?
[17:49] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) has joined #ceph
[17:51] * boll (~boll@00012a62.user.oftc.net) has joined #ceph
[17:51] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[17:53] * Leseb (~Leseb@ Quit (Quit: Leseb)
[17:54] <synfin> The provided ceph RPM's for RH6 (http://ceph.com/rpms/el6/x86_64/) do not provide the 'ceph' module for mounting the filesystem. Any idea where I can get the module for this? I attempted building ceph-0.55 but I did not see any ceph.ko file in the resulting RPMs.
[17:56] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[18:00] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[18:04] * boll (~boll@00012a62.user.oftc.net) Quit (Quit: boll)
[18:05] <LeaChim> And now the monitor's crashed.. (ceph0.55) http://pastebin.com/nfP9JnBR
[18:06] * tnt (~tnt@207.171-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:07] * drokita (~drokita@ Quit (Quit: Leaving.)
[18:07] <noob2> congrats on the .55 release guys :)
[18:07] <noob2> synfin: redhat 6's kernel is too old to support rbd
[18:08] * drokita (~drokita@ has joined #ceph
[18:08] <synfin> noob2: Any supported distribution that does?
[18:09] <synfin> noob2: eg, ubuntu or whatnot?
[18:10] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[18:10] <noob2> ubuntu 12.04+
[18:16] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:22] <janos> is there any sort of timeline for when .55 will make it to the fedora rpm's?
[18:22] <janos> well, fedora repo
[18:22] <rweeks> glowell is probably the person to ask
[18:23] <janos> cool. i'm just now investigating using ceph and i see that .55 just came out, so i'd prefer to at least start there
[18:23] <rweeks> I know someone else in here was maintaining RPMs also but I don't have the scrollback to it
[18:24] <janos> no problem
[18:24] <janos> i'm not ina huge rush
[18:24] <janos> i'll be dogfooding it at home for quite a while before i even bother with work
[18:24] <janos> looking for general flexible file storage in additon to backing devices for kvm
[18:25] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:25] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[18:26] * LeaChim (~LeaChim@b0fa9cd5.bb.sky.com) Quit (Ping timeout: 480 seconds)
[18:29] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:37] <noob2> anyone know which package includes the python rbd library for ubuntu?
[18:39] * xiaoxi (~xiaoxiche@ Quit (Ping timeout: 480 seconds)
[18:41] * BManojlovic (~steki@242-174-222-85.adsl.verat.net) has joined #ceph
[18:41] * jbarbee (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[18:46] * drokita (~drokita@ Quit (Quit: Leaving.)
[18:46] * drokita (~drokita@ has joined #ceph
[18:47] <joshd1> python-ceph
[18:48] <glowell> rweeks: The topic came up a while back about working more closely with the fedora maintainers. Currently they have they own timeline for ceph integration.
[18:48] <rweeks> gotcha
[18:48] <rweeks> someone in here yesterday had posted a link to RPMs he maintains, though
[18:49] <noob2> thanks joshd1
[18:50] * loicd (~loic@jem75-2-82-233-234-24.fbx.proxad.net) has joined #ceph
[18:50] <infernix> "When the journal is enabled, all writes are written both to the journal and to the file system". so as i understand this, if I have 10 OSD disks in a box, and I write 10x4MB, there will be 10x4MB writes to the disk and 10x4MB writes to the journal?
[18:50] <infernix> or is only metadata written out to the journal?
[18:51] <wer> Twice now I have made an osd get %100 full :) The first time the journel was on there, so I could stop it and move the journel, and manage to recover the osd. But the second time it happened, I didn't have that canary in the cave (so to speak) and could not get the osd to start, even though I added an additional empty one. So I am curious what to do in the case of a completely full osd that will not start. Any suggestions?
[18:51] <infernix> also, what happens if the journal disk dies?
[18:52] <jtang> synfin: i know its not a great solution, we've been testing with the elrepo's releases of the mainline kernel which does have a recent rbd and cephfs modules for EL6
[18:52] <jtang> elrepo aren't as "reliable" or "trusted" as the upstream packaged kernels
[18:52] * drokita1 (~drokita@ has joined #ceph
[18:53] <wer> infernix: I think I am going to keep my journels on the osd disk.... I don't require fast access times though. But yeah... the journals.
[18:53] <jtang> i'd still prefer if proper backports were made for EL6 so our other binary blobs on our systems can still function
[18:54] <jmlowe> infernix: I believe the osd is trashed if the journal dies
[18:54] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[18:55] <jmlowe> infernix: replace journal device, save osd keyring/monmap, format with saved files, restart osd, wait for stuff to come back up to full replication
[18:56] <rweeks> jtang, I don't think that will happen due to the oldness of the EL6 supported kernel
[18:56] <rweeks> too many things are missing there for ceph to work properly
[18:57] <wer> I think the only reason I got full was because in .48 I have to manually remove deleted files with radosgw. Is radosgw-admin temp remove still needed in 0.54?
[19:00] * chutzpah (~chutz@ has joined #ceph
[19:03] * flakrat (~flakrat@eng-bec264la.eng.uab.edu) has joined #ceph
[19:05] <synfin> jtang: Thanks for the update
[19:05] <infernix> jmlowe: umm, but then 1 ssd journal for X disks is a big risk increase
[19:05] <infernix> e.g. 1:6 = 1 journal disk dead, 6 osds lost. right?
[19:05] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[19:06] <synfin> jtang: I know the feeling with Red Hat being "old", I use it everyday. But I have little choice in the matter :-/
[19:06] <rweeks> yes infernix, that is definitely a risk increase for single point of failure for those OSDs
[19:07] * joshd1 (~jdurgin@2602:306:c5db:310:44b1:30ad:f0a1:af97) Quit (Quit: Leaving.)
[19:08] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[19:09] <absynth> jeez, i am getting too old for ceph
[19:09] <absynth> and i just missed josh by a minute, damnit
[19:10] * deepsa (~deepsa@ Quit (Remote host closed the connection)
[19:11] * mikedawson_ (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[19:11] <jmlowe> infernix: ask anybody who has run lustre about safety/performance tradeoffs
[19:11] * drokita1 (~drokita@ Quit (Quit: Leaving.)
[19:11] * deepsa (~deepsa@ has joined #ceph
[19:12] <jmlowe> infernix: lustre devs often choose performance over safety, ceph leaves it up to you
[19:16] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[19:16] * mikedawson_ is now known as mikedawson
[19:17] <synfin> infernix: I am using Lustre, but it is for high performance data reduction. I am here now in the Ceph channel because I'm trying to find something more reliable, and redundant for serving data for, say, VM's, webservers and the like. I like Lustre for performance, but I don't trust it for reliability.
[19:18] <jmlowe> all goes back to consistent, available, distributed: pick any two
[19:19] <via> sorry to sound like a broken record, but surely there must be a procedure for replacing a monitor that has failed if it is the only monitor. does anyone know?
[19:20] <absynth> err, why would you want only one monitor?
[19:20] <darkfaded> maybe so he can test how to sort this out
[19:20] <darkfaded> it would be the same if 3 mons failed
[19:20] <via> maybe because i don't have three physical machines
[19:20] <via> darkfaded: i don't believe it is
[19:20] <via> because i treated as such, and nothing works
[19:21] <via> furthermore, what if you lose the monmap? are you just screwed?
[19:21] <via> in this case i did not, but it doesn't seem to do any good anyway
[19:21] <gregaf1> yep, you're just screwed
[19:21] <roald> via, without monmap you´re done
[19:21] <gregaf1> we could reconstruct it manually but there are no automated tools for doing so right now, and it'd take a while
[19:21] <via> seems pretty shitty i don't mind saying
[19:21] <via> that means its even more failure prone than the osd's
[19:22] <via> regardless, i do have the monmap because i backed it up
[19:22] <absynth> hence my earlier question.
[19:22] <roald> via, if you only implement one monitor, you´re basically creating a single point of failure
[19:22] <via> i'd like to think the answer is the same whether i want or don't want one mon
[19:22] <roald> if that mon crashes, you´re done... that´s why ceph advises you to at least implement 3
[19:22] <absynth> gregaf1: you are who i think you are, aren't you?
[19:23] <gregaf1> I guess?
[19:23] <via> roald: fwiw, no ceph documentation makes this clear that i have found
[19:23] <via> but anyway, i *do* have the monmap
[19:24] <via> and i'm pretty sure that the reason i'm having to replace the monitor would have happened with 3 anyway, since it was a bug in migrating from .54 to .55
[19:24] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[19:24] <gregaf1> the monmap is easy to rebuild; it's the OSDMap history that's important
[19:24] <roald> via, http://ceph.com/docs/master/rados/configuration/ceph-conf/#monitors
[19:25] <absynth> gregaf1: we saw -1234 pgs degraded on our 0.48.2 prod setup today
[19:25] <absynth> what the hell is that supposed to mean?
[19:25] <gregaf1> if you've got it failing asserts, that's different from having lost the monitor...
[19:25] <via> roald: "You may deploy Ceph with a single monitor, but if the instance fails, the lack of a monitor may interrupt data service availability."
[19:25] <gregaf1> absynth: I am not aware of any bugs that would cause that, how interesting!
[19:26] <via> "interrupt data service availability" != "lose all data permenantly"
[19:26] * gucki (~smuxi@80-218-32-162.dclient.hispeed.ch) has joined #ceph
[19:26] <gregaf1> what's the pg dump look like?
[19:26] <janos> i have an architectural/setup question (note, i'm researching, have not implemented ceph yet) = if i have some fairly robust host machines (24+cores) would it be stupid/insane/reasonable to put a monitor on each host machine - either directly on or in a small resource-dedicated VM?
[19:26] <gucki> good evening :
[19:26] <absynth> gregaf1: s/interesting/massive amount of expletives, uttered in a very loud voice/
[19:26] <roald> via, you´ve got a point there :-)
[19:26] <absynth> gregaf1: no idea, we didnt debug that deep, because we were busy saving our production vms
[19:26] * yasu` (~yasu`@dhcp-59-168.cse.ucsc.edu) has joined #ceph
[19:26] <via> but anyway, i tried mkfs'ing a new monitor, referencing the old monmap
[19:27] <absynth> basically, packetloss caused OSDs to die
[19:27] <absynth> we restarted them and one of hte OSDs was stuck in some weird state
[19:27] <via> the new monitor's pgmap thinks there's only 192 pg's instead of the original 960, and recognizes no OSD's even with cephx turned off
[19:27] <absynth> after we found the osd in question and restarted it manually, it recovered
[19:27] <gucki> i saw 0.55 has just been released. would it be ok to have 1/3 monitor and 1/2 osd per host running on 0.55 and keep the others on argonat? just in case 0.55 is not stable and crashes, the stable argonaut instances would keep the cluster alive?
[19:27] <jmlowe> I would say common sense would tell you that without functioning monitors you don't know what placement groups belong to what osd's so you don't know where your objects are http://ceph.com/docs/master/architecture/
[19:27] <gregaf1> janos: depends how many hosts, but yes, it's pretty unreasonable to create a monitor per host in a large cluster
[19:27] <gucki> is 0.56 (next stable) expected to be released this year? :)
[19:28] <janos> gregaf1: i was thinking 3 hosts to play with
[19:28] <via> jmlowe: was that directed at me?
[19:28] <absynth> gucki: 0.56 or a stable version? :D
[19:28] <gregaf1> janos: 3 is the default number of monitors we recommend :)
[19:28] <janos> it the unreasnable part - having a potentially large number of monitors - or putting those monitors on the hosts?
[19:28] <janos> it/is
[19:28] <gregaf1> large number of monitors
[19:29] <janos> cool, easy enough to avoid!
[19:29] <gucki> absynth: should'nt 0.56 be the next stable version? :)
[19:29] <janos> thank you
[19:29] <jmlowe> via: yes, I'm not sure where it explicitly says what a mon does, but I'm sure it's in there somewhere
[19:29] <absynth> gucki: sorry, i was being sarcastic
[19:29] <via> jmlowe: in fact, many things state that mon's store no data at all
[19:29] <gucki> absynth: i hope it doesn't get real. besides memory leaks, argonaut runs really fine here :)
[19:29] <gregaf1> yes, we expect .56 to be Bobtail; I'm not sure when it's coming out though — ordinarily the next month would be a good guess, but with Christmas in the way....
[19:30] <via> i don't know how further i can reduce this question to make it simpler. is there a way to replace a monitor?
[19:30] <via> when you only have one
[19:30] <absynth> gucki: in what kind of usecase?
[19:30] <jmlowe> via: there it is " Ceph clients contact a Ceph monitor and retrieve a copy of the cluster map. The CRUSH algorithm allows a client to compute where objects should be stored, and enables the client to contact the primary OSD to store or retrieve the objects."
[19:30] <via> i stil have the old monitor's directory
[19:30] <gucki> absynth: i only use qemu rbd...
[19:30] <jmlowe> via: no mon no objects
[19:30] <via> jmlowe: so, yes, i have the old monmap and crush map
[19:30] <via> i set both
[19:30] <via> it does not work
[19:30] <gregaf1> via: no, there is not a way to replace it; they are supposed to be a permanent unbreakable store, which is why for a real cluster you should run three
[19:30] <gregaf1> if you still have all the data from it, it can probably be fixed
[19:30] <absynth> gucki: how many vms?
[19:31] <gregaf1> if your problem is that it's crashing on startup, you should ask about that, not about replacing the monitor
[19:31] <jmlowe> via: you are beyond my abilities to help you as I am only a simple user
[19:31] <gucki> absynth: mh, around 100...
[19:31] <via> gregaf1: i did, i spent all day yesterday in here to no avail
[19:31] <via> so since i stil have the old monitor's crap, i figured it would be possible to just replace it
[19:31] <via> you all are telling me that even with everything the old monitor had, i am screwed?
[19:31] <gregaf1> did anybody talk to you about the monitor crashing?
[19:31] <gucki> gregaf1: what do you think of my idea of having running half of a cluster using 0.55 and and the other half on argonaut? upgrade is just restarting the daemons with the new version, right?
[19:32] <via> the monitor wasn't crashing, it was an authentication issue
[19:32] <gregaf1> gucki: we haven't tested it, but it *ought* to work
[19:32] <absynth> gucki: we talked about this in the company today. we think it is not a good idea
[19:32] <gucki> absynth: why not?
[19:32] <gucki> gregaf1: ok, then i'll better leave it as it is and wait for bobtail ;-)
[19:32] <absynth> we have had mixed-version environments and it always failed in some regard. and it makes debugging a lot harder
[19:33] <gregaf1> yeah, I know that's on QA's list of things to test on bobtail before release ;)
[19:33] <absynth> and if an inktank guy says "we haven't tested it", you can safely discard the rest of his statement. *ducks*
[19:33] <gucki> absynth: ah ok, so you mean the mixed setup..but you are not against using qemu rbd ... :)
[19:33] <absynth> nah, we are using that too
[19:33] <gucki> absynth: which version are you running?
[19:34] <absynth> argonaut
[19:34] <via> so there is no way to back up a monitor's state? so that if all three died due to some ceph bug, the entire cluster would irrecoverably lose all data? i cannot believe there's not a way to recover this with something i have backed up
[19:34] <gregaf1> via: you have a bug in auth I guess; maybe Yehuda can continue that conversation with you, but replacing the monitor is not the solution
[19:34] <gucki> absynth: do you also have problems with memory leaks?
[19:34] <gregaf1> and he says he can't help you any more if you can't produce logs for him when he asks for them
[19:34] <absynth> not so much. we have problems with unresponsiveness and crashing OSDs
[19:34] * yasu`_ (~yasu`@soenat3.cse.ucsc.edu) has joined #ceph
[19:35] <via> gregaf1: i was under the impression that i had produced all logs asked for, possibly evidenced by the 10ish pastebin's i posted yesterday
[19:35] <gregaf1> that's if you still have the original monitor's data store; if not, no, you're just screwed, like I said
[19:35] <via> gregaf1: i do
[19:35] <via> sorry, i thougtht i'd made that clear
[19:36] <via> i have a complete copy of the original monitor's data store
[19:36] <gucki> absynth: ah ok. i only had 1-2 osd crashes and 1 mon crash...but the memory leaks really suck. fafter a few days, a the "main" monitor uses several hundret mb ram...some osds use 1-2 gig ram after a few days...and even some kvm processes use 2.5 gb ram, even memory assigned is only 1 gb ;)
[19:36] <gregaf1> well, the two of you will need to hash that out; I've got a meeting coming up and some work to do
[19:36] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[19:36] <absynth> no, i cannot say we have seen that
[19:36] <gucki> absynth: but i'm happy that restarting the mons and ods works very well .. :)
[19:36] <absynth> but everything else. :D
[19:36] <yehudasa> via: so if you have the old mon data, just copy it back to the original location and restart the mon
[19:37] <absynth> try having a couple OSDs crash (like, say, 4 of 20) and then see what happens. its a nightmare
[19:37] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:38] <via> yehudasa: while i can do that, is there seriously not just like two or three things i can draw from that folder to rebuild it? i see now that the crushmap and monmap is not enough, but what would i have to back up if i wanted to be safe against losing a monitor? the whole data store?
[19:38] <gucki> absynth: once i had two failing servers at once..then everything blocked. but once one of the two servers was back, everything continued just fine :)
[19:38] <yehudasa> the whole mon data directory
[19:38] <via> ok
[19:39] <gucki> absynth: what i'm not really happy with it the read performance...but i hope that i'll be much better with 0.56, as i read about many optimizations etc..
[19:40] <absynth> what network do you have between the nodes?
[19:40] <via> yehudasa: okay, its rebuilding
[19:40] <gucki> absynth: gbit
[19:40] <via> but i'm still where i was yesterday where i cannot do anything without disabling cephx completely
[19:40] <gucki> absynth: network is not the bottleneck, nor the disks nor the cpu...
[19:41] <gucki> absynth: it just seems like someone forgot some sleep calls in the code ;()
[19:41] <via> yehudasa: i pasted you a link last night with more mon log output, did you see that or should i find it?
[19:41] <yehudasa> via: so start with disabling cephx completely, make sure everything works and then we can try again figuring out what's going on
[19:41] * yasu` (~yasu`@dhcp-59-168.cse.ucsc.edu) Quit (Ping timeout: 480 seconds)
[19:42] <via> right now cephx is disabled, and the cluster is more or less working. there are some stale+active+clean pg's due to an osd failure a few days ago, but thats about it. all osd's are in
[19:42] <via> some are still peering
[19:44] <yehudasa> via: does ceph auth list work?
[19:45] <yehudasa> via: specifically, are there entries for the osds there?
[19:45] <via> yes
[19:45] <via> and they match the keyring files for each one
[19:46] <yehudasa> via: so now you can shut off your osds, mon, enable cephx in your ceph.conf, start mon
[19:47] <yehudasa> and try to run ceph -s
[19:47] <via> ok
[19:49] <via> yehudasa: https://pastee.org/eg3vf
[19:50] * via double checks kering for admin
[19:50] <via> yeah, they match
[19:52] <yehudasa> via: mon and client don't agree on the protocol
[19:52] <via> anything i can get that would help explain why?
[19:52] <yehudasa> via: one of them doesn't use cephx
[19:53] <yehudasa> mon log?
[19:53] <yehudasa> can you paste your ceph.conf again?
[19:54] <via> yehudasa: https://pastee.org/62ksm
[19:56] <via> https://pastee.org/7mtyt
[19:57] <gucki> i justed installed 0.55, but now the init scripts seem to be broken,,., ?
[19:57] <gucki> "service ceph stop mon" -> "stop: Env must be KEY=VALUE pairs"
[19:57] <gucki> "service ceph stop" does nothing..
[19:57] <gucki> the old ceph instances keep running...
[19:59] <via> tonight parts for my 3rd and 4th boxes wll arrive so i will be able to run 3 monitors
[19:59] <gucki> "/etc/init.d/ceph start" works...
[19:59] * roald (~Roald@ Quit (Read error: Connection reset by peer)
[20:00] <gucki> shall i file a bug report?
[20:00] * roald (~Roald@ has joined #ceph
[20:06] <yehudasa> via: can you reproduce it with also 'debug ms = 20'?
[20:06] <via> mon log?
[20:10] <via> https://pastee.org/qsm3c
[20:10] <gregaf1> gucki: I think that bug's on the mailing list, but a bug in the tracker would be welcome!
[20:10] <gregaf1> sometthing to do with upstart jobs taking over from init
[20:12] <via> yehudasa: and ceph -s outputs: https://pastee.org/k3qvb
[20:12] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: ircolle)
[20:13] <flakrat> When are CentOS rpms planned for Ceph?
[20:13] * drokita (~drokita@ has joined #ceph
[20:14] <via> flakrat: the repo they have is lagging behind a bit, but i've been rolling my own into this repo: http://mirror.ece.vt.edu/pub/ceph-dev/
[20:15] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[20:16] <sjustlaptop> gregaf1, joshd: anyone want to do a quick review for wip_query_deleted_pool?
[20:16] <yehudasa> via: sorry about that, can you add 'debug mon = 20'?
[20:16] <via> sure
[20:16] <sjustlaptop> handle_pg_query needs to ignore requests about deleted pools
[20:16] <via> its a lot easier now that i'm using pastee's api instead of trying to copy/paste everything
[20:17] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) Quit ()
[20:17] <via> yehudasa: https://pastee.org/vtvj9
[20:19] * gregaf1 (~Adium@2607:f298:a:607:54b4:3d88:b69b:826e) Quit (Quit: Leaving.)
[20:20] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[20:21] * gregaf (~Adium@ has joined #ceph
[20:22] <joshd> sjustlaptop: are any other handle_* methods missing that check?
[20:22] <sjustlaptop> no, the others go through get_or_create_pg
[20:24] <joshd> seems fine to me then
[20:24] <sjustlaptop> k
[20:24] <sjustlaptop> pushing to next
[20:25] <sjustlaptop> ah... one sec
[20:25] <gregaf> sjustlaptop: this is a problem now because of the per-PG epochs?
[20:27] <sjustlaptop> no, it's always been a bug, but pg_to_up_acting_etc just returns instead of crashing when asked about a non-existent pg so project_pg_history accidentally returns the right thing
[20:27] <sjustlaptop> it becomes a problem if you as about the pg_num
[20:27] <sjustlaptop> in that case it crashes
[20:28] <sjustlaptop> ok, I repushed a version that actually builds
[20:29] <sjustlaptop> it becomes a bug with the new split code since project_pg_history has to check for splits since they cause a new interval
[20:30] * dmick (~dmick@2607:f298:a:607:c8bf:e8a1:2154:e15f) has joined #ceph
[20:30] <sjustlaptop> ugh, and that too is wrong
[20:30] <sjustlaptop> one sec
[20:31] <sjustlaptop> ok, *that* one should be ok
[20:31] <yehudasa> via: not sure what's going on, can you try specifying 'auth service required = cephx' explicitly and restart mon?
[20:34] <joshd> sjustlaptop: it'd be clearer if it's above the pgmap lookup and queue
[20:35] <sjustlaptop> it's logically part of the handling of a query on a pg not in the pg_map
[20:35] <sjustlaptop> if the pg is still in the pg_map, this check is unnecessary
[20:36] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: ircolle)
[20:36] <joshd> if the pool's already deleted, it just seems like it causes a bit of unnecessary work
[20:36] <gucki> gregaf: ok it's here now: http://tracker.newdream.net/issues/3576
[20:37] <gucki> gregaf: i mistyped the title but cannot change it (or i just can't find the link)
[20:37] <sjustlaptop> well, to put it another way, I could assert(osdmap->have_pg_pool(pgid.pool())) in the if (pg_map) branch
[20:37] <gucki> gregaf: would you recommand staying with argount or upgrading to 0.55?
[20:37] <gregaf> stick with argonaut for now
[20:37] <sjustlaptop> handle_osd_map will atomically remove the pg from the pg_map along with updating osdmap
[20:38] <yehudasa> via: also set 'auth client required = cephx' explicitly
[20:38] <sjustlaptop> to put it yet another way, this sort of check should be left to the pg except for when the pg doesn't exist and the OSD needs to synthesize an answer
[20:40] <joshd> ok, that makes sense
[20:40] <sjustlaptop> ok for me to push to next?
[20:40] <joshd> yeah
[20:40] <sjustlaptop> cool
[20:48] * ircolle (~ian@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[20:49] <via> yehudasa: okay, one minute
[20:50] <via> yehudasa: i think ceph -s worked now, its hard to tell with all the log output
[20:51] <via> https://pastee.org/kun3p
[20:52] <infernix> hmm ok, so if journal ssds are critical for performance and for reliability, raid 1 ssd journals seem to make sense.
[20:53] <infernix> a 10GB journal per OSD works for any size? or is it 10GB per 1TB of OSD?
[20:54] <infernix> also if i'm planning to write 10 to 40TB to it daily, those SSDs are going to get a beating
[20:54] <infernix> mlc is out of the question
[20:54] <infernix> maybe 15k SAS for journals then
[20:57] <tontsa> raid-1 of diffrent vendor SSDs also work if you have a good deal so you can get replacements if they die under 3 years or so :)
[20:58] <tontsa> same vendor SSDs tend to die exactly or very close to the same time so avoid for that reason
[20:58] <via> yehudasa: yeah, looks like auth is all working, thank you so much for helping et this working
[20:58] <infernix> well SLC is an option if i only need 10GB per OSD, regardless of OSD size
[20:58] <infernix> 64GB SLC is affordable
[20:58] <infernix> and fast
[20:58] <via> i technically have a little issue with being unable to get these 48 pg's to be unstale, but i can work on that for a bit
[21:00] <nhm> infernix: SSD journals help with write performance if you get fast journals, but for read performance you'll actually loose out a bit since you'll have fewer data disks in the same chassis.
[21:00] <nhm> s/loose/lose
[21:01] <joao> is anyone able to access the tracker?
[21:01] <nhm> infernix: I'm very interested in testing the new Intel S3700 100GB drives for journals.
[21:01] <infernix> nhm: i did some math on building with just those
[21:02] <elder> joao, I can access the tracker right now.
[21:02] * jbarbee (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.89 [Firefox 16.0.2/20121024073032])
[21:02] <infernix> but it's not viable for our setup
[21:04] <nhm> infernix: I think the two strategies that make the most sense is either have a controller with cache and all spinnning disks, or cheap SAS controllers and SSDs for journals.
[21:04] <nhm> each has benefits and drawbacks.
[21:04] * jluis (~JL@ has joined #ceph
[21:05] <infernix> or pcie ssd
[21:05] <nhm> that too
[21:05] * yasu`_ (~yasu`@soenat3.cse.ucsc.edu) Quit (Remote host closed the connection)
[21:05] * LeaChim (~LeaChim@b0fac111.bb.sky.com) has joined #ceph
[21:06] <nhm> probably will be more interesting as it gets more common/cheaper.
[21:09] * yasu` (~yasu`@dhcp-59-168.cse.ucsc.edu) has joined #ceph
[21:10] <infernix> but the problem is that I already have stupid fast ssd storage
[21:10] * joao (~JL@ Quit (Ping timeout: 480 seconds)
[21:10] <infernix> so adding ssd to the system that's supposed to back that up is kind of meh
[21:10] <infernix> and i have no problem throwing many spindles at this problem
[21:10] <nhm> infernix: yeah, it sounds like you want read performance anyway. I'd just stick with spinning disks.
[21:10] <infernix> say 100+
[21:11] <infernix> well read is paramount, write shouldn't suck but can be fine at 500mbyte/s
[21:12] * loicd (~loic@jem75-2-82-233-234-24.fbx.proxad.net) Quit (Quit: Leaving.)
[21:12] <infernix> i can't really find any benchmarks with that many spindles
[21:13] <nhm> infernix: yeah, not really any out there that I know of. I was going to do a setup like that on our Dell cluster, but we've had a lot of performance issues on the nodes.
[21:13] <via> yehudasa: my mds is stuck in replay, along with my 48 stale+active+clean, and producing a neverending log listing like this: https://pastee.org/xg22j
[21:14] <infernix> i have a set of 3 hps with a bunch of sas drives that i could reinstall with cobbler
[21:14] * infernix checks
[21:15] <nhm> infernix: so far the best results I've gotten have been from our supermicro box.
[21:16] <nhm> Wish I had another 10 of them or so. ;)
[21:16] * Cube (~Cube@ has joined #ceph
[21:17] <infernix> i will order a bunch as soon as i figured this all out
[21:17] <nhm> And more power running into my basement. ;)
[21:17] <elder> Don't you mean the performance lab?
[21:17] <infernix> so if i'm not using ssd, why bother with journals?
[21:17] <nhm> elder: er yeah!
[21:17] <infernix> writing the same data twice to the same disk seems pointless
[21:19] <infernix> nhm: you wrote up those benchmarks, right?
[21:20] * guigouz (~guigouz@ Quit (Quit: Computer has gone to sleep.)
[21:20] <nhm> infernix: I did
[21:22] * madkiss (~madkiss@ has joined #ceph
[21:22] <madkiss> cheers!
[21:22] <madkiss> has anyone ever tested doing Ceph with four raspberry pis?
[21:23] <elder> Love to, but I'm not aware of anyone who has.
[21:23] <nhm> madkiss: joao was just talking about wanting to do that. :D
[21:23] <madkiss> i've just seen http://blogs.linbit.com/p/406/raspberry-tau-cluster/
[21:23] <elder> We really need to port it to android and distribute it across phones.
[21:23] <madkiss> and I think we can beat that with something incredibly cooler
[21:23] <nhm> madkiss: memory might be an issue. CPU will get bogged down doing crc32c.
[21:24] <lurbs> elder: Write a worm to deploy it. Those things never get patched.
[21:27] <elder> madkiss, it should be called the Raspberry Omega, since the capital Omega looks a bit like our ceph logo.
[21:27] <madkiss> hehe
[21:28] <dmick> ...."the kind you find in a secondhand store..."
[21:31] <elder> Are you suggesting our octopus should wear a beret of some kind?
[21:31] <jluis> that would totally make it more hip
[21:31] * jluis is now known as joao
[21:33] <joao> <elder> We really need to port it to android and distribute it across phones. <- I bet it would run better in most phones than on the pi :p
[21:33] <elder> True.
[21:34] <joao> I'm aiming at cross-compiling ceph this weekend and try it out on a friend's raspberry-pi just to see if it runs
[21:35] <rturk> ha! synchronous replication might be a problem when one of your OSDs is driving in a tunnel
[21:36] * jrisch (~jrisch@4505ds2-hi.0.fullrate.dk) Quit (Quit: jrisch)
[21:38] <joao> rturk, if done in large scale, one osd in a tunnel won't be much of a concern :)
[21:39] <rturk> ya, but what if every OSD loses network 25 times a day?
[21:40] <joao> then you must have bad mobile service and you should change your service provider :p
[21:42] <joao> I'd be more concerned on how to create a crushmap to reflect the osd's mobility
[21:42] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:00] * Lea (~LeaChim@b0fac111.bb.sky.com) has joined #ceph
[22:00] * LeaChim (~LeaChim@b0fac111.bb.sky.com) Quit (Read error: Connection reset by peer)
[22:05] <elder> joao, that's a very interesing observation. We'd need dynamically changing maps somehow to reflect the cell tower(s) that can access each phone.
[22:07] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:07] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[22:07] * Leseb_ is now known as Leseb
[22:10] <rweeks> I was hoping to have 3 ARM baby boards and 3 USB drives at LISA
[22:10] <rweeks> to show off the world's smallest Ceph cluster
[22:10] <rweeks> but I ran out of time
[22:10] <rturk> I have a pi at home, maybe I'll start building ceph on it
[22:11] <rweeks> I think the current models will have memory problems
[22:11] <rweeks> I was looking at other ARM boards that have 1gb of ram
[22:11] <joao> rturk, I wonder how long that would take
[22:11] <rturk> what, to build it?
[22:11] <rturk> hm
[22:12] <joao> yeah
[22:12] <rweeks> these guys are 49 bucs
[22:12] <rweeks> http://www.viaembedded.com/en/products/boards/1930/1/VAB-800.html
[22:12] <rweeks> we had a partner already compile ceph for ARM
[22:12] <rweeks> at the SC12 show
[22:12] <joao> if you're going to build it on the pi, it should take a long time
[22:12] <rweeks> on the Calxeda architecture
[22:12] <rturk> I am running debian on it
[22:12] <rweeks> System Fabric Works
[22:12] <rturk> do we build packages for debian-arm?
[22:12] <joao> building ceph on my dual core is already slow as hell
[22:12] <elder> Who do I ask about osd map updates?
[22:13] <joao> elder, o/
[22:13] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:13] <elder> OK.
[22:13] <elder> Will an update include an incremental map *or* a complete map?
[22:14] * dmick wonders if there's a qemu-arm
[22:14] <dmick> indeed there is
[22:15] <joao> elder, pretty sure that most times it will be incrementals
[22:15] <joao> let me check
[22:16] <elder> Looks like we'll get: 16-byte fsid; 32-bit nr_maps; then nr_maps * (incremental map), then 32-bit nr_maps, then nr_maps * (full map)
[22:16] <joao> elder, I think you might get a full map as well
[22:19] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[22:19] <elder> And it looks like if I successfully decode the incrementals I can skip the full oness.
[22:21] <joao> elder, as far as I can see, the monitor will always send either one or the other, solely based on whether the other side already has a map or not
[22:22] <joao> but on the osd side, it first attempts to decode the latest map available on the MOSDMap message, and then apply any incrementals (with epoch > than the full map's) on top of it
[22:23] <jmlowe> I'm getting some ceph auth errors I don't think I should have
[22:24] <jmlowe> cephx: verify_authorizer could not get service secret for service osd secret_id=0
[22:24] <jmlowe> auth: could not find secret_id=0
[22:24] <elder> OK. Thanks joao
[22:26] <jmlowe> libceph: osd4 xxx.xxx.xxx.xxx:6821 connect authorization failure
[22:26] <jmlowe> any idea what's going on there?
[22:28] * fmarchand (~fmarchand@ has joined #ceph
[22:28] <fmarchand> hi !
[22:29] <yehudasa> jmlowe: does that happen mid-run, or is it just when you start the client?
[22:29] <yehudasa> fmarchand: hi!
[22:29] <fmarchand> my mds process craches ... someone can help me ?
[22:29] <fmarchand> yehuda : hi !
[22:29] <gregaf> fmarchand: do you have a backtrace from the log?
[22:30] <fmarchand> yes
[22:30] <jmlowe> kernel rbd client, seems to have mounted ok and I can do i/o but my logs have that periodically
[22:30] <gregaf> can you pastebin it?
[22:31] <fmarchand> yehuda : it crashes after many socket is
[22:31] <jmlowe> maybe about every 15 minutes
[22:32] <fmarchand> gregaf : hi :)
[22:32] <jmlowe> only seems to affect one osd
[22:32] <fmarchand> gregaf : when it does this ... what does it mean ?
[22:33] <gregaf> I have no idea
[22:33] <gregaf> if you can pastebin the backtrace I can tell you if there's any hope of a quick resolution, and if we need a new bug entry for it or not
[22:33] <gregaf> but I'm afraid that's about all we can do today
[22:33] <fmarchand> gregaf : oki I'm gonna try to pastbin it
[22:34] <yehudasa> jmlowe: besides that message, do you see anything happens?
[22:34] <gregaf> brb, gotta restart my box
[22:34] * gregaf (~Adium@ Quit (Quit: Leaving.)
[22:34] <jmlowe> I have no symptoms other than messages in the logs afaik
[22:36] <elder> jmlowe, I have seen those too, and I too seem to have no adverse symptoms, but I don't have any more info for you.
[22:36] <yehudasa> jmlowe: basically it looks like a client uses an old ticket
[22:36] <jmlowe> huh
[22:36] <yehudasa> jmlowe: then when it fails it acquires a new one and that fixes it
[22:37] <jmlowe> so there is an old ticket stuck in there somewhere?
[22:37] <dmick> yeah, just shove a butter knife into the slot and dislodge it :)
[22:37] <yehudasa> jmlowe: clients hold the tickets until they expire
[22:37] <jmlowe> would rmmod get rid of it?
[22:37] <yehudasa> dmick: that's my job definition
[22:38] <noob2> for the default crush rules, if i tell it about hosts/osd's will it know to not store an object twice on the same host? i couldn't really verify in the docs
[22:38] <yehudasa> jmlowe: probably
[22:38] <jmlowe> noob2: i believe as long as you have all the hosts defined when you initially set things up it will do the right thing
[22:38] * gregaf (~Adium@2607:f298:a:607:4c22:6146:25c9:b624) has joined #ceph
[22:38] <fmarchand> gregaf : sorry if the log is a bit incomplete ... http://pastebin.com/A2Btuj8G
[22:39] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[22:39] <noob2> that's what i figured but i wasn't sure :) thanks
[22:40] <fmarchand> gregaf : maybe to send you a mail would be easier ... tell me if you prefer an email
[22:40] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:41] <sjustlaptop> joshd: did you get a chance to look at wip_split2?
[22:41] <gregaf> fmarchand: that looks like a new one
[22:41] <fmarchand> I had this after I upgraded to 0.55
[22:41] <gregaf> can you 1) put an entry in the bug tracker, 2) install the ceph debug packages and reproduce (so as to get line numbers and things)?
[22:42] <joshd> sjustlaptop: partially, but it'll take a while to fully review it
[22:42] <fmarchand> so I downgrded to 0.54 ... and I still have the same pb
[22:42] <sjustlaptop> k
[22:43] <joshd> sjustlaptop: I didn't realize you'd be at home today and tomorrow, or I'd have asked for an overview of it yesterday
[22:43] <fmarchand> gregaf : as soon as I have access to a pc I do it
[22:43] <gregaf> cool
[22:44] <fmarchand> gregaf : do I need to rise debug level ?
[22:44] <sjustlaptop> oh, sorry about that
[22:44] <sjustlaptop> want an overview via vidyo?
[22:44] <gregaf> it wouldn't hurt us
[22:44] <joshd> sjustlaptop: sure
[22:45] <sjustlaptop> I'll go into the inktank room
[22:45] <sjustlaptop> oops, in use
[22:45] <sjustlaptop> I'll create a room
[22:45] <gregaf> you can just connect to each other; you don't need a room for person-to-person
[22:46] <rweeks> each person has their own room anyway
[22:46] <rweeks> just search for the name
[22:47] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[22:47] <fmarchand> gregaf : what ceph debug packages will do ? I need to configure something special ?
[22:47] <gregaf> no config necessary; you just install them and it can resolve symbols and line numbers better
[22:47] <fmarchand> oki
[22:48] <gregaf> I don't remember which ones are available, but it'll be ceph-debug-mds or something like that if there are multiple ones
[22:48] <dmick> they tend to be named <pkg>-dbg
[22:49] <fmarchand> oki I will find it I think :)
[22:51] <fmarchand> in fact I want to reconigure a cluster using cephfs to a rados cluster but I have data in osd's that I would like to copy before to re-install a new cluster. The question is ... can we access to osd's data without mds ?
[22:51] <fmarchand> I think I know the answer
[22:51] <rweeks> I'm going to say no
[22:51] <rweeks> but I am not a dev
[22:52] <fmarchand> I would sy that too ...
[22:52] <rweeks> I think you'd have to copy the data out using a client
[22:54] <fmarchand> a client ?
[22:56] <rweeks> something that has the filesystem mounted, yes
[22:57] <fmarchand> but it's a cephfs ... and I can't mount it without mds apparently
[22:57] <rweeks> you have a cephfs without an mds?
[22:59] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[23:01] <gregaf> it's crashing on startup
[23:02] <rweeks> ah
[23:02] <rweeks> hm
[23:05] <fmarchand> it says in the log that there are "sessions" with other ip's .... how can it be ? those machines are not used anymore
[23:05] <fmarchand> could I clean these "sessions" ?
[23:06] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[23:07] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:07] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:12] <gregaf> fmarchand: last time your MDS was running, it had connections to those machines (as clients in this case, I believe)
[23:12] <gregaf> it's giving them a chance to reconnect and describe any changes they made
[23:18] * SkyEye (~gaveen@ Quit (Remote host closed the connection)
[23:26] <drokita> Has anyone out there looked into multi-pathing or otherwise remote mounting ceph RBDs in an HA or fault tolerant setup? Like for instance if I front ended a couple of block devices with a CIFS/Samba server and wanted to account for failure in one of the heads?
[23:33] <gregaf> drokita: there are a couple people doing that with iSCSI for use with VMWare, and I think also some more using it for a more HA Samba setup
[23:33] <gregaf> it's feasible as long as the software stack is aware of the caching requirements
[23:34] <gregaf> but I don't remember who's doing it
[23:36] <lurbs> Could you map the same RBD volume onto multiple different hosts, export it from both via iSCSI and then have the iSCSI client use multipath for access to both?
[23:37] <drokita> I am sure you could, just wondering if it could be done without introducing iSCSI in between.
[23:37] <drokita> What is the connection type of network RBD mount?
[23:39] <gregaf> not sure what you mean
[23:39] <gregaf> it's using the (custom) RBD protocol on TCP, if that's what you're asking
[23:39] <lurbs> I saw a good howto around for HA NFS via pacemaker and DRBD/LVM. That might be able to be adapted to swap out DRBD for the same RBD device attached to multiple hosts, but I'd be rather scared of split brain and filesystem corruption.
[23:39] <drokita> That is what I meant
[23:40] <drokita> The DRBD path means that you are essentially replicating the already replicated block device though
[23:40] <drokita> I might have to look into pacemakers capabilities a bit more
[23:41] <lurbs> You could remove that part and instead pass the same RBD device to multiple VMs. And then pray the HA stack will never allow the filesystem sitting on top of it from being mounted in multiple places at once.
[23:42] <drokita> 'pray' :)
[23:42] <drokita> I have worked with DRBD, the split brain thing scares me too
[23:42] <lurbs> CephFS is supposed to be what you'd use in that sort of situation. I've never touched it though, just the RBD stuff.
[23:43] <drokita> Yeah. just not ready for prod use yet
[23:44] <lurbs> Right now, I'd probably just suffer the loss from replicas on replicas and use OCFS2/Gluster/whatever on top of multiple RBD backed volumes.
[23:45] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:45] <drokita> At this point, I can probably wait for CephFS
[23:46] <drokita> then do the pacemaker failover. That would limit the split brain to the service, and not the data/block device
[23:47] <drokita> thanks for helping me think out loud
[23:50] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has left #ceph
[23:52] * cblack101 (86868b4c@ircip1.mibbit.com) has joined #ceph
[23:52] <cblack101> Got .55 running on a new test cluster, pounding begins tomorrow!
[23:53] <rweeks> what you going to use for pounding? :)
[23:53] <cblack101> Quick question: when I run ceph osd dump -o - | grep pool, I se rep size 2 for my rbd pool, does that mean 2 replicas and the original or 2 copies total?
[23:54] <cblack101> @rweeks - will be using multiple VMs (Ubuntu 12.04 w3/6/9 kernel) to multiple rbds
[23:54] <cephalobot`> cblack101: Error: "rweeks" is not a valid command.
[23:54] <joao> cblack101, 2 copies total
[23:55] <rweeks> cool!
[23:55] <cblack101> cool, thanks for the clarification joao, I'll need to set that to 3 for my testing then, anyone know the syntax OTTH?
[23:56] * fc__ (~fc@home.ploup.net) Quit (Quit: leaving)
[23:56] <joao> cblack101, I *think* it's something like 'ceph osd pool set <pool-name> size <rep-level>'
[23:57] <dmick> http://ceph.com/docs/master/rados/operations/pools/#set-the-number-of-object-replicas
[23:57] <joao> it should fail if something is wrong; should change the rep level if it's right
[23:57] <cblack101> thanks will poke about and find it
[23:57] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[23:57] <joao> cblack101, dmick's link should help you with that :)
[23:58] <cblack101> ceph osd pool set nova size 3 - worked like a champ thanks!

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.