#ceph IRC Log


IRC Log for 2012-01-13

Timestamps are in GMT/BST.

[0:01] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[0:05] * adjohn (~adjohn@ Quit (Quit: adjohn)
[0:09] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[0:10] * gregaf1 (~Adium@aon.hq.newdream.net) has joined #ceph
[0:15] * gregaf (~Adium@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[0:35] * thafreak (~thafreak@dynamic-acs-24-144-210-108.zoominternet.net) has left #ceph
[1:01] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) has joined #ceph
[1:06] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[1:07] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:10] * edwardw is now known as edwardw`away
[1:25] <josh> \quit
[1:25] * josh (~josh@50-46-198-131.evrt.wa.frontiernet.net) Quit (Quit: leaving)
[1:41] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:59] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[2:13] * jojy (~jvarghese@ Quit (Quit: jojy)
[2:15] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:48] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:49] * spadaccio (~spadaccio@213-155-151-233.customer.teliacarrier.com) Quit (Quit: WeeChat 0.3.7-dev)
[3:23] * MarkDude (~MT@wsip-98-191-4-50.rn.hr.cox.net) has joined #ceph
[3:24] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has left #ceph
[3:26] * MarkDude (~MT@wsip-98-191-4-50.rn.hr.cox.net) Quit (Read error: Connection reset by peer)
[3:35] * jojy (~jvarghese@75-54-228-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[3:35] * jojy (~jvarghese@75-54-228-176.lightspeed.sntcca.sbcglobal.net) Quit ()
[3:49] * lxo (~aoliva@lxo.user.oftc.net) Quit (autokilled: Spammer - Contact support@oftc.net for help. (2012-01-13 02:49:39))
[4:00] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:15] * lxo (~aoliva@lxo.user.oftc.net) Quit (autokilled: Spammer - Contact support@oftc.net for help. (2012-01-13 03:15:19))
[4:15] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:21] * voidah (~voidah@pwel.org) Quit (Quit: leaving)
[4:22] * lxo (~aoliva@lxo.user.oftc.net) Quit (autokilled: Spammer - Contact support@oftc.net for help. (2012-01-13 03:22:35))
[4:25] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:44] * votz (~votz@pool-108-52-121-248.phlapa.fios.verizon.net) Quit (Quit: Leaving)
[5:02] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[5:48] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[6:24] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[6:25] * joshd (~joshd@aon.hq.newdream.net) Quit ()
[6:27] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[6:32] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[7:06] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[7:55] * gohko (~gohko@natter.interq.or.jp) Quit (Quit: Leaving...)
[7:56] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Read error: Operation timed out)
[8:42] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) has joined #ceph
[8:52] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[8:53] * yoshi (~yoshi@u637024.xgsfmg2.imtp.tachikawa.mopera.net) has joined #ceph
[9:04] * BManojlovic (~steki@93-87-148-183.dynamic.isp.telekom.rs) has joined #ceph
[9:07] * yoshi (~yoshi@u637024.xgsfmg2.imtp.tachikawa.mopera.net) Quit (Remote host closed the connection)
[9:10] * yoshi (~yoshi@u637024.xgsfmg2.imtp.tachikawa.mopera.net) has joined #ceph
[9:21] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[9:32] * Kioob (~kioob@luuna.daevel.fr) Quit (Quit: Leaving.)
[9:39] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[10:23] * yoshi (~yoshi@u637024.xgsfmg2.imtp.tachikawa.mopera.net) Quit (Ping timeout: 480 seconds)
[10:40] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[10:41] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[10:42] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[10:47] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[10:51] * verwilst (~verwilst@dD576F293.access.telenet.be) has joined #ceph
[11:05] * _Tassadar (~tassadar@tassadar.xs4all.nl) has left #ceph
[11:06] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[11:06] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[11:06] * fronlius_ is now known as fronlius
[11:12] * spadaccio (~spadaccio@213-155-151-233.customer.teliacarrier.com) has joined #ceph
[11:21] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[11:21] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[11:21] * fronlius_ is now known as fronlius
[12:28] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[12:29] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[12:34] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:40] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[12:55] * gregorg_taf (~Greg@ has joined #ceph
[12:55] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[13:50] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[13:59] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[14:00] * lollercaust (~paper@41.Red-88-15-116.dynamicIP.rima-tde.net) has joined #ceph
[14:37] * jmlowe (~Adium@173-161-9-146-Illinois.hfc.comcastbusiness.net) has joined #ceph
[14:44] * lollercaust (~paper@41.Red-88-15-116.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[14:52] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[16:05] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[16:10] * jmlowe (~Adium@173-161-9-146-Illinois.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[16:12] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:16] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[16:24] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[16:33] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[16:34] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:54] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[16:55] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[16:57] * lxo (~aoliva@lxo.user.oftc.net) Quit (autokilled: Spammer - Contact support@oftc.net for help. (2012-01-13 15:57:38))
[17:01] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) has joined #ceph
[17:04] * BManojlovic (~steki@93-87-148-183.dynamic.isp.telekom.rs) Quit (Remote host closed the connection)
[17:37] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[17:44] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[18:11] * lollercaust (~paper@41.Red-88-15-116.dynamicIP.rima-tde.net) has joined #ceph
[18:24] * lxo (~aoliva@lxo.user.oftc.net) Quit (autokilled: Spammer - Contact support@oftc.net for help. (2012-01-13 17:24:04))
[18:27] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[18:37] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:57] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[19:07] <Tv> sagewk1: yeah my bad i was reading an old checkout of teuthology, well revise
[19:07] <Tv> *will
[19:07] <sagewk1> tv: np
[19:08] <sagewk1> tv: what's confusing me is that the identical task when i run teuthology as user sage@metropolis gets different results than when teuthworker@teuthology runs it. there must be some dependency on the environment i'm missing?
[19:08] <Tv> digging
[19:08] <sagewk1> tnx
[19:10] <Tv> sagewk1: is a.run:Running: 'cat -- /tmp/cephtest/archive/s3readwrite.client.0.config.yaml' a local edit in just your tree?
[19:10] <Tv> i'm confused why that is getting no output, and can't find it
[19:11] <sagewk1> my tree is clean
[19:11] <Tv> oh it's in the run part.. why can't i find a match for "cat"...
[19:11] <Tv> brain not on yet ...
[19:11] <sagewk1> that output is from the teuthworker run
[19:11] <Tv> ohhh teuthology.get_file
[19:11] <Tv> that's funky
[19:11] <Tv> so it writes a file remotely, then reads it back in to the controller and writes it out again
[19:12] <Tv> that's a bit messed up ;)
[19:13] <Tv> okay now the log is starting to make sense..
[19:13] <Tv> aaand i know where the difference is
[19:13] <Tv> something puts a unicode string in as the hostname
[19:24] * otherone (518f110a@ircip4.mibbit.com) has joined #ceph
[19:25] <otherone> hi, we were able to successfuly install ceph with btrfs on debian, all worked fine, but when doing stress testing and failing nodes in order to check the redundancy we had number of stability issues
[19:25] <otherone> including osd going into state where we found no other way then to rebuild a cluster
[19:25] <otherone> or problems with btrfs which made it unusable
[19:26] <otherone> so, we are looking for ceph developer to provide us with support services and installation services of such a cluster which will allow production use.
[19:27] <otherone> we are willing to pay reasonable money :-)
[19:27] <otherone> I belive newdream is using ceph in production, we like the technology and belive that it can be used by us in production
[19:28] <otherone> just need to aquire the knowledge how to get it started in stable way in quick way :-)
[19:28] <otherone> anyone ?
[19:28] <nhm> otherone: a couple of the newdream guys were around about 10 minutes ago. I'm sure they'll be able to answer your questions.
[19:29] <otherone> nhm: hopefuly we are going to be back soon :-)
[19:30] <nhm> otherone: If you've still got any debugging information from your stress testing I bet they'd love to see it too. :D
[19:32] <jmlowe> otherone, I couldn't make a go of it with the stock 3.0 ubuntu kernel, I had to have the 3.2 from the upcoming ubuntu 12.4, it became too difficult to make that work so I went with ext4 this week
[19:33] <jmlowe> I would probably suggest that route for the time being, at least until btrfs has a working fsck
[19:34] <jmlowe> I should mention I had the same problems under load, I could keep things going under light load and btrfs but under stress the backing filesystems fell apart and eventually overwhelmed ceph replication
[19:35] <nhm> jmlowe: how is it going with ext4?
[19:37] <jmlowe> so far so good, don't notice much of a difference, other than my osd's haven't fallen over yet
[19:43] <nhm> I wonder if this 4kb limit on xattrs with ext4 is still an issue.
[19:43] <Tv> nhm: i think yes...
[19:43] <gregaf1> otherone: Dreamhost is rolling out products built on RADOS, but the filesystem isn't production-ready yet (well, depending on what you're trying to do)
[19:44] <jmlowe> I'm only using rbd, it's the posix layer that runs into trouble right?
[19:44] <gregaf1> but you can email one of our business people to get started: dona.holmberg@dreamhost.com
[19:44] <otherone> jmlowe: is dreamhost using ext4 or btrfs for their deployment ? what's the replication level they have ? 2 ? 3? 4 ?
[19:44] <gregaf1> the 4KB limit is still a problem on ext4, yes, but it's not people have really run into outside of rgw users
[19:45] <gregaf1> jmlowe: as long as you don't take a ridiculous number of snapshots it won't be a problem :)
[19:46] <otherone> gregaf1: thank you for the contact email address. I will get in touch.
[19:46] <gregaf1> welcome, thanks for looking at Ceph :)
[19:46] <otherone> gregaf1: are you using ceph for your hosting business ? is there any public information about your own use of ceph/rados ?
[19:47] <otherone> gregaf1: thank you for such great technology as rados/ceph :-)
[19:47] <nhm> oh nice, someone on the maililng list was reporting single client throughput of 700MB/s over 10G from a VM with btrfs.
[19:47] <jmlowe> otherone: I don't know, I'm not a dreamhost person, I work for IU
[19:48] <otherone> jmlowe: IU ? never heard about them, but I'm in europe :-)
[19:48] <otherone> jmlowe: the problems we mainly had were not with posix layer (ceph) but with osd and btrfs
[19:49] <jmlowe> Indiana University
[19:49] <gregaf1> otherone: I'm not sure what exactly is public knowledge; Dona could tell you more :)
[19:49] <nhm> otherone: A couple of us here are from Universities. :)
[19:49] <otherone> jmlowe: only once during mount'ing of ceph one of the hosts run out of memory (including swap) and rebooted itself.
[19:50] <otherone> gregf1: oh come on spill the beans ;-)
[19:50] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (Quit: Ex-Chat)
[19:50] * jojy (~jvarghese@ has joined #ceph
[19:52] <otherone> what's the general assumption on the safe and ecnomicly justified number of replications ? we are considering 2 or 3.
[19:54] <otherone> nhm: using kingston hyper-x ssd or ocz vortex 3 sdd disks you can get 700MB/s easily ;-)
[19:54] <jmlowe> probably depends on the underlying hardware
[19:55] <gregaf1> 2 or 3 are good choices; that's what we expect most people to use
[19:55] <otherone> jmlowe: simple hardware, no raid controlers, sata drives of different sizes, we are planning to connect to storage cluster everyhting what has a lot of disks and is not used any more.
[19:55] <nhm> otherone: Yeah, modern SSDs can do 500MB/s under ideal conditions.
[19:56] <otherone> jmlowe: the storage clusted is going to be used for backup storage, hosting and vm disks for openstack.
[19:56] <gregaf1> depends on your own assessment of your hardware reliability, mostly, and of course the more replicas the less likely you are to temporarily lose access to things
[19:56] <nhm> otherone: adding in all of the complications of a distributed file system over a network and it becomes more interesting. :)
[19:57] <jmlowe> otherone: let me know how openstack goes, I have been doing things by hand so far but I'm interested in a bit more automation
[19:57] <otherone> gregaf1: we assume that most hardware is going to be less than three years old, sata drives and server grade (no desktop/workstation type), therefore hardware failure probably is going to be small, except for disks which always fail.
[19:58] <nhm> jmlowe: I'm actually building out an openstack cloud right now...
[19:58] <otherone> gregaf1: each node connected using GE and the switches cross conected using 10G
[19:58] <otherone> jmlowe: openstack.... hmmm been working with it for over a year now...
[19:59] <gregaf1> otherone: you sound pretty confident in your hardware…of course as a dev I'm still a little nervous so I like larger numbers :) (and even with 3x replication you should be talking pennies/gig)
[19:59] <nhm> otherone: yeah, I've got a test cloud I've had deployed since march that's still running cactus
[19:59] <otherone> jmlowe: how should I put it mildly ....
[20:00] <jmlowe> otherone: well there is a reason I haven't played with it yet
[20:00] <dwm_> I'm thinking about using Ceph for live user-data (once it's stabalised); I think I'd be distinctly nervous about having less than 3x replication.
[20:00] <otherone> jmlowe: nice idea, nice people, no architect, no coordination, everybody doing whatever they want, mainly rackspace pulling their weight and doing their bit regardles of the need of the whole project
[20:01] <otherone> dwm_: the best bet it to know how the creators use it, as probably they have the biggest deployment yet :-)
[20:02] <jmlowe> otherone: should get interesting with ATT now involved
[20:02] <nhm> dwm_: we use lustre with no replication for live user data. ;) Granted it's scratch, but still...
[20:02] <jmlowe> nhm: use ddn gear with that lustre?
[20:02] <otherone> gregaf1: I'm pretty sure that hardware will fail, but the rate of failure I can aniticipate to be low, as we are going to use for the storage cloud server from ex-projects with extra drives. as we have hundreds of servers I can statisticaly judge what the failure rate is goign to be
[20:03] <nhm> jmlowe: actually we are using Dell/Terascala. DDN bid high.
[20:03] <dwm_> nhm: Yeah, scratch is different to $HOME. :-)
[20:03] <gregaf1> otherone: so run those numbers, figure out the odds of a disk failing while the system is still recovering from a failure (that shares data with it)…. :)
[20:03] <jmlowe> nhm: it's been a few years but dell bid low with ddn for us
[20:04] <otherone> jmlowe: rackspace did get all the big names on the bandwagon however I cannot see much coordination and standarisation.
[20:04] * The_Bishop_ (~bishop@f052097027.adsl.alicedsl.de) has joined #ceph
[20:04] <gregaf1> the odds will be happier than with RAID but then I'm a lot more confident in a RAID card's not breaking anything than I am in OSD recovery never failing due to undiscovered bugs
[20:04] <jmlowe> nhm: it was last time we buy from dell, we now know all you have to do to knock dell out of the race is to require integration
[20:05] <gregaf1> as I hear it the guys actually running OpenStack are starting to kick back against the drive-by patches and work on architecting it better
[20:05] <otherone> jmlowe: I would suggest to look at the code, there are big chunks of code which are duplicated, communication between different parts is not standarised, documentation is not existient (your best bet is to read the code).
[20:05] <nhm> gregaf1: I wouldn't be terribly confident about raid cards not breaking. ;P
[20:05] <gregaf1> and I suspect that we're going to start pushing for some of it due to rbd and the terrible way those subsystems are structured right now
[20:05] <jmlowe> 3.2 broke my raid card
[20:05] <otherone> gregaf1: is it that bad or you are just one of the developers ? ;-)
[20:06] <gregaf1> otherone: I'm a dev and I get nervous ;)
[20:06] <gregaf1> depends on the failure conditions you subject it to I suppose
[20:06] <gregaf1> I'm a lot more confident in its failure handling than Lustre's though! ;)
[20:06] <nhm> jmlowe: terascala has been pretty good. Dell supplies the hardware, terascala does the OS/Lustre/Management/etc.
[20:08] <nhm> gregaf1: :D
[20:08] <otherone> gregaf1: if I had time I would help you out develope this technology further. this is first technology in long time which had made me go "wow.... I see the future differently now"
[20:09] <otherone> gregaf1: if you want to increase the adoption rate then develop part of openstack nova-compute which will allow for provisioning and starting a VM from the RBD.
[20:10] <gregaf1> yeah, some of that's in already and more of it will be going in Real Soon Now
[20:10] <otherone> gregaf1: I did look into it and currenly there is no code which allows you to do it, you have attache additional volume via nova-volume which is on RBD, but not the starting/root volume.
[20:10] <joshd> otherone: there's a way to boot from volumes with openstack too
[20:10] <gregaf1> hrm, really? I thought joshd had gotten that hacked in for Essex
[20:10] <otherone> gregaf1: I belive Duncan from your company is into openstack project.
[20:11] <gregaf1> joshd: essex, right? or am I misremembering the timeline
[20:11] <joshd> otherone: last I checked there was some clunkiness with using it, but the functionality was there (in an openstack api extension)
[20:11] * The_Bishop (~bishop@e179024117.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[20:11] <otherone> gregaf1: I will check, however I was doing research/discovery about this right before christmas time and it wasn't there
[20:12] <gregaf1> otherone: it's not pretty, unfortunately, because OpenStack designers for some reason never considered that you might want a distributed block device as your root drive…lame given how long Sheepdog's been around
[20:12] <joshd> otherone: it's not well documented or anything, I think I looked at the patches to tell how to use it
[20:12] <otherone> gregaf1,joshd: our idea was to use SSD drive for nova-compute nodes and ALL storage on RBD
[20:12] <otherone> right now we are using ceph on /var/..../instances/
[20:13] <otherone> joshd: I'm not looing at openstack documenation for long time now, the only documenation this project has is the code itself :-)
[20:15] <otherone> gregaf1: there was a patch which allowd to start from the root on the network, however there was no provisioning done.
[20:16] <joshd> otherone: ah, yeah, provisioning isn't there yet
[20:16] <otherone> gregaf1: you have to manualy first create the volume using nova-volume (which has support for rbd), then make sure your libvirt supports rbd and you were able to start VM using RBD as root
[20:16] <otherone> gregaf1: however there was no way to simply start an instance which by default would provision the rbd volume, use glance to image it and then start vm using it as root
[20:17] <gregaf1> ah
[20:17] <otherone> I will recheck it once more, but that is my undrestanding as pre-Christmas time.
[20:18] <otherone> I'm suprised that openstack team didn't invite you into the project in order to replace swift with rados/ceph ;-)
[20:18] <gregaf1> no, that sounds right
[20:19] <Tv> otherone: yeah we've been making noises about that
[20:19] <Tv> otherone: what is there currently would have to download an image from glance via http and write it out to rbd
[20:20] <Tv> otherone: and if you store the glance images in rbd already, we can do a faster copy-on-write
[20:20] <otherone> tv: that would be perfect :-)
[20:20] <Tv> otherone: but right now, we pushed back the cow feature to spend more time in QA
[20:21] <Tv> otherone: so the nova/glance api changes sort of got shelved for a while.. the relevant people know the use case, but we're not currently putting dev hours into it
[20:21] <otherone> tv: I can sponsor it in small bit if you have the developers :-)
[20:22] <Tv> otherone: not sure what you mean by that, honestly
[20:22] <otherone> tv: my idea was to have all on rbd, glance image store, root and volumes for VMs
[20:22] <Tv> yeah
[20:22] <Tv> otherone: and as far as i remember, all of that exists.. but right now, you need to do steps manually to "instantiate" the images for new vms
[20:23] <Tv> otherone: but you can have glance images in rbd already, etc
[20:23] <otherone> tv: I mean that I can put money towards the developement of such functionality if you have developres which have time to do it and know both technologies. my main problem is lack of human resources.
[20:23] <Tv> otherone: there is no cow yet, and there is no automation on getting it going yet
[20:23] <Tv> otherone: oh nice.. umm, i should patch you through to the right person
[20:24] <Tv> otherone: you should email dona.holmberg@dreamhost.com and say you are interested in using RBD for OpenStack and are willing to pay for development
[20:25] <otherone> tv: my main reason I'm here is that we want to use RBD ane CEPH, however we tried it on debian and had numerous problems when stress testing.
[20:26] <otherone> tv: mainly with btrfs and osd, which resulted on few ocassions on need to rebuild the whole cluster
[20:27] <otherone> tv: so I came here to ask for support in order to install stable storage cluster and aquire blackmagic (knowledge) needed to support it in production environmnet
[20:27] <Tv> otherone: we're definitely in the QA phase, bugs are still being found
[20:27] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:27] <otherone> tv: as I said before we lack human resources but have need and a small bit of money
[20:28] <otherone> tv: I don't expect a rock stable not bugs, etc. instalation.
[20:28] <Tv> otherone: our support organization is still.. being recruited.. Dona will help you get maximum value you can right now
[20:28] <gregaf1> if you run into specific issues the mailing list or here is the best way to ask for help (depending on time of day)
[20:28] <otherone> tv: what I expect is an instalation with people to rely on when things go wrong :-) so they know quickly, what's wrong and how to fix it :-)
[20:29] <gregaf1> I don't think we've encountered data loss bugs in a very long time, although poor jmlowe managed to kill his OSDs with a bad crush map yesterday, and I know that btrfs bugs (and occasional OSD bugs/interface odditiies) can make revival a challenge
[20:29] <otherone> tv: hehehe I don't wnat the newly recruited ones :-) please..... I beg you.
[20:29] <otherone> tv: I will contact dona on monday.
[20:29] <Tv> otherone: oh we're definitely not hiring anyone without a clue
[20:30] <Tv> otherone: but it's more.. going through that channel will make us pay more attention to you
[20:30] <otherone> gregaf1: however if you have the blackmagic and know how it works you can mitigate the revival challenge :-)
[20:30] <Tv> this channel and the mailing list we drift in & out of based on what we as individuals are developing at the moment
[20:31] <gregaf1> somebody will pay attention to you if you're on the mailing list (or here and patient); depending on the bug it might be "yep, that's a bug, here's the tracker for it, sorry"
[20:31] <gregaf1> but if it's an OSD revival problem we'll be very interested in it so you should ask for help :)
[20:32] <otherone> tv: I'm sure you want the best from the market, however I been programmin myself for over 20+ years and I know that it takes time to find out how things work and then it takes years to gain the experience thorugh being in number of tough situations :-)
[20:34] <otherone> gregaf1: regretfully I'm not involved in day-to-day testing of these solutions (day has only 24h) however I will get a nice lady who is in charge to connect in here and ask tough questions :-)
[20:35] <Tv> otherone: yes, but we get a lot of "oh it's that category of failures, extract this info and follow these steps..."
[20:35] <otherone> gregaf1: the main problem for us is people, we are very small team and a lot to do. working 12-14h a day 6-7 days a week. probably you have started in similar way in Dreamhost.
[20:36] <gregaf1> luckily for me I came in to the company a little later than that phase!
[20:37] <otherone> tv: I will try to find the latest reason why we had to rebuild the cluster. it had something to do with btrfs.
[20:39] * lollercaust (~paper@41.Red-88-15-116.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[20:43] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[20:51] <otherone> tv: did you get my private message ? I have sent you an url to the description of the problem which was posted on btrfs mailling list, no solution until now (pre-christmas time it was)
[20:51] <Tv> yes
[20:52] <Tv> http://comments.gmane.org/gmane.comp.file-systems.btrfs/14964
[20:54] <otherone> tv: so when using ext4 each time you add a drive to the system you need to resize the fs, btrfs is that much easier that you just a drive and add it to the "mountpoint"
[20:55] <otherone> tv: or am I wrong ?
[20:55] <gregaf1> that doesn't look familiar to me, but sjust might have an idea when he gets back
[20:55] <gregaf1> but it's lunch time now, bbiab!
[20:56] <Tv> otherone: are you raiding the drives together?
[20:56] <otherone> lunchtime for you, evening for me :-)
[20:56] <Tv> otherone: most of our usage doesn't do raid
[20:56] <Tv> otherone: so for us, new drive = new osd
[20:56] <otherone> tv: no just plain drives, no hardware raid or software raid.
[20:56] <Tv> so why the resize?
[20:56] <jmlowe> otherone: did you say you worked for a university?
[20:56] <otherone> oh... that way.
[20:57] <otherone> jmlowe: I helped one uni on some project work as friends worked there (now in google)
[20:58] <otherone> tv: plain, easy solution. we were thinking along the lines: one node = one osd = one mountpoint = many drives in file system on lvm
[20:58] <otherone> tv: thank you for changing the way of thinking :-)
[20:58] <jmlowe> otherone: damn so much for a employer paid trip over there to "collaborate"
[20:59] <otherone> jmlowe: I mean friends work for google
[20:59] <nhm> jmlowe: wow, how do you pull that off?
[20:59] <otherone> jmlowe: do you want to come over and you need to be invited by uni or you want me come over ? ;-) both can be arranged ;-)
[21:00] <otherone> tv: is there performance or architecure related limit on number of osd's on one node or in the whole "cluster" ?
[21:02] <jmlowe> cheapest way to fill our booth at supercomputing, let other people to the hard stuff then "collaborate" and have them give booth talks come supercomputing
[21:03] <nhm> jmlowe: Heh, we canceled our SC booth. I was one of the few people that got to go this year.
[21:04] <jmlowe> our current president was our vpit, he started us going about 10 years ago
[21:04] <nhm> jmlowe: I took a picture of your booth though. ;) http://www.clusterfaq.org/wp-content/uploads/2011/11/IMG_20111115_105734.jpg
[21:05] <nhm> sadly my cellphone photography isn't exactly prize-worthy.
[21:05] <jmlowe> https://picasaweb.google.com/104314253101021700373/SC11
[21:06] <jmlowe> the lighting there was deceptively bad
[21:06] <nhm> yeah, it was hard to get good shots. I wish I would have remembered to bring a camera.
[21:08] <nhm> man I wish we had a place like the taphouse locally.
[21:10] <otherone> is there performance or architecture related limit on number of osd's on one node or in the whole "cluster" ?
[21:11] <jmlowe> the thing that blew my mind was the tap house wasn't the best place to go
[21:11] <jmlowe> there were places better
[21:11] <nhm> jmlowe: I made it out to a couple of other places but liked the taphouse best. Where did you like better?
[21:12] <nhm> I'm a sucker for belgians.
[21:12] <jmlowe> black raven brewery had the best beer I've ever had, period
[21:12] <nhm> wow
[21:13] <nhm> otherone: I don't know the answer, though I imagine there could be variables that could affect performance depending on the hardware setup.
[21:18] <otherone> nhm: thanks
[21:20] <nhm> otherone: there was some chatter on irc from september about large OSD setups. It sounds like by default every OSD talks to every other OSD so you end up with a ton of threads and need to customize the crushmap.
[21:20] <nhm> s/large/many
[21:20] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) has joined #ceph
[21:21] <jmlowe> oh, you need to go to Brouwer's Cafe, North 35th Street, Seattle, WA if you are into Belgian beer
[21:21] <otherone> nhm: how do you define large osd setup ? low or high hundres of osd instances or hardware nodes ?
[21:22] <jmlowe> nhm: that bottle in the album was aged 5 years in oak then rested for another 10 in cold storage in that bottle
[21:22] <nhm> jmlowe: wow
[21:22] <nhm> jmlowe: I emailed myself that address so whenever I make it back to Seattle I can reference it. :)
[21:24] <nhm> jmlowe: I really like piraat on tap. My favorite bottled belgian (that I can get locally) is La Trappe quad.
[21:25] <jmlowe> nhm: the guy I mentioned at ncsa who was working on statistical log analysis, Joshi Fullop, is an international beer judge, he judged the 2010 world cup
[21:25] <nhm> otherone: someone on IRC was having some problems with a 320 OSD setup spawning something like 30000 threads.
[21:27] <otherone> nhm: ufff... my plan for the begining is less than 100 physical nodes, probably 30-50 for a start.
[21:27] <nhm> jmlowe: !
[21:28] <nhm> jmlowe: clearly we need to collaborate on something.
[21:30] <nhm> otherone: looks like at least as of a couple of months ago you want to keep it to about 100 PGs per host.
[21:31] <jmlowe> nhm: I think Salt Lake is going to be a bit of a disappointment in terms of the beer, food was pretty good, I was there this summer for a conference
[21:33] <nhm> jmlowe: it will be interesting to see if attendance is down because of it. :)
[21:35] <nhorman> joshd, ping
[21:36] <nhm> otherone: btw, here are the docs on placement groups. Note that it should be 100PGs/host, not 100PGs/OSD. http://ceph.newdream.net/docs/latest/dev/placement-group/
[21:36] <joshd> nhorman: pong
[21:36] <nhorman> joshd, hey, since you helped me out yesterday, I wanted to let you know I've made a (small) bit of progress on that cephx thing we were talking about
[21:37] <nhorman> joshd, so, I have no idea why this is the case, but it seems that the KEY_SPEC_USER_KEYRING that mount.ceph adds a key too doesn't be default have any permissions on it
[21:38] <nhorman> i.e. I tried calling keyctl_setperms on it, and I was unable to modify any permissions on that keyring or any attached keys that I added
[21:38] <nhorman> no idea what use that is, but it seemed to hold true both in Fedora 16 and on an upstream kernel
[21:39] <nhorman> if I modifed mount.ceph to add the key to KEY_SPEC_PROCESS_KEYRING however, the kernel was able to find it
[21:40] <nhorman> joshd, that make any sense to you at all?
[21:40] <joshd> that's odd, I wonder if it's a packaging difference in libkeyutils or the kernel
[21:41] <nhorman> joshd, I was thinking that too, particularly the kernel, but I tried with linus' latest and its the same deal
[21:41] <nhorman> maybe libkeyutils. Is libkeyutils responsible for creating the default keyrings?
[21:42] <nhorman> I would have though the kernel did that on clone (given that there are per-process/thread keyrings)
[21:42] <joshd> nhorman: I'm not sure who does that, but I would guess so
[21:43] <nhorman> joshd, well, I'll have to keep looking at that.
[21:43] <joshd> nhorman: Tv may know
[21:43] <nhorman> unfortunately, its just got me to another problem as well. I can see I'm trying to get key data from ceph now, but the mds log when I try to mount keeps spewing this:
[21:43] <nhorman> 2012-01-13 15:37:16.179850 b3f36b40 auth: could not find secret_id=2
[21:43] <nhorman> 2012-01-13 15:37:16.179874 b3f36b40 cephx: verify_authorizer could not get service secret for service mds secret_id=2
[21:43] <nhorman> 2012-01-13 15:37:16.179906 b3f36b40 -- >> pipe(0x86f0380 sd=9 pgs=0 cs=0 l=0).accept bad authorizer
[21:44] <nhorman> any clue?
[21:45] <joshd> does 'ceph auth list' show the right users/secrets?
[21:45] <nhorman> joshd, yup, its a cut'n paste from the output of ceph auth list to my mount command, secrets match
[21:47] <joshd> nhorman: check the monitor logs - there should be a reason the mds couldn't get the service secret there
[21:47] <nhorman> joshd, anything special I should grep for>
[21:48] <nhorman> only other log entry that seems to relate is: 2012-01-13 15:41:09.137494 b3f36b40 auth: could not find secret_id=2
[21:49] <joshd> oh, you know, you may be hitting an issue that was recently fixed where an mds could be connected to a monitor that's out of the quorum
[21:49] <nhorman> would ceph mon status show that?
[21:50] <joshd> ceph quorum status
[21:50] <joshd> err 'ceph quorum_status'
[21:50] <nhorman> hmm, unrecognized command....1 sec
[21:51] <joshd> ah, that might have been added too recently
[21:51] <nhorman> ceph health does show 'HEALTH_WARN There are lagging MDSes: 0(rank 0)'
[21:52] <joshd> yeah, it sounds like it's not connecting to the monitors (or not to ones in the quorum)
[21:53] <nhorman> well I'm currently running in the trivial case 'single node - one mon/mds/osd daemon)
[21:53] <nhorman> so if its not connecting I imagine thats the problem - you say a packaging of the git head should fix that?
[21:53] <joshd> yeah
[21:53] <nhorman> ok, cool, I'll harrass jbacik about that then
[21:53] <nhorman> thanks!
[21:53] <joshd> bug is http://tracker.newdream.net/issues/1912
[21:54] <nhorman> ah, perfect, thank you!
[21:54] <joshd> no problem
[22:00] <Tv> nhorman: i don't know much about your keyring problems; i would ask how is fedora different
[22:00] <Tv> nhorman: e.g. selinux?
[22:01] <nhorman> Tv, ATM I've got selinux disabled, so thats not a factor
[22:01] <nhorman> Tv, and the thing is, Fedora shouldn't really do anything odd with libkeyutils at all
[22:01] <Tv> nhorman: i'm thinking more kernel than libkeyutils
[22:01] <nhorman> Tv, if you want the executive recap, I was trying to setup an F16 ceph single node cluster with cephx authentication
[22:02] <nhorman> but the mount command kept failing with EPERM
[22:02] <nhorman> I tracked it down to the fact that the kernel could neither search, nor read keys in the KEY_SPEC_USER_KEYRING
[22:02] <nhorman> and I was unable to change permissions there
[22:03] <nhorman> But if I add the key in mount.ceph to the KEY_SPEC_PROCESS_KEYRING, things go much more smoothly (setting asside the bug that joshd and I were just discussing)
[22:04] <nhorman> joshd, btw, I was just looking, and libkeyutils doesnt 'create' those special keyrings. They're just defines that just represent keyrings that are allocated automatically on clones and other control points in the kernel
[22:05] <nhorman> and since I hit the same problem with the upstream kernel, I'm kind of left wondering if debian (I think thats the ceph development distro, right)? Isn't carrying a patch to make the USER keyring work that upstream doesn't yet have
[22:05] <Tv> nhorman: we run upstream kernels
[22:06] <nhorman> Tv, then I'm really confused, because I spent all yesterday running stap scripts watching mount.ceph fail in key_validate
[22:07] <Tv> nhorman: might be something about actual mount binary
[22:07] <Tv> nhorman: did you call mount.ceph directly, or through mount?
[22:08] <nhorman> Tv, did It both ways
[22:08] <nhorman> thats what I was talking to joshd about above, I was playing with set_kernel_secret in mount.ceph, and found that I was unable to change permissions for the USER keyring
[22:09] <nhorman> despite having appropriate permission bits set on it (as dumped out by key_describe)
[22:09] <nhorman> every call to key_setperm failed
[22:09] <nhorman> but If I changed the keyring used in add_key to KEY_SPEC_PROCESS_KEYRING, everything started working
[22:11] <nhorman> sorry, I mispoke before, it wasn't key_validate that was failing, it was keyring_search_aux, which failed because key_task_permission indicated that the USER keyring didn't have search permissions
[22:12] <nhorman> but I wasn't able to add search permissions because the calling process didn't have setacl permissions
[22:14] <nhorman> Tv, yeah, actually, looking at this I think I see how its failing.
[22:14] <nhorman> Tv, the KEY_SPEC_USER_KEYRING is the special defnied used to reference the uid_keyring that is pointed at by a creds struct in the kernel.
[22:17] <nhorman> wait, nm, it sets KEY_USR_ALL, so I'm back to being confused
[22:18] <nhorman> oh well, I'll figure it out. Thanks for listening :)
[22:35] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:36] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[22:41] * votz (~votz@pool-108-52-121-248.phlapa.fios.verizon.net) has joined #ceph
[22:47] * verwilst (~verwilst@dD576F293.access.telenet.be) Quit (Quit: Ex-Chat)
[23:18] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[23:46] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.