#ceph IRC Log


IRC Log for 2012-10-12

Timestamps are in GMT/BST.

[0:08] <elder> nhm_, do you ever use Vidyo on Linux any more?
[0:08] <elder> I was just trying it. I seem to remember having to start from the command line or something.
[0:12] <amatter> howdy guys. is there an easy way to change the ip addresses of all the members in a cluster?
[0:12] * steki-BLAH (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:14] <joshd> amatter: the monitor ips are the only ones that are well-known (and fixed). for those you'll need to add new monitors and remove old ones while keeping a majority running
[0:15] <amatter> joshd- ok, thanks
[0:15] <joshd> amatter: then update the monitor addresses in the rest of your cluster's configuration, you might be good to go after that (restarting the osds will have them use whatever network is available/specified)
[0:16] * synapsr (~synapsr@50-0-17-201.dsl.static.sonic.net) has joined #ceph
[0:16] <joshd> I think clients and osds that are already connected will continue working
[0:17] * The_Bishop (~bishop@2001:470:50b6:0:719e:9c41:8a74:8843) Quit (Remote host closed the connection)
[0:18] <amatter> joshf-
[0:18] <amatter> oops
[0:19] * synapsr (~synapsr@50-0-17-201.dsl.static.sonic.net) Quit (Remote host closed the connection)
[0:19] * aliguori (~anthony@cpe-70-123-130-163.austin.res.rr.com) has joined #ceph
[0:20] <amatter> joshd: I'm trying to move the whole cluster to another ip subnet. Can I just create a new monmap reflecting the new location of the monitors?
[0:22] <joshd> I don't think that will work, but joao or gregaf may have a better idea
[0:22] * synapsr (~synapsr@50-0-17-201.dsl.static.sonic.net) has joined #ceph
[0:24] <atrius> okay... am i just blind (probably?) or am i not seeing where one ties a ceph volume to real disk?
[0:24] <dmick> atrius: are you trying to get a kernel-usable rados block device?
[0:25] <dmick> ("tie a ceph volume to a real disk" can be interpreted a lot of ways)
[0:25] <atrius> dmick: that's the end goal, yeah.. specifically i'm setting it up for openstack
[0:25] <gregaf> joshd: amatter: you might be able to do that kind of manipulation but it isn't easy or recommended
[0:25] <dmick> oh for openstack you probably don't want a kernel-usable driver
[0:25] <atrius> dmick: ah, true... well.. the actual data must be stored "somewhere"... where is that?
[0:25] <gregaf> you're better off doing an incremental transfer by adding and removing monitors
[0:25] <benpol> I'm scheming to put together a small ceph cluster at work. I'd like to put the OSD journals on an SSD (as that seems to be the most commonly recommended approach). Are there any rules of thumb regarding how many OSD journals an individual SSD can handle?
[0:25] <dmick> so openstack can use qemu, which knows how to talk to librbd in userland and open a RADOS cluster, and access images there
[0:26] <amatter> there's no data on this cluster. maybe best to just recreate it at the new location
[0:26] <dmick> (i.e. no kernel path)
[0:26] <atrius> dmick: okay.. but the data needs to be stored somewhere on physical disk.. where is that? i've got a pair of 2TB drives to use for the purpose, but not sure how to do so
[0:27] <dmick> atrius: the rados cluster is a set of machines and daemons that run on top of LInux filesystems
[0:27] <joshd> benpol: divide your desired measure (IOPS or throughput) of the SSD by that measure for the number of OSD disks
[0:27] <benpol> joshd: fair enough, thanks!
[0:27] <dmick> ah. so you haven't set up a cluster yet
[0:27] <joshd> amatter: yeah, that's probably easier
[0:27] <atrius> dmick: right.. i was making sure my network was squared away.. now it is time to setup the cluster :D
[0:28] <dmick> ok. setting up a cluster is clearly your first step. http://ceph.com/docs/master/start/ for an intro
[0:28] <atrius> oh.. i think i found it else where.. it is defined at the [osd] level?
[0:28] <dmick> um...
[0:29] <dmick> what do you have set up so far?
[0:29] <atrius> dmick: nothing.. i was reading the quick-start and not seeing where anything that looked like a path was defined... something then seemed "off" to me
[0:29] <dmick> there are many layers here
[0:32] <dmick> when I say "a cluster", I mean something like "1-3 monitor daemons, each with their own chunk of native filesystem, plus N OSD daemons, each with their own chunk of native filesystem, talking on the net, and listening for requests" (you can also optionally have other things but for rbd that's sorta the minimum)
[0:32] <dmick> the bottom two boxes on http://ceph.com/docs/master/
[0:32] <atrius> ah, okay... found the ceph-conf docs.. now it makes sense... i was thinking it might end up in /var/lib/ceph/osd/$cluster-$id :D
[0:32] <dmick> so that's one of the places one of the daemons stores information
[0:33] <atrius> are monitors particularly resource hungry?
[0:33] <dmick> no
[0:33] <atrius> could be a VM?
[0:33] <dmick> don't need to be separate hosts, even
[0:34] <dmick> you can run multiple daemons per host
[0:34] <dmick> just depends on what you're trying to do.
[0:34] <atrius> is it correct to say that the OSD path is where most of the disk is consumed?
[0:34] <dmick> the OSDs do most of teh object storage, yes
[0:35] <atrius> okay.. so when openstack goes to create a VM or volume, that's where the actual data ends up?
[0:35] <dmick> yes, in encoded form, of course
[0:36] <atrius> right
[0:36] <benpol> do rbd mappings persist between reboots of a ceph client?
[0:37] <benpol> (and if so where is the mapping stored?)
[0:37] <atrius> currently i've got a pair of 2TB drives in a single host.. is there any real advantage to doing anything special with them? or would it just be better to just format them and stick them each on a separate OSD path?
[0:37] <dmick> pretty sure "no", benpol
[0:38] <dmick> you could add rc scripts to do the add (it's just writing to a magic path), but nothing does that now that I'm aware of
[0:38] <benpol> dmick: ok yeah, just part of the yak shaving, so to speak
[0:38] <dmick> atrius: we tend to recommend an OSD per drive
[0:38] <atrius> dmick: okay, that's what i was thinking too :)
[0:39] <dmick> the OSDs/cluster manage redundancy, so raiding them is just sorta wasted
[0:39] <atrius> okay.. off to build a cluster :D
[0:39] <atrius> dmick: that's what i thought
[0:39] <dmick> gl!
[0:39] <dmick> come back if you have questions
[0:39] <atrius> thanks :)
[0:39] <dmick> joshd: that's true, right, rbd mappings aren't persisted anywhere?
[0:40] <atrius> well, one just popped in my head... btrfs or xfs? is btrfs "stable" enough at this point?
[0:40] <joshd> yeah, that's right. I think there's a feature request for that in the tracker from a while back
[0:40] <dmick> atrius: it'll work, but we've seen aging performance degradation that's worse on btrfs
[0:40] <dmick> for a toy cluster it probably won't matter much
[0:40] <atrius> okay, i'll skip it for now
[0:40] <dmick> but we're tending to recommend xfs for nw
[0:54] * PerlStalker (~PerlStalk@ Quit (Quit: home)
[0:58] * synapsr (~synapsr@50-0-17-201.dsl.static.sonic.net) Quit (Remote host closed the connection)
[0:58] <atrius> some what randomly... is there a known speed difference between ext4 and xfs?
[0:59] <joshd> I don't think we have recent numbers from ext4
[1:00] <atrius> was there a speed difference in the past?
[1:00] <joshd> I don't remember the details, but nhm_ could tell you
[1:00] <dmick> performance benchmarking even simple I/O loads is much harder than you think; doing it with ceph in the way is a lot harder
[1:00] <dmick> we're studying, but the answer is never simply "x is better than y"
[1:01] <gregaf> http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[1:01] <gregaf> you can look at some benchmarks here if you like
[1:01] <gregaf> sometimes xfs is faster, sometimes ext4 is
[1:01] <gregaf> but for now we recommend xfs as the filesystem of choice
[1:01] <atrius> i wasn't thinking of ceph in that case.. i was just poking at my SSD equipped desktop and the spinning disk server... results were interesting.. that's what made me think of it
[1:01] <atrius> gregaf: okay :)
[1:02] <dmick> aw I didn't realize nhm had finally published that
[1:02] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:04] * synapsr (~synapsr@50-0-17-201.dsl.static.sonic.net) has joined #ceph
[1:04] <scalability-junk> does ceph currently support deduplication within their storage cluster?
[1:04] <scalability-junk> or is it on the roadmap?
[1:05] <gregaf> no, and not really
[1:06] <scalability-junk> damn would be great
[1:06] <scalability-junk> saying two users put up the same object just have 3 replicas instead of 6
[1:07] <atrius> w00t... cluster is alive
[1:07] <scalability-junk> or even better deduplicating object chunks (saving for image files would be big I assume)
[1:07] * elder (~elder@2607:f298:a:607:1059:f4c1:babe:9559) Quit (Quit: Leaving)
[1:07] <scalability-junk> gregaf, any reason why it's not on the roadmap?
[1:08] <gregaf> it's hilariously complicated and we have a lot of other things to work on first
[1:08] <gregaf> I mean, we're interested and it'd be a really fun research problem for us
[1:09] <gregaf> but there's a lot of other functionality related to administration and management that we'd like to get down first
[1:09] <gregaf> and it's not super-important for most of our use cases right now
[1:10] <joshd> scalability-junk: with the block device, you can do copy-on-write clones from images if you store images in rbd as well (with openstack folsom)
[1:11] <scalability-junk> gregaf, ah alright thanks
[1:11] <gregaf> right, I should have mentioned what joshd did, which is that there's already a solution for copy-on-write block device images
[1:11] * elder (~elder@2607:f298:a:607:38d5:c5b3:89cb:71e1) has joined #ceph
[1:11] <gregaf> that goes in the other direction and is a lot easier ;)
[1:12] <scalability-junk> ah alright thanks joshd and gregaf
[1:37] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[1:41] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) has left #ceph
[1:43] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[1:46] * synapsr (~synapsr@50-0-17-201.dsl.static.sonic.net) Quit (Remote host closed the connection)
[1:48] <wido> Nice! Just found a Atom board which supports 8GB of memory :)
[1:49] <dmick> hah. can Atom even see 8GB? :-P
[1:49] <wido> Intel says the boards to 4GB max, but you can put 8GB on them and it will work with 64-bit Linux
[1:49] <wido> dmick: It can, but that is nice
[1:49] <dmick> (that's actually cool)
[1:50] <wido> as you might know I'm still on a personal conquest to have OSDs run on Atoms
[1:50] <atrius> anyone know off hand why nova-volume would insist on looking for nova-volumes even after i've set the configuration to use rbd?
[1:51] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[1:51] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[1:51] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:52] <joshd> atrius: what do you mean by 'looking for nova-volumes'?
[1:52] <atrius> joshd: nova-volume is refusing to start because "volume group nova-volumes" doesn't exist... as if it were still configured to use the default volume_driver
[1:55] <joshd> atrius: definitely sounds like it's not picking up the configuration
[1:56] * tren (~Adium@2001:470:b:2e8:d9c7:b3:1ac:9ddf) has joined #ceph
[1:56] <joshd> atrius: you can strace it to see what config file it's trying to read
[1:56] <tren> Gregaf: Don't suppose you're around are ya? Have a question
[1:57] <atrius> joshd: i'll try that... though, if it isn't reading the config file at all, nova-compute would fail as well and its happy as a clam
[1:57] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:07] * tren (~Adium@2001:470:b:2e8:d9c7:b3:1ac:9ddf) Quit (Quit: Leaving.)
[2:08] <dmick> tren: no, Greg left for the day
[2:14] <atrius> well... this is fun
[2:14] * atrius kicks nova-volume
[2:17] <joshd> atrius: if nova does revert back to the default when the setup step for the driver doesn't work, and that's causing your problem, 'sudo -u nova rados lspools | grep volumes' will fail
[2:19] <atrius> joshd: trying that
[2:20] <atrius> joshd: works
[2:25] <atrius> it really is acting like it isn't even bothering to read the config
[2:25] <atrius> that said.. the strace implies that it is
[2:27] <joshd> does the log say anything about volume_driver?
[2:30] <atrius> joshd: lots of things... grepping for volume_driver gets a bunch of lines referring to nova.volume.driver.ISCSIDriver
[2:30] <atrius> which isn't configured
[2:37] <atrius> grrrr... its up... syntax error in the documentation...
[2:39] <atrius> okay... with that all said and done... :D
[2:40] <joshd> glad you figured it out
[2:40] <joshd> what was the error?
[2:40] <atrius> it should be listed as --volume_driver (as page B said) and not "volume_driver" (as page A said)
[2:41] <atrius> http://docs.openstack.org/trunk/openstack-compute/admin/content/rados.html <-- that page is wrong and will never work
[2:43] <joshd> oh, I think there's two formats for the config file, and the one used depends on how it's passed to the service
[2:43] <atrius> possibly... either way.. annoying
[2:43] <atrius> oh well.. on to the next thing which is broke
[2:46] <atrius> so, is it expected that new instances will be created "in" ceph?
[2:47] <joshd> not yet
[2:47] <joshd> you'll need to create a volume, and boot from that
[2:47] <atrius> ah, okay
[2:48] <atrius> so images and volumes can go in there, but not just "spawn a new random instance"
[2:48] <joshd> yeah, probably in the next release
[2:48] <joshd> are you using Folsom?
[2:48] <atrius> no, Essex
[2:49] <joshd> ah, that's trickier then
[2:49] <atrius> lol... figures :d
[2:49] <joshd> Essex doesn't have a way to put data from an image onto a volume
[2:49] <atrius> :D
[2:49] <joshd> you have to do it manually
[2:49] <joshd> like, attach to an instance, dd an image over the device, and then you can boot from it
[2:50] <atrius> lets say then i moved to folsom... what then? my end goal here is to get live-migration working along with survivable volumes for DB access
[2:51] <atrius> for the second one it would seem things would work as they are right now.. it's the first i'm trying to get working at this point
[2:52] <joshd> with folsom, you can create volume from images, and if you store the images in rbd too, volumes can be copy-on-write clones of them
[2:53] <atrius> that would be pretty cool
[2:53] <joshd> live migration I'm not sure about, I suspect there are still issues with it when using any kind of volume, but I haven't tried
[2:53] <joshd> it works fine at the libvirt level, but OpenStack has it's own restrictions about live migration
[2:54] <atrius> how so?
[2:55] <joshd> it has it's own way of determining what's live-migratable, and I think it may try to detach/reattach volumes (since that's needed for some other backends)
[2:55] <atrius> ah, okay
[2:56] <atrius> well.. lets say for a moment we tossed openstack out the window... ceph alone would be enough for live migration of VMs?
[2:56] <joshd> yeah, you can manage it directly with libvirt
[2:57] <joshd> the networking side might require some extra scripting, but I'm not sure how well that works in OpenStack right now either
[2:58] <atrius> yeah, that's one thing that completely blew up the other night... i did a live-migration of a VM using the previous storage method... it finally... finally... got across... and the networking was completely jacked
[2:59] <joshd> libvirt by itself is pretty capable - you can hook into various vm events, and use ebtables for firewall rules
[2:59] <atrius> it has, more than once, made me think we'd be better off just giving openstack a pass and go direct to the source, as it were
[3:02] <joshd> cloudstack also has rbd support in the almost-released 4.0 version
[3:02] <atrius> i could give them another look... do you know if that version is less.... complicated... than the previous one?
[3:03] <joshd> I didn't use earlier versions, but... probably not
[3:03] <atrius> nuts... that's what made me drop it like a rock... it was certainly prettier than openstack (dashboardwise)... but holy crap it was complicated with all the management VMs and such
[3:04] <joshd> yeah... lots of layers of abstraction and indirection there too
[3:04] <atrius> yeah
[3:05] <atrius> hell.. sometimes i wonder if those who say we should just go with vmware and be done with it might not be wrong... lol
[3:10] * scalability-junk (~stp@188-193-208-44-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[3:15] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[3:24] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[3:30] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[3:30] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[3:30] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[3:32] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[3:37] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[3:41] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) has joined #ceph
[3:42] <scalability-junk> mh just a theory question while I'm working through the docs and papers of ceph
[3:42] <scalability-junk> when I have a one node setup with 2 disks what would be the best solution?
[3:44] <scalability-junk> raid1 and having only one replica or having one partition on each disk with one replica each? the other partition on each disk should probably be used for os related stuff with a raid1?
[3:44] <scalability-junk> any suggestion
[3:45] <scalability-junk> when running a nova compute on top of such a node I probably would need 3 partitions on each disk. one for os (raid1), one for nova (raid?) and one for ceph...
[3:52] <dmick> generally I suggest one OSD per drive
[3:52] <dmick> no other raid
[3:53] <dmick> you'll want someplace to keep journals too
[3:53] <dmick> but those are smaller
[3:55] <atrius> can i use rbd as regular block devices in a host OS? if so, what stops me from just formatting and mounting them like any other block device? anything?
[3:56] <dmick> you can create an access path to an RBD image through librbd or the rbd kernel driver
[3:56] <dmick> with the kernel driver, yeah, it's just a block device
[3:56] <dmick> tjat
[3:56] <dmick> that's the most-compatible view
[3:56] <atrius> could it be accessed from multiple machines at the same time? or with it end up with massive corruption at that rate?
[3:57] <dmick> think of it as a drive
[3:57] <dmick> nothing stops you from shooting yourself in the foot with a drive or an rbd 'virtual drive'
[3:57] <atrius> lol... fair enough :D
[3:57] <dmick> rbd doesn't support any kind of device reservation or any multi-initiator protocols
[3:57] <atrius> okay
[4:01] <scalability-junk> dmick, so you would suggest having 4 disks? 2 for os raid1 and nova compute and 2 for ceph?
[4:01] <scalability-junk> dmick, and journals can be held inside the osd I thought or is that bad?
[4:07] <joao> scalability-junk, you probably want a disk for journal, to avoid competing for bandwidth with the osd
[4:07] <scalability-junk> seems like I should have a lot of disks :P
[4:07] <joao> but I'm not an expert on that matter, so don't take my word for it
[4:07] <joao> s/for/on
[4:08] <scalability-junk> yeah read in the docs it's best to use an ssd for journal
[4:08] <joao> furthermore, it's late and my judgment may not be in the best shape ever
[4:08] <scalability-junk> yeah 4am :D
[4:08] <joao> 3am here
[4:09] <joao> and I'll take this moment to excuse myself and go to sleep :p
[4:10] <joao> nighty night #ceph
[4:10] <scalability-junk> night
[4:11] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[4:19] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[4:51] * miroslavk (~miroslavk@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[5:17] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[5:21] * maelfius (~mdrnstm@ Quit (Quit: Leaving.)
[5:30] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[5:30] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:32] * Tobarja (~athompson@cpe-071-075-064-255.carolina.res.rr.com) Quit (Read error: Connection reset by peer)
[5:58] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[6:03] * dmick (~dmick@2607:f298:a:607:84b4:cd46:ebad:f64e) Quit (Quit: Leaving.)
[6:08] * chutzpah (~chutz@ Quit (Quit: Leaving)
[6:33] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[6:51] * miroslavk (~miroslavk@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[7:53] * deepsa (~deepsa@ has joined #ceph
[8:07] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:17] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:27] * tryggvil (~tryggvil@s529d22d5.adsl.online.nl) Quit (Quit: tryggvil)
[8:30] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[8:31] * deepsa (~deepsa@ has joined #ceph
[8:43] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[9:26] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:34] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[9:46] * deepsa (~deepsa@ Quit (resistance.oftc.net larich.oftc.net)
[9:46] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) Quit (resistance.oftc.net larich.oftc.net)
[9:46] * dok (~dok@static-50-53-68-158.bvtn.or.frontiernet.net) Quit (resistance.oftc.net larich.oftc.net)
[9:46] * gohko (~gohko@natter.interq.or.jp) Quit (resistance.oftc.net larich.oftc.net)
[9:46] * sage1 (~sage@ Quit (resistance.oftc.net larich.oftc.net)
[9:46] * rosco (~r.nap@ Quit (resistance.oftc.net larich.oftc.net)
[9:46] * jeffhung_ (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (resistance.oftc.net larich.oftc.net)
[9:46] * iggy (~iggy@theiggy.com) Quit (resistance.oftc.net larich.oftc.net)
[9:46] * f4m8_ (f4m8@kudu.in-berlin.de) Quit (resistance.oftc.net larich.oftc.net)
[9:46] * asadpanda (~asadpanda@ Quit (resistance.oftc.net larich.oftc.net)
[9:46] * SpamapS (~clint@xencbyrum2.srihosting.com) Quit (resistance.oftc.net larich.oftc.net)
[9:48] * deepsa (~deepsa@ has joined #ceph
[9:48] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) has joined #ceph
[9:48] * dok (~dok@static-50-53-68-158.bvtn.or.frontiernet.net) has joined #ceph
[9:48] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[9:48] * sage1 (~sage@ has joined #ceph
[9:48] * rosco (~r.nap@ has joined #ceph
[9:48] * SpamapS (~clint@xencbyrum2.srihosting.com) has joined #ceph
[9:48] * iggy (~iggy@theiggy.com) has joined #ceph
[9:48] * asadpanda (~asadpanda@ has joined #ceph
[9:48] * f4m8_ (f4m8@kudu.in-berlin.de) has joined #ceph
[9:48] * jeffhung_ (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[10:07] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[10:07] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[10:09] * loicd (~loic@ has joined #ceph
[10:21] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[10:26] * atrius (~atrius@24-179-64-97.dhcp.jcsn.tn.charter.com) Quit (Ping timeout: 480 seconds)
[10:44] * tryggvil (~tryggvil@s529d22d5.adsl.online.nl) has joined #ceph
[10:45] * tryggvil (~tryggvil@s529d22d5.adsl.online.nl) Quit ()
[10:54] <Fruit> crushtool: crush/builder.c:142: crush_add_bucket: Assertion `map->buckets[pos] == 0' failed.
[10:55] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[10:55] * loicd (~loic@ has joined #ceph
[10:58] <Fruit> ah, non-unique ids
[11:09] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:12] * gaveen (~gaveen@ has joined #ceph
[11:29] * tryggvil (~tryggvil@2001:610:188:431:1c33:ea19:9cbb:d284) has joined #ceph
[11:29] <Fruit> buffer::malformed_input: __PRETTY_FUNCTION__ decode past end of struct encoding
[11:29] <Fruit> after fgetxattr
[11:40] * tziOm (~bjornar@ has joined #ceph
[11:59] * tryggvil_ (~tryggvil@host-200-1.eduroamers.nl) has joined #ceph
[12:01] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[12:05] * tryggvil_ (~tryggvil@host-200-1.eduroamers.nl) Quit (Quit: tryggvil_)
[12:05] * tryggvil (~tryggvil@2001:610:188:431:1c33:ea19:9cbb:d284) Quit (Ping timeout: 480 seconds)
[12:06] * psomas_ (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[12:10] * tryggvil (~tryggvil@2001:610:188:431:ac32:fd4d:70f7:3ce2) has joined #ceph
[12:10] * gaveen (~gaveen@ has joined #ceph
[12:13] <masterpe> Good morning
[12:13] <masterpe> I need to reinstall two ceph nodes, what is the best procedure what i can follow?
[12:14] * loicd (~loic@ Quit (Quit: Leaving.)
[12:14] <masterpe> (the nodes have the function mon, osd)
[12:15] <masterpe> 1. Remove the node from the mon and osd cluster?
[12:16] <masterpe> of 2. make an backup of the complete node and restore the node on the new hardware?
[12:16] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[12:21] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) Quit (Quit: Leaving.)
[12:41] * tryggvil_ (~tryggvil@host-243-1.eduroamers.nl) has joined #ceph
[12:46] * tryggvil (~tryggvil@2001:610:188:431:ac32:fd4d:70f7:3ce2) Quit (Ping timeout: 480 seconds)
[12:46] * tryggvil_ is now known as tryggvil
[13:02] <wido> masterpe: If it's an OSD, you can leave it in the cluster configuration
[13:03] <wido> just shut it down, format the node, bring it back and Ceph will do the rest
[13:03] <wido> For the monitor, you can shut it down, tarball the data directory if you want, but you can also wipe it
[13:03] <wido> backup the cephx key
[13:07] * MikeMcClurg (~mike@ has joined #ceph
[13:08] <joao> if it's a monitor, you should pay attention to the number of monitors you have in the cluster
[13:09] <joao> and how many monitors are up and down, and how many will be down after you bring that monitor node down
[13:09] <joao> if you happen to break quorum, your cluster will be unresponsive until the monitor is brought back up
[13:10] <Fruit> so, with zfs as a backend, something goes wrong with the extended attribute parsing
[13:10] <Fruit> not sure how to debug this
[13:14] * deepsa_ (~deepsa@ has joined #ceph
[13:14] <joao> I heard that some other people tried using ceph with zfs and ended up incurring in issues
[13:14] <joao> not sure how I can help though, but I don't mind giving it a shot :)
[13:14] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[13:14] * deepsa_ is now known as deepsa
[13:28] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[13:28] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[13:29] <wido> Fruit: zfsonlinux.org?
[13:29] <Fruit> yeah
[13:29] <wido> Nice! That was still on my todo
[13:29] <wido> seems pretty stable
[14:11] * tryggvil (~tryggvil@host-243-1.eduroamers.nl) Quit (Quit: tryggvil)
[14:14] * loicd (~loic@ has joined #ceph
[14:15] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) has joined #ceph
[14:24] * tryggvil (~tryggvil@2001:610:188:431:78d8:e937:9c1b:b9c1) has joined #ceph
[14:52] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[14:55] * tryggvil_ (~tryggvil@host-127-2.eduroamers.nl) has joined #ceph
[15:00] * tryggvil (~tryggvil@2001:610:188:431:78d8:e937:9c1b:b9c1) Quit (Ping timeout: 480 seconds)
[15:00] * tryggvil_ is now known as tryggvil
[15:04] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:11] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[15:44] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[15:49] * tryggvil_ (~tryggvil@host-127-2.eduroamers.nl) has joined #ceph
[15:53] * tryggvil (~tryggvil@host-127-2.eduroamers.nl) Quit (Read error: Connection reset by peer)
[15:53] * tryggvil_ is now known as tryggvil
[16:02] * scalability-junk (~stp@188-193-208-44-dynip.superkabel.de) has joined #ceph
[16:04] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[16:10] * scalability-junk (~stp@188-193-208-44-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[16:11] * tryggvil (~tryggvil@host-127-2.eduroamers.nl) Quit (Quit: tryggvil)
[16:19] * loicd (~loic@ Quit (Quit: Leaving.)
[16:19] * loicd (~loic@magenta.dachary.org) has joined #ceph
[16:21] * PerlStalker (~PerlStalk@ has joined #ceph
[16:22] * scalability-junk (~stp@188-193-208-44-dynip.superkabel.de) has joined #ceph
[16:50] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[17:02] * Tv_ (~tv@ has joined #ceph
[17:04] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[17:06] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:18] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[17:19] * deepsa (~deepsa@ has joined #ceph
[17:19] <nhm_> Fruit: ah, you gave it a try!
[17:20] <nhm_> Fruit: any luck if you use "filestore xattr use omap = true"?
[17:28] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[17:38] * cattelan (~cattelan@2001:4978:267:0:21c:c0ff:febf:814b) Quit (Ping timeout: 480 seconds)
[17:43] * jeffp (~jplaisanc@net66-219-41-161.static-customer.corenap.com) has left #ceph
[17:43] * tren (~Adium@ has joined #ceph
[17:49] * cattelan (~cattelan@2001:4978:267:0:21c:c0ff:febf:814b) has joined #ceph
[17:56] * jlogan1 (~Thunderbi@2600:c00:3010:1:5858:566a:6064:ca29) has joined #ceph
[18:12] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[18:13] <Tv_> awesome email exchange.. "RBD snapshots are immutable by design." "1.I want to create writable snapshot."
[18:14] * tryggvil (~tryggvil@host-127-2.eduroamers.nl) has joined #ceph
[18:14] <jamespage> Tv_, are you the guy who wrote the upstart configuration for ceph?
[18:14] <Tv_> jamespage: yup
[18:15] <jamespage> Tv_, cool - I've been using them in some work I've been doing deploying ceph with Ubuntu Juju
[18:15] * jlogan1 (~Thunderbi@2600:c00:3010:1:5858:566a:6064:ca29) Quit (Quit: jlogan1)
[18:15] <Tv_> jamespage: sweet -- do ask if there's anything funky
[18:16] <jamespage> Tv_, love the way OSD device initialization works - makes life really easy
[18:16] <Tv_> jamespage: ceph-disk-prepare et al? yeah
[18:16] <Tv_> jamespage: i *really* wanted that stuff to be less painful
[18:16] <jamespage> Tv_: +1000
[18:16] <jamespage> rocks
[18:16] * tryggvil (~tryggvil@host-127-2.eduroamers.nl) Quit ()
[18:16] <Tv_> jamespage: also see latest about external journal devices
[18:17] <Tv_> jamespage: upstart makes a lot of things a whole lot simpler, too
[18:17] <jamespage> Tv_, I need to have a catchup
[18:17] <nhm_> Tv_: ooh, nice
[18:17] * jlogan1 (~Thunderbi@2600:c00:3010:1:5858:566a:6064:ca29) has joined #ceph
[18:18] <Tv_> still in branch wip-hotplug-journal
[18:18] <Tv_> will get it merged today
[18:18] <Tv_> https://github.com/ceph/ceph/commit/9f84209f6789cf9f733b515a51da444c6c6b1315
[18:19] <Tv_> heh i see a typo ;)
[18:19] <jamespage> Tv_, nice!
[18:19] <Tv_> https://github.com/ceph/ceph/commit/d9b0c630997cbefb5734ab39bcce216bfbaccc93
[18:20] * jamespage (~jamespage@tobermory.gromper.net) has left #ceph
[18:20] * jamespage (~jamespage@tobermory.gromper.net) has joined #ceph
[18:21] <Tv_> one unresolved issue with 100% automating the setup there is that the /dev/sdc in the second example needs to be gpt partitioned by *something*
[18:21] <jamespage> Tv_, the only niggle I've noticed in my use has been around mutiple invocations of udevadm trigger
[18:21] <Tv_> jamespage: what i've seen is that i have no way of preparing a disk without udev->upstart trying to also activate it
[18:21] * stxShadow (~Jens@ip-178-203-169-190.unitymediagroup.de) has joined #ceph
[18:21] <jamespage> Tv_: I've been testing on a openstack cloud and I'm able to add additional devices post mon and initial OSD bootstrap
[18:22] <Tv_> jamespage: if you have some other weird behavior you're seeing, please let me know
[18:22] <jamespage> however the devices get mounted mutiple times
[18:22] <Tv_> oh that
[18:22] <jamespage> looks odd but still works OK
[18:22] <Tv_> jamespage: i think that's a missing error handling path
[18:22] <Tv_> jamespage: that should be covered by some of the latest commits to master by me
[18:23] <jamespage> Tv_, I found myself re-implementing alot of the code you have in ceph-disk-activate|prepare in my code
[18:23] <Tv_> jamespage: but yeah, it does not harm, is a little bit ugly; there's plenty of cleanup to be done, but the big challenge was getting the full chain going
[18:23] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[18:23] <jamespage> maybe we can find some way to make that re-usable?
[18:24] <Tv_> jamespage: ceph-disk-* are essentially extracted out of the chef cookbook, so they can be shared across chef/juju/puppet/ceph-deploy/manual
[18:24] <jamespage> Tv_, yeah - got that
[18:24] <Tv_> jamespage: so yeah please talk about what your needs were, maybe there's a better way of doing things
[18:25] <jamespage> Tv_, I bootstrap the mons in a slight different way to the chef cookbook; I don't know the identity of any of the mon nodes pre-bootstrap
[18:25] <jamespage> So wait for X number of mon server nodes to join a peer relationship and then I bootstrap the cluster
[18:26] <jamespage> so I have todo things like 'wait_for_quorum' and 'get_named_key' to pass stuff between the servers participating in the deployment
[18:27] <Tv_> jamespage: sorry, what's the timeline? when do you ceph-mon --mkfs, when do you start the daemon, relative to the juju hooks?
[18:27] <jamespage> Tv_, context is : http://bazaar.launchpad.net/~james-page/charms/quantal/ceph/trunk/files/head:/hooks/
[18:28] <jamespage> Tv_, I have a peer relation defined; it runs ceph-mon --mkfs on each node after 3 nodes have joined the relation
[18:29] <Tv_> jamespage: and then you want to wait for those the reach a quorum, ok
[18:29] <jamespage> Tv_, I then wait for the mons to establish quorum and the keys to be created before I start OSD'izing disks
[18:30] <jamespage> yeah
[18:30] <Tv_> jamespage: btw you can prepare the osds before bootstrap-osd keys, they just won't activate
[18:30] * loicd (~loic@magenta.dachary.org) has joined #ceph
[18:30] <Tv_> jamespage: if you're just waiting for those both, you could just wait for the keys to be created
[18:31] <jamespage> Tv_, yeah - I preferred to know everything is set before I do the OSD initialization
[18:31] <jamespage> I check for the keyring to appear
[18:31] <Tv_> jamespage: yeah the biggest benefit is really a little bit of parallelization.. sometimes the mkfs takes a while
[18:32] <jamespage> Tv_, yeah - I guess so
[18:32] <Tv_> jamespage: i'm saying this as someone who runs the whole thing in a loop a lot ;)
[18:32] <jamespage> Tv_, I have been for the last week or so
[18:33] <Tv_> jamespage: do you use import_osd_bootstrap_key for anything? that looks unnecessary
[18:33] <Tv_> and get_osd_bootstrap_key etc
[18:33] <Tv_> i don't think you need to pierce that abstraction
[18:33] <jamespage> Tv_, that code gets re-used in another charm; ceph-osd
[18:33] <Tv_> (it'd be better if the caps needed aren't copy-pasted all over the place)
[18:34] <jamespage> ah - no I see what you are saying - that is a little be legacy now
[18:34] <Tv_> jamespage: you could just essentially ship the /var/lib/ceph/boostrap-osd/$cluster.keyring file without peeking inside it, or dealing with monitors
[18:35] <Tv_> jamespage: it just needs to be shuffled from a machine that runs ceph-mon to machines that want to run ceph-osd
[18:35] <jamespage> Tv_, meh - yeah I guess so
[18:35] <jamespage> but I prefer to pass 'data' rather than full config for some reason I can't quite explain
[18:35] <Tv_> jamespage: juju's config-changed sounds perfect for that sort of stuff ;)
[18:35] <Tv_> jamespage: (much better than chef!)
[18:36] <Tv_> jamespage: well if nothing else please read the file and don't talk to the monitor; that saves you a bunch of trouble around wait_for_quorum etc that ceph-create-keys does for you anyway
[18:37] <jamespage> Tv_: ack - agreed
[18:37] <jamespage> I've been looking at the code since last week...
[18:37] <Tv_> jamespage: and thanks for your efforts there!
[18:37] <jamespage> Tv_, I'll blog about it soon... just need to get the ubuntu release out of the way
[18:37] <Tv_> jamespage: i love how different juju is from chef.. it's like.. there's square holes and chef is a round peg
[18:38] <jamespage> Tv_, yeah - its pretty cool
[18:38] <Tv_> (there are also round holes, and juju is a square peg there...)
[18:38] <jamespage> I've been able to deploy the ceph charms alongside the existing openstack charms and integrate nova and glance really easily
[18:39] <jamespage> Tv_, absolutely - horses for courses et al.
[18:40] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[18:42] * jlogan1 (~Thunderbi@2600:c00:3010:1:5858:566a:6064:ca29) Quit (Ping timeout: 480 seconds)
[18:43] * tryggvil (~tryggvil@host-127-2.eduroamers.nl) has joined #ceph
[18:43] <jamespage> Tv_, anyway - nice work - thanks!
[18:44] * Tv_ bows
[18:51] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:58] * elder (~elder@2607:f298:a:607:38d5:c5b3:89cb:71e1) Quit (Quit: Leaving)
[19:04] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[19:07] * tziOm (~bjornar@ti0099a340-dhcp0358.bb.online.no) has joined #ceph
[19:07] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[19:10] * BManojlovic (~steki@ has joined #ceph
[19:10] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[19:14] * aaron (~chatzilla@ has joined #ceph
[19:15] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:18] * dmick (~dmick@2607:f298:a:607:45fc:a2e5:8679:dbd3) has joined #ceph
[19:19] * tryggvil (~tryggvil@host-127-2.eduroamers.nl) Quit (Quit: tryggvil)
[19:21] <aaron> Is the rewrite rule at http://ceph.com/docs/master/radosgw/config/ supposed to be avoided for swift-like rados gateways?
[19:22] <aaron> I just get 404's without it
[19:22] * nwatkins (~nwatkins@soenat3.cse.ucsc.edu) has joined #ceph
[19:24] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[19:31] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[19:34] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[19:34] * jlogan1 (~Thunderbi@2600:c00:3010:1:3d22:f87e:dd79:4d12) has joined #ceph
[19:37] * Ryan_Lane (~Adium@ has joined #ceph
[19:39] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[19:46] * chutzpah (~chutz@ has joined #ceph
[19:47] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[19:50] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[19:58] * stxShadow (~Jens@ip-178-203-169-190.unitymediagroup.de) has left #ceph
[20:00] * ChanServ sets mode +o dmick
[20:04] * MikeMcClurg (~mike@ Quit (Quit: Leaving.)
[20:04] * jlogan1 (~Thunderbi@2600:c00:3010:1:3d22:f87e:dd79:4d12) Quit (Quit: jlogan1)
[20:13] * jlogan1 (~Thunderbi@2600:c00:3010:1:24b1:1c4c:fafe:8318) has joined #ceph
[20:30] * deepsa (~deepsa@ Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[20:33] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[20:42] * lakie1 (~lakie@9KCAACA3O.tor-irc.dnsbl.oftc.net) has joined #ceph
[20:43] * jlogan1 (~Thunderbi@2600:c00:3010:1:24b1:1c4c:fafe:8318) Quit (Quit: jlogan1)
[20:46] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[20:48] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[20:59] * jlogan1 (~Thunderbi@2600:c00:3010:1:f101:e522:57b7:af61) has joined #ceph
[21:05] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[21:09] * Cube1 (~Adium@ has joined #ceph
[21:09] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[21:20] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[21:22] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:22] * loicd (~loic@magenta.dachary.org) Quit ()
[21:23] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Quit: Leaving)
[21:26] * cblack101 (86868b48@ircip2.mibbit.com) has joined #ceph
[21:40] * steki-BLAH (~steki@ has joined #ceph
[21:41] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[21:42] * aaron resolves the gateway problem
[21:43] <aaron> So does ceph have problems with pools with millions of files (like how swift has to be sharded due to large sqlite btrees)?
[21:45] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[21:49] <joshd> aaron: no, there's not nearly as much examination of all objects, and no metadata bottleneck like sqlite
[21:50] <aaron> right now we have SSDs for relatively modest container dbs
[21:50] <aaron> otherwise, listings would just timeout...
[21:50] <aaron> joshd: are container listings sequentially consistent?
[21:51] <aaron> I mean "pool" listings
[21:53] <joshd> if you're talking about bucket/container listings through the rados gateway, yes
[21:53] <aaron> yes, that would be another improvement over swift
[21:53] <joshd> pools are a separate level, and I think they're consistent too, but it's more expensive to list them - they're not indexed
[21:54] <aaron> hmm, maybe I'm not clear on the relation between gateway containers and pools.
[21:54] <joshd> containers/buckets are an abstraction the gateway provides on top of pools
[21:54] <aaron> are gateway containers all under the "data" pool or something?
[21:55] * steki-BLAH (~steki@ Quit (Ping timeout: 480 seconds)
[21:55] * aaron apologies in advance for any newb questions
[21:55] <phantomcircuit> can i setup a system where a write returns complete after it's in memory on n osd systems?
[21:56] <aaron> we just started seriously looking at ceph for wikimedia/wikipedia
[21:56] <rweeks> that's awesome to hear, aaron
[21:56] <aaron> swift has no decent multi-DC replication
[21:56] <aaron> we were hoping container sync would work but it's total garbage
[21:56] <rweeks> and no apologies. we're hear to talk about ceph, newbs or not.
[21:56] <joshd> I think they're all under a .rgw pool or something like that by default, but the gateway stores its metadata in another pool
[21:57] <joshd> phantomcircuit: at the rados level yes, at other levels that configuration isn't really exposed, but would be easy to add
[21:57] <aaron> joshd: are there any major uses of ceph that have replicas in different geographic places?
[21:58] <joshd> aaron: not with large latencies
[21:58] <aaron> 32ms is ok?
[21:59] <joshd> aaron: for radosgw, it might be. for running a vm on, it would probably hurt performance too much
[21:59] * paravoid (~paravoid@scrooge.tty.gr) has joined #ceph
[22:00] <aaron> joshd: so actual container listings are not that fast then. How does it deal with paging (e.g. ?marker and ?limit)?
[22:01] <rweeks> so 32ms, you're talking across a bit of geography but not across oceans?
[22:01] <joshd> aaron: container listings are fast - these are indexed by radosgw - it's pool listings that are slow (which you probably don't care about)
[22:02] <aaron> ohhh, ok I had that backwards then, sorry
[22:02] <aaron> nice
[22:02] <nhm_> aaron: one thing to keep in midn too regarding performance is that by default a bucket's index is stored in ceph, and usually ends up on 1 OSD. That can be a performance bottleneck.
[22:03] <aaron> "usually"? When would it be one multiple osds?
[22:03] <rweeks> we should really decide on container or bucket, and use that one term. :)
[22:03] <aaron> heh
[22:04] <nhm_> aaron: sorry, that was mangled. I meant that you can disable that so the bucket doesn't end up on an OSD.
[22:04] <nhm_> er bucket index rather.
[22:04] * jlogan2 (~Thunderbi@ has joined #ceph
[22:05] * jlogan3 (~Thunderbi@ has joined #ceph
[22:06] <aaron> nhm_: you mean disable storing the index in an osd? where would it go then?
[22:09] <nhm_> aaron: hrm, perhaps I'm misremembering what this does. I was thinking of "rgw_enable_ops_log".
[22:09] <aaron> joshd: actually, how are bucket listings stored anyway?
[22:10] * jlogan1 (~Thunderbi@2600:c00:3010:1:f101:e522:57b7:af61) Quit (Ping timeout: 480 seconds)
[22:10] <nhm_> aaron: logging of the RGW operations.
[22:10] <aaron> other than "not sqlite" hopefully ;)
[22:10] * jlogan3 (~Thunderbi@ Quit (Quit: jlogan3)
[22:10] * steki-BLAH (~steki@ has joined #ceph
[22:12] * jlogan2 (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[22:13] <aaron> <rweeks >so 32ms, you're talking across a bit of geography but not across oceans?
[22:14] <aaron> tampa, FA and ashburn, VA
[22:14] <rweeks> Gotcha
[22:14] <aaron> *FL
[22:14] <rweeks> I was just assuming that based on the latency, since most oceans are +40ms at least depending on the route.
[22:14] <joshd> aaron: they're stored in rados objects, yehudasa could tell you the exact details
[22:14] <Tv_> nhm_: ops log is very different from index
[22:15] <nhm_> Tv_: yes, I didn't actually remember what we were doing. :)
[22:15] <Tv_> aaron: imagine your fastest (low-level, one block) write operation always being >32ms -- is that acceptable or not; that's up to you
[22:15] <nhm_> Tv_: Now I'm actually going back and reading about what that actually does.
[22:15] * lakie1 (~lakie@9KCAACA3O.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[22:16] <Tv_> aaron: maybe even 2x because it's round-trip
[22:16] <nhm_> It looks like we are at least logging deleted objects for later removal.
[22:16] <aaron> Tv_: the clients would just use the gateway, so we would not be sending many block writes to do something to one file
[22:16] <nhm_> Not sure what else we are logging.
[22:16] <aaron> just PUT/GET/HEAD sort of stuff
[22:17] <Tv_> yeah, it might be tolerable with just s3-style PUTs
[22:17] <Tv_> not actual file IO
[22:17] <aaron> of course, I don't know how those are working under the hood internally
[22:17] <nhm_> aaron: anyway, forget what I said about indexing.
[22:17] <aaron> heh, ok
[22:17] <Tv_> aaron: so the other bad case in spreading across DCs is tromboning
[22:17] <nhm_> aaron: but do keep in mind that there is operating logging (to some extent) happening against a single OSD.
[22:17] <aaron> yeah, I thought using a log as an index sounded a bit strange ;)
[22:17] <nhm_> operation
[22:18] <Tv_> aaron: say you have A and B; client is near A; primary for some object happens to be in B; 3x replication
[22:18] <Tv_> aaron: client(A) -> B for primary write, B->A & B->A for replication
[22:18] <nhm_> aaron: some how I got it in my head that we were indexing, not sure why. Wires seem to randomly get crossed.
[22:19] <Tv_> nhm_: there is an index of bucket contents
[22:20] <nhm_> Tv_: Does the user have any control where that gets placed?
[22:21] <nhm_> Tv_: or just in the rgw metadata pool or something?
[22:21] <aaron> paravoid: I'd imagine we'd want the primaries all in one data center, were the apaches are too, right?
[22:21] <Tv_> nhm_: i'm thinking it's in some .rgw.foo pool, but i don't know off hand and yehuda's afk
[22:22] <joshd> aaron: it looks like the bucket index is stored in leveldb (each osd has a leveldb instance where key/value pairs associated with a rados object can be stored)
[22:22] <paravoid> yeah, there are no active/active plans yet.
[22:22] <aaron> I though level only did key/value lookups? How does it do "listings starting from X" queries?
[22:22] <aaron> *leveldb
[22:22] <Tv_> aaron: leveldb has range queries just fine
[22:23] <aaron> so it's sorted-key-value?
[22:23] <Tv_> aaron: log-structured-merge-tree
[22:23] <Tv_> but yeah each of the stable (non-journal) files is a sorted key-value list with index
[22:24] <Tv_> you ask for a range fetch, it does a merge sort of all the possible source files (& RAM)
[22:28] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[22:29] <aaron> Tv_, joshd: it would be nice to meet some ceph people in person
[22:29] <Tv_> aaron: you're in Sweden?
[22:29] <aaron> San Francisco
[22:29] * BManojlovic (~steki@ has joined #ceph
[22:29] <Tv_> aaron: oh. your irc server is in sweden ;)
[22:29] <aaron> heh
[22:29] <joshd> aaron: I'll be at the openstack conference next week
[22:30] <aaron> paravoid: I think Ryan will be there, right?
[22:30] <Tv_> aaron: the original group is in LA, we have an office in Sunnyvale, etc; i'm sure that can be arranged
[22:31] * lofejndif (~lsqavnbok@04ZAAANX0.tor-irc.dnsbl.oftc.net) has joined #ceph
[22:31] <Tv_> aaron: so SF still counts as home turf for us; shoot an email to dona@inktank.com and ask for a meeting to have an overview, or something
[22:31] * aliguori (~anthony@cpe-70-123-130-163.austin.res.rr.com) Quit (Remote host closed the connection)
[22:32] <paravoid> yes, Ryan and Andrew
[22:32] <rweeks> aaron: Inktank has offices in Sunnyvale. Dreamhost (parent company for now) has offices in SF
[22:32] <rweeks> there are definitely people here to meet.
[22:32] <aaron> ok, nice
[22:33] <nhm_> aaron: I think we are at like 11 conferences between now and the end of november...
[22:33] * steki-BLAH (~steki@ Quit (Ping timeout: 480 seconds)
[22:33] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:34] <rweeks> yeah a lot of people are travelling
[22:34] <rweeks> if you want to meet up with any of the developers, that might require more wrangling.
[22:35] <nhm_> we'll have a couple at the ubuntu summit.
[22:36] <Tv_> nhm_: that's copenhagen, quite far from SF ;)
[22:36] <nhm_> Tv_: hey, I'm trying to give him an excuse. ;)
[22:36] <Tv_> that's why is asked if aaron is in Sweden, as his irc server put him within travel distance
[22:36] <Tv_> hah
[22:36] <rweeks> sage1 or sagewk, when is the next time you're up in the bay area?
[22:36] <Tv_> rweeks: don't expect him on irc today
[22:36] <rweeks> travelliung?
[22:37] <rweeks> but without the u
[22:37] <Tv_> he was in a car earlier
[22:37] <Tv_> travellung -- what you need for driving in the LA smog
[22:37] <rweeks> heh
[22:40] * steki-BLAH (~steki@ has joined #ceph
[22:46] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[22:49] <aaron> so does ceph ignore the x-newest header for the gateway, since ceph is already strongly consistent?
[22:57] <joshd> I don't see that string in the codebase, so I'm guessing yes :)
[22:58] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: This computer has gone to sleep)
[22:59] <rweeks> that was one thing that confused me about swift, the way it acted with x-newest depending on which node you asked and how you asked it
[22:59] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[23:07] * loicd (~loic@ has joined #ceph
[23:08] <rweeks> aaron: what does wikipedia use today for storage?
[23:08] * tryggvil (~tryggvil@s529d22d5.adsl.online.nl) has joined #ceph
[23:10] <aaron> rweeks: we use swift with nfs over netapps for a backup
[23:10] <rweeks> gotcha
[23:10] <aaron> there are two working netapp setups in both clusters, with one replicating to the other
[23:10] <aaron> if things went as planned though, we would just have had swift and replicated from one cluster to the other
[23:11] <aaron> we also had lots of hardware failures on the swift boxes with cp2100s, which didn't help
[23:11] <rweeks> oh fun
[23:12] <aaron> paravoid could tell you more about that horror
[23:13] <rweeks> so you're using SnapMirror to replicate between FL and VA today?
[23:13] <aaron> drbd
[23:13] <paravoid> no, we use snapmirror.
[23:14] <aaron> hmm, I must have misheard asher
[23:14] <paravoid> there's no drbd involved anywhere
[23:14] <rweeks> (for spectators: SnapMirror is NetApp replication)
[23:14] <paravoid> the netapps are a temporary solution though
[23:14] <rweeks> (full disclosure: I left NetApp to join Inktank 2 weeks ago)
[23:14] <dmick> http://cdn-images.hollywood.com/site/MichaelJacksonEatingPopcorn.gif
[23:15] <paravoid> to have DC resiliency until our primary storage gains replication support
[23:15] <aaron> he must have been talking about non-netapp replication or something in the future (possible sharded nfs)
[23:15] <rweeks> Thriller, nice, dmick
[23:15] <paravoid> just to be clear, we're still using swift and haven't decided to replace that (with ceph or otherwise) yet
[23:16] <paravoid> we also have some other needs at the WMF besides our media storage
[23:16] <rweeks> understood. Just trying to understand how things are working for you today
[23:16] <paravoid> for our Wikimedia Labs platform
[23:17] <paravoid> which is basically openstack nova giving VMs to staff & volunteers for a development/staging
[23:17] <rweeks> ahh
[23:17] <rweeks> so then the Ceph RBD is of interest
[23:17] <paravoid> we currently use glusterfs for the home directories; we used to use gluster for instance storage but that failed us pretty quickly and we've fallen back to local storage
[23:17] * miroslavk (~miroslavk@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[23:17] <paravoid> so ceph is also worth looking from that PoV
[23:18] <paravoid> I have the luck to be involved in both media storage & labs
[23:18] <aaron> "luck"
[23:18] <paravoid> heh
[23:18] <rweeks> sounds like a fun job.
[23:19] <paravoid> Ryan & Andrew are colleagues of ours that are involved in labs and are attending the summit next week
[23:19] <Ryan_Lane> who now?
[23:19] <Ryan_Lane> :)
[23:20] <paravoid> heh :)
[23:20] <aaron> Ryan_Lane: I didn't notice you in here
[23:20] <Ryan_Lane> I lurk
[23:20] <aaron> so I see
[23:22] <rweeks> well the, Ryan_Lane, please come look up the Inktank folks at the summit, a number of us will be there.
[23:22] <Ryan_Lane> yep. I'll come find you guys
[23:23] <rweeks> it sounds like Ceph has a lot of potential for wikimedia.
[23:24] <Ryan_Lane> well, this will be one of the only distributed file systems we haven't tried ;)
[23:25] <rweeks> distributed object store *cough cough* :)
[23:25] <paravoid> there are lots more you know
[23:25] <Ryan_Lane> tomato tomato :)
[23:25] <paravoid> be careful what you wish for
[23:25] <Ryan_Lane> paravoid: oh, I know. we weeded out a ton early on
[23:25] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: This computer has gone to sleep)
[23:26] <rweeks> as we note on the ceph.com site, the CephFS posix filesystem is not fully baked, but the object store and the block device are.
[23:27] <paravoid> yeah, cephfs would perhaps be useful for labs /home but I don't think we have an urgent need about that
[23:27] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[23:27] <paravoid> radosgw and rbd should be fine
[23:27] <aaron> hmm, "Transfer-Encoding: chunked" might not work with the gateway it seems
[23:28] <aaron> gives 411 status...though the fact that we are hitting that case is also a bug in CF somewhere ;)
[23:28] <aaron> (cloudfiles)
[23:28] <aaron> "if ( !$obj->content_length ) {"
[23:29] * aaron sighs at bad PHP code and the number/string 0
[23:29] <rweeks> isn't "bad PHP code" sort of redundant?
[23:29] * rweeks ducks
[23:33] * tren (~Adium@ Quit (Quit: Leaving.)
[23:33] * tziOm (~bjornar@ti0099a340-dhcp0358.bb.online.no) Quit (Remote host closed the connection)
[23:34] * tren (~Adium@ has joined #ceph
[23:35] * tren (~Adium@ Quit ()
[23:35] * dmick (~dmick@2607:f298:a:607:45fc:a2e5:8679:dbd3) Quit (Ping timeout: 480 seconds)
[23:39] * danieagle (~Daniel@ has joined #ceph
[23:45] * vata (~vata@2607:fad8:4:0:8d6a:59a3:71b9:93f4) Quit (Quit: Leaving.)
[23:46] * nwatkins (~nwatkins@soenat3.cse.ucsc.edu) Quit (Remote host closed the connection)
[23:55] * vata (~vata@2607:fad8:4:0:3537:8284:b905:1e36) has joined #ceph
[23:57] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[23:59] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.