#ceph IRC Log


IRC Log for 2012-11-06

Timestamps are in GMT/BST.

[0:00] <dmick> ccb94785007d33365d49dd566e194eb0a022148d in libvirt, which went into v10.0, which is not in Ubuntu quantal
[0:00] <dmick> is required if you want to run without auth; otherwise, a workaround is to add "auth_supported=none" to the domain XML
[0:00] <dmick> I'm going to write this down somewhere more permanent
[0:01] <rweeks> I suggest a parchment on the door of the libvirt developers
[0:01] <dmick> well, but it's for Ceph rbd users, is the thing
[0:01] <dmick> http://ceph.com/docs/master/rbd/libvirt/, probably
[0:01] <dmick> (which also needs to get the wiki instructions added to it)
[0:02] * lurbs (user@uber.geek.nz) has joined #ceph
[0:02] <jefferai> hrm
[0:02] <jefferai> 3.5 kernel recommended for using btrfs with ceph eh
[0:02] <jefferai> wheezy has 3.2 :-(
[0:04] * PerlStalker (~PerlStalk@perlstalker-1-pt.tunnel.tserv8.dal1.ipv6.he.net) Quit (Remote host closed the connection)
[0:04] <dmick> jefferai: yeah, btrfs is seeing lots of development, and 3.2 is pretty old by now
[0:05] * maxim (~pfliu@ has joined #ceph
[0:05] <jefferai> yeah, I know it is
[0:06] <jefferai> I do wish wheezy had a newer kernel
[0:06] <jefferai> I could grab the experimental kernel
[0:06] <jefferai> but that's obviously not as well tested
[0:06] * jefferai notes that the latest Ubuntu LTS is 3.2 as well
[0:07] * dmick is currently running quantal :)
[0:07] <jefferai> yeah
[0:07] <dmick> I know, tension
[0:07] <jefferai> my team didn't want to use Ubuntu
[0:07] <jefferai> for various reasons, preferred to stay with Debian wheezy
[0:07] <lurbs> Anyone seen issues where a cloned rbd image shows its parent correctly, but a list of the childen of the parent snapshot doesn't list it?
[0:07] <jefferai> so yeah, tension between using not-well-tested kernel and using quite old btrfs code
[0:08] <lurbs> http://paste.nothing.net.nz/a75f43
[0:08] <lurbs> And so it can't be removed or flattened.
[0:09] <joshd> lurbs: no, I'd like to figure out why that happened
[0:09] <lurbs> There are other cloned children of the same parent image that seem to be working fine.
[0:09] <lurbs> This is 0.53 from the debian-testing repository on 12.04 LTS BTW.
[0:09] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[0:10] <lurbs> Any other debug info that would be useful?
[0:11] <joshd> lurbs: did that children list always omit that image, or only after you tried to remove it?
[0:11] <lurbs> Not entirely sure, sorry.
[0:12] <joshd> well, cloning would have failed if it wasn't able to add it to the child list
[0:13] <lurbs> Fair enough. Is there any way to simply force the rm to succeed without it deleting the parent reference?
[0:14] <joshd> could you do 'rados -p rbd listomapkeys rbd_children'
[0:15] <lurbs> http://paste.nothing.net.nz/1bea67
[0:15] * PerlStalker (~PerlStalk@perlstalker-1-pt.tunnel.tserv8.dal1.ipv6.he.net) has joined #ceph
[0:16] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:16] <lurbs> I should probably mention that there are 750 other (working) children of that snapshot.
[0:17] <jefferai> dmick: so is quantal basically the "ideal" distribution to run on right now?
[0:17] <joshd> lurbs: well, you can delete the objects for the image if you just want to free up space
[0:18] <lurbs> joshd: Will removing the objects also implicitly remove the RBD volume?
[0:18] <lurbs> In that things like 'rbd ls' won't show it anymore.
[0:18] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[0:18] <dmick> jefferai: I wouldn't say that. I'm just illustrating what a bleeding-edge freak I am :)
[0:18] <joshd> lurbs: removing the header object will make it inaccessible, although removing it from 'rbd ls' is harder
[0:19] <jefferai> well
[0:19] <dmick> precise is what's getting the most testing overall here at Inktank anyway
[0:19] * loicd (~loic@ Quit (Quit: Leaving.)
[0:19] <joshd> lurbs: 'rbd rm' should handle this case and continue
[0:19] <jefferai> btrfs is the recommended filesystem
[0:19] <jefferai> but > kernel 3.5
[0:19] <jefferai> which isn't in precise
[0:19] <jefferai> or do you guys do most testing on xfs?
[0:19] <lurbs> btrfs still scares the bejeezus out of me.
[0:19] <dmick> we tend to recommend xfs more than btrfs right now partly because of the velocity of change in btrfs
[0:20] <jefferai> hm
[0:20] <dmick> but we're testing with both and people are using both. we just tend to use later kernels for btrfs
[0:20] <jefferai> how's stability been on quantal with btrfs?
[0:21] <dmick> lurbs: can you provide rados -p rbd listomapvals rbd_children? Afraid that listomapkeys might be hiding info
[0:21] <jefferai> I guess also it's not necessarily terribly difficult to change later, right, since you can just take a node offline, switch its filesystem, and bring it back?
[0:21] <jefferai> I'm just trying to get things set up from the beginning without doing a whole bunch of changing around later
[0:21] <dmick> jefferai: couldn't say. I'm running quantal on my desktop, but not running a cluster there
[0:21] <jefferai> ah
[0:22] <jefferai> so the most testing gets done on precise with xfs?
[0:22] <dmick> I think the story on later-kernels-on-precise is still like "faster on new filesystems, slower as it ages" is worse for btrfs than xfs
[0:22] <joshd> lurbs: 'rbd ls' reads the rbd_directory object, which you can modify (for format 2 images) with 'rados -p rbd rmomapkey <key_for_image>' and 'rados -p rbd rmomapkey <key_for_image_name>'
[0:22] <dmick> and we're actively studying/diagnosing/working with btrfs folks about it
[0:23] <jefferai> hm, okay
[0:23] <jefferai> so xfs on 3.2 is really a decent way to go for now, then?
[0:23] <dmick> but yes, it's also not hard to wipe and OSD and change it
[0:23] <lurbs> dmick: http://paste.nothing.net.nz/70d829
[0:23] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[0:23] <dmick> jefferai: I would say so
[0:23] <jefferai> the Ceph docs do a good job of making you really want to use btrfs since it's oh-so-suitable for Ceph, but then dropping warning bombs
[0:23] <jefferai> :-)
[0:23] <lurbs> That, er, might be a little verbose. As I said there are 750 active children off that snapshot.
[0:24] <jefferai> dmick: thanks -- will do xfs
[0:24] <denken> when using an ssd partiton as an osd journal, does ceph simply use the raw device, or does it put some kind of file system on it when the cluster is initialized? i want to do something like osd journal = /dev/disk/by-partlabel/${name}-journal but that requires GPT which isnt an option in this particular case... so i was thinking mkfs.btrfs and writing a label to it, then using /dev/disk/by-label/${name}-journal instead
[0:24] <denken> but that only works, obviously, if there is a filesystem on the partition
[0:24] <joshd> lurbs: so you'd want to remove keys 'id_17d51f1fab05' and 'name_precise-rbd257' from the rbd_directory
[0:26] <joshd> denken: it uses the raw device for the journal... you could stick a filesystem on it and make the journal the only file in that fs, but it'd add some overhead
[0:27] <denken> ty joshd
[0:28] <lurbs> joshd: Not sure what you mean. Basically all files *17d51f1fab05* and *precise-rbd257* from the osd data directories?
[0:29] <joshd> lurbs: no, I mean 'rados -p rbd rmomapkey id_17d51f1fab05' and 'rados -p rbd rmomapkey name_precise-rbd257'
[0:30] <joshd> err, that's not quite right. add rbd_directory after rmomapkey
[0:31] <lurbs> Still claims it exists: http://paste.nothing.net.nz/6d2bf4
[0:31] <lurbs> Thanks for the help, BTW.
[0:32] <lurbs> ...and now it doesn't.
[0:33] <lurbs> I guess I checked too soon.
[0:33] <joshd> 'rbd info' is reading the header object (rbd_header.17d51f1fab05)
[0:34] <dmick> lurbs: do you remember the failure you got the very first time you tried to remove the image, by chance?
[0:34] <joshd> you can 'rados -p rbd rm rbd_header.17d51f1fab05' to get rid of it
[0:34] <dmick> (or is it in your scrollback?)
[0:34] <lurbs> It would have been in the middle of some daft 'for each in `seq -w 000 750`; do ; stuff; done' command, so I'd have missed it.
[0:35] <dmick> mm
[0:35] <dmick> yeah
[0:35] <dmick> a theory: someone else still had it open? is that possible?
[0:35] <joshd> or just crashed with it open
[0:35] <lurbs> Very. It would have been backing a KVM machine.
[0:35] <dmick> (i mean we have a bug here, just trying to understand how it played out)
[0:35] <dmick> ok
[0:36] <lurbs> I'll see if I can recreate it in isolation. It's happened a few times so far. I ended up just trashing the parent snapshot and starting over the other times.
[0:38] <elder_> joshd, I'd like your opinion on something. I've changed the "snap_exists" flag to just "exists."
[0:39] <elder_> That value will always be true, unless and until it gets changed, exactly once, to false, if and only if an image refresh finds that the mapped snapshot has disappeared from the snapshot context.
[0:40] <lurbs> Is there much of a performance penalty to having RBD volumes as a copy on write clone of a snapshot, BTW? As I understand it that's how OpenStack deployment of RBD backed VMs is likely to work.
[0:40] <elder_> Because of that, if I want to do conditional processing based on whether exists is false, there's no real need to protect that test with the header semaphore.
[0:41] <joshd> lurbs: there is a little (some more optimizations are coming soon). you can always 'rbd flatten' them later
[0:41] <dmick> less with caching
[0:41] <elder_> Would you agree? At least if I either set it atomically, or insert an appropriate memory barrier before reading it?
[0:42] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Quit: Leaving.)
[0:42] <elder_> Gotta go, but I'll be back online again in about two hours.
[0:42] <joshd> elder_: yes, if it's set/read atomically, there's no reason to guard it with the header semaphore
[0:42] <joshd> lurbs: yeah, you'll definitely want librbd caching enabled
[0:49] <lurbs> It's enabled, but with the standard 32 MB (I think) size.
[0:50] <lurbs> Haven't done much performance tuning, other than that.
[0:59] * PerlStalker (~PerlStalk@perlstalker-1-pt.tunnel.tserv8.dal1.ipv6.he.net) Quit (Quit: ...)
[1:15] * vata (~vata@ Quit (Quit: Leaving.)
[1:16] <dmick> lurbs: branch wip-rbd-rm will be the fix; sorry for the trouble. joshd: when you get a chance, review plz?
[1:19] * andreask1 (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[1:20] * vata (~vata@ has joined #ceph
[1:20] <lurbs> Nice, thanks. I'll try not to break it any more. :)
[1:22] <dmick> lol, no no no, our bad. mine specifically :)
[1:23] <lurbs> Blame Florian for introducing me to Ceph in the first place.
[1:23] <dmick> the more the merrier.
[1:25] * vata (~vata@ Quit (Quit: Leaving.)
[1:25] <lurbs> Also, 'rbd ls' only shows the first 64 volumes. Known limitation, I take it?
[1:26] * tnt (~tnt@34.23-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[1:26] <dmick> not known to me
[1:27] <dmick> hm, and I don't see the problem immediately. Same with ls -l?
[1:30] <lurbs> ls -l where?
[1:30] <dmick> and in fact not true for me with format 1 images; I'm guessing this is format 2?
[1:30] <dmick> (rbd ls -l)
[1:30] <lurbs> It's all format 2, and 'rbd ls -l' just tells me there's no such pool.
[1:30] * jlogan1 (~Thunderbi@2600:c00:3010:1:9b2:ed42:a1f6:a6ec) Quit (Ping timeout: 480 seconds)
[1:30] <lurbs> error opening pool -l: (2) No such file or directory
[1:30] <dmick> oh, you must be running older ceph
[1:31] <dmick> when did that go in...
[1:31] <lurbs> 0.53 from debian-testing repository.
[1:31] <lurbs> 0.55 is slated to be bobtail?
[1:32] * Cube (~Cube@ Quit (Quit: Leaving.)
[1:33] <nwl> lurbs: yes
[1:33] <lurbs> Is there a way to force format 2 to be the default, or is it changing in a later release?
[1:37] <dmick> lurbs: no way to force now. it'll change eventually, but we're working on the kernel side support before we do that
[1:37] <dmick> I'm surprised that fix isn't in 0.53, but I'm reviewing my git-fu
[1:38] <dmick> in any event, i don't remember a limitation like that, so that's weird.
[1:38] <dmick> let me try 100 f2 images
[1:38] * kYann (~Yann@did75-15-88-160-187-237.fbx.proxad.net) has joined #ceph
[1:38] <lurbs> Most of those images are clones of a snapshot, if that makes a difference.
[1:39] * Guest4341 (~Yann@did75-15-88-160-187-237.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[1:46] <dmick> it shouldn't, but might
[1:46] <joshd> there was a bug with listing >1024 format 2 images
[1:53] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[1:57] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[1:58] <lurbs> I duplicated the clone removal issue, and it was because it was being held open - as suspected.
[1:59] <dmick> ok, thanks
[1:59] * kYann (~Yann@did75-15-88-160-187-237.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[1:59] <dmick> I will push for this fix to be included in the next release; if you feel up to building on your own it's very isolated and would be easy to cherrypick
[2:00] * kYann (~Yann@did75-15-88-160-187-237.fbx.proxad.net) has joined #ceph
[2:00] <lurbs> I'm not going live until bobtail, so I can live a workaround until then no problem.
[2:02] <lurbs> Thanks very much for the help.
[2:06] <dmick> lurbs: we do spot a 64-image problem here, and in fact seems worse on later code (endless loop?) thanks again, we'll ferret that one out
[2:08] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[2:09] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[2:09] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:14] * yoshi (~yoshi@p20198-ipngn3002marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:23] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[2:35] * silversu_ (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[2:35] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[3:34] * adjohn (~adjohn@ Quit (Quit: adjohn)
[3:38] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[3:39] * trhoden (~trhoden@pool-108-28-184-124.washdc.fios.verizon.net) has joined #ceph
[3:44] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[3:44] <joshd> sagewk: sagelap: wip-rbd-read seems ok, other than memory leaks (#3445). have you run really long fsx on it ? (like 100s of k of operations)?
[4:06] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:41] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[4:41] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[4:44] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[4:44] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[4:47] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[5:06] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[5:15] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Quit: Leaving.)
[5:32] * stp (~stp@dslb-084-056-023-188.pools.arcor-ip.net) has joined #ceph
[5:40] * stp__ (~stp@dslb-084-056-002-013.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[5:46] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[6:04] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) has joined #ceph
[6:24] * dmick (~dmick@2607:f298:a:607:6d30:a089:4f65:a21d) Quit (Quit: Leaving.)
[6:27] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[6:36] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[6:39] * KindOne is now known as Romney
[7:01] * sagelap (~sage@bzq-218-183-205.red.bezeqint.net) Quit (Read error: No route to host)
[7:18] * bchrisman1 (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[7:19] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[7:19] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[7:24] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[7:40] * Romney is now known as KindOne
[7:43] * kyannbis (~yann.robi@tui75-3-88-168-236-26.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[7:43] * zynzel_ (zynzel@spof.pl) has joined #ceph
[7:44] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[7:47] * zynzel (zynzel@spof.pl) Quit (Ping timeout: 480 seconds)
[7:51] * tnt (~tnt@34.23-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:05] * sagelap (~sage@bzq-10-168-31-220.red.bezeqint.net) has joined #ceph
[8:10] * sagelap1 (~sage@bzq-16-168-31-55.red.bezeqint.net) has joined #ceph
[8:17] * sagelap (~sage@bzq-10-168-31-220.red.bezeqint.net) Quit (Ping timeout: 480 seconds)
[8:22] * loicd (~loic@ has joined #ceph
[8:22] * kYann (~Yann@did75-15-88-160-187-237.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[8:23] * kyann (~yann.robi@tui75-3-88-168-236-26.fbx.proxad.net) has joined #ceph
[8:23] * hijacker (~hijacker@ Quit (Remote host closed the connection)
[8:24] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Server closed connection)
[8:25] * Yann__ (~Yann@did75-15-88-160-187-237.fbx.proxad.net) has joined #ceph
[8:25] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[8:35] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Quit: Leaving.)
[8:35] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[8:40] * BManojlovic (~steki@gprswap.mts.telekom.rs) has joined #ceph
[8:40] * stp (~stp@dslb-084-056-023-188.pools.arcor-ip.net) Quit (Quit: Leaving)
[8:46] * hijacker (~hijacker@ has joined #ceph
[8:50] * hijacker (~hijacker@ Quit (Remote host closed the connection)
[8:52] * s_parlane (~scott@121-74-235-205.telstraclear.net) has joined #ceph
[8:53] <s_parlane> Can I safely run ceph osd's without a journal on btrfs ? The docs seem to indicate this is possible, but nothing I've found explicitly states it, and I haven't figured out how to turn the journal off
[9:03] * brambles (xymox@grip.espace-win.org) Quit (Server closed connection)
[9:03] * brambles (xymox@grip.espace-win.org) has joined #ceph
[9:03] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[9:09] * jjgalvez1 (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[9:09] * MarkN (~nathan@ has joined #ceph
[9:12] * hijacker (~hijacker@ has joined #ceph
[9:13] * loicd (~loic@ Quit (Quit: Leaving.)
[9:14] * verwilst (~verwilst@d5152D6B9.static.telenet.be) has joined #ceph
[9:17] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:18] * jantje_ (~jan@paranoid.nl) Quit (Server closed connection)
[9:18] * jantje (~jan@paranoid.nl) has joined #ceph
[9:24] * sagelap1 (~sage@bzq-16-168-31-55.red.bezeqint.net) Quit (Ping timeout: 480 seconds)
[9:28] * tnt (~tnt@34.23-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:30] * zynzel_ is now known as zynzel
[9:31] * gucki (~smuxi@80-218-125-247.dclient.hispeed.ch) has joined #ceph
[9:31] <gucki> good morning :)
[9:40] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[9:40] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[9:41] <todin> morning #ceph
[9:44] * sagelap (~sage@bzq-16-168-31-55.red.bezeqint.net) has joined #ceph
[9:47] * iltisanni (d4d3c928@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[9:48] * iltisanni (d4d3c928@ircip1.mibbit.com) has joined #ceph
[9:51] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:52] <tnt> Is there a programatic way to retrieve the same output as 'radosgw-admin usage show' ?
[10:05] * rektide (~rektide@deneb.eldergods.com) Quit (Server closed connection)
[10:05] * rektide (~rektide@deneb.eldergods.com) has joined #ceph
[10:06] * steki (~steki@ has joined #ceph
[10:13] * BManojlovic (~steki@gprswap.mts.telekom.rs) Quit (Ping timeout: 480 seconds)
[10:16] * elder_ (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[10:18] * Leseb (~Leseb@ has joined #ceph
[10:21] <tnt> I also see radosgw-admin temp remove needs a date ... what is that date exactly ? remove temp older than XXX ?
[10:25] * elder_ (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[10:26] * loicd (~loic@ has joined #ceph
[10:37] * steki (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[10:39] * BManojlovic (~steki@ has joined #ceph
[10:49] * tryggvil (~tryggvil@16-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[10:55] * masterpe (~masterpe@ has joined #ceph
[10:58] * kavonr (~rnovak@ns.indyramp.com) Quit (Server closed connection)
[10:59] * kavonr (~rnovak@ns.indyramp.com) has joined #ceph
[11:00] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[11:05] * dweazle (~dweazle@tilaa.krul.nu) Quit (Server closed connection)
[11:05] * dweazle (~dweazle@tilaa.krul.nu) has joined #ceph
[11:08] * eternaleye (~eternaley@tchaikovsky.exherbo.org) Quit (Server closed connection)
[11:08] * eternaleye (~eternaley@tchaikovsky.exherbo.org) has joined #ceph
[11:12] * sagelap (~sage@bzq-16-168-31-55.red.bezeqint.net) Quit (Ping timeout: 480 seconds)
[11:15] <ctrl> Hi all!!
[11:15] <ctrl> Maybe someone know how to compile libvirt with rbd support?
[11:17] * yeled (~yeled@spodder.com) Quit (Server closed connection)
[11:17] * yeled (~yeled@spodder.com) has joined #ceph
[11:23] * joao (~JL@ has joined #ceph
[11:32] * yoshi (~yoshi@p20198-ipngn3002marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:35] * maxim (~pfliu@ Quit (Ping timeout: 480 seconds)
[11:38] * morse (~morse@supercomputing.univpm.it) Quit (Server closed connection)
[11:45] * sagelap (~sage@bzq-16-168-31-55.red.bezeqint.net) has joined #ceph
[11:48] * mistur (~yoann@kewl.mistur.org) Quit (Server closed connection)
[11:48] * mistur (~yoann@kewl.mistur.org) has joined #ceph
[11:48] <NaioN> ctrl: there are distro that have libvirt + qemu-kvm + librbd integrated
[11:49] <NaioN> or isn't that an option for you?
[12:06] * yehudasa (~yehudasa@2607:f298:a:607:5df2:7084:cf61:812) Quit (Server closed connection)
[12:07] * yehudasa (~yehudasa@2607:f298:a:607:81d4:c364:fa9e:b0ec) has joined #ceph
[12:09] <ctrl> What distributions?
[12:10] <NaioN> I saw Ubuntu 12.10
[12:12] <ctrl> yeah, i know, but i like rhel distro )
[12:13] <NaioN> aha ok, well I'm not familiar with those
[12:13] <ctrl> :)
[12:14] <ctrl> if i can`t compile, i will try use ubuntu :)
[12:15] <NaioN> hehe it's overall a better choice ;)
[12:37] * isomorphic (~isomorphi@659AABQ4O.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[12:38] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[12:39] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[13:09] * maxim (~pfliu@ has joined #ceph
[13:17] * sagelap1 (~sage@bzq-25-168-31-231.red.bezeqint.net) has joined #ceph
[13:19] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[13:20] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[13:23] * sagelap (~sage@bzq-16-168-31-55.red.bezeqint.net) Quit (Ping timeout: 480 seconds)
[13:23] * kees_ (~kees@devvers.tweaknet.net) has joined #ceph
[13:24] * gregorg (~Greg@ has joined #ceph
[13:27] * maxim (~pfliu@ Quit (Ping timeout: 480 seconds)
[13:29] * sagelap1 (~sage@bzq-25-168-31-231.red.bezeqint.net) Quit (Ping timeout: 480 seconds)
[13:37] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Server closed connection)
[13:37] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[13:41] * isomorphic (~isomorphi@659AABSIX.tor-irc.dnsbl.oftc.net) has joined #ceph
[13:47] <jefferai> if I have multiple OSDs per node, I don't need to put each daemon on its own IP address, do I?
[13:48] <tnt> no
[13:48] <iltisanni> yo can have multible daemons on one node
[13:48] <iltisanni> multiple
[13:48] <jefferai> I know
[13:49] <tnt> only mon need to to explicitely set ip/port for the others it's just automatic
[13:49] <jefferai> even if you're using separate cluster/public IPs?
[13:50] <tnt> huh, that I don't know sorry, I never used that config.
[13:54] <kees_> i have a fairly simple setup with 3 nodes as cluster and 2 clients. If i put 1000 files on client1 and do an ls, it takes like 0.001s to return, however, if i do the same ls on client2 it takes 16s.. what am i doing wrong?
[13:55] <kees_> to clarify: on the second client i just do the ls, im not writing any files
[13:57] <tnt> you mean using cephfs ?
[13:57] <kees_> yes
[13:57] <tnt> well my guess is that all the file metadata are already cached on client1 so it doesn't have to fetch them ...
[13:57] <gucki> kees_: it might already be in the kernel page cache on client1 as you created the files there?
[13:58] <gucki> kees_: is a second run of ls on client2 faster?
[13:58] <kees_> no, it still takes a while
[13:58] <kees_> even when i reboot client1 the ls is fast there
[13:59] <tnt> and if you create 1000 files on client2, will it be fast on client2 after a reboot and slow on client1 ?
[14:00] <kees_> yep
[14:00] * trhoden (~trhoden@pool-108-28-184-124.washdc.fios.verizon.net) Quit (Server closed connection)
[14:00] <tnt> huh ... that doesn't make any sense
[14:01] <kees_> just tested it, client2 created 1000 files, ls returns instantly, and it took 28 seconds on client1
[14:01] <tnt> yes, but reboot client2 and check if it's still fast
[14:01] <kees_> rebooting.. :)
[14:01] <tnt> because just after creating the file, it's normal to be fast, it'll be cached.
[14:01] <kees_> true, that was my first guess
[14:01] <tnt> it actually might not even be written to the cluster yet ...
[14:02] <kees_> hm, and as long as it isn't written to the entire cluster, it should be locked on other clients i guess
[14:03] <kees_> is there any way to check if a file has been written to the cluster?
[14:03] <tnt> do a sync
[14:03] <tnt> it should force it then
[14:06] <kees_> ok, rebooted the client... on it's "own" files after the reboot: 0.023s, on the other files: 17.6s
[14:14] <kees_> i guess it is only for a while after writing, cause the first directory of files is fast now..
[14:16] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[14:16] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Read error: Connection reset by peer)
[14:38] <elder_> back in a bit, Comcast is here to bury my new cable so service will be flaky...
[14:38] * elder_ (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[14:39] <tnt> kees_: strange ... i don't know how cephfs works well enough to have a theory about this ...
[14:40] <kees_> seeing as it gets faster after some time, i guess it has to do with writing/locking
[14:41] <kees_> and i don't know how ceph works either, but building a PoC now, and ceph might replace our current storage soon :)
[14:41] <tnt> yes but I don't see why the original client would be faster even after a reboot ... I'm not even sure _how_ it would know it's the original client ...
[14:42] <kees_> i guess it has the files locked on the mds, if i tail the logfile on a mds and i do an ls on the first client i get almost nothing in the logs, but if i do the ls on the second client i get several thousand messages in the log
[14:43] <tnt> but after a reboot nothing should be left ...
[14:43] <kees_> true, but i didn't reboot the mds
[14:43] <kees_> and i guess that one is keeping some state as well
[14:57] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[15:12] <nhm> good morning #ceph
[15:12] <jefferai> morn
[15:14] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[15:14] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:18] * iltisanni (d4d3c928@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[15:18] * iltisanni (d4d3c928@ircip2.mibbit.com) has joined #ceph
[15:30] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[15:36] * sagelap (~sage@bzq-25-168-31-231.red.bezeqint.net) has joined #ceph
[15:43] * hijacker (~hijacker@ Quit (Remote host closed the connection)
[15:45] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[15:52] * hijacker (~hijacker@ has joined #ceph
[15:58] * PerlStalker (~PerlStalk@perlstalker-1-pt.tunnel.tserv8.dal1.ipv6.he.net) has joined #ceph
[15:59] * hijacker (~hijacker@ Quit (Read error: Connection reset by peer)
[16:00] * hijacker (~hijacker@ has joined #ceph
[16:02] * hijacker (~hijacker@ Quit (Remote host closed the connection)
[16:02] <joao> sagelap, there?
[16:03] <wido> hey joao ;)
[16:03] <joao> hey wido :)
[16:03] <joao> how's it going?
[16:03] <wido> Great, at ApacheCon now
[16:03] <wido> Just gave a talk about CloudStack and RBD, more people interested in Ceph! :)
[16:03] <joao> how did the talk go?
[16:06] * hhoover (~hhoover@of2-nat1.sat6.rackspace.com) has joined #ceph
[16:08] * hijacker (~hijacker@ has joined #ceph
[16:12] * hhoover (~hhoover@of2-nat1.sat6.rackspace.com) Quit (Quit: ["Bye"])
[16:15] <scuttlemonkey> nice Wido
[16:15] <scuttlemonkey> did you post the slides up?
[16:15] <scuttlemonkey> (also, are you going to Apachecon NA?)
[16:15] <wido> joao: scuttlemonkey I
[16:15] <wido> Uh, yes, I'll post the slides
[16:15] <scuttlemonkey> sweet
[16:15] <wido> they will come online at ApacheCon's website
[16:16] <wido> Not going to ApacheCon NA though, doesn't fit in my schedule :(
[16:16] <joao> ApacheCon NA?
[16:16] <scuttlemonkey> north america
[16:16] <joao> oh
[16:16] <wido> Also not going to make the CloudStack summit in November
[16:16] <joao> makes sense :p
[16:16] <scuttlemonkey> the one in Portland in Feb
[16:16] <wido> In Vegas
[16:16] <scuttlemonkey> ahh
[16:17] <wido> But I'll be in Vegas and CA in the beginning of February
[16:17] <scuttlemonkey> cool
[16:17] <scuttlemonkey> think I'm gonna try to hit apachecon in Feb
[16:17] <scuttlemonkey> and maybe FAST earlier the same month
[16:30] * kees_ (~kees@devvers.tweaknet.net) Quit (Remote host closed the connection)
[16:43] * Yann__ (~Yann@did75-15-88-160-187-237.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[16:44] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[16:49] * oliver2 (~oliver@jump.filoo.de) has joined #ceph
[16:51] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[16:52] * kyann (~yann.robi@tui75-3-88-168-236-26.fbx.proxad.net) Quit (Quit: Quitte)
[16:52] <oliver2> Gd'day... anybody out there who's in Ubuntu 12.04.1/ in connection with xfs + syncfs? We bit the bullet and reinstalled two new nodes from debian in favour of performance increase?!
[16:53] <oliver2> still getting:
[16:53] <oliver2> 2012-11-06 16:30:32.798016 7f036457e780 0 filestore(/data/osd6-3) mount syncfs(2) syscall not support by glibc
[16:53] <oliver2> 2012-11-06 16:30:32.798019 7f036457e780 0 filestore(/data/osd6-3) mount no syncfs(2), must use sync(2).
[16:54] * kYann (~Yann@did75-15-88-160-187-237.fbx.proxad.net) has joined #ceph
[16:57] * stxShadow (~jens@ has joined #ceph
[16:57] <stxShadow> Hi all
[17:03] * kYann (~Yann@did75-15-88-160-187-237.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[17:06] * PerlStalker (~PerlStalk@perlstalker-1-pt.tunnel.tserv8.dal1.ipv6.he.net) Quit (Remote host closed the connection)
[17:06] * PerlStalker (~PerlStalk@perlstalker-1-pt.tunnel.tserv8.dal1.ipv6.he.net) has joined #ceph
[17:14] <stxShadow> is there anything i need to do that ceph use syncfs on ubuntu 12.04 ?
[17:14] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[17:16] <oliver2> ceph ist 0.48.2 ATM.
[17:16] <jefferai> oliver2: you probably have libc 2.13
[17:19] * verwilst (~verwilst@d5152D6B9.static.telenet.be) Quit (Quit: Ex-Chat)
[17:20] * jlogan1 (~Thunderbi@2600:c00:3010:1:9b2:ed42:a1f6:a6ec) has joined #ceph
[17:21] <stxShadow> jefferai -> no ... we use precise with libc 2.15
[17:23] <stxShadow> hmmm ... you are right ....
[17:23] * MikeMcClurg (~mike@ has joined #ceph
[17:23] <oliver2> Sh.t.
[17:24] <jefferai> hm
[17:24] <jefferai> I know that precise does have a 2.15 package
[17:24] <stxShadow> libc6/precise-updates uptodate 2.15-0ubuntu10.3
[17:24] <stxShadow> but :
[17:24] <stxShadow> Package: libc6
[17:24] <stxShadow> Version: 2.15-0ubuntu10
[17:24] <stxShadow> Provides: glibc-2.13-1
[17:24] <jefferai> well that's odd
[17:24] <jefferai> :-)
[17:25] <oliver2> true... true...
[17:25] <stxShadow> i don't have to understand that
[17:25] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[17:25] <stxShadow> question: where may i get the right libc ?
[17:25] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[17:25] <jefferai> well, you probably do have to understand that
[17:26] <jefferai> because it should be giving you 2.15 :-)
[17:26] <jefferai> seems like the latest is 2.15-0ubuntu10.3?
[17:26] <jefferai> elder: sage: you guys do your testing mostly on precise, right? Can you comment on that? ^
[17:32] * bchrisman1 (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:33] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[17:34] <elder> jefferai, I can tell you my testing is on precise. But my work is almost entirely focused on the kernel so I don't know a lot about user space dependencies.
[17:35] <oliver2> root@fcmsnode6:/lib/x86_64-linux-gnu# strings libc-2.15.so | grep syncfs
[17:35] <oliver2> syncfs
[17:35] <oliver2> this is what the shared-lib tells me.
[17:37] <oliver2> I'm in a loop, ldd on ceph-osd points to the same libc.so.6.
[17:39] <jefferai> elder: I figured you might know if precise is able to take advantage of syncfs :-)
[17:39] <jefferai> just to confirm what's expected
[17:39] <jefferai> since it's a syscall
[17:42] <jefferai> oliver2: did you switch to precise *just* for the syscall?
[17:42] * jefferai saw your mail
[17:44] <oliver2> jefferai: one of many reasons... LTS, preferred by many ( learned on ceph-workshop ;) ) and general up-to-date policy in comparison to debian.
[17:45] * tnt uses 12.04 as well, works like a charm so far
[17:45] <gucki> jefferai: did you end up with using master or stable?
[17:45] <jefferai> gucki: ?
[17:46] <gucki> jefferai: oh, did you ask yesterday if you should go with debian-testing or debian? if not, i'm mixing something up now ;)
[17:48] <gucki> anybody here who knows how to get better performance when using ceph as storage for kvm guests (qemu-rbd)?
[17:49] <gucki> i already turned on the cache (writeback), but sometimes reads hang awefully long...so logging in using ssh can hang for ex. 30 seconds :(
[17:49] <gucki> when inside the vm and doing a "find /" for example it looks fast...it's really strange
[17:50] * glowell2 (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has left #ceph
[17:50] <tnt> gucki: strange ... I don't use qemu-rbd, I use the kernel driver and although it's not blazingly fast, it doesn't just 'pause'.
[17:51] <gucki> tnt: how many osds and clients do you have? how many page cache on the osds?
[17:52] <oliver2> gucki: 'im now on qemu-1.2.0 with :rbd_cache=true, did u try that already?
[17:52] <tnt> I have 4 OSDs. 2 have 2 disks and 2 have 4 disks. page cache ? huh ...you mean memory ?
[17:53] <gucki> tnt: i got 6 osds and around 62 clients (kvm guests). do you think there's simply to much random io? i mean the disks of the osds aren't fully loaded....but probably i need more threads to handle async io better?
[17:53] <gucki> tnt: with page cache i meant how much memory is used by the hsot for caching..so eg 20gb total, 16gb used by vms so 4gb free for caching of disk data
[17:54] <tnt> I have 1 gb free per disk on the OSD
[17:54] <tnt> They're also all on 10k disks with HW jbod cards with battery backed write cache.
[17:54] <gucki> tnt: are you using raid? otherwise i think you should have an extra osd for each disk?
[17:54] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[17:54] <gucki> tnt: ok, so much better than mine. mine are normal 7.2k discs..
[17:55] <tnt> I use the raw disks, so 1osd process per disk.
[17:55] <gucki> ok, so you have 2*2 + 2*4 = 12 osds?
[17:55] <tnt> yes
[17:56] <tnt> sorry ... I noticed my mistake above, I mean 4 machines running OSDs :p
[18:02] <gucki> tnt: no problem. where did you put the osd journals?
[18:04] * stxShadow (~jens@ Quit (Remote host closed the connection)
[18:05] <tnt> gucki: on the same disk as the data but in a dedicated partition
[18:06] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:11] * gravatite (~nick@monkeybar.nickstoys.com) has joined #ceph
[18:12] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[18:15] <oliver2> Be back l8r with my libc-problem... beware ;)
[18:15] * oliver2 (~oliver@jump.filoo.de) has left #ceph
[18:21] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[18:23] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:31] * johnl (~johnl@2a02:1348:14c:1720:3960:bd45:f20f:59ae) Quit (Remote host closed the connection)
[18:31] * johnl (~johnl@2a02:1348:14c:1720:f499:57b9:54fe:1992) has joined #ceph
[18:31] * slang (~slang@ace.ops.newdream.net) Quit (Ping timeout: 480 seconds)
[18:37] * Tv (~tv@cpe-76-170-224-21.socal.res.rr.com) has joined #ceph
[18:38] <Tv> nhm: https://www.computerworld.com/s/article/9233217/Intel_releases_third_gen_data_center_SSD_slashes_price_by_40_
[18:38] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:38] <nhm> Tv: nice
[18:40] <jefferai> Tv: nice...still pricey, but nice
[18:40] <nhm> Tv: that's really impressive.
[18:40] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[18:40] <jefferai> those are some pretty beefy numbers
[18:42] <nhm> If the 100GB version can sustain the full read/write throughput that is going to be a fantastic journal disk for a very reasonable price.
[18:55] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[19:06] * vata (~vata@ has joined #ceph
[19:06] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:08] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) has joined #ceph
[19:09] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:12] * tnt (~tnt@34.23-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[19:17] * Tv (~tv@cpe-76-170-224-21.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[19:33] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[19:34] * nwatkins (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[19:34] * chutzpah (~chutz@ has joined #ceph
[19:36] * sagelap (~sage@bzq-25-168-31-231.red.bezeqint.net) Quit (Ping timeout: 480 seconds)
[19:41] * psandin (psandin@staff.linode.com) has left #ceph
[19:50] * gregaf (~Adium@2607:f298:a:607:54ad:7d4e:4ed:af03) has joined #ceph
[19:51] * adjohn (~adjohn@ has joined #ceph
[20:06] * styx-tdo (~styx@chello084113243057.3.14.vie.surfer.at) has joined #ceph
[20:07] <styx-tdo> hi, maybe this is a silly question.. but is ceph a concurrent write cluster filesystem like gfs?
[20:07] <sjust> in the sense that it it clustered and supports concurrent writes efficiently
[20:07] * stxShadow (~jens@ip-178-203-169-190.unitymediagroup.de) has joined #ceph
[20:07] * stxShadow (~jens@ip-178-203-169-190.unitymediagroup.de) Quit ()
[20:08] <styx-tdo> can i e.g. mount it on 2 servers and the locks etc are intelligently managed? e.g. as backend for a webserver?
[20:08] <styx-tdo> *webserver farm
[20:08] * stxShadow (~jens@ip-178-203-169-190.unitymediagroup.de) has joined #ceph
[20:08] * glowell1 (~glowell@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[20:09] <gregaf> styx-tdo: it parcels out "capabilities" to write on a file-by-file basis
[20:09] <gregaf> if your webservers are all writing to the same file then the MDS will coordinate them into taking turns, but it's not going to let them each independently write into a separate part of the file simultaneously, if that's what you're asking
[20:10] <gregaf> if they are writing to different files, then yes, they will all write independently
[20:10] * sagelap (~sage@bzq-218-183-205.red.bezeqint.net) has joined #ceph
[20:10] <styx-tdo> ok. file-level locking, then.. ok.
[20:10] <rweeks> ceph is actually an object store that also can expose that object store as a POSIX filesystem.
[20:11] <rweeks> so comparing it to GFS is a bit tricky
[20:11] <gregaf> styx-tdo: the kernel client also supports standard posix and flock locks if your application wants to do those
[20:11] <gregaf> rweeks: that…doesn't make any sense
[20:12] <rweeks> why not?
[20:12] <styx-tdo> so, how comes e.g. xfs in the game? if i have ceph as a FS, why would i need xfs or btrfs at all?
[20:12] <gregaf> CephFS uses the object store; it's not exporting it
[20:12] <rweeks> ceph isn't just a filesystem like gfs, is my point.
[20:12] <gregaf> and we can compare it to GFS as well as we can to any other distributed FS
[20:12] <rweeks> I suppose
[20:12] <rweeks> my point is that the architecture is very different than other distributed filesystems.
[20:12] <gregaf> styx-tdo: Ceph OSDs (object storage devices/daemons) store their data in regular local filesystems
[20:13] <styx-tdo> OOH - That makes it clearer
[20:13] <styx-tdo> so, basically: Format with anything, config & start the ceph daemons, - boom, instant cluster
[20:14] <styx-tdo> that takes care of - replication between storage nodes and in addition, provide access to clients
[20:14] * Tamil (~Adium@ has joined #ceph
[20:14] <gregaf> styx-tdo: umm, I believe so, yes :)
[20:15] <styx-tdo> *g* that sounds too good to be true, somehow
[20:15] <gregaf> it's not like re-exporting a local filesystem the way NFS is doing
[20:15] <stxShadow> hmmm ..... why the hack is syncfs not working in ubuntu 12.04 .... what are we doing wrong .... ?
[20:15] <gregaf> as long as that isn't your mental model, then yes
[20:16] <styx-tdo> i have somehow very mixed experiences w/ gfs2 (nodes going mute&deaf, locking hell,...) and look for something that works
[20:17] <gregaf> you do need to be aware that Inktank doesn't consider CephFS to be stable at this time — it's "feature complete" but needs to go through some good quality QA
[20:18] <styx-tdo> hm.. are we talking minor issues like "ops, my node crashed" or things like "where the hell are all my files?"
[20:19] * Tv (~tv@cpe-76-170-224-21.socal.res.rr.com) has joined #ceph
[20:19] <gregaf> usually the first, although somebody recently reported the second :(
[20:20] <gregaf> I don't think anybody's run into new issues after testing their workload for a week or two, so you could try that if you have some spare hardware
[20:20] <styx-tdo> hm. sounds like a plan.
[20:21] <styx-tdo> :( hooray. back to the "buy hardware" stage, then
[20:21] <rweeks> styx-tdo: do you happen to be coming to the SC12 conference next week?
[20:22] <styx-tdo> nope :/
[20:22] <rweeks> ah ok
[20:22] <rweeks> a number of us from Inktank will be there
[20:23] <styx-tdo> utah is a bit far :)
[20:23] <stxShadow> are there any recommends packages i have to install to geht syncfs working ?
[20:23] * MikeMcClurg (~mike@ Quit (Ping timeout: 480 seconds)
[20:25] <gregaf> stxShadow: I know people have been talking about this on the mailing list; it should just work but indeed something seems to have happened
[20:25] <gregaf> joshd or nhm might know more since I know we use Precise locally and do QA runs on it that I don't think have seen any issues related to that
[20:26] <Oliver2> Yes Greg, that's the "something" we're after ;)
[20:27] <sjust> Oliver2: where did you get the package? from us?
[20:27] <NaioN> stxShadow, gregaf for my information: syncfs is really useful if you have multiple osds per node?
[20:27] <NaioN> or is it also better for 1 osd/node?
[20:27] <Oliver2> sjust: from the evil internet, harrr...
[20:28] <sjust> Oliver2: trying to work out why the package was built without syncfs
[20:29] <sjust> yeah, that's the message for when syncfs is ifdef'd out
[20:29] <Oliver2> sjust: it's from ceph.com squeeze
[20:29] <sjust> odd
[20:29] <gregaf> NaioN: syncfs means that you only need to sync the filesystem/disk that the OSD is on, instead of all filesystems in the system
[20:29] <gregaf> it's most important for multiple OSDs per node but might help a bit in a single-daemon configuration too
[20:29] <sjust> ah... try the precise package?
[20:30] <Oliver2> sjust: from where? Would always believe, that a propagated system includes all necessary files…? ;)
[20:30] <gregaf> I think he means the Ceph package, which certainly shouldn't be regressed to Precise's 0.41
[20:30] <NaioN> gregaf: k thx well in a single-osd config the performance gain will be minimal
[20:31] <Oliver2> precise install 0.48.2argonaut…?!
[20:32] <Oliver2> Uhm, means, with a /etc/apt/sources.list.d/ceph.list with ceph.com squeeze installs 0.48.2 ;)
[20:32] <gregaf> ah, I see
[20:33] <sjust> Oliver2: the problem might be that you installed the package built for squeeze (which may not include syncfs) rather than the one built for precise (which would)
[20:33] <sjust> can you post your sources.list.d/ceph.list?
[20:33] <Oliver2> sjust: well, so my q.: from where?
[20:34] <Oliver2> sjust: deb http://ceph.com/debian/ squeeze main
[20:34] <sjust> change sqeeze to precise
[20:34] <sjust> *squeeze
[20:34] * jjgalvez (~jjgalvez@ has joined #ceph
[20:35] <Oliver2> It's from the ceph.com pages as recommendation for debian/ubuntu?
[20:36] <Oliver2> From my understanding.. it's a shared library, I test a syscall like "syncfs", it's there, I utilize it. If not, fallback.
[20:37] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[20:38] <sagelap> joshd: there?
[20:38] <Oliver2> mount syncfs(2) syscall fully supported (by glibc and kernel)
[20:38] <joshd> sagelap: yeah
[20:38] <Oliver2> This makes my day.
[20:38] <sagelap> joshd: pushed a fix for the memory leak (well, the one i turned up with massif) in wip-rbd-read
[20:38] <sagelap> otherwise looked ok to you? want to merge it?
[20:39] <sagelap> oh, i ran for 10000 ops, haven't done longer yet
[20:39] <Oliver2> But...
[20:39] <joshd> sagelap: I'll want to test performance too, since that's what this change is for
[20:39] <sagelap> perfect
[20:51] <Oliver2> Thnx sjust. Sometimes one is blind B-)
[20:54] <nhm> joshd: what change is this?
[20:55] <joshd> nhm: moving layered read handling above the cache for rbd
[20:55] * AaronSchulz (~chatzilla@ has joined #ceph
[20:55] <joshd> nhm: it's in wip-rbd-read
[20:55] <nhm> joshd: cool. I'm so behind. :)
[20:56] <AaronSchulz> yehudasa: does X-Account-Meta-Temp-URL-Key work with rados (as in http://docs.openstack.org/trunk/openstack-object-storage/admin/content/swift-tempurl.html)?
[20:56] <joshd> nah, sage just did it over the weekend
[21:02] <jmlowe> joshd: off the top of your head do you know what arguments to nova boot to use to boot from a volume instead of a image?
[21:03] <joshd> jmlowe: you used to need to specify a block device mapping, I'm not sure currently though
[21:04] <jmlowe> so you have to use /dev/rbdN?
[21:04] <jmlowe> you can't use the qemu rbd driver?
[21:04] <joshd> jmlowe: no, you just say 'try to map /dev/vda in the guest to this openstack volume id'
[21:05] <jmlowe> ah, I get it now
[21:05] <joshd> this worked on diablo: https://github.com/ceph/ceph-openstack-tools/blob/master/boot-from-volume
[21:06] * vata (~vata@ Quit (Remote host closed the connection)
[21:10] <yehudasa> AaronSchlz: no it doesn't, for the gateway you can use the S3 presigned urls
[21:11] <yehudasa> AaronSchulz: .
[21:11] * vata (~vata@ has joined #ceph
[21:12] <NaioN> I had some more thought about the discussion on the mailing list about recovery/number of nodes/speed
[21:12] <AaronSchulz> yehudasa: opps, I mean rgw not rados
[21:13] <yehudasa> yeah, my response stands, you can use the S3 api for that
[21:13] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[21:13] * AaronSchulz has to resist the urge to abbreviate "rados gateway" as "rados"
[21:13] <yehudasa> radosgw, rgw
[21:13] <yehudasa> maybe we need to have a more catchy name, e.g., sparky
[21:13] <NaioN> would it be an option to have a config option for telling ceph to first recover back to a consistent state before moving pg's back to a replaced osd?
[21:14] <NaioN> I'm i correct that when a osd fails and you replace it (before the recovery finishes) that ceph begins filling the replaced osd again
[21:14] <AaronSchulz> yehudasa: I'm trying to use the swift api of rgw, is there anything that can be done to have this functionality at this time?
[21:14] <NaioN> and that would take more time than first recover to the other osds?
[21:14] <AaronSchulz> I guess one could use whichever api is convient
[21:15] <AaronSchulz> though that feels evil :)
[21:15] <AaronSchulz> *convenient
[21:15] <yehudasa> we can put that on the tracker, so that we can prioritize it
[21:16] <joshd> NaioN: there's already some smarts like that. if a chosen primary needs to backfilled before going active, one with the up to data can temporarily override the crush chosen one and act as the primary
[21:17] * gucki (~smuxi@80-218-125-247.dclient.hispeed.ch) Quit (Remote host closed the connection)
[21:17] <NaioN> joshd: but it could also be the secondary?
[21:18] <NaioN> joshd: put the point is that after failure of an osd (with rep 2) all the pgs residing on that osd have a rep level of 1
[21:18] <joshd> NaioN: yes, I think it has to be a secondary to get backfilled
[21:19] <NaioN> and when you bring the replaced osd back up it starts to fill the replaced osd
[21:19] <NaioN> but that will take a long time, so for a long time those pgs have a rep level of 1
[21:20] <NaioN> while if you continu to replicate to the other osds first you are faster back in a consistend state
[21:20] <NaioN> and then begin backfilling the replaced osd
[21:21] <NaioN> you could mimick the behavior by waiting to add the replaced osd till after the recovery is done
[21:21] <joshd> yeah, this is basically what happens now I believe
[21:22] <NaioN> joshd: which one? :)
[21:22] * slang (~slang@ace.ops.newdream.net) has joined #ceph
[21:22] <joshd> https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L1091
[21:23] <joshd> preferring osds with log continuity means they have less recovery to do
[21:24] <joshd> NaioN: the latter, continuing to use the existing replicas while backfilling the replaced one
[21:24] <NaioN> ok so basicly it first moves the pgs to other osds and then moves them back to the replaced osd
[21:25] <NaioN> joshd: well i get it that it uses the existing replicas
[21:25] <NaioN> but that's not what i'm trying to say :)
[21:26] <NaioN> the point is that you want to minimize the time you have a rep level of 1 for pg's (if you set it on 2 for the pool)
[21:29] <joshd> I think I may be confused about what you mean, but keep in mind that the extra copies of data aren't deleted until recovery is complete, so 'moving pgs' is cheap
[21:30] <NaioN> joshd: yeah i know that
[21:30] <NaioN> let me try it another way
[21:30] <NaioN> you have a 1 osd per node, 10 node cluster
[21:30] <NaioN> 1 node fails
[21:31] <NaioN> the pgs that resided on that failed node have an actual rep level of 1
[21:31] <NaioN> assuming you had a rep level of 2 for the cluster/pool
[21:32] <NaioN> the cluster begins to remap the pgs to other nodes/osds and begins moving pgs from the existing replica
[21:32] <NaioN> but you replace the osd/node in the meanwhile and ceph sees the replaced osd/node (failed disk so a clean osd)
[21:33] <NaioN> what will ceph do? will it continu to first move the pgs as it was doing
[21:33] <NaioN> or now it sees the replaced osd and begins to fill that one with the pgs it had before?
[21:34] <NaioN> because that would take a longer time than recovering to the other osds
[21:34] <NaioN> so for the not-yet-recovered pgs it takes a longer time to come back in a rep level 2
[21:34] * sagelap (~sage@bzq-218-183-205.red.bezeqint.net) Quit (Read error: Connection reset by peer)
[21:35] <NaioN> because all osds with existing pgs begin writing to the replaced osd, instead of reading/writing to each other
[21:36] <sjust> Oliver2: no problem, let me know if it works out
[21:36] <joshd> so initially it will move the pgs back to the replaced osd, but as they peer, it will use that choose_acting function to remap any pgs that have replicas that more up to date
[21:37] * sagelap (~sage@bzq-218-183-205.red.bezeqint.net) has joined #ceph
[21:37] <NaioN> joshd: I get that... but it means it will take longer before all pgs have rep level 2 again
[21:39] <joshd> yeah, it could certainly be suboptimal
[21:39] <Oliver2> sjust: it's running… 21% degraded…
[21:40] <NaioN> so it would be better to first wait till the cluster recovers (this goes fast because all osd's read/write to each other on average)
[21:40] <joshd> but if there aren't other relatively up to date replicas to be remapped to, everything will be read from the single replica that stayed up
[21:40] <NaioN> and than replace the failed osd
[21:41] <joshd> yes, I see what you mean now. that's right.
[21:41] <NaioN> hehe yeah it's a bit hard to explain what i mean
[21:42] <NaioN> i understand it will read/write to the other replica's for the time being but that wasn't the point i tried to make
[21:42] <joshd> if your goal is to minimize the time pgs are degraded, it might be best to wait for recovery before replacing an osd
[21:42] <NaioN> yeah i assume that would be the prefered goal
[21:43] <NaioN> because if you have a rep level of 2 and you have two osd failures you could lose pgs
[21:43] <NaioN> so you would like to minimize the recovery time
[21:44] <joshd> right, that's usually the goal, but sometimes you might want to e.g. recover objects that clients are trying to use before unused ones, even if it takes a little longer
[21:45] <NaioN> well i understand, but i think it's always better/faster to recover to the remaining osd's instead of recovering to the replaced osd
[21:46] <joshd> only if you know a large amount of recovery will be needed (i.e. lots of writes happened to that osds pgs while it was being replaced)
[21:47] <NaioN> because the more nodes/osd you have the less data you have to move between two osds on average
[21:47] <gregaf> so yes, that's an interesting idea; no, we don't support it now
[21:47] <NaioN> well with replacing i mean the disk crashed and you have to replace it, so you have to move all data back
[21:48] <gregaf> one thing you could do is give the new OSD a very low CRUSH weight so that most of the PGs continue replicating elsewhere
[21:48] <NaioN> gregaf: yeah that would be an option
[21:48] <NaioN> but i would like to have it automated
[21:48] <NaioN> with an option
[21:49] <NaioN> because i could have this behavior just to wait for the recovery to complete and than add the replaced osd back
[21:49] <NaioN> I see this behavior for example with our sans (HP EVA)
[21:50] <NaioN> if a disk fails it first rebuilds to the other disks in the pool even if you add a replacement
[21:50] <NaioN> after it finishes rebuilding it begins leveling = spreading the data evenly over all the disks again
[21:51] <NaioN> so the repalced disks gets filled again
[21:51] <joshd> yeah, probably the easiest way to accomplish that would be an automated way to add the osd and gradually increase its weight
[21:52] <joshd> that could be done from a script with no changes to the osd
[21:52] <NaioN> yes it could
[21:53] * s_parlane (~scott@121-74-235-205.telstraclear.net) Quit (Ping timeout: 480 seconds)
[21:53] <NaioN> set the weight on 0 till recovery finishes and begin increasing it after that
[21:53] <NaioN> so you don't overload the osd in the meanwhile
[21:53] <joshd> right
[21:54] <joshd> the only problem with that is if the script crashes, the osd is stuck at a lower weight than intended
[21:56] <NaioN> well that would be a minor problem, for me it's more important to get back to a consistent state as soon as possible so i could take another osd failure
[21:57] <NaioN> with a lot of nodes/osds the changes on a double failure grow
[21:58] <NaioN> and with double failure i mean a second osd failure while the cluster is in recovery
[21:59] <joshd> yeah, it'd be good to add that to ceph-deploy or something
[21:59] <NaioN> and of course you could increase the rep level, but the same story holds of you have hunderds of osd's
[21:59] <NaioN> should i add a feature request with this story?
[22:00] <joshd> go for it
[22:00] <NaioN> hehe
[22:01] <rweeks> NaioN: this is why in a large cluster you want to spread your CRUSH map across multiple rows in a datacenter
[22:02] <NaioN> rweeks: i know you can spread the risk with the CRUSH map, but it's all in the numbers
[22:02] * sagelap (~sage@bzq-218-183-205.red.bezeqint.net) Quit (Read error: No route to host)
[22:03] <gregaf> if it's random failures happening that doesn't eliminate the problem
[22:03] <NaioN> the more nodes/disks the more change you have on a failure
[22:03] <rweeks> I mean, of course there is no perfect protection against disaster
[22:03] <NaioN> no certainly not
[22:03] <NaioN> this is also one of the weak points of a HP EVA
[22:03] <rweeks> yes, but if your nodes and disks are spread across multiple rows in a datacenter and your replicas are also spread, even a double failure won't kill you
[22:04] <rweeks> this is where the EVA architecture and CRUSH/RADOS differ
[22:04] <rweeks> EVA was _never_ designed to scale that way
[22:04] <NaioN> no that's true
[22:04] <gregaf> not if you have 2x replication and you lose two nodes that share data…which you can...
[22:04] <sjust> rweeks: if your replicas are on three rows and those osds die at the same time, you still loose data...
[22:05] <rweeks> I would argue the EVA was never designed to scale at all
[22:05] <NaioN> rweeks: hehe
[22:05] <rweeks> of course, sjust
[22:05] <NaioN> well as gregaf you have enough change to have a failure across failure domains
[22:05] <rweeks> you cannot completely prevent a failure of anything
[22:06] <NaioN> and as your cluster grows the changes of a double failure across failure domains increase
[22:06] <rweeks> I'm not sure we have the statistics to prove that
[22:07] <NaioN> well it's very easy
[22:07] <sjust> rweeks: probability of a double failure is (1-n)**2 or so for large n (- a bit due to failure domains, etc)
[22:07] <sjust> wow
[22:07] <sjust> that was not correct
[22:07] <sjust> (1-(1/n))**2
[22:08] <rweeks> so if you have triple replicas, how much does that matter?
[22:08] <NaioN> a lot :)
[22:08] <sjust> wow, wrong again, this is a bad day for me
[22:08] <gregaf> 1 - p ^ n
[22:08] <gregaf> where p is the probability of a single-node failure
[22:09] <gregaf> and n is the number of nodes
[22:09] <gregaf> if you have three replicas then obviously you need three nodes to die
[22:09] <rweeks> ok, so what are the numbers behind the chances of 3 nodes dying simultaneously?
[22:09] <gregaf> wait, shoot, that's not it either, argh
[22:09] <NaioN> gregaf: no you don't count the failure domains in this
[22:10] <gregaf> wait again, yes it is, we've now got three confused engineers
[22:10] <gregaf> NaioN: yeah, that's true but failure domains makes it harder ;)
[22:10] <rweeks> (full disclosure: I came from NetApp and I know one of there HUGE arguments around their RAID-DP is that the likelihood of a dual failure occurring is super small)
[22:10] <NaioN> yes i know :)
[22:10] <rweeks> 3 confused engineers and one confused marketing guy who came from IT.
[22:10] <NaioN> rweeks: yeah but they have DP per a number of disks
[22:11] <rweeks> of course.
[22:11] <rweeks> usually 16-20
[22:11] <NaioN> and they spread the data over all failure domains
[22:11] <rweeks> no, the RAID-DP is just for that raid group of 16-20 disks
[22:11] <NaioN> so you can take 2 disk failure per failure domain
[22:11] <rweeks> you have two parity disks for that raid group.
[22:11] <rweeks> correct.
[22:12] <NaioN> rweeks: a failure domain = a raid group in this context
[22:12] <rweeks> yah I realized that
[22:12] <NaioN> the EVA works the same
[22:12] <rweeks> except that
[22:12] <NaioN> it has (on average) failure domains of 8 disks
[22:13] <darkfaded> a filer doesnt go on rebuilding and equalizing all weekend
[22:13] <rweeks> both the netapp and the EVA are restricted to x number of controllers that have disk shelves hanging off them
[22:13] <darkfaded> *giggle*
[22:13] * rweeks snorts at darkfaded
[22:13] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[22:13] <rweeks> in the NetApp edge case, you could have up to 24 of their heads in a cluster-mode system
[22:13] <NaioN> rweeks: what have the filers/controllers have to doe with it?
[22:14] <rweeks> because more often than not, the filer head or the disk shelf is what fails
[22:14] <NaioN> oh ok
[22:14] <rweeks> with, let's say 300 disks hanging off it
[22:14] <NaioN> well that's just bad design :)
[22:14] <rweeks> my point exactly.
[22:14] <rweeks> the mathematics don't change
[22:14] <rweeks> but the design CAN
[22:15] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:15] <rweeks> the EVA has similar non-scalable design, really.
[22:15] <NaioN> so the changes on a head failure a greater than a multi disk failure in a failure domain
[22:15] <NaioN> yeah I knwo
[22:15] <NaioN> that's one of the reasons we are looking at ceph :)
[22:15] <rweeks> so here, let's say you designed your ceph cluster around 1U boxes with, eh 4 disks each
[22:15] <rweeks> and placed those around your datacenter in multiple stacks and rows
[22:16] <rweeks> your failure domains are much different than a traditional storage array, and can be sculpted to your needs
[22:16] <rweeks> which yeah, is probably one of the coolest things about Ceph there is.
[22:16] <NaioN> yes true
[22:17] <NaioN> well also that you have a psuedo random placement algoritm, so you don't need a metadata server that keeps track where your objects are
[22:17] <NaioN> that makes it real scalable
[22:17] <rweeks> yep.
[22:19] <NaioN> but the point I tried to make is that with a rep level of 2 in the event of a double osd failure you lose all pgs that those osds shared and weren't recovered (replicated) in the meantime to other osds
[22:20] <NaioN> so you want to minimize the receover time
[22:20] <NaioN> and of course think good about your failure domains (the CRUSH map)
[22:21] <dweazle> also, with a rep level of 2, if you have silent data corruption ceph doesn't know which one is the good copy, so you might consider using a rep level of 3 regardless
[22:21] <NaioN> dweazle: well that's indeed another good point for rep level 3, although I don't know if deep scrubbing is in place at the moment
[22:21] <rweeks> right
[22:22] <rweeks> this is why we recommend a replica level of 3 by default
[22:22] <dweazle> NaioN: don't know that either
[22:23] * dmick (~dmick@2607:f298:a:607:dd74:bc35:af3f:43c) has joined #ceph
[22:23] <dweazle> then again, if you don't use jbod osd's but use raid that might not actually be an issue
[22:24] <rweeks> .then you have a whole other set of failure domains to consider, too
[22:24] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Ping timeout: 480 seconds)
[22:25] <lurbs> And if a big RAID array does die, then the time and effort taken to recover and re-distribute data is that much worse.
[22:25] <NaioN> dweazle: well I was thinking about that
[22:25] <NaioN> you can handle more disk failures
[22:25] <rweeks> but what kind of raid?
[22:25] <rweeks> and how many disks?
[22:25] <dweazle> lurbs: that's true, however that's much less likely to happen
[22:25] <NaioN> well say you have 4 disk nodes
[22:26] <NaioN> and rep level of 2
[22:26] <NaioN> and raid 5 on those nodes
[22:26] <NaioN> you need at least 4 disk failures worst case
[22:26] <lurbs> dweazle: True. I did have a big RAID 10 fail the other day, though. Two paired disks in quick succession.
[22:26] <lurbs> I might still be a bit sore about it.
[22:26] <rweeks> right, but raid 5 also has performance implications depending on your workload
[22:27] <sjust> ok, probability of data loss with n nodes and m replicas seems to be something like (1-((1 - p)^(n-m-1)))
[22:27] <rweeks> and then you have to deal with "does this RAID controller suck or not" which in most cases today seems to be "yes"
[22:27] <NaioN> rweeks: dmraid :)
[22:27] <sjust> where p is the probability of any single node failing within the time to recover from a single node failure
[22:27] <NaioN> just software raid
[22:27] <rweeks> again, performance issues, right?
[22:27] <rweeks> software raid isn't free
[22:28] <rweeks> (performance wise, I mean)
[22:28] <NaioN> no a recent cpu with sse handles it easily
[22:28] <rweeks> it probably does
[22:28] <rweeks> but
[22:28] <rweeks> you take software raid
[22:28] <rweeks> layer on btrfs (for example)
[22:28] <rweeks> layer on ceph
[22:28] <dweazle> lurbs: i've had that happen to me once, but it was caused by a firmware bug (disk died, hot spare kicked in, caused a controller timeout and then it just completely lost it), took us 6 weeks to get the array back online, but lsi did fix it and fixed the firmware as well
[22:28] <rweeks> what does the CPU look like under load then?
[22:28] <NaioN> well i would take software raid + xfs
[22:28] * houkouonchi-work (~linux@ Quit (Remote host closed the connection)
[22:29] <rweeks> that's fine too
[22:29] <rweeks> I'm just saying, nothing is free from a performance standpoint
[22:29] <NaioN> that's true but cpu power is cheap
[22:29] <rweeks> (and yes, I know I segued into performance from a resiliency talk)
[22:29] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[22:30] <NaioN> so with 4 disks per node in raid 5 and a dual core cpu you have 1 core for the raid and 1 for the osd
[22:30] <NaioN> that should be enough
[22:30] * slang (~slang@ace.ops.newdream.net) Quit (Remote host closed the connection)
[22:30] <dweazle> i would go with raid-10
[22:30] <dweazle> iops are more valuable than disk space
[22:31] <lurbs> For a larger setup I'd consider putting underlying RAID for the OSDs, but for something smaller I don't really see the point.
[22:31] <NaioN> if you load the raid modules the kernel tests the speed of the cpu for the different raid levels
[22:31] * s_parlane (~scott@ has joined #ceph
[22:32] <dweazle> i also wonder, if you're using hardware raid controllers with a BBU, wouldn't a ceph journal device be overkill?
[22:32] * slang (~slang@ace.ops.newdream.net) has joined #ceph
[22:32] * houkouonchi-work (~linux@ has joined #ceph
[22:33] <lurbs> dweazle: Journal devices can be substantially larger than the cache size you're likely to get on a hardware RAID controller, aren't they? Seems worth it to me.
[22:33] <lurbs> Why not have both?
[22:33] <sjust> dweazle: the journal is necessary for transactions, the underlying controller won't give us that
[22:33] <joshd> dweazle: for non-btrfs data might still be in the page cache rather than disk cache anyway, so the journal is still necessary
[22:36] <dweazle> i'm not saying _not_ using a journal, but putting the journal on the same raid device as the osd. the controller does writeback so should commit fairly quickly (even if the memory of the controller is significantly smaller than an ssd) .. i'm looking at it not from a performance but from a cost perspective
[22:36] <dweazle> i would like to deploy ceph on my current hardware and unless absolutely required i don't like to buy like 100 ssd's :)
[22:37] <dweazle> with all the endurance issues that go with it
[22:38] <dweazle> so what's your view on that? i haven't done any benchmarking so i have no clue what such a design choice would do for performance
[22:43] <joshd> the journal is pure streaming writes, so making it share a disks with the osd data will still be slower
[22:43] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:43] <Tv> nhm: more on that intel ssd: http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed/2 and the next page
[22:43] <joshd> but if your workload isn't write-intensive, it probably doesn't matter
[22:43] <Tv> nhm: they stopped updating the translation layer on disk, just rely on being able to write it out fully at power loss
[22:44] <joshd> of course, the best way to find out is benchmarking on your hardware :)
[22:44] <dweazle> well, i'm considering using ceph rbd for vm's, so it will be a mostly write workload
[22:44] <dweazle> true..
[22:45] <joshd> you can use a tmpfs journal to pretend you have a really fast, unreliable ssd
[22:47] <dweazle> that's a good test i guess ;)
[22:48] * noob2 (a5a00214@ircip4.mibbit.com) has joined #ceph
[22:48] <noob2> hi everyone. i'm having some trouble with mkcephfs using btrfs drives
[22:49] <noob2> first time i ran it i get: http://fpaste.org/ccAI/
[22:49] * sagelap (~sage@bzq-218-183-205.red.bezeqint.net) has joined #ceph
[22:49] <noob2> when i run it again it fails immediately because some of the osd's are created
[22:49] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[22:50] * gucki (~smuxi@84-72-8-40.dclient.hispeed.ch) has joined #ceph
[22:51] <gregaf> noob2: yep, mkcephfs doesn't handle much so you can't run it again in case of partial failure; you'll need to wipe and redo
[22:52] <stxShadow> good night all
[22:53] <noob2> ok
[22:53] <noob2> have you run into issues with btrfs not letting you delete a directory?
[22:53] <noob2> rm: cannot remove `snap_1': Directory not empty
[22:53] <gregaf> that'll happen on old versions if the directory is a...
[22:53] <gregaf> yep, snapshots
[22:53] <gregaf> you're running on kernel 2.6.32 or something?
[22:53] <noob2> no way newer than that
[22:53] <noob2> ubuntu 12.10
[22:54] <noob2> 3.5.0-17
[22:54] <gregaf> umm
[22:54] <gregaf> no idea then
[22:54] <gregaf> sjust?
[22:54] <gregaf> (I assume you looked and it is in fact empty)
[22:54] <sjust> you need to use a btrfs tool to remove the subvol
[22:54] <sjust> it's not a normal directory
[22:54] <noob2> correct haha
[22:54] * stxShadow (~jens@ip-178-203-169-190.unitymediagroup.de) Quit (Quit: bye bye !! )
[22:54] <gregaf> oh, normal rm doesn't do it?
[22:54] <sjust> no
[22:55] <sjust> or it didn't last I checked
[22:55] <noob2> how about unlink?
[22:55] <sjust> not sure, but probably -ENOTEMPTY
[22:55] <sjust> isn't it btrfsctl?
[22:56] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[22:56] <dmick> btrfs subvolume delete?
[22:56] <sjust> ah, yes
[22:56] <dmick> (once they had less-regular tools; now they have sentence-command versions)
[22:56] <sjust> yeah
[22:58] <noob2> ok i'll give that a shot
[22:58] <noob2> that works :)
[22:59] * miroslav1 (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[22:59] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[23:00] <noob2> woo! ceph is powering up :D
[23:01] <noob2> thanks
[23:04] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[23:07] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[23:09] <jefferai> Oliver2: ping
[23:10] <jefferai> Oliver2: if you don't mind, please reply to the mailing list with the answer so that anyone else finding your issue doesn't just see it ending with "got it answered off-list"
[23:11] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[23:11] <noob2> so rados pools, are those just logical groupings?
[23:11] <NaioN> yeah
[23:12] <noob2> ok cool
[23:12] <NaioN> but you can make certain settings per pool
[23:12] <noob2> so they're just to help me keep track of what's where
[23:12] <NaioN> like the replication level
[23:12] <dmick> and pg count
[23:12] <noob2> awesome :)
[23:12] <dmick> and authentication
[23:12] <noob2> can you limit the size of a pool?
[23:12] <NaioN> nope
[23:12] <noob2> ok
[23:12] <noob2> that would be the individual block devices
[23:12] <noob2> i gotcha
[23:12] <NaioN> not as far as i know :)
[23:13] <NaioN> you mean rbd
[23:13] <NaioN> well you make a rbd with a certain size
[23:13] <NaioN> so yeah if you mean that
[23:13] <noob2> right
[23:13] <noob2> yeah that's what i meant sorry
[23:13] * miroslav1 (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[23:13] <NaioN> but rbd is on top off rados
[23:14] <NaioN> a rbd is just a collection of objects in a pool
[23:14] <Oliver2> jefferai: yeah, just try to handle all the slow requests while rebuilding…
[23:14] <noob2> i see
[23:14] <noob2> thanks
[23:15] <jefferai> Oliver2: eh?
[23:17] <Oliver2> jefferai: just rebuilding right now… having many slow requests to handle
[23:17] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Ping timeout: 480 seconds)
[23:17] <jefferai> What does that have to do with responding to your email with the solution to your question?
[23:20] * noob2 (a5a00214@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[23:20] <Oliver2> What's the prob? I now have the right combination of kernel + distro + lib + ceph. Facing many many of slow requests… Now what?
[23:22] <jefferai> Oliver2: I have no idea what the problem is, but your problem that you posted on the mailing list was about syncfs
[23:22] <jefferai> and you got an answer in here
[23:22] <jefferai> and left the thread hanging
[23:22] <jefferai> it's nice to respond and let people know what the solution was, not just that there was one
[23:22] <jefferai> for future people Googling around looking for an answer to the same question
[23:25] <Oliver2> I was thankful for answers. But still have problems ongoing. Sorry for having a 14h day.
[23:26] <Oliver2> 012-11-06 23:26:24.223953 osd.33 [WRN] 108 slow requests, 2 included below; oldest blocked for > 1753.163022 secs
[23:27] <Oliver2> Ask the user, if he/she's felling cool
[23:28] <jefferai> Oliver2: I really don't see how one thing and the other are correlated. They're different problems. Why does you have a different problem mean you can't send a quick reply with the answer to your previous one?
[23:28] <jefferai> and what does that last statement even mean? :-)
[23:28] * tryggvil (~tryggvil@16-80-126-149.ftth.simafelagid.is) has joined #ceph
[23:28] * s_parlane (~scott@ Quit (Ping timeout: 480 seconds)
[23:30] * Cube1 (~Cube@ has joined #ceph
[23:31] <Oliver2> Story in short: re-install 2 nodes, having a couple of problems, not having syncfs working in first step. Dunno what's going on. Got syncfs working, including first of two new nodes. Now seeing trouble. "slow requests" massively.
[23:33] <jefferai> Yes. Exactly. "Got syncfs working" <--- that is what I'd like you to reply to your own post about and explain *how* you got it working
[23:33] <jefferai> So that other people with the same problem know how to fix it
[23:34] <Oliver2> Got your point… having trouble right now. Sorry.
[23:36] <Oliver2> Running a rather small cluster with 4 nodes and about 500 VMs with qemu. That is <=500 customers. *sigh*
[23:39] * Tamil (~Adium@ Quit (Quit: Leaving.)
[23:47] * houkouonchi-work (~linux@ Quit (Remote host closed the connection)
[23:49] * Tamil (~Adium@ has joined #ceph
[23:54] * Tamil (~Adium@ has left #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.