#ceph IRC Log


IRC Log for 2013-01-08

Timestamps are in GMT/BST.

[0:07] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[0:08] * allsystemsarego (~allsystem@5-12-241-245.residential.rdsnet.ro) Quit (Quit: Leaving)
[0:08] <sjust> paravoid: can you file a bug with a description of how you added the osds, what you put on cephdrop@ceph.com, and the old crush rule?
[0:09] <paravoid> I can
[0:09] * korgon (~Peto@isp-korex- has joined #ceph
[0:09] <paravoid> will you need anything more?
[0:09] <paravoid> or can I go ahead and add more OSDs?
[0:10] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[0:11] <tziOm> Why dont ceph output valid json when I run: ceph osd dump --format=json
[0:11] <dec> is there a description somewhere of master vs next vs stable vs testing ?
[0:11] <dec> in ceph terms
[0:11] <tziOm> it adds a line (in STDOUT) that breaks json, first line: dumped osdmap epoch 40
[0:11] <sjust> I don't think there's anything more I can get from your cluster
[0:12] <sjust> paravoid: to clear the error, you can do ceph osd out <osdid>; ceph osd in <osdid>
[0:12] <sjust> for involved osds until the remapped pgs go away
[0:12] <sjust> that won't actually restart the osds, so it should be a bit easier on the cluster
[0:12] <dmick> tziOm: I just saw a fix from Josh for this I think
[0:12] <paravoid> sjust: oh, cool, thanks.
[0:13] <paravoid> sjust: someone suggested that I should alter the tunables to what the manual suggests
[0:13] <paravoid> sjust: but it has a big fat scary warning, so I chickened out
[0:13] <sjust> paravoid: that might be related, but if doing out/in fixes the problem, then the tunables are not related
[0:13] <paravoid> okay
[0:13] <paravoid> thanks a lot
[0:14] <tziOm> dmick, ok, so in git?
[0:14] <sjust> paravoid: no problem, a description of how the osd addition process happened should help explain how it happened
[0:14] <tziOm> dmick, I mean.. what do ppl think when they code like that?! should be tracked down and shot
[0:15] <paravoid> what do you mean?
[0:15] <paravoid> I basically did ceph-disk-prepare /dev/sda
[0:15] <paravoid> etc.
[0:15] <sjust> sorry, I guess I meant the crush rule change
[0:16] <dmick> tziOm: it's complicated by the fact that the output is put together in the OSD, and then piped back through to the invoking ceph command, so
[0:16] <dmick> it's not immediately obvious the stderr-vs-stdout distinction
[0:16] <dmick> but it's just a bug. Obviously you can easily work around it until it's really fixed
[0:16] <tziOm> dmick, but output=json should be expected to be json
[0:16] <tziOm> not "crap"
[0:17] <dmick> and most of it is. this is just a message.
[0:17] <dmick> that goes to the wrong pipe.
[0:17] <dmick> I don't think one misplaced message turns the whole output to "crap"
[0:17] <tziOm> yeah I know, but then it ends up beeing "crap" to a parser
[0:17] <tziOm> because its not valid json anymore
[0:17] <dmick> sigh. yes, I understand.
[0:17] <tziOm> unless | tail -n +2
[0:18] <tziOm> ..but that is scary when this command suddenly works.
[0:18] <dmick> grep -v 'dumped osdmap'
[0:19] <tziOm> thats my osd's name ;)
[0:27] * amichel (~amichel@salty.uits.arizona.edu) has joined #ceph
[0:29] <amichel> I saw some traffic about deploying Ceph osds on zfs mounts, but I'm getting an error 22 during mkcephfs
[0:30] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:30] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:30] <amichel> Is that a decipherable error code? I'm not really sure where to look or how to enable debugging on the mkcephfs command
[0:32] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:33] <iggy> amichel: don't think it'll work because zfs doesn't support odirect
[0:33] <iggy> or something similar
[0:33] <amichel> I just threw journal dio = false into the config file and it completed
[0:33] <amichel> but my expectation is that it will fail when I try to do actual writing to the disk if you're on to something there
[0:34] * The_Bishop (~bishop@2001:470:50b6:0:31a9:74eb:eb95:cf60) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[0:34] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[0:34] <amichel> I'm gonna quick provision a client and see what happens :D
[0:35] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[0:35] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[0:35] <iggy> but then the question kind of becomes... why?
[0:35] <amichel> Well
[0:36] <amichel> I've got a storage server with 45 sata drives in it
[0:36] <amichel> I'd like them to be in something like a RAID6 for safety
[0:36] <amichel> I find md to be a bear to work with, especially with a large disk count
[0:36] <iggy> so you're going to have ceph doing raid10 on top of your raid6?
[0:37] <amichel> I'm gonna have multiple storage servers when I'm done, but yes. Each of these boxes will have multiple RAID6 sets as osds.
[0:37] <iggy> and I fully expect ceph to fail trying to rebuild that big of an osd
[0:38] <amichel> Well, it's broken up
[0:38] <amichel> I didn't realize there was a best practice to osd sizing
[0:38] <amichel> What's the right size?
[0:38] <gregaf1> yeah, I believe ZFS needs some (fairly trivial) dev work, but we don't have the bandwidth for it right now
[0:38] <iggy> right size isn't quite the right phrase... most tested configuration is generally to have each disk be it's own OSD
[0:39] <amichel> Each disk?
[0:39] <amichel> Holy smokes
[0:39] <dec> anyone using a fusionIO nand card as an OSD journal device?
[0:40] <dec> iggy: are you aware of a practical limit on how many OSDs a large cluster can support?
[0:40] <iggy> dec: I've only heard of people using SSDs, but some of the ceph guys may have more info than I know
[0:40] <iggy> dec: millions?
[0:40] <amichel> Am I not in the majority opinion to think that managing 45 osds per node would be kind of over the line?
[0:41] <paravoid> why would it be?
[0:41] <iggy> amichel: it's all fairly trivial to automate, so it's not that bad (once you get the automation down that is)
[0:41] <benpol> amichel: I think most people are deploying ceph with fewer OSDs per node
[0:42] <amichel> Well, isn't an OSD supposed to be considered a unique and uniquely reliable storage unit?
[0:42] <iggy> uniquely reliable?
[0:42] <paravoid> sjust, wubo: filed the bug we were talking about before, http://tracker.newdream.net/issues/3747
[0:42] <amichel> What I mean to say is if I tell ceph I want three copies of an object, isn't that equivalent to telling it to store it on three unique osds?
[0:42] <dec> amichel: I manage ours with puppet; 1 OSD is not much different than 50...
[0:43] <amichel> Does it have any concept of the fact that these would all be in the same physical box?
[0:43] <dec> amichel: yes, the CRUSH map manages that
[0:43] <paravoid> dec: do you have proper puppet recipes for osds? :)
[0:43] <dec> paravoid: only really simple ones
[0:43] <dec> paravoid: they don't actually add the OSD into the cluster
[0:43] <amichel> I'm not concerned about the config files, that's certainly trivial
[0:43] <paravoid> do partition/format it though?
[0:43] <iggy> amichel: what dec said... your crush map takes care of making sure the data is in different failure domains
[0:44] <amichel> I just thought I was opening myself up to some serious data loss
[0:44] <paravoid> amichel: you can adjust crush to say that you want 3 copies in three different boxes, or three different racks etc.
[0:44] <amichel> oh, I see
[0:44] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[0:44] <paravoid> or arbitrary hierarchies
[0:44] <paravoid> with 45 disks per box you'll probably care about raid controllers too :)
[0:44] <paravoid> or shelves/sas expanders?
[0:44] <amichel> controllers :D
[0:45] <amichel> It's a backblaze pod
[0:45] <amichel> Well, it's essentially a backblaze pod
[0:47] <amichel> There are many layers, unfortunately.
[0:47] <dec> paravoid: yes, our puppet conf just manages the directories, filesystems and mountpoints
[0:48] <amichel> So can the CRUSH mappings be nested? IE, this config has sata multipliers for five drives to single controller port, then three controller ports in use per controller across three controllers on a single montherboard.
[0:48] * buck (~buck@bender.soe.ucsc.edu) Quit (Quit: Leaving.)
[0:48] <paravoid> dec: is this public?
[0:48] <gregaf1> 45 OSDs in a single box is going to be too many though; they do take some CPU and a bit of memory
[0:48] <amichel> Can I represent that intelligibly?
[0:48] <paravoid> amichel: yes
[0:48] <gregaf1> and backblaze pods really only work for Backblaze...
[0:48] <paravoid> have a look at the manual
[0:48] <amichel> Ok
[0:49] <amichel> I'm sorry, I didn't mean to be that guy
[0:49] <amichel> I thought I understood how this worked :D
[0:49] <paravoid> no, I'm not saying it because of that
[0:49] <paravoid> I'm saying it because I'm not feeling capable enough to explain crush in its entirety
[0:49] <dec> paravoid: no, not at the moment - I can probably put it up somewhere
[0:49] <iggy> very few people feel that confident with crush I'd guess
[0:50] <paravoid> dec: I'd love that
[0:50] <paravoid> there is a public puppet module but doesn't do OSDs at all
[0:50] <iggy> like say sage and the janitor at his school that fills in math problems when nobody is looking
[0:50] <amichel> @gregaf1 This pod has actually done well at a few various tasks, but I'd like to use it as the start of a scalable storage cluster and basically Ceph seems like the only good game in town for that
[0:50] <cephalobot`> amichel: Error: "gregaf1" is not a valid command.
[0:50] <paravoid> I was thinking of hooking ceph-disk-{prepare,activate} into it, but I'm not terribly excited with it
[0:51] <amichel> gregaf1: This thing has actually done well at a few various tasks, but I'd like to use it as the start of a scalable storage cluster and basically Ceph seems like the best way to take that on
[0:51] <amichel> Sorry if that sent twice
[0:52] <gregaf1> amichel: it's a question of ratios; backblaze pods have a lot of disk compared to how much network bandwidth and CPU power they have to control it
[0:52] <amichel> Oh yeah, I agree with that. I'm actually looking into the SuperMicro 36-bay units as an alternative
[0:52] <gregaf1> but yes, you want to look into CRUSH map design; there might even be enough in the manual for you to work it out on your own
[0:53] <amichel> dual-socket, integrated SAS, hotswap
[0:53] <paravoid> I have 24 osds per box and it's using about 25G of RAM on average
[0:53] <paravoid> that's pure process memory, not page cache
[0:53] <amichel> how's the CPU utilization?
[0:53] <iggy> 1G per OSD seems about the norm
[0:54] <amichel> That's good to know
[0:54] <amichel> Is that relative to osd size or bandwidth or anything like that?
[0:54] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[0:55] <iggy> I would think it would be more related to size, but most people just say they've got X # of OSDs, not what size they are
[0:56] <paravoid> my osds are one per 2TB disk fwiw
[1:00] <gregaf1> memory use is proportional to number of PGs the OSD hosts
[1:00] <gregaf1> CPU is going to be based on that and on how many requests it's handling
[1:01] <dmick> tziOm left, but for anyone else interested in command output pollution, I filed http://tracker.newdream.net/issues/3748
[1:01] <amichel> PGs? This is CRUSH map related?
[1:02] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:03] <sleinen1> Hi, I have a Ceph cluster with 55 OSDs across ten servers that I somehow managed to hose. In particular, one of the ceph-mons is running at 260% CPU.
[1:03] <sleinen1> Can anyone help?
[1:04] <sleinen1> I'm running 0.56-1 (Precise packages)
[1:04] <gregaf1> amichel: sort of; each pool is sharded into "Placement Groups" which are distributed over the OSDs and are used to ease history tracking (over placing by object) when doing recovery and things
[1:05] <gregaf1> sleinen1: sure
[1:05] <gregaf1> what's "ceph -s" output?
[1:05] <gregaf1> and what are your other symptoms? ("hosed")
[1:06] <sleinen1> Unfortunately ceph -s hangs.
[1:06] <sleinen1> I guess it's because the ceph-mon is unresponsive?
[1:06] <gregaf1> probably
[1:07] <sleinen1> A few log messages from this ceph-mon are here: http://pastebin.com/AjrGiB0Y
[1:07] <gregaf1> woah, that's a lot of monitors
[1:08] <sleinen1> Yes, we have one on each server - probably not such a good decision, but we wanted to have as homogeneous as possible a configuration.
[1:08] <gregaf1> that *should* work right now (we don't test with that many, though!) but is going to cause problems at some point
[1:09] <gregaf1> how did your cluster get into this situation?
[1:09] <gregaf1> in particular I notice that pretty much all of the PGs are remapped
[1:10] <sleinen1> I did a few modifications to the network attachment on some of the servers, then restarted these servers and eventually the whole cluster a few times.
[1:11] <sleinen1> In particular, I started to move the servers from a configuration with two separate IP addresses to one with a "bonded" interface.
[1:11] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[1:11] <sleinen1> It's possible that I did something wrong in the process, but all my network tests looked great.
[1:13] <sleinen1> Oh, the ceph-mon process also looks a little big
[1:13] <sleinen1> root 23511 130 6.9 9395968 9224664 ? Ssl 00:58 19:05 /usr/bin/ceph-mon -i s0 --pid-file /var/run/ceph/mon.s0.pid
[1:13] <sleinen1> (that's 9GB)
[1:13] * mattbenjamin (~matt@wsip-24-234-55-160.lv.lv.cox.net) Quit (Quit: Leaving.)
[1:15] <gregaf1> yeah, that's large but as long as it's still spinning there's a lot of innocuous things it could be
[1:15] <gregaf1> are you getting new log messages from the monitor?
[1:16] <sleinen1> Always the same sort as a posted on pastebin, as far as I see.
[1:16] <sleinen1> I killed the runaway ceph-mon (mon.0) now
[1:16] <gregaf1> can you get a ceph -s going at this point?
[1:16] <sleinen1> No - apparently the monitor system doesn't heal.
[1:17] <sleinen1> If I check the logs of another (the "next") ceph-mon, I see
[1:17] <sleinen1> 2013-01-08 01:15:11.667093 7f95ca10d700 0 -- >> pipe(0x2fdafc0 sd=31 :6789 pgs=1590 cs=1 l=0).fault with nothing to send, going to standby
[1:17] <sleinen1> 2013-01-08 01:15:16.837904 7f95c8af7700 0 -- >> pipe(0x2fdafc0 sd=18 :6789 pgs=1590 cs=2 l=0).fault
[1:17] <sleinen1> 2013-01-08 01:15:16.838617 7f95cc422700 0 log [INF] : mon.s1 calling new monitor election
[1:17] <sleinen1> 2013-01-08 01:15:16.838731 7f95cc422700 1 mon.s1@2(electing).elector(948) init, last seen epoch 948
[1:17] <sleinen1> 2013-01-08 01:15:22.708775 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 26 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:22.708823 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:22.708840 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:22.708855 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:22.708869 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:22.708890 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 26 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:22.708905 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:35.856885 7f95cc422700 0 log [INF] : mon.s1 calling new monitor election
[1:17] <sleinen1> 2013-01-08 01:15:35.857038 7f95cc422700 1 mon.s1@2(electing).elector(950) init, last seen epoch 950
[1:17] <sleinen1> 2013-01-08 01:15:42.709830 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 30 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:42.709885 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <dec> erk
[1:17] <sleinen1> 2013-01-08 01:15:42.709901 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:42.709916 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:42.709931 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:42.709946 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 27 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <sleinen1> 2013-01-08 01:15:55.020801 7f95cc422700 0 log [INF] : mon.s1 calling new monitor election
[1:17] <sleinen1> 2013-01-08 01:15:55.020940 7f95cc422700 1 mon.s1@2(electing).elector(952) init, last seen epoch 952
[1:17] <sleinen1> 2013-01-08 01:16:14.137058 7f95cc422700 0 log [INF] : mon.s1 calling new monitor election
[1:17] <sleinen1> 2013-01-08 01:16:14.137225 7f95cc422700 1 mon.s1@2(electing).elector(954) init, last seen epoch 954
[1:17] <sleinen1> 2013-01-08 01:16:17.719344 7f95ccc23700 1 mon.s1@2(electing) e1 discarding message auth(proto 0 26 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[1:17] <dmick> AAAAAAAAAA
[1:17] <sleinen1> Sorry for the last paste. The first two lines are OK - is the ceph-mon that I killed.
[1:17] <dmick> sleinen1: pastebin large output
[1:18] * jlogan1 (~Thunderbi@2600:c00:3010:1:b121:611b:9c01:6f68) Quit (Ping timeout: 480 seconds)
[1:18] <sleinen1> Sorry - http://pastebin.com/YQ1dMRxm
[1:18] <gregaf1> sleinen1: okay, without being hands-on I'd recommend turning off everything else and letting the monitors run and seeing if they settle down
[1:19] <sleinen1> OK. Everything else would be the ceph-osd processes - ceph-msd are running on two separate servers.
[1:19] <gregaf1> no, I mean you want to get rid of the things talking to the monitors
[1:19] <sleinen1> ^msd^mds
[1:20] <gregaf1> oh, are the OSDs also on the monitor servers? do they share any disks at all?
[1:20] <sleinen1> Yes, they do.
[1:20] <sleinen1> But the OSD data disks are separate.
[1:20] * jlogan (~Thunderbi@ has joined #ceph
[1:20] * dxd828 (~dxd828@host-78-151-106-120.as13285.net) has joined #ceph
[1:20] * dxd828 (~dxd828@host-78-151-106-120.as13285.net) Quit (Remote host closed the connection)
[1:20] <sleinen1> (e.g. /var/lib/ceph/osd/* are on separate disks)
[1:20] <dec> How important are the disks for the Mons?
[1:21] <gregaf1> okay
[1:21] <dec> (i.e. how performant do the mon disks need to be)
[1:21] <gregaf1> usually not very, I'm just wondering if the OSDs are maybe spamming log output to the root disk and the monitor's on root as well sleinen1?
[1:22] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:24] <sleinen1> No, at least not now - all disks are very quiet.
[1:24] <sleinen1> - on the server that had the runaway ceph-mon (mon.0). It also has 4 ceph-osds.
[1:25] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[1:25] <gregaf1> okay, yeah, turn stuff off and see if the monitors settle down
[1:25] <gregaf1> if they don't, turn on the debug output and generate some logs
[1:26] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[1:26] <sleinen1> OK, thanks, I'll do that.
[1:27] * tnt (~tnt@112.169-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[1:27] <dec> gregaf1: are you able to explain the difference between 'next' and 'testing' in git?
[1:27] <dec> or is this doc'd somewhere?
[1:28] <gregaf1> oh, probably not
[1:28] <gregaf1> "next" is going to become the next release
[1:28] <gregaf1> "testing" is going to become the next stable point release
[1:28] <dec> ah
[1:29] <dec> and 'stable' ? :P
[1:29] <gregaf1> I don't think testing and stable both still exist, do they?
[1:29] <gregaf1> if they do sagewk will have to do the explaining
[1:30] <sagewk> yeah, i think we'll rename the branches though..
[1:30] <sagewk> master -- active dev
[1:30] <sagewk> next -- next dev release
[1:30] <sagewk> last -- last dev release
[1:30] <sagewk> bobtail -- bobtail stable series
[1:30] <sagewk> argonaut -- older stable series
[1:30] <sagewk> right now testing == bobtail == last
[1:31] <dec> ok, right;
[1:31] <dec> I built and upgraded to testing last night to fix some issues... was wondering whether I should've used 'next' instead
[1:32] <sagewk> testing was probably the best choice
[1:32] <dec> cool
[1:33] <dec> FWIW as a data point, testing @ a10950f working well in production for ~12 hours now
[1:33] <paravoid> is bobtail formally announced yet?
[1:33] <dmick> paravoid: not yet
[1:33] <sagewk> close!
[1:34] <paravoid> close as in waiting for 0.56.1 or is bobtail going to be 0.56? :-)
[1:34] <dec> with the OSD backwards-compat issue, I'd hope 0.56.1...
[1:35] * sagelap (~sage@2607:f298:a:607:c554:b663:176f:da4d) Quit (Read error: Operation timed out)
[1:35] <dmick> it's browning around the edges, but the middle is still a little soft. a few more minutes. Go play with your brother for a little while. :)
[1:40] <dec> are there any requirements to use inktank's support services?
[1:40] <dec> i.e. do you only support a certain platform, OS, etc?
[1:41] <gregaf1> nope, not right now
[1:41] * sagelap (~sage@36.sub-70-197-128.myvzw.com) has joined #ceph
[1:42] <ircolle> You can enter your system specifics here: http://www.inktank.com/pps-form/ and we would be happy to get in touch with you
[1:42] * markl (~mark@tpsit.com) Quit (Ping timeout: 480 seconds)
[1:42] <mikedawson> Does anyone know what type of IOPS performance boost RBD writeback caching is capable of giving?
[1:42] <iggy> mikedawson: I'm betting the answer is "it depends"
[1:43] <mikedawson> mikedawson: something like bcache or flashcache could give a few order of magnitude performance boost perhaps. Wondering if RBD writeback is capable of anything close
[1:44] <sleinen1> gegraf1 - I commented out five of the ten mons - including the original runaway one (mon.s0) - in my /etc/ceph/ceph.conf and restarted everything. The problem simply moved to the new elected leader (mon.h0): Consumes >100% CPU and has grown to 30GB over a few minutes.
[1:44] <gregaf1> it's a writeback cache; it's not a random-to-sequential log-structured filesystem implementation ;)
[1:45] <dmick> it does pretty well on "write one block, read the same block 1000x" :)
[1:45] <gregaf1> sleinen1: commenting out the monitors isn't going to do anything except (depending on order) mean that you've got five down monitors
[1:45] <gregaf1> you need to tell the cluster to remove them
[1:45] <dmick> (I know, I'm not being very helpful mikedawson)
[1:46] <gregaf1> did you turn off all the daemons and clients trying to talk to it? or generate any more debugging?
[1:46] <dec> ircolle: if we're using ceph in production, does that void the pre-production support? :P
[1:46] <sleinen1> gegraf1, no, the ceph-osds have been restarted as well. For debugging, you're talking about something more than "-d", right?
[1:46] <mikedawson> gregaf1: dmick: right now I'm getting ~40IOPS per 7200 rpm SATA backed OSD with 3x replication when doing 100% random 16K writes, so I *think* RBD writeback is giving me some performance benefit
[1:47] <gregaf1> that's in the neighborhood of what you get out of a drive, but it might be helping a bit
[1:47] <dmick> mikedawson: you'd really like to see those stats
[1:47] <gregaf1> depends on your sizes though
[1:48] <mikedawson> gregaf1: dmick If I estimate those drives should max at ~75 iops, then 3x replication should knock it down to ~25 iops, right?
[1:49] <iggy> if you only have 3 OSDs
[1:49] <gregaf1> yeah, but they are often good for a bit more than that (~125)
[1:49] <gregaf1> and the effects of cache depend a lot on the specific test; it could work either way
[1:49] <dmick> well...but the two replicas are written simultaneously, right? but there may be bus contention
[1:49] <dmick> so it might not be a strict /3
[1:49] <gregaf1> dmick: doesn't matter; it's streaming random IO...
[1:51] <mikedawson> gregaf1: if the drives can do ~125, then rbd writeback may not be doing anything useful for my usecase testing
[1:51] <gregaf1> you'd have to test :)
[1:51] <dmick> gregaf1: yes, but I'm just saying two replicas is not automatically twice the time of one
[1:51] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[1:52] <mikedawson> I get similar results when benchmarking volumes I think have writeback on versus volumes I think have writeback disabled. And similar results using rados bench
[1:52] <gregaf1> sounds like maybe you're not getting it turned on then, do you remember the rules dmick?
[1:53] <dmick> AFAIK it's just config settings; don't remember any gating conditions
[1:53] <gregaf1> I thought you needed a couple switches on QEMU or something
[1:54] <dmick> mikedawson: if you can't examine caching stats you should be able to find evidence in the client logs with sufficient logging level. Where did you end up with on asok?
[1:54] <mikedawson> The holy grail for me is ingesting lots of 16K random writes then reordering with bcache or similar into 4MB sequential writes
[1:54] <sleinen1> (I'm going to bed now - guess it's time to rebuild the cluster. Thanks for your assistance anyway.)
[1:54] <dmick> gregaf1: yeah, you put the same params in the rbd config line as you do in the file AFAIK
[1:54] <mikedawson> dmick: asok still creating and deleting right away
[1:55] <dmick> did you determine which process was doing that?
[1:56] <mikedawson> gregaf1: in my setup (kvm, libvirt and openstack cinder), you do need to send some args to libvirt
[1:56] <dmick> mikedawson: gregaf1: yes, in the rbd device spec
[1:56] <mikedawson> dmick: it happens when I restart cinder-volume
[1:57] <dmick> mikedawson: I know, but I was asking before about which actual process it is that's opening and closing the socket
[1:57] <dmick> [23:55] <dmick> right, I'm just wondering which proc it is
[1:57] <dmick> [23:56] <mikedawson> 18227 last time, but now process #18227 stays around
[1:57] <dmick> [23:57] <dmick> and is that a kvm process?
[1:57] <dmick> [23:57] <mikedawson> dmick: cinder-volume is 18222 or 18215
[1:57] <dmick> [23:58] <dmick> but what is 18227?
[1:57] <mikedawson> gregaf1: for my environment, joshd had me do something like sed -i 's/conf.driver_cache = "none"/conf.driver_cache = "writeback"/g' /usr/share/pyshared/nova/virt/libvirt/volume.py
[1:58] <mikedawson> dmick: never figured that out
[1:58] <dmick> ok
[1:58] <dmick> at that point I was just wondering what, say, ps -fp 18227 said
[1:59] <mikedawson> root@node1:/var/log/ceph# service cinder-volume restart
[1:59] <mikedawson> cinder-volume stop/waiting
[1:59] <mikedawson> cinder-volume start/running, process 24367
[1:59] <mikedawson> [ 7/Jan/2013 19:59:15] IN_CREATE mkd/client.volumes.24379.asok
[1:59] <mikedawson> [ 7/Jan/2013 19:59:15] IN_DELETE mkd/client.volumes.24379.asok
[1:59] <mikedawson> [ 7/Jan/2013 19:59:15] * mkd/client.volumes.24379.asok is deleted
[2:00] <dmick> and is 24379 still running? and if so, what is it?
[2:00] <mikedawson> root@node1:/var/log/ceph# ps -fp 24379
[2:00] <mikedawson> UID PID PPID C STIME TTY TIME CMD
[2:00] <mikedawson> nothing
[2:01] <dmick> ok. that stands to reason since it got opened and removed. When you said "18288 stays around", I thought we were looking at something different
[2:01] <dmick> *18227, sorry
[2:01] <dmick> anyway
[2:01] <dmick> are there kvm processes running?
[2:03] <mikedawson> kvm processes are 26053 and 32039
[2:05] <dmick> is it possible to stop/start them, or start a new one, while watching /var/run/ceph?
[2:05] <dmick> s/them/one of them/
[2:10] * ShaunR (~ShaunR@staff.ndchost.com) has joined #ceph
[2:18] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[2:20] * sagelap (~sage@36.sub-70-197-128.myvzw.com) Quit (Ping timeout: 480 seconds)
[2:23] * gregaf1 (~Adium@cpe-76-174-249-52.socal.res.rr.com) Quit (Quit: Leaving.)
[2:24] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[2:26] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[2:28] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Remote host closed the connection)
[2:29] * rturk (~rturk@ds2390.dreamservers.com) Quit (Quit: Coyote finally caught me)
[2:29] * rturk-away (~rturk@ds2390.dreamservers.com) has joined #ceph
[2:29] * rturk-away is now known as rturk
[2:30] * rdt (~rdt@ has joined #ceph
[2:30] * rturk is now known as rturk-away
[2:30] * rturk-away is now known as rturk
[2:30] * rdt (~rdt@ Quit ()
[2:31] * rturk is now known as rturk-away
[2:32] * jeffrey4l (~jeffrey@ has joined #ceph
[2:33] <dec> compiling ceph with tcmalloc has greatly reduced the OSD's memory usage
[2:33] <phantomcircuit> how does the placement algorithm deal with a full osd?
[2:35] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[2:37] <jeffrey4l> Could any body meet "failed to chmod /mnt/osd0/current to 0755: (30) Read-only file systemERROR: error creating object store in /mnt/osd0: (30) Read-only file system" error? I could not find any reason for this
[2:38] * jeffrey4l (~jeffrey@ Quit (Quit: Leaving)
[2:40] * jeffrey4l_ (~jeffrey@ has joined #ceph
[2:42] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[2:45] <mikedawson> dmick: when I restart an instance, nothing happens in /var/run/ceph or /var/log/ceph/mkd (where I am configured for the admin sockets right now)
[2:47] * LeaChim (~LeaChim@b0faeeb0.bb.sky.com) Quit (Ping timeout: 480 seconds)
[2:49] <dmick> mikedawson: something's wrong with my local libvirtd or I'd play a bit
[2:49] <dmick> sigh
[2:49] <dmick> do you have your kvm's configured to use a ceph.conf that specifies the log positions?
[2:50] <mikedawson> dmick: Not sure
[2:50] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[2:50] <mikedawson> dmick: /usr/bin/kvm -name instance-00000007 -S -M pc-1.2 -cpu Nehalem,+rdtscp,+dca,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm -m 4096 -smp 2,sockets=2,cores=1,threads=1 -uuid caf81274-7fcf-4b16-a764-d21076beaf53 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000007.monitor,server,nowait...
[2:50] <mikedawson> ...-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=rbd:volumes/volume-1fefc724-418f-4bb6-a92b-f7a9f2fad9dc:id=volumes:key=AQDvZ+RQ6NENBhAAjokqIsN5jEr1TVaPCJL1FA==:auth_supported=cephx\;none,if=none,id=drive-virtio-disk0,format=raw,cache=writeback -device...
[2:51] * agh (~agh@www.nowhere-else.org) has joined #ceph
[2:51] <mikedawson> ...virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=rbd:volumes/volume-de90e602-3f43-427f-b311-319c09a8b6e9:id=volumes:key=AQDvZ+RQ6NENBhAAjokqIsN5jEr1TVaPCJL1FA==:auth_supported=cephx\;none,if=none,id=drive-virtio-disk1,format=raw,serial=de90e602-3f43-427f-b311-319c09a8b6e9,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x
[2:51] <mikedawson> 6,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=21,id=hostnet0,vhost=on,vhostfd=23 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:b5:3c:5f,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-00000007/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -de
[2:51] <mikedawson> vice usb-tablet,id=input0 -vnc -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
[2:52] * gregaf1 (~Adium@cpe-76-174-249-52.socal.res.rr.com) has joined #ceph
[2:54] <gregaf1> dec: yeah, you really want tcmalloc
[2:55] <gregaf1> jeffrey4l_: are you using btrfs, and if so what kernel version?
[2:56] <jeffrey4l_> gregaf, yes, I am using rhel 4.3, the kernel version is 2.6.32-279.el6.x86_64
[2:57] <gregaf1> err, you mean 6.3?
[2:57] <gregaf1> I wonder what btrfs that actually is
[2:57] <jeffrey4l_> gregaf, yes, my typo.
[2:57] <gregaf1> sjust, still around? or dmick do you know the btrfs snapshot issues that pop up?
[2:58] <jeffrey4l_> gregaf, Name : btrfs-progs Arch : x86_64 Version : 0.19
[2:58] <tore_> happy new year everybody
[2:59] <dmick> I don't, but do we know /mnt/osd0 is not a ROFS?
[2:59] <jeffrey4l_> gregaf, I found this issue is discussed past year. I found it at http://irclogs.ceph.widodh.nl/index.php?date=2012-01-12
[2:59] <jeffrey4l_> But I found no solution.
[2:59] <dmick> mikedawson: if you're using /etc/ceph/ceph.conf that should be read by default
[3:00] <mikedawson> dmick: yep
[3:00] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[3:01] <gregaf1> I was thinking that EROFS was one of the weird return codes you could get out of old btrfs with subvolumes
[3:01] <gregaf1> but perhaps not, jeffrey4l_ you should make sure the filesystem is actually useable
[3:01] <gregaf1> create a couple files and write to them
[3:01] <mikedawson> dmick: thanks for working through it all with me today. Should I ping joshd next time he's on?
[3:03] * andreask1 (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[3:03] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[3:04] <jeffrey4l_> The weird thing is that I can create new file in <btrfs>/ ( root folder) but not <btrfs>/current/ foder
[3:04] <jeffrey4l_> The weird thing is that I can create new file in <btrfs>/ ( root folder) but not <btrfs>/current/ folder
[3:06] <gregaf1> hrm; I'm not sure what could cause that
[3:07] <dmick> mikedawson: joshd will certainly have a leg up on this, but he's gone all week. I can do some more here if I can get libvirtd to behave
[3:08] <gregaf1> jeffrey4l_: I think we've lost the people who might have a better idea for the evening (come back tomorrow), but I'm inclined to toss it at the btrfs people
[3:09] <mikedawson> dmick: Good stuff. I should be around.
[3:09] <dmick> (had a log setting in /etc/libvirt/libvirt.conf that might have been causing the issue)
[3:11] <jeffrey4l_> gregaf, 1. what the time now? It is my morning now. 2. after I create folder <btrfs>/osd0/current folder, It seems works find...
[3:11] <dmick> jeffrey4l_: Pacific Time. It's 18:11 here now
[3:12] <dmick> there, started
[3:14] <jeffrey4l_> dmick, thx
[3:16] * jeffrey4l (~jeffrey@ has joined #ceph
[3:17] * jeffrey4l_ (~jeffrey@ Quit (Quit: Leaving)
[3:19] <jeffrey4l> gregaf, Do you means the best discuss time is daytime? right?
[3:21] * gregaf1 (~Adium@cpe-76-174-249-52.socal.res.rr.com) Quit (Quit: Leaving.)
[3:23] * amichel (~amichel@salty.uits.arizona.edu) Quit ()
[3:27] * jlogan (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[3:28] <dmick> jeffrey4l: that's what he meant, yes
[3:33] <jeffrey4l> thx. I am a newbie.
[3:35] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[3:36] * ShaunR (~ShaunR@staff.ndchost.com) Quit (Read error: Connection reset by peer)
[3:36] * MooingLemur (~troy@phx-pnap.pinchaser.com) Quit (Read error: Operation timed out)
[3:36] * ShaunR (~ShaunR@staff.ndchost.com) has joined #ceph
[3:36] * MooingLemur (~troy@phx-pnap.pinchaser.com) has joined #ceph
[3:37] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[3:46] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[4:24] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:26] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has left #ceph
[4:35] * wubo (80f40d05@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[4:36] * Cube (~Cube@ has joined #ceph
[4:37] * Cube (~Cube@ Quit ()
[4:42] * Cube1 (~Cube@ Quit (Ping timeout: 480 seconds)
[4:52] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[4:52] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit ()
[4:57] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[5:07] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[5:07] * korgon (~Peto@isp-korex- has joined #ceph
[5:09] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[5:10] * agh (~agh@www.nowhere-else.org) has joined #ceph
[5:13] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[5:14] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit ()
[5:25] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[5:36] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[5:40] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit ()
[6:02] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[6:02] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has left #ceph
[6:02] <mikedawson> Upgrade to 0.56.1 was smooth
[6:06] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:09] <phantomcircuit> op_queue_max_ops is changing
[6:14] <phantomcircuit> ah i see it's doubling when it's committing
[6:21] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[6:22] * agh (~agh@www.nowhere-else.org) has joined #ceph
[6:30] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[7:13] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[7:13] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit ()
[7:39] * jeffrey4l (~jeffrey@ Quit (Quit: Leaving)
[7:45] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[7:54] <phantomcircuit> hmm
[7:54] <phantomcircuit> ceph osd tell 0 bench seems to be writing into cache
[7:54] <phantomcircuit> it's reporting
[7:54] <phantomcircuit> 2013-01-08 06:50:32.016704 osd.0 [INF] bench: wrote 1024 MB in blocks of 4096 KB in 7.822724 sec at 130 MB/sec
[7:55] <phantomcircuit> except after that's reported the dirty cache is still full of dirty blocks from it
[8:07] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:10] * jamespag` (~jamespage@tobermory.gromper.net) Quit (Quit: Coyote finally caught me)
[8:11] * jamespage (~jamespage@tobermory.gromper.net) has joined #ceph
[8:14] * jpieper_ (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[8:19] * jpieper_ (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[8:21] * loicd (~loic@magenta.dachary.org) Quit (Ping timeout: 480 seconds)
[8:31] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[8:36] * agh (~agh@www.nowhere-else.org) has joined #ceph
[8:37] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[8:40] * gluffis (~gluffis@castro.mean.net) has joined #ceph
[8:41] * KindOne (~KindOne@ Quit (Read error: Connection reset by peer)
[8:54] * tnt (~tnt@112.169-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:57] * KindOne (~KindOne@ has joined #ceph
[8:58] * psiekl (psiekl@wombat.eu.org) Quit (Read error: Operation timed out)
[8:59] * psiekl (psiekl@wombat.eu.org) has joined #ceph
[9:13] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:20] * jpieper_ (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[9:20] * fc (~fc@ has joined #ceph
[9:21] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:22] * ScOut3R (~ScOut3R@ has joined #ceph
[9:23] * sleinen1 (~Adium@2001:620:0:25:34f4:2b26:b63c:36cb) Quit (Quit: Leaving.)
[9:23] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[9:25] * jpieper_ (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[9:27] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: I used to think I was indecisive, but now I'm not too sure.)
[9:28] * tnt (~tnt@112.169-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:29] * Leseb (~Leseb@ has joined #ceph
[9:31] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[9:33] * loicd (~loic@ has joined #ceph
[9:35] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:41] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[9:48] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[9:55] * korgon (~Peto@isp-korex- has joined #ceph
[9:58] * jeffrey4l (~jeffrey@ has joined #ceph
[10:00] <jeffrey4l> hi all , could any one tell me how can I attach one rbd image to my physical machine. I found that "echo � name=admin rbd foo� > /sys/bus/rbd/add" this command may help , But on my OS(ubuntu 12.04.1 server), there is no /sys/bus/rbd/ files.
[10:00] * jeffrey4l (~jeffrey@ Quit ()
[10:00] * jeffrey4l (~jeffrey@ has joined #ceph
[10:04] * ninkotech (~duplo@ip-94-113-217-68.net.upcbroadband.cz) Quit (Remote host closed the connection)
[10:06] <ScOut3R> jefferai: http://ceph.com/docs/master/rbd/rbd-ko/
[10:06] <ScOut3R> jeffrey4l: http://ceph.com/docs/master/rbd/rbd-ko/
[10:12] <jeffrey4l> ScOut3R, Thx. I have map the images successfully.
[10:13] * sleinen (~Adium@ has joined #ceph
[10:17] * sleinen1 (~Adium@2001:620:0:26:fda7:ada2:46ac:81b6) has joined #ceph
[10:18] * LeaChim (~LeaChim@b0faeeb0.bb.sky.com) has joined #ceph
[10:19] * sleinen (~Adium@ Quit (Read error: Operation timed out)
[10:20] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[10:20] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[10:37] * ninkotech (~duplo@ip-94-113-217-68.net.upcbroadband.cz) has joined #ceph
[10:42] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[10:49] * loicd (~loic@ has joined #ceph
[10:52] * jeffrey4l (~jeffrey@ Quit (Read error: Connection reset by peer)
[10:52] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[10:54] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[10:56] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[11:01] * gamakichi (c32fff42@ircip4.mibbit.com) has joined #ceph
[11:01] <gamakichi> hello
[11:01] <gamakichi> looking for help with stuck ceph
[11:02] <gamakichi> ceph health detail HEALTH_ERR 1 full osd(s) osd.2 is full at 95%
[11:02] <gamakichi> ceph mon tell \* injectargs '--mon-osd-full-ratio 99'
[11:02] <gamakichi> seems doesn't work
[11:02] * loicd1 (~loic@ has joined #ceph
[11:02] <gamakichi> mds comlains 2013-01-08 15:59:35.000294 b4a73b70 0 mds.0.objecter FULL, paused modify 0x9ba62d0 tid 19724
[11:02] * loicd (~loic@ Quit (Read error: No route to host)
[11:03] * agh (~agh@www.nowhere-else.org) has joined #ceph
[11:03] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[11:04] * agh (~agh@www.nowhere-else.org) has joined #ceph
[11:17] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[11:22] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[11:22] <gamakichi> rados -p data bench 10 write
[11:23] <gamakichi> 1 16 16 0 0 0 - 0
[11:38] * gamakichi (c32fff42@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[11:57] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has left #ceph
[12:04] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[12:04] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:12] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:18] <loicd1> Hi, for the record I documented what I did to create the code coverage report for ceph in http://dachary.org/?p=1781 .
[12:18] * loicd1 is now known as loicd
[12:19] <loicd> I'm now going to try with https://github.com/ceph/teuthology :-)
[12:23] <gluffis> I think I have missed something in the docs, how does the HA work for CephFS from the client perspective? As I understand the client mount any of the mon servers, but is there a way to tell the mon servers to have a shared Ip ?
[12:24] <gluffis> nativly...
[12:24] <ScOut3R> gluffis: you don't need a shared ip for the mons, the ceph.conf contains every mon, so the clients knows about each mon instance
[12:25] <gluffis> so mount -t ceph server:/share /path is the wrong way to do it ? or will it handle failover automagically ?
[12:31] <ScOut3R> wait
[12:31] <ScOut3R> you are using cephfs
[12:31] <gluffis> yes :D
[12:31] <ScOut3R> i don't know about the mds part
[12:31] <ScOut3R> i'm not using cephfs
[12:32] <gluffis> i am looking to replacing the old nfs servers with something better suited... like ceph/gluster/xtreemfs or something ;)
[12:34] <gluffis> the rdb paired with kvm seems really nice ;)
[12:42] <agh> gluffis: I was like you... But unfortunatly, after testing it quite a lot, CephFS is really not stable yet :(
[12:43] <gluffis> agh: so what did you go with ? kept nfs ?
[12:43] <paravoid> dec: ping?
[12:43] <paravoid> dec: wondering about those puppet manifests :)
[12:46] <agh> gluffis: yes, kept NFS. In fact today my VM disks are on an NFS share
[12:46] <agh> gluffis: and i wanted to simply replace it with CephFS
[12:46] <agh> gluffis: Worked. But it's not stable. Lot of MDS crashes
[12:46] <gluffis> built on drbd+corosync ?
[12:46] <agh> gluffis: So, for the moment i keep my NFS, and i'm working with RBD
[12:47] <agh> gluffis: RBD, which seems to be really more stable
[12:47] <gluffis> that could probably work also. WE only use NFS for webroot and mail currently
[12:48] <gluffis> we really dont have the budget for proper shared storage, like an HDS rack ;)
[12:48] * jeffrey4l (~jeffrey@ has joined #ceph
[12:48] <agh> gluffis: ... for the moment, i'm not sure that CephFS will work for you
[12:48] * jeffrey4l (~jeffrey@ Quit ()
[12:49] <agh> gluffis: yes it works... but.. hum. hum
[12:49] <gluffis> probably not ;)
[12:49] <gluffis> can keep nfs for web, and use rdb for kvm
[12:49] <gluffis> kvm is more dependant on shared storage
[12:49] <agh> gluffis: but is your fs has to be shared ?
[12:49] <gluffis> for kvm yes, for live migration
[12:50] <gluffis> mutiple web frontends
[12:50] <agh> gluffis: yes sure, rbd is great for that
[12:50] <gluffis> webfront is varnish->ngnix+memecached->apache->django
[12:58] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[12:58] <match> gluffis: I'm doing kvm with rbd like this: www.woodwose.net/thatremindsme/2012/10/ha-virtualisation-with-pacemaker-and-ceph/
[12:58] <match> gluffis: Seems to work really well
[12:59] <gluffis> match: ok ;) I'll have a look
[13:05] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:05] * jeffrey4l (~jeffrey@ has joined #ceph
[13:06] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:11] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[13:12] * agh (~agh@www.nowhere-else.org) has joined #ceph
[13:14] * Joel (~chatzilla@2001:620:0:46:2c52:9ddd:30b5:b57e) has joined #ceph
[13:19] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[13:22] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[13:22] * jano (c358ba02@ircip4.mibbit.com) has joined #ceph
[13:25] <jano> hi all
[13:25] <jano> today i have upgraded my ceph to v0.56.1 and now i have a big problem
[13:26] <jano> all of my osds are slowly failing
[13:27] <jano> osdmap e2542: 56 osds: 28 up, 30 in
[13:27] * low (~low@ has joined #ceph
[13:38] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:41] * jano (c358ba02@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[13:47] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[13:51] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[13:54] * loicd (~loic@ has joined #ceph
[14:03] * Joel_ (~chatzilla@2001:620:0:2d:b5ba:9795:b5a2:47db) has joined #ceph
[14:04] <paravoid> I don't see a 0.56.1 tag on git
[14:04] <paravoid> but there are packages already
[14:04] <paravoid> sagewk: ^^^
[14:05] <nhm> paravoid: hrm... I see one on github...
[14:06] <nhm> paravoid: created 16 hours ago
[14:07] <tnt> git fetch -t
[14:07] <jluis> someone on the ml suggested 'git fetch -t'
[14:08] * Joel (~chatzilla@2001:620:0:46:2c52:9ddd:30b5:b57e) Quit (Ping timeout: 480 seconds)
[14:09] * Joel_ is now known as Joel
[14:11] * BManojlovic (~steki@ has joined #ceph
[14:17] * Joel (~chatzilla@2001:620:0:2d:b5ba:9795:b5a2:47db) Quit (Ping timeout: 480 seconds)
[14:19] * korgon (~Peto@isp-korex- has joined #ceph
[14:20] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[14:25] * Joel (~chatzilla@user-23-21.vpn.switch.ch) has joined #ceph
[14:28] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[14:37] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[14:44] * jano (c358ba02@ircip2.mibbit.com) has joined #ceph
[14:55] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[15:00] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[15:04] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:10] * allsystemsarego (~allsystem@5-12-241-245.residential.rdsnet.ro) has joined #ceph
[15:17] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[15:17] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:19] * XSBen (~XSBen@ has joined #ceph
[15:34] <paravoid> how should I upgrade from 0.56 to 0.56.1 considering the protocol change?
[15:34] <paravoid> upgrade OSDs first and clients (radosgw here) last?
[15:36] * Joel (~chatzilla@user-23-21.vpn.switch.ch) Quit (Remote host closed the connection)
[15:37] * sleinen1 (~Adium@2001:620:0:26:fda7:ada2:46ac:81b6) Quit (Quit: Leaving.)
[15:37] * sleinen (~Adium@ has joined #ceph
[15:40] * sleinen1 (~Adium@ has joined #ceph
[15:40] * sleinen (~Adium@ Quit (Read error: Connection reset by peer)
[15:42] * sleinen (~Adium@2001:620:0:25:a517:3d90:9cda:457) has joined #ceph
[15:48] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) has joined #ceph
[15:48] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) has left #ceph
[15:48] * sleinen1 (~Adium@ Quit (Ping timeout: 480 seconds)
[15:57] * sagelap1 (~sage@126.sub-70-197-135.myvzw.com) has joined #ceph
[15:59] * sagelap2 (~sage@6.sub-70-197-142.myvzw.com) has joined #ceph
[16:03] * PerlStalker (~PerlStalk@ has joined #ceph
[16:04] * The_Bishop (~bishop@2001:470:50b6:0:c9e4:f321:8361:eb82) has joined #ceph
[16:05] * sagelap1 (~sage@126.sub-70-197-135.myvzw.com) Quit (Ping timeout: 480 seconds)
[16:06] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[16:07] * sagelap2 (~sage@6.sub-70-197-142.myvzw.com) Quit (Ping timeout: 480 seconds)
[16:08] * jano (c358ba02@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[16:12] * absynth_47215 (~absynth@irc.absynth.de) has joined #ceph
[16:12] <absynth_47215> joshd: you aren't awake yet, are you?
[16:14] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) Quit (Remote host closed the connection)
[16:21] <wido> paravoid: Yes, with 0.56.1 old clients can still connect
[16:22] * wubo (80f40d05@ircip2.mibbit.com) has joined #ceph
[16:32] <paravoid> wido: what's old? <= 0.56 or < 0.56?
[16:32] <wido> paravoid: <0.56
[16:32] <wido> like 0.48.2
[16:32] <paravoid> right, so that's my question
[16:32] <paravoid> my entire cluster is 0.56 now
[16:33] <paravoid> I want to upgrade to 0.56.1
[16:33] <paravoid> but I think I can't do that gracefully
[16:43] * tokitoki (~juneym@ has joined #ceph
[16:43] <tokitoki> hello!
[16:43] <tokitoki> quick question..
[16:43] <tokitoki> can ceph be able to run under a VirtualBox instance?
[16:44] <absynth_47215> why would you want to do that?
[16:44] <tokitoki> oh .. i just want to try it out
[16:45] <tokitoki> without real hardware.
[16:46] * wubo (80f40d05@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[16:47] * sleinen (~Adium@2001:620:0:25:a517:3d90:9cda:457) Quit (Quit: Leaving.)
[16:50] <match> tokitoki: I've run test setups on vms inside vms -works fine, but the performance is a bit poor :-)
[16:51] <tokitoki> cool! thanks match. that's fine. just want to familiarize myself with the installation and tool chain
[16:51] <tokitoki> by the way, based from reading the docs.
[16:51] * The_Bishop (~bishop@2001:470:50b6:0:c9e4:f321:8361:eb82) Quit (Read error: Operation timed out)
[16:51] <tokitoki> the POSIX-based fs of CEPH is not being recommended as production FS due to the need for more testing.
[16:52] * The_Bishop (~bishop@2001:470:50b6:0:f5f7:e1fe:efbb:87f9) has joined #ceph
[16:52] <tokitoki> is this true? i mean, all i've been reading relates to RADOS>
[16:53] <match> tokitoki: Yep - RADOS/rbd is ok for use, but cephfs is still a bit 'beta'
[16:53] <match> (as I understand it)
[16:54] * jlogan1 (~Thunderbi@2600:c00:3010:1:52d:be18:aa69:de7) has joined #ceph
[16:54] <tokitoki> i see. thanks.
[16:56] * sleinen (~Adium@ has joined #ceph
[16:57] * sleinen1 (~Adium@2001:620:0:26:650e:6be8:3e49:8795) has joined #ceph
[16:58] * aliguori (~anthony@ has joined #ceph
[17:04] * sleinen (~Adium@ Quit (Ping timeout: 480 seconds)
[17:05] * mattbenjamin (~matt@wsip-24-234-55-160.lv.lv.cox.net) has joined #ceph
[17:05] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:07] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[17:07] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:12] * tokitoki (~juneym@ Quit (Quit: tokitoki)
[17:14] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[17:17] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[17:23] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[17:26] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[17:38] * sleinen1 (~Adium@2001:620:0:26:650e:6be8:3e49:8795) Quit (Quit: Leaving.)
[17:39] * sleinen (~Adium@ has joined #ceph
[17:39] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[17:41] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[17:42] * sstan (~chatzilla@dmzgw2.cbnco.com) Quit (Remote host closed the connection)
[17:42] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[17:45] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[17:47] * sleinen (~Adium@ Quit (Ping timeout: 480 seconds)
[17:48] * low (~low@ Quit (Quit: Leaving)
[17:50] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:58] * sstan (~chatzilla@dmzgw2.cbnco.com) has joined #ceph
[18:00] * gmi (~Miranda@124.43.220-216.q9.net) has joined #ceph
[18:02] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[18:03] * jjgalvez (~jjgalvez@ has joined #ceph
[18:04] * tnt (~tnt@86.188-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:13] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:15] * Cube (~Cube@ has joined #ceph
[18:17] * ScOut3R (~ScOut3R@dsl5401A397.pool.t-online.hu) has joined #ceph
[18:27] * mattbenjamin (~matt@wsip-24-234-55-160.lv.lv.cox.net) Quit (Quit: Leaving.)
[18:27] * mattbenjamin (~matt@wsip-24-234-55-160.lv.lv.cox.net) has joined #ceph
[18:27] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[18:33] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[18:35] * mattbenjamin (~matt@wsip-24-234-55-160.lv.lv.cox.net) Quit (Ping timeout: 480 seconds)
[18:36] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:37] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Read error: Operation timed out)
[18:37] * sleinen (~Adium@2001:620:0:25:c8da:6853:45dd:e85c) has joined #ceph
[18:40] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[18:47] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:49] * sleinen (~Adium@2001:620:0:25:c8da:6853:45dd:e85c) Quit (Quit: Leaving.)
[19:06] * brady (~brady@rrcs-64-183-4-86.west.biz.rr.com) has joined #ceph
[19:12] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[19:15] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[19:24] * mattbenjamin (~matt@ has joined #ceph
[19:25] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[19:31] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[19:39] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[19:53] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[20:10] * mattbenjamin (~matt@ Quit (Ping timeout: 480 seconds)
[20:11] * mattbenjamin (~matt@ has joined #ceph
[20:12] * sander (~chatzilla@c-174-62-162-253.hsd1.ct.comcast.net) has joined #ceph
[20:29] <paravoid> anyone around?
[20:30] <paravoid> my whole cluster (mon/osd/radosgw) is 0.56 now, I'm wondering if it's possible (and in what order) to upgrade to 0.56.1 without being bitten by the protocol incompatibility
[20:30] <gregaf> what are you using your cluster for, paravoid?
[20:30] <paravoid> radosgw
[20:32] <dmick> paravoid: someone on the list reported success doing that, I believe. I don't *think* upgrading should involve any issues with the 0.56 protocol issue
[20:33] <gregaf> it's not going to break any disk data
[20:33] <paravoid> I'm not asking about that
[20:33] <gregaf> but the OSDs and RGW daemons will get fussy about talking to each other
[20:33] <paravoid> I'm asking about segfaults
[20:33] <paravoid> or well, sigabrts
[20:33] <paravoid> the 0.55->0.56 upgrade wasn't much fun :)
[20:33] * ScOut3R (~ScOut3R@dsl5401A397.pool.t-online.hu) Quit (Remote host closed the connection)
[20:34] <dmick> sounds like you might be trying to stay live during upgrade. Not sure about that.
[20:34] <gregaf> if you can I would turn off the RGW instances, then upgrade the OSDs as quickly as you can (but you don't need to turn them all of at once), then turn the RGW instances back on
[20:34] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[20:35] <gregaf> That's the best I can offer though
[20:35] <paravoid> ceph is not getting very happy when I'm restarting OSDs, it takes a while to recover
[20:35] <paravoid> and I currently have 48 osds, so at best this means downtime for quite a while
[20:35] <paravoid> but oh well
[20:35] <paravoid> it's not the best situation but I can live with that
[20:36] <dmick> when you had the long recovery process, was the cluster under load at the time? if so, was it a lot of load?
[20:38] <paravoid> there was a bit of a load yes
[20:40] <dmick> I don't have a lot of experience upgrading large clusters but I'm still surprised. maybe I shouldn't be.
[20:40] <paravoid> it's not a particuarly large cluster imho
[20:41] <paravoid> it's just 48 osds right now
[20:41] <paravoid> and it'll be at least 96 within the month, possibly 144
[20:43] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:44] <dmick> well it's larger than my test clusters with which I have the most experience :)
[20:44] <gregaf> paravoid: you mean it takes a while to recover, right? how's the performance while recovery is going on?
[20:45] <gregaf> I think we just need do find a way to do more auto-tuning of recovery bandwidth/IOPS/CPU
[20:45] <paravoid> very sucky
[20:45] <paravoid> and it took about 5-6 days when I moved from 24->48 OSDs
[20:46] <janos> paravoid: bless you for doing you part to keep hdd prices down
[20:46] <janos> ;)
[20:46] <paravoid> 2TB drive per OSD, less than 20T (x2 replicas = 40T) data
[20:46] <paravoid> janos: eh?
[20:46] <janos> buying many many hdd's
[20:48] <paravoid> many?
[20:48] <paravoid> you mean 144?
[20:48] <janos> you're killing the joke. nm
[20:48] <janos> ;(
[20:48] <paravoid> that's only one DC, we have another 144 in the other DC and 48 in the third DC ;)
[20:48] <janos> nice
[20:52] <gregaf> it's lunch time but if recovery is causing that much trouble I'd like to hear a bit more about it; maybe we need to put more QA time on seeing how it behaves in various scenarios...
[20:52] * BManojlovic (~steki@ has joined #ceph
[20:53] <paravoid> gregaf: even an OSD restart results in multiple "slow requests" warnings for 30-60s
[20:53] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:54] <paravoid> 30-60s slow and this lasts for a while
[20:59] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:02] * KOulampamias (~home@athedsl-194481.home.otenet.gr) has joined #ceph
[21:03] * KOulampamias (~home@athedsl-194481.home.otenet.gr) Quit (autokilled: This host triggered network flood protection. please mail support@oftc.net if you feel this is in error, quoting this message. (2013-01-08 20:03:36))
[21:03] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:04] <buck> I'm working on adding a workunit that has a dependency on a package being installed (ant). Should I just do an "apt-get ant" at the start of the workunit or is there a more formal way to have a package installed pre-workunit execution?
[21:08] * korgon (~Peto@isp-korex- has joined #ceph
[21:17] * doubleg (~doubleg@ has joined #ceph
[21:18] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Ping timeout: 480 seconds)
[21:18] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[21:21] <dmick> buck: ceph.py looks like the only one that does that now
[21:21] <dmick> the other option is to add it to the chef code. I guess I'm evenly split on which is best
[21:22] <dmick> it's certainly easier to add to the chef code
[21:23] * madkiss (~madkiss@ has joined #ceph
[21:23] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[21:26] <gregaf> dmick: buck: shouldn't be doing package management in the workunits
[21:26] <gregaf> if we need new packages, put them in the chef-qa or whatever it is
[21:29] * buck1 (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[21:32] <dmick> gregaf: except that it does, I tend to agree
[21:32] <gregaf> it does what?
[21:34] <dmick> ceph.py already does install filesystem management packages based on the filesystem type requested
[21:34] * markl (~mark@tpsit.com) has joined #ceph
[21:34] <dmick> but I still agree that it's better left to chef
[21:35] <gregaf> that's probably just history ;)
[21:35] <dmick> yeah
[21:46] <buck1> dmick: gregaf: thanks. will go the chef route
[21:47] * gmi (~Miranda@124.43.220-216.q9.net) Quit (Quit: Miranda IM! Smaller, Faster, Easier. http://miranda-im.org)
[21:48] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[21:54] <mikedawson> dmick: any luck getting your libvirt working?
[21:55] <dmick> mikedawson: I had to leave before I got to the endpoint last night, and it involved rebuilding my Ceph binaries. Let me try starting that VM again
[21:57] <dmick> foo. let me rebuild again first :)
[21:58] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[22:05] <fghaas> gregaf: in light of the recent massive changes in the osd code, is the information contained in this thread still valid? http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6258
[22:06] <fghaas> or has the osd journal become smarter, such that perhaps you can just recreate a journal and do an incremental sync of the filestore thereafter?
[22:06] <gregaf> fghaas: the journal stuff is still the same
[22:07] <fghaas> so suppose you have an ssd that hosts several journals, and that ssd dies a horrible death, you'd essentially have to nuke all the osds, recreate them, and then sync back the data when you put them back in?
[22:08] <gregaf> yeah, that's still how it stands right now
[22:09] <fghaas> gregaf: gotcha, thanks
[22:11] <fghaas> out of curiosity, any plans to change that?
[22:13] <gregaf> I don't think anybody's considering it right now
[22:15] * sstan (~chatzilla@dmzgw2.cbnco.com) Quit (Remote host closed the connection)
[22:16] <fghaas> hm. so in case you would like to pass this on, this is something I've heard several customers/users say they're scared about, specifically the ones where they'd run a rather small number of nodes (less than 10), with say 8 disks/osds per node. assuming cheap 2TB SATA spinners, that's a whopping 16TB potentially being shuffled around if an SSD dies (in the extreme case of just 3 osd hosts)
[22:17] * rosco (~r.nap@ Quit (Quit: bubye)
[22:17] * Rocky (~r.nap@ has joined #ceph
[22:18] <Vjarjadian> fghaas, if their cluster wasnt configured properly they could even lose data with that one node going down with 8 OSDs...
[22:18] <nhm> fghaas: we've had a couple of users ask about that too. My personal instinct is to tell people not to put more than around 3-4 journals on each SSD.
[22:19] <fghaas> Vjarjadian: yeah but the default crushmap has that covered, so if users change things and that breaks the cluster, it's not ceph's fault
[22:19] <nhm> fghaas: though on 12 bay 2U nodes that have 2 2.5" hotspare bays in the back, 6 osd journals per SSD starts looking attractive.
[22:19] * rlr219 (43c87e04@ircip2.mibbit.com) has joined #ceph
[22:20] <nhm> tough call.
[22:20] <gregaf> fghaas: have you been in contact with Neil yet? he's probably the one to start that discussion with from a commercial perspective
[22:20] <fghaas> nhm: so that would be 8-12TB being shuffled around then. in users' view that's still a lot of data
[22:20] <nhm> fghaas: very true
[22:21] <fghaas> gregaf: I doubt that this would be something that actually makes a user shy away from using ceph, but it is considered a bit of a nuisance by some
[22:21] <dmick> mikedawson: ok, gregaf pointed out that I was forgetting which version of Ceph qemu was actually running (sigh) so I have a VM up now
[22:22] <rlr219> Hi I see there is a maintenance release for Argonaut. I am currently on 0.48.2 and want to upgrade to 0.48.3 because my OSDs are all using XFS. what is the recommended upgrade process?
[22:23] <fghaas> nhm: 8TB being synced off even at 100MB/s takes 23 hours. that would be plenty of time to pop in a new SSD, so being able to recover gracefully from a journal-less ssd might be quite a win
[22:24] <mikedawson> dmick: good stuff. Can you get client admin sockets?
[22:25] <fghaas> rlr219: http://ceph.com/docs/master/install/upgrading-ceph/
[22:26] <gregaf> fghaas: nhm: discussing it with sjust offline and it is at least something we can talk about now, unlike last time we had the conversation
[22:27] <rlr219> Thanks fghaas! seemed to have missed that in my parusing of the docs.
[22:27] <fghaas> rlr219: no sweat :)
[22:27] <phantomcircuit> 58 active+clean+scrubbing+repair
[22:27] <phantomcircuit> that number isn't changing and there doesn't seem to be any activity on the osd
[22:27] <phantomcircuit> s
[22:28] <phantomcircuit> that mean there is an unrecoverable error?
[22:28] <mikedawson> fghaas: Read some of your blogs. Are you still doing writeback caching with Ceph? Flashcache?
[22:29] <fghaas> mikedawson: erwhat? afair I only ever blogged about flashcache+drbd, with ceph you don't need to jump through hoops like that :)
[22:30] <gregaf> phantomcircuit: did you start up a repair scrub?
[22:30] <phantomcircuit> yeah i did
[22:30] <phantomcircuit> but it seems to have stalled
[22:30] <gregaf> how'd you start it?
[22:30] <phantomcircuit> ceph osd scrub 0
[22:30] <phantomcircuit> ceph osd repair 0
[22:30] <phantomcircuit> why?
[22:30] <fghaas> mikedawson: osd journal on ssd == similar effect but just a whee bit more elegant
[22:30] <fghaas> s/whee/wee/
[22:30] <gregaf> so you started a repair on all the PGs hosted by osd.0
[22:31] <gregaf> phantomcircuit: of which there should be more than 48
[22:31] <gregaf> so some of them are completing but when they do new ones get started up, probably :)
[22:31] <phantomcircuit> gregaf, there are > 58 most of them completed
[22:31] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:31] <mikedawson> fghaas: thought I had seen some sildes teasing flashcache and ceph
[22:31] <phantomcircuit> there is zero disk activity at the moment
[22:31] <phantomcircuit> so im 100% sure it's stalled
[22:32] <gregaf> phantomcircuit: what version are you running?
[22:32] <phantomcircuit> 0.55.1
[22:32] <gregaf> sjustlaptop can run through the scenarios faster than I can
[22:32] <phantomcircuit> fghaas, well that isn't 100% true flashcache still gives you read caching on the ssd which ceph doesn't do
[22:32] <fghaas> mikedawson: no, like I said I talked about flashcache+drbd at linux.conf.au last january. and incidentally I'm talking about Ceph there this January. :)
[22:32] <mikedawson> fghaas: I have a use case for lots of ~16K random writes. So I throw spindles at the problem, or look at writeback cache and write reordering (like Bcache writeback mode)
[22:33] <sjustlaptop> phantomcircuit: why did you repair?
[22:33] <sjustlaptop> is osd.0 alive?
[22:33] * rlr219 (43c87e04@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[22:33] * nwat (~Adium@173-164-201-201-SFBA.hfc.comcastbusiness.net) has joined #ceph
[22:34] <phantomcircuit> sjustlaptop, the drive was replaced and i dont 100% trust the copy job that my dc did
[22:34] <sjustlaptop> is io going through?
[22:34] <dmick> mikedawson: working on it. doublechecking that this rbd device is actually accessible from VM...and yes.
[22:34] <fghaas> phantomcircuit: I don't know what your use case is, but mine at the time was virtualization with qemu/kvm, and rbd is just way nicer than the drbd/flashcache combo, and comes with caching built in
[22:34] <phantomcircuit> yes
[22:34] <phantomcircuit> fghaas, same situation actually
[22:35] <phantomcircuit> sjustlaptop, yes
[22:36] <sjustlaptop> phantomcircuit: right, 0.55.1 had a bug in repair, fixed in 0.56 (one sec)
[22:36] <phantomcircuit> ah
[22:36] <phantomcircuit> ok restarted osd.0 and it's fixed
[22:36] <sjustlaptop> yeah, that would hide it
[22:36] <phantomcircuit> what's the bug?
[22:37] <fghaas> phantomcircuit: then, like I said, you'll be very well served with kvm and rbd :)
[22:37] <sjustlaptop> actually, in your case, not exactly sure, I'm looking at the commits, and that bug shouldn't have caused your specific behavior
[22:37] <sjustlaptop> odd
[22:38] <phantomcircuit> :|
[22:38] <sjustlaptop> one sec
[22:38] <phantomcircuit> i seem to trigger all kinds of bizarre bugs
[22:38] <fghaas> such as?
[22:38] <mikedawson> fghaas: I am doing qemu/kvm and rdb with writeback enabled. But the RBD is good for small random reads, but not fast enough for my write volume
[22:38] <phantomcircuit> fghaas, such as this one :)
[22:39] <sjustlaptop> phantomcircuit: found the bug, fixed in 0.56
[22:40] <sjustlaptop> we just failed to clear the REPAIR flag in some cases
[22:40] <sjustlaptop> it's actually harmless in this case
[22:40] <phantomcircuit> ah
[22:40] <phantomcircuit> good :)
[22:40] <fghaas> mikedawson: I seem to recall people in here reporting write speeds well in excess of 200 MB/s _from within the VM_ around the time of the argonaut release... assuming we are talking about streaming writes here
[22:41] <mikedawson> fghaas: I can get that with large enough writes, but I can't get the IOPS with my ~16K writes.
[22:41] <phantomcircuit> journal on ssd?
[22:41] <mikedawson> yes
[22:42] <phantomcircuit> check that your filestore drives are good
[22:42] <fghaas> more importantly: how many OSDs?
[22:44] <mikedawson> fghaas: Any number of OSDs I've tried have left me maxing at about 40IOPS per physical 7200rpm drive when using 3x replication.
[22:45] <fghaas> well how many do you have now?
[22:45] <fghaas> osds, that is
[22:45] <mikedawson> fghaas: my problem is an IOPS available per spindle issue (thus my desire to do writeback reordering to take in lots of 16K writes and write out less 4MB writes or something like that)
[22:46] <mikedawson> fghaas: 8 today. 36 last week.
[22:46] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[22:48] <mikedawson> mikedawson: the fallback is always throw more spindles at the problem
[22:48] <mikedawson> a lot more
[22:51] <fghaas> mikedawson: at the risk of telling you something you already know: I am not entirely sure how you're measuring that or what benchmark you're using, but in case your benchmark within the vm does an fsync or fdatasync at any time you are aware that that goes to disk immediately, all the way through the host and storage driver? I ask because I'm unsure how your realistic your write-coalescing expectations are... not that I'm the expert
[22:52] <mikedawson> fghaas: I'm not sure there
[22:53] <fghaas> http://libvirt.org/formatdomain.html#elementsDisks ... the doc for the "driver" option has an explanation of what the various cache options mean
[22:54] * jjgalvez (~jjgalvez@ Quit (Quit: Leaving.)
[22:54] <fghaas> I think the behavior you're looking for is actually in "unsafe" but that option is -- tadaaa! -- unsafe
[22:54] * jjgalvez (~jjgalvez@ has joined #ceph
[22:54] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:55] <mikedawson> fghaas: I've used writeback and none. I see an indistinguishable performance difference for 16K random writes.
[22:56] <gregaf> fghaas: the reason he's talking about bcache is that there are safe options which basically interpose a log-structured filesystem between the writer and the disk
[22:56] <gregaf> the OSD journal does a teensy bit of this, but it doesn't last very long and it tries not to get too far ahead so it's not often worth that much
[22:56] * The_Bishop (~bishop@2001:470:50b6:0:f5f7:e1fe:efbb:87f9) Quit (Read error: Connection reset by peer)
[22:57] <fghaas> mikedawson: like I said, if your benchmark would frequently tell its device to write _to disk_, I would expect to see no major difference between the cache and writeback
[22:57] <fghaas> "between cache='none' and cache='writeback'", sorry
[22:57] <mikedawson> fghaas: gotcha. thanks
[22:59] <mikedawson> gregaf: fghaas: that's exactly what I'm hoping for. Something that can give a ~250x IOP performance gain by re-ordering 16K random writes -> 4MB sequential writes.
[23:00] * sstan (~chatzilla@dmzgw2.cbnco.com) has joined #ceph
[23:00] <fghaas> mikedawson: would you like a pony with that? ;)
[23:00] <mikedawson> yessir
[23:00] <gregaf> well, Ceph isn't going to do that, at least not on its own (it's a big, hard problem), but feel free to play around with sticking other stuff in the stack and telling us how it comes out ;)
[23:01] <mikedawson> Bcache claims to do something similar http://bcache.evilpiepirate.org/
[23:01] <phantomcircuit> fghaas, i would actually expect cache=none to be faster in that event
[23:01] <fghaas> phantomcircuit: possibly
[23:02] <phantomcircuit> mikedawson, bcache requires you run a fully custom kernel
[23:02] * The_Bishop (~bishop@2001:470:50b6:0:6077:d570:bf06:22e4) has joined #ceph
[23:02] <fghaas> only until kent manages to get it upstream :)
[23:02] <phantomcircuit> also if you're using rbd im not sure it would even help
[23:02] <mikedawson> phantomcircuit: that's the current problem, but the maintainer rebaised on 3.7 this week. Not currently holding my breath
[23:02] <phantomcircuit> fghaas, that'll be years at best
[23:05] * buck1 (~Adium@soenat3.cse.ucsc.edu) has left #ceph
[23:07] <phantomcircuit> gregaf, why does the osd journal try so hard to commit the journal rapidly?
[23:07] <sjustlaptop> phantomcircuit: what do you mean?
[23:07] <phantomcircuit> well it seems like the journal is always empty
[23:07] <phantomcircuit> but maybe im just reading the perf dump wrong
[23:08] <gregaf> no, by default it does commits pretty frequently
[23:08] <phantomcircuit> 0.1 seconds right?
[23:08] <mikedawson> phantomcircuit: the default settings have it flush between 0.01 and 5 seconds, iirc. I believe more frequent flushes give more consistent performance
[23:08] <gregaf> partly that's because we haven't tried turning up the min interval and seeing what happens; Sam keeps explaining to me why that's okay and I keep on going back to wanting to turn it up ;)
[23:08] <sjustlaptop> it's going to do commits no more often than filestore_min_sync_interval and no less often than filestore_max_sync_interval
[23:09] <sjustlaptop> and it starts a commit when the journal hits half-full
[23:09] <gregaf> but basically the journal is for smoothing out latencies and data safety, not for improving stable throughput
[23:09] <sjustlaptop> right, you can't actually go faster than the backing disk in the long run
[23:10] <mikedawson> when I move those up to something like 10 and 15, I get noticeable latency when the journal syncs
[23:10] <phantomcircuit> journal_queue_ops is that the number of ops waiting to be put into the journal or the number of ops in the journal waiting to be commited?
[23:12] <sjustlaptop> iirc, it's the number of ops waiting to be written to the journal
[23:13] <phantomcircuit> op_queue_ops is the number of ops waiting to go to disk on the filestore?
[23:14] <buck> who is a good person to get to review a 1-line chef commit?
[23:17] <gregaf> just pin it on somebody, buck
[23:17] <gregaf> where's it at?
[23:17] <buck> branch wip-buck in ceph-qa-chef
[23:17] <buck> it adds ant to the standard install recipe for qa nodes
[23:20] <gregaf> hurray
[23:20] <gregaf> sadly I don't see any typos there
[23:20] <buck> gregaf: I like to keep those for my multi-line commits. Thanks greg
[23:21] <gregaf> I just figure if I'm going to review a commit that short I should get some satisfaction out of it, but there's nothing for me ;)
[23:24] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[23:24] <phantomcircuit> what units is avgcount?
[23:25] <gregaf> it's a count
[23:25] * chutzpah (~chutz@ has joined #ceph
[23:25] <phantomcircuit> oh right
[23:25] <gregaf> ie: average_latency{avg_count: 10, sum: 100}
[23:26] <gregaf> actually let's do average_latency{avg_count: 10, sum: 200}
[23:26] <gregaf> would be 10 ops or whatever, with a total of 200 units (milliseconds, here, though more probably seconds) of latency for an average of 20 each
[23:26] * rlr219 (43c87e04@ircip3.mibbit.com) has joined #ceph
[23:27] <phantomcircuit> ah so to get averages i need to do some maths
[23:28] <phantomcircuit> so for example "journal_latency":{"avgcount":8983,"sum":63.746854000}
[23:28] <rlr219> I wasu upgrading from 0.48.2 to 0.48.3 and my OSDs and MONs went fine. However I have 3 MDSs and one is stuck in a resolve state and the other 2 keep crashing. How do i fix this?
[23:28] <phantomcircuit> that's 7.096388066 ms average per op
[23:29] <gregaf> rlr219: did you have more than one active MDS/
[23:29] <rlr219> all 3 were active
[23:30] <gregaf> well, the one stuck in resolve is waiting to get information from the others
[23:30] <gregaf> what are the backtraces on the dead ones?
[23:31] <rlr219> 2013-01-08 17:08:14.306557 7f94b23f8700 0 mds.-1.0 ms_handle_connect on 2013-01-08 17:08:14.307579 7f94b23f8700 0 mds.-1.0 handle_mds_map mdsmap compatset compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object} not writeable with daemon features compat={},roc
[23:32] <rlr219> http://mibpaste.com/7J4JWF
[23:32] <rlr219> gregaf: Easier to read
[23:34] <rlr219> gregaf: Would the whole log be better?
[23:35] <gregaf> rlr219: no, that's not necessary
[23:35] <rlr219> gregaf: that was from one mds, but they are both pretty much identical.
[23:36] <rlr219> gregaf: is it true that the MDSs are only used for the cephfs?
[23:37] <gregaf> yes, they're only for the filesystem
[23:37] <gregaf> I'm surprised one is still running without the others though
[23:37] <gregaf> and the feature stuff shouldn't have changed from v0.48.2 to v0.48.3
[23:37] <gregaf> hrm
[23:38] <gregaf> oh, wait, it did, I was looking at the wrong tag
[23:38] <gregaf> rlr219: did you actually upgrade and restart the MDS which is still running?
[23:38] <rlr219> if it makes it easier, i am not using the cephfs at this time.
[23:38] <rlr219> yes
[23:39] * jjgalvez (~jjgalvez@ Quit (Quit: Leaving.)
[23:39] <gregaf> rlr219: well, you could just turn them off then
[23:40] <rlr219> gregaf: comment them out in config and shut down?
[23:41] <gregaf> yeah
[23:41] <rlr219> then if I want to add back later, I woud go through the adding and MDS process?
[23:41] <gregaf> you might continue getting health warnings though, so then you'll want to remove the second and third from the map, do the newfs command, and turn on one daemon to sit idle
[23:42] <rlr219> ok. I can do that. If I have an issue i will hollar. Thanks!
[23:42] <gregaf> yep!
[23:43] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:44] <fghaas> gregaf: mind sharing the outcome of the offline discussion with sjust?
[23:45] <gregaf> well, since we now have deep scrub and incremental backfill we could co-opt those mechanisms to do an intelligent data transfer
[23:46] <gregaf> it's bigger than I was initially thinking though and it's not currently on our roadmap, but it's possible
[23:46] <gregaf> anything more precise than that I'd have to leave to him, but that's about it
[23:46] <fghaas> thanks, much appreciated
[23:46] <fghaas> any one of you guys traveling with sage to australia, btw?
[23:47] <gregaf> I think it's just Sage
[23:47] <gregaf> speaking of which I should ask him if he's still wearing his war-like t-shirt from Jeff now that Jeff won't be there
[23:48] <fghaas> what was the war-like t-shirt again?
[23:49] <gregaf> "My distributed filesystem is better than yours"
[23:49] <gregaf> Jeff sent it over and Sage was excited to be in a socially-appropriate setting to wear it ;)
[23:50] <nhm> gregaf: :)
[23:53] * nwat (~Adium@173-164-201-201-SFBA.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[23:53] * sander (~chatzilla@c-174-62-162-253.hsd1.ct.comcast.net) Quit (Ping timeout: 480 seconds)
[23:53] <fghaas> well he should still wear it and I'll try to get John Mark to wear one too
[23:54] <fghaas> I'll bring that up in our call this week, thanks for the reminder :)
[23:56] <gregaf> haha, don't get me in trouble :p
[23:57] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:57] * aliguori (~anthony@ Quit (Remote host closed the connection)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.