#ceph IRC Log

Index

IRC Log for 2013-01-18

Timestamps are in GMT/BST.

[0:00] <xmltok> check it out v
[0:00] <xmltok> http://pastebin.com/93Cc2FYR
[0:01] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[0:02] <sstan> dmick: omg ... you take half of the cluster out ... and plug it to a new rack. It multiplies itself
[0:02] * vata (~vata@2607:fad8:4:6:e06d:b51e:cfa2:38bb) Quit (Quit: Leaving.)
[0:03] <sstan> so one could sell "cluster seeds"
[0:03] <dmick> xmltok: yeah, see, that's the chef log
[0:03] <dmick> not the monitor log
[0:03] <dosaboy> hi, I have a newbie/naive question. I am trying to understand the topology required to use RGW and want to clarify that I understand correctly. Is it correct to say that MDS (metadata servers) are not required when using RGW with RADOS?
[0:03] <dmick> oh at the bottom, sorry
[0:04] <xmltok> np
[0:04] <dmick> dosaboy: yes. mds is only for the Posix filesystem
[0:04] <dosaboy> dmick: thanks, glad I understood correctly
[0:05] <dmick> xmltok: hum. I'm not sure how you add more debug to a chef cluster but we need to know why the mon is dying
[0:06] <xmltok> 'ceph-mon-all-starter' is the thing that starts mon?
[0:06] <dmick> yeah, should be
[0:06] <dmick> I guess you can do something with node['ceph']['config-sections']
[0:06] <dmick> see ceph.conf.erb
[0:06] <dmick> we'd like to have debug mon = 20 or the like in [global] or [mon]
[0:06] <dmick> or you could just hack the .erb
[0:07] <xmltok> ill shove it in the erb
[0:07] <dmick> or possibly just hack the /etc/ceph/ceph.conf; maybe if one's already there it'll leave it
[0:08] <xmltok> i threw debug mon = 20 in there but there istn anything new in the log dir, ceph-mon on the command line complains about a missing store: problem opening monitor store in /var/lib/ceph/mon/ceph-admin: (2) No such file or directory
[0:09] * dosaboy (~gizmo@host86-164-229-186.range86-164.btcentralplus.com) Quit (Quit: Leaving.)
[0:10] * dosaboy1 (~user1@host86-164-229-186.range86-164.btcentralplus.com) Quit (Quit: Leaving.)
[0:12] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[0:12] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[0:13] <dmick> hmmm
[0:14] <sstan> .. but there actually IS such fole or directory, right ?
[0:14] <sstan> fole/file
[0:14] <dmick> https://github.com/ceph/ceph/blob/master/src/upstart/ceph-osd-all-starter.conf is what should be running
[0:14] <dmick> which will then run
[0:15] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[0:15] <dmick> https://github.com/ceph/ceph/blob/master/src/upstart/ceph-mon.conf because of that emit ceph-mon
[0:15] <xmltok> the directory didnt exist, i made it and ran again, i ended up with a lock file in ther
[0:15] <xmltok> if i look in the recipe its making /var/lib/ceph/mon/ceph-<hostname>, but not /var/lib/ceph/mon/ceph-admin
[0:15] <xmltok> maybe the admin thing is just something that is normally overridden with chef to the -hostname variant
[0:15] <dmick> ceph-admin seems odd to me
[0:16] <dmick> did you configure a "cluster name" somewhere as admin?
[0:17] <dmick> or better yet
[0:17] <dmick> what's the ceph.conf that got installed
[0:17] <xmltok> yeah admin is the default, i only see that when i run ceph-mon manually, im thinking chef overrides it with --mon-data
[0:18] <xmltok> this is the ceph.conf generated http://pastebin.com/T1k2UkMg
[0:18] <xmltok> mon host = is blank .......
[0:18] <xmltok> ok, i know the problem, hold up
[0:19] <xmltok> i applied recipes and not roles, the library is looking up roles
[0:19] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Read error: Operation timed out)
[0:19] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[0:21] <xmltok> here it is now, it has mon hosts defined, still fails the same way and without logs, http://pastebin.com/6jmF4bpV
[0:23] <sstan> what does the cluster monmap look like
[0:23] <sstan> got to go; ttyl
[0:26] <xmltok> ok
[0:26] <xmltok> upstart isn't starting ceph-mon-all-starter because the ceph-mon-all-starter.conf file for upstart looks to see if /var/lib/ceph/mon/ceph-<hostname>/upstart exists
[0:27] <xmltok> i touched the upstart file and it started up, ceph-mon is running
[0:36] <dmick> well it's looking for /var/lib/ceph/mon/$cluster-$id really
[0:36] <dmick> (/upstart)
[0:36] * mattbenjamin (~matt@aa2.linuxbox.com) Quit (Quit: Leaving.)
[0:36] <dmick> I assume you're saying the recipes didn't create that properly
[0:37] <xmltok> right, i dont see anything in them to create them
[0:37] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has left #ceph
[0:38] <dmick> that should have been done by ceph-disk-activate
[0:39] <dmick> I think
[0:39] <dmick> hm, maybe not; that's OSDs only I think
[0:42] <gregaf> it's possible we left a hole when putting in that fix :/
[0:43] <gregaf> I think ceph-deploy will touch the file but other deployment systems which use upstart might not
[0:47] <dmick> just consulted with Sage; yeah, that's a hole because of the recent "don't assume upstart" change. Sorry xmltok
[0:47] <dmick> dealing with a multiplicity of service frameworks is annoying
[0:49] * drokita (~drokita@199.255.228.10) Quit (Quit: Leaving.)
[0:49] * aliguori_ (~anthony@32.97.110.59) Quit (Remote host closed the connection)
[0:49] <dmick> probably for now putting something in recipes/mon.rb to touch that file is the right answer for Chef
[0:50] <xmltok> thats my plan, im not real solid on chef so im not sure what the best way would be to do that only when upstart is used on a node, so i wont recommend a fix in github
[0:50] <gregaf> hmm, if the recipes assume Upstart, otherwise they'll need a guard
[0:50] <dmick> sage seemed to think they did; I'm auditing for that
[0:50] * PerlStalker (~PerlStalk@72.166.192.70) Quit (Quit: ...)
[0:51] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[0:51] <gregaf> not sure what state they're in but I think they pre-date ceph-disk-prepare, and outside of using that they'd have to work pretty hard to require upstart (Chef has its own primitives for starting and stopping daemons)
[0:51] <dmick> service "ceph-mon-all-starter" do
[0:51] <dmick> provider Chef::Provider::Service::Upstart
[0:51] <dmick> action [:enable]
[0:51] <dmick> end
[0:51] <dmick> <shrug>
[0:51] <xmltok> its the upstart configuration that came with my ceph package
[0:52] <xmltok> root@swiftpocdev001:/var/lib/ceph# grep done /etc/init/ceph-mon-all-starter.conf
[0:52] <xmltok> if [ -e "/var/lib/ceph/mon/$f/done" ] && [ -e "/var/lib/ceph/mon/$f/upstart" ]; then
[0:52] <dmick> yeah, but that's not the chef; gregaf means does the chef require upstart
[0:52] <xmltok> gotcha
[0:53] <xmltok> so now all i need to do is mkfs and mount my osd's and i should be ready to roll?
[0:53] <dmick> well chef should be able to do those as well, but maybe has similar problems
[0:54] * BManojlovic (~steki@gprswap.mts.telekom.rs) Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:54] <xmltok> the osd cookbook looks like it only works if you are using crowbar
[0:54] <xmltok> otherwise you need to manually do the ceph-disk-prepare
[0:56] <dmick> it does seem that way
[0:58] <dmick> apologies, I thought it was more complete
[0:58] <xmltok> it seemed too good to be true
[0:58] <xmltok> :)
[0:59] * xiaoxi (~xiaoxiche@134.134.139.72) has joined #ceph
[1:00] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[1:00] * ChanServ sets mode +o scuttlemonkey
[1:06] <dmick> xmltok: http://tracker.newdream.net/issues/3852
[1:07] <dmick> http://tracker.newdream.net/issues/3851
[1:07] * xiaoxi (~xiaoxiche@134.134.139.72) Quit (Ping timeout: 480 seconds)
[1:07] <dmick> in case you want to watch, contribute, etc.
[1:14] * mattbenjamin (~matt@adsl-75-45-226-110.dsl.sfldmi.sbcglobal.net) has joined #ceph
[1:16] <xmltok> thanks dmick
[1:16] <xmltok> i was able to get my OSDs going pretty easily with ceph-disk-prepare and then ceph-disk-activate
[1:17] <dmick> cool
[1:17] <xmltok> i do not have any mds processes though
[1:27] * jlogan1 (~Thunderbi@2600:c00:3010:1:49d6:5ead:ab1a:61ba) Quit (Ping timeout: 480 seconds)
[1:29] * xiaoxi (~xiaoxiche@jfdmzpr03-ext.jf.intel.com) has joined #ceph
[1:29] * gaveen (~gaveen@112.135.133.191) Quit (Remote host closed the connection)
[1:30] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[1:30] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Read error: Connection reset by peer)
[1:31] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:37] * The_Bishop_ (~bishop@i59F6A14A.versanet.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[1:41] * mikedawson_ (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:45] * jmlowe1 (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[1:46] * dty (~derek@testproxy.umiacs.umd.edu) Quit (Quit: dty)
[1:48] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[1:50] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[1:51] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[1:53] * bstaz (~bstaz@ext-itdev.tech-corps.com) Quit (Quit: leaving)
[1:57] * jlogan1 (~Thunderbi@72.5.59.176) has joined #ceph
[2:01] <xmltok> root@swiftpocdev001:/etc/apt/sources.list.d# ceph osd stat
[2:01] <xmltok> e138: 39 osds: 39 up, 39 in
[2:01] <xmltok> im in business
[2:05] * benpol (~benp@garage.reed.edu) has left #ceph
[2:06] <dmick> xmltok: cool
[2:07] <xmltok> very
[2:07] <xmltok> it looks like the radosgw is going to need some love before its ready to roll but this is a good start, i can play with some things now and hopefully mount cephfs and see how that works out
[2:08] * Cube1 (~Cube@12.248.40.138) Quit (Ping timeout: 480 seconds)
[2:08] <dmick> rados and rbd CLIs are simpler fun
[2:08] <dmick> albeit not particularly useful, they'll let you explore the cluster a bit
[2:09] <xmltok> i am poking at making pools and setting sizes now
[2:09] <xmltok> my two main use cases are an s3/swift store and cephfs as a replacement for gpfs. the gpfs data is transient so it wouldnt be a major issue if cephfs became corrupted
[2:15] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[2:19] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[2:20] * alram (~alram@38.122.20.226) Quit (Quit: leaving)
[2:21] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[2:23] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[2:26] * Ryan_Lane (~Adium@216.38.130.166) Quit (Quit: Leaving.)
[2:32] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Quit: Leaving.)
[2:35] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has joined #ceph
[2:37] * ninkotech (~duplo@89.177.137.236) Quit (Ping timeout: 480 seconds)
[2:37] * LeaChim (~LeaChim@b0fadd12.bb.sky.com) Quit (Ping timeout: 480 seconds)
[2:38] * ninkotech_ (~duplo@89.177.137.236) Quit (Ping timeout: 480 seconds)
[2:39] * ninkotech (~duplo@89.177.137.236) has joined #ceph
[2:40] * ninkotech_ (~duplo@89.177.137.236) has joined #ceph
[2:40] * BillK (~billk@58-7-235-12.dyn.iinet.net.au) has joined #ceph
[2:43] * The_Bishop (~bishop@e179005167.adsl.alicedsl.de) has joined #ceph
[2:43] <xmltok> and now i have put an image file into rgw. sweet
[2:44] <xiaoxi> hi
[2:44] <xiaoxi> any idea about assert(_get_map_bl(epoch, bl)); failed?
[2:46] * dpippenger (~riven@216.103.134.250) Quit (Remote host closed the connection)
[2:48] <dmick> xiaoxi: there's http://tracker.newdream.net/issues/3770
[2:51] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has left #ceph
[2:53] <xiaoxi> dmick:Thanks, BTW, Can I roolback ceph to 0.53 or something?
[2:53] <dmick> I doubt rollback is safe wrt preserving data
[2:54] <dmick> if you don't care about your data you can do anything you like :)
[2:56] <xiaoxi> Well ,I have no data there,but how can I get 0.53 rpm?
[2:59] <dmick> either it's on our yum repo or it isn't; if it's not, you can get the sources and build
[3:00] <dmick> glowell1: where is our official yum repo point, anyway?
[3:01] <dmick> http://ceph.com/docs/master/install/rpm/ may be useful I suppose :)
[3:01] <dmick> xiaoxi: http://ceph.com/rpm-testing/el6/x86_64/
[3:02] <dmick> not sure you want to go for 0.53 though
[3:02] <xiaoxi> so for rpm-testing there is no availabe rpm for ubuntu?
[3:03] <glowell1> There are repos for the major releases. The repo file that can be installed points to I think ceph.com/rpms
[3:03] <dmick> if nothing else, if you're up for installing rpms, 3770 is fixed in master
[3:03] <dmick> xiaoxi: ubuntu uses .debs
[3:03] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[3:03] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[3:07] <xiaoxi> dmick:yes, but i cannot find 0.53 from http://ceph.com/debian-testing/dists/quantal/main/binary-amd64/Packages
[3:09] <dmick> why do you think you want 0.53 specifically?
[3:11] * Ryan_Lane (~Adium@216.38.130.166) has joined #ceph
[3:13] * Ryan_Lane (~Adium@216.38.130.166) Quit ()
[3:14] * houkouonchi-work (~linux@12.248.40.138) Quit (Read error: Connection reset by peer)
[3:15] * houkouonchi-work (~linux@12.248.40.138) has joined #ceph
[3:24] <dmick> xiaoxi: ^
[3:30] * Ryan_Lane (~Adium@216.38.130.166) has joined #ceph
[3:30] * Ryan_Lane (~Adium@216.38.130.166) Quit ()
[3:31] <phantomcircuit> http://178.33.22.5/~phantomcircuit/charts/
[3:31] <phantomcircuit> for anybody interested i expanded my charts
[3:33] <xiaoxi> dmick:Well, we are facing serios unstable issue for 0.56.1, we are thinking to rollback ceph to some earlier version
[3:34] <phantomcircuit> xiaoxi, you cant without completely rebuilding the cluster :(
[3:34] <xiaoxi> The two candidates are 0.48.3 and 0.53, since 0.48.X is the stable release for a long time,and we have doing some test on 0.53 previously, but indeed I am not sure it's stable enough
[3:35] <xiaoxi> phantocircuit: I don't care the data:) it's just a test rack
[3:35] <phantomcircuit> ah
[3:37] <dmick> xiaoxi: I'll bet master is even more stable than those
[3:38] <xiaoxi> what do you mean by master?
[3:39] <xiaoxi> 0.56.1? or newest master source?
[3:41] <xiaoxi> dmick: last night I have a 6-nodes ceph cluster(120 disks with 120 osds) and using 8 client(480 RBDs) to push a very high load to ceph, as a result, 4 out of 6 nodes get reset after ~6 hours
[3:42] <phantomcircuit> dmick, when FileStore::sync_entry runs there is a complete stop to all io
[3:42] <phantomcircuit> do you know if this is the intended behaviour?
[3:42] <phantomcircuit> cause i gotta say it doesn't make a whole lot of sense to me
[3:43] <dmick> xiaoxi: I mean latest build of the git master branch (i.e. built-on-demand)
[3:44] * rturk is now known as rturk-away
[3:44] <xiaoxi> dmick:so I need to clone the git master and build by myself ,right?
[3:44] <dmick> xiaoxi: no, there are packages built continuously for git branches
[3:44] * rturk-away is now known as rturk
[3:44] <dmick> http://ceph.com/docs/master/install/debian/#development-testing-packages
[3:45] <dmick> phantomcircuit: I'm not an expert on that code but I think that's sort of the point of sync_entry: to be a "stop and commit things" point
[3:47] <gregaf> dmick: phantomcircuit: it should be stopping the backing store, but the journal should keep rolling through, not halt
[3:47] <gregaf> I'm not sure what the issue is likely to be though; should poke sjust/sjustlaptop next time he's around
[3:47] <gregaf> or just file a bug report on it :)
[3:49] * rturk is now known as rturk-away
[3:49] <dmick> well ok. "complete stop to all io" is potentially ambiguous
[3:49] <phantomcircuit> gregaf, it definitely looks like a bug
[3:49] <phantomcircuit> dmick, yeah it is, i meant that journal writes stop also
[3:49] <dmick> ok
[3:50] <phantomcircuit> you'll see the 'cur' graph towards the bottom
[3:50] * mattbenjamin (~matt@adsl-75-45-226-110.dsl.sfldmi.sbcglobal.net) Quit (Quit: Leaving.)
[3:50] <phantomcircuit> that's current MB/s throughput from rados bench
[3:50] <phantomcircuit> but also you can see journal_ops/bytes plateau when committing=1
[3:51] <phantomcircuit> this is 0.49 btw but i saw idential behaviour in 0.56
[3:53] * winston-d (~zhiteng@pgdmzpr01-ext.png.intel.com) has joined #ceph
[3:58] <winston-d> dmick: regarding to this bug: http://tracker.newdream.net/issues/3770 are you still suggesting using 0.56 against 0.48.3?
[4:01] <winston-d> xiaoxi and i have struggled with ceph for last week, OSD node auto reboot, OSD crash. what we need is just one stable environment which is able to provide reasonable performance for RBD
[4:01] <dmick> that bug is the one that started the conversation, and yes, it's fixed in master, which is why I've been recommending master
[4:03] <sjustlaptop> phantomcircuit: yes, that is the correct behavior sort of
[4:03] <dmick> there will soon be a 0.56.2 that includes that fix as well, I'm sure. I don't understand the fix well enough to know when it was introduced, but the comments imply anything past 0.50 may suffer from thsi problem
[4:03] <sjustlaptop> we need a stable commit point, so we pause writes to the file system before the sync
[4:03] <sjustlaptop> writes can still go through the journal in the mean time though
[4:03] <dmick> sjustlaptop knows better about that bug
[4:04] * jlogan1 (~Thunderbi@72.5.59.176) Quit (Ping timeout: 480 seconds)
[4:05] <sjustlaptop> dmick: it's also fixed in the "bobtail" branch
[4:05] <winston-d> dmick: in order to use master, we have to get the source and compile?
[4:05] <sjustlaptop> we'll need to do another 56 point release
[4:05] <sjustlaptop> at some point
[4:07] <dmick> (06:44:06 PM) xiaoxi: dmick:so I need to clone the git master and build by myself ,right?
[4:08] <dmick> (06:44:26 PM) dmick: xiaoxi: no, there are packages built continuously for git branches
[4:08] <dmick> (06:44:57 PM) dmick: http://ceph.com/docs/master/install/debian/#development-testing-packages
[4:08] <winston-d> dmick: good to know.
[4:08] <dmick> sjustlaptop: git is failing me
[4:09] <dmick> maybe it was cherrypicked
[4:09] <sjustlaptop> it was
[4:09] <sjustlaptop> 3293b31b44c9adad2b5e37da9d5342a6e4b72ade in the bobtail branch
[4:09] <dmick> yup
[4:10] <dmick> xiaoxi: winston-d: in which case, you can use the bobtail packages and be closer to 0.56.1 if you like (and somewhat more stable)
[4:11] <winston-d> dmick: k. thx
[4:12] <dmick> yw
[4:13] * Ryan_Lane (~Adium@216.38.130.166) has joined #ceph
[4:15] * chutzpah (~chutz@199.21.234.7) Quit (Quit: Leaving)
[4:15] <winston-d> dmick: 0.56 is more preferable due to improved performance as well as stability?
[4:24] * Ryan_Lane (~Adium@216.38.130.166) Quit (Quit: Leaving.)
[4:25] <dmick> winston-d: I was just providing bobtail (which is 0.56.1++) as an alternative to master in case you didn't want to be absolutely-bleeding-edge
[4:25] <dmick> it has less change in it than master, and more thought given to upgrading from 0.56.1
[4:25] <dmick> since the plan is that it will become 0.56.2
[4:26] <dmick> so you can choose which route you prefer based on your own philosophy
[4:26] <dmick> you can compare the changes by examining the git tree to get a sense of how different they are if you like
[4:29] <winston-d> dmick: thx. so my question is, bobtail's improvement over argonaut is mainly performance or stability as well?
[4:30] <dmick> well that's a big question. There's a lot of new functionality
[4:30] <winston-d> since we also met auto-reboot-OSD-node issue beside OSD crash issue, so we really need a stable version.
[4:30] <dmick> and we always strive for performance *and* stability, of course.
[4:30] <dmick> yes, there are bugs; 3770 was one
[4:31] <dmick> I'm probably not the best judge of how stable is stable and how stable is comfortable for you, but very soon now, argonaut will not get any but the most critical fixes applied
[4:34] <winston-d> dmick: well, our biggest problem is OSD nodes will random auto reboot under certain load. we can't blame anyone yet since we haven't done enough tests to narrow down the root cause (it took 1~5 hours to reproduce the issue), but Ubuntu kernel/XFS/Ceph are my list.
[4:35] <dmick> that's pretty catastrophic. is there any information about the reboot in syslog or console messages?
[4:36] <winston-d> dmick: sadly, nothing obvious.
[4:39] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[4:39] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[4:40] <winston-d> you know i really want to blame ubuntu kernel for this, :) but others in the team suggest we change ceph version.
[4:41] <dmick> it would surprise me very much if ceph were at the root of this. Are you using kernel rbd or cephfs?
[4:43] <winston-d> dmick: we use kernel rbd. and that's another story. using 12.04 default kernel 3.2, we've seen client kernel panic. and we moved to 3.6 kernel, now rbd client is fine.
[4:44] <dmick> right. lots of stuff has gone on in krbd. But yes, that wouldn't affect the OSD nodes unless they were also the krbd nodes (which is a really bad idea anyway...deadlock)
[4:46] <winston-d> dmick: suppose the user space rbd shouldn't have such problem, e.g. used with QEMU/KVM?
[4:46] <dmick> yes
[4:47] <winston-d> k.
[4:48] * dty (~derek@testproxy.umiacs.umd.edu) has joined #ceph
[4:49] * dty (~derek@testproxy.umiacs.umd.edu) Quit ()
[5:24] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[5:40] <phantomcircuit> sjustlaptop, that's what i expected the thing is writes aren't going to the journal in the meantime, they're stopped completely (and the journal isn't full notice i have journal_full as a graph)
[5:58] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[6:06] * yoshi (~yoshi@p6124-ipngn1401marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[6:08] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[6:08] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[6:13] <phantomcircuit> sjust, in case you're not on your laptop anymore :)
[6:15] <sjustlaptop> phantomcircuit: is your journal a seperate device?
[6:16] <phantomcircuit> yes it's a 10GB partition on an ssd
[6:16] <phantomcircuit> this is 1 KB writes so im 100% sure it's not full
[6:16] <sjustlaptop> is the fs on the same ssd?
[6:16] <phantomcircuit> no it's 4 KB writes but still
[6:16] <phantomcircuit> no the fs is on a conventional hard drive
[6:16] <sjustlaptop> hmm
[6:16] <phantomcircuit> which is why the sync takes so long :)
[6:17] <sjustlaptop> the journal doesn't have to be full, the journal can only get up to 1000 ops (I think) ahead of the backing disk during a sync
[6:17] <phantomcircuit> but still the committing flag shouldn't cause journal writes to block
[6:17] <phantomcircuit> im pretty sure i actually upped that number to 1 000 000
[6:17] <sjustlaptop> which settings did you change?
[6:18] <phantomcircuit> journal max write bytes/ops journal queue max ops/bytes all set to ridiculously high values
[6:18] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[6:19] <phantomcircuit> i can try increasing them to even more ludicrous values but i sort of doubt it'll change anything
[6:19] <phantomcircuit> world a shot though i guess
[6:19] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[6:20] <sjustlaptop> hmm, I think you need filestore_queue_max_ops
[6:20] <sjustlaptop> be advised though, if your backing disk is 1/2 the speed at small random io as your backing ssd is at streaming smallio than the journal must be idle most of the time
[6:20] <xiaoxi> winton-d:are you still online?
[6:21] <sjustlaptop> the size of the various queues merely determines how long you can exceed that speed limit
[6:22] <sjustlaptop> actually, make filestore_queue_max_committing_max_ops and filestore_queue_max_ops large
[6:22] <sjustlaptop> I think if you make them large enough you will see the journal fill
[6:22] <sjustlaptop> and op latencies get obscenely bad
[6:23] <xiaoxi> sjustlaptop:but increasing filestore_queue_max_ops seems to lead more fluctuation in performance from client's view, when the load to ceph cluster remain stable for long enough time
[6:23] <sjustlaptop> xiaoxi: yeah, it's a bad idea in general
[6:24] <sjustlaptop> well, tuning the values could probably help since I doubt the defaults are terribly likely to be optimal, but making them very large will just make the io extremely spiky
[6:25] <xiaoxi> will larger filestore_queue_max_byte and ops means better chance for request merge?
[6:25] <dec> does librbd log warnings/errors anywhere? e.g. if I'm using librbd via qemu/kvm and it hits some error, does that get logged ?
[6:25] <sjustlaptop> xiaoxi: potentialy, if you increase the sync interval as well
[6:26] <sjustlaptop> not hugely sure about that
[6:26] <winston-d> xiaoxi: yup
[6:26] <sjustlaptop> it would depend on how good the filesystem is about flushing out data between syncs
[6:29] <xiaoxi> winton-d:seems we met the same unstable issue,you can refer to my mail(in maillist) for what test I have done
[6:31] <xiaoxi> sjustlaptop: When try to make the performance stable from client's side,what's your suggestion for tuning?
[6:34] * winston-d (~zhiteng@pgdmzpr01-ext.png.intel.com) Quit (Quit: Leaving)
[6:36] <sjustlaptop> xiaoxi: it's really hard to say, nhm might have a better idea
[6:36] <sjustlaptop> smaller queues might mean smoother performance
[6:37] <xiaoxi> for default configuration,It's likely for some client suffer 0 bandwidth for several minutes
[6:38] <xiaoxi> it's obviously not the good idea
[6:40] <phantomcircuit> btw i think it might be a good idea for rados rmpool to be an interactive command
[6:40] <phantomcircuit> heh
[6:40] <phantomcircuit> sort of like how rm -rf / doesn't work on most systems anymore
[6:40] * winston-d (~zhiteng@pgdmzpr01-ext.png.intel.com) has joined #ceph
[6:41] <phantomcircuit> sjustlaptop, and yeah i know in general it's a bad idea but the typical case is for there to be a burst of activity for maybe 30 seconds and then the entire setup to go back to completely idle
[6:41] <phantomcircuit> otherwise i would definitely not be trying to do this
[6:42] * winston-d (~zhiteng@pgdmzpr01-ext.png.intel.com) Quit ()
[6:43] <sjustlaptop> phantomcircuit: sure, there are certainly use cases, just not a good general purpose setup
[6:44] <sjustlaptop> anyway, if you want to be able to burst through your entire journal, you need filestore queue max ops and filestore queue max committing ops
[6:44] <phantomcircuit> also with filestore_queue_max_ops and filestore_queue_committing_max_ops both set to ludicrous numbers everything stalls at ~15k iops
[6:45] <sjustlaptop> how short a burst are you doing?
[6:45] <phantomcircuit> that's only 3 seconds
[6:45] <sjustlaptop> what did you set those to?
[6:46] <phantomcircuit> i have rados bench trying to run for 60 and it makes it about 3 seconds before writes stop completely
[6:46] <phantomcircuit> a billion each
[6:46] <phantomcircuit> like i said ludicrous values
[6:46] <sjustlaptop> you are hitting the messenger limits, I think
[6:46] <sjustlaptop> how many osds?
[6:46] <phantomcircuit> just 2 for now
[6:46] <phantomcircuit> this is a proof of sanity
[6:46] <sjustlaptop> ok, so that's 45000 outstanding ops at that point
[6:47] <phantomcircuit> yeah roughly
[6:47] <sjustlaptop> the spinning disks are probably only good for 200 iops (being massively optimistic)
[6:47] <sjustlaptop> one sec
[6:49] <phantomcircuit> yeah im partially hoping for massive queue reordering which i got with bcache but in that case it was bcache itself doing the reordering and not cfq
[6:49] <sjustlaptop> you also need to increase the corresponding _bytes limits as well
[6:49] <phantomcircuit> i do have about 70GB more space on the ssd i could use with bcache but then im limiting throughput on large io
[6:49] <sjustlaptop> also, the ms_dispatch_throttle
[6:49] <phantomcircuit> yeah those are all set to 10 GB
[6:50] <phantomcircuit> i didn't touch ms_dispatch_throttle
[6:50] <sjustlaptop> there is a client side throttle as well
[6:50] <sjustlaptop> objecter_inflight_ops
[6:50] <sjustlaptop> actually, that one you should leave alone
[6:51] <phantomcircuit> rados bench only has 16 io in flight at a time anyways
[6:51] <sjustlaptop> ms_dispatch_throttle defaults to 100MB and 4k*45000 is around that
[6:51] <phantomcircuit> or rather -t which defaults to 16
[6:51] <sjustlaptop> phantomcircuit: yeah, that's that I realized when I retracted my statement
[6:53] <sjustlaptop> the OSD won't release the ms_dispatch_throttle until the op clears the filesystem
[6:55] <phantomcircuit> well then that 100% explains it i bet
[6:56] * winston-d (~zhiteng@pgdmzpr01-ext.png.intel.com) has joined #ceph
[6:58] <sjustlaptop> one day, we should try to introduce a grand unified throttle
[6:58] <sjustlaptop> a throttle of everything
[7:00] <phantomcircuit> heh
[7:00] <phantomcircuit> that would probably be much easier to work with than this setup
[7:00] <sjustlaptop> indeed
[7:01] <phantomcircuit> rados bench write just writes out new objects of whatever size you specify to the pool right
[7:01] <sjustlaptop> yes
[7:02] <phantomcircuit> in that case this blktrace of btrfs doing this looks ridiculous
[7:02] <phantomcircuit> http://i.imgur.com/8znz7.png
[7:03] <phantomcircuit> :/
[7:03] <sjustlaptop> that's disappointing
[7:03] <sjustlaptop> have you tried increasing the sync interval?
[7:03] <phantomcircuit> it's set to 30 seconds
[7:03] <sjustlaptop> hmm
[7:04] <phantomcircuit> oh you know what i turned on the flusher actually
[7:04] <phantomcircuit> let me turn that off
[7:04] <sjustlaptop> yeah
[7:04] <phantomcircuit> it was helping because of the ms dispatch throttle
[7:04] <phantomcircuit> as a side effect since it forced those to clear faster than they would have otherwise
[7:05] <sjustlaptop> ah
[7:05] <phantomcircuit> it didn't make sense to me at the time but now all makes sense
[7:05] <phantomcircuit> performance tuning is a bizarre thing heh
[7:05] <sjustlaptop> yes
[7:05] <sjustlaptop> yes it is
[7:07] <phantomcircuit> HA
[7:07] <phantomcircuit> so it was ms throttle dispatch the entire time
[7:07] <sjustlaptop> so what are you seeing now?
[7:08] <phantomcircuit> 23 seconds of 9.26 MB/s with 4096 byte blocks
[7:08] <sjustlaptop> cool
[7:08] <phantomcircuit> 2370 write iops
[7:09] <phantomcircuit> still hitting a limit somewhere but that'll do
[7:09] <sjustlaptop> you have replication 2, right?
[7:09] <phantomcircuit> yeah
[7:09] <phantomcircuit> identical osds and i've tested the disks extensively to verify performance is identical (within 5%)
[7:09] <sjustlaptop> that's actually 9.26MB/s on each filestore actually
[7:10] <sjustlaptop> the limitation is probably osd op processing
[7:10] <phantomcircuit> hmm that was a nasty surprise
[7:10] <sjustlaptop> try bumping osd_op_threads to 10
[7:10] <phantomcircuit> it's at 8 already
[7:11] <sjustlaptop> try 20
[7:11] <phantomcircuit> probably the limit it -t
[7:11] <dec> phantomcircuit: you haven't fixed your 0.56.1 performance issue, have you? :)
[7:11] <sjustlaptop> also probably
[7:11] <sjustlaptop> *probable
[7:11] <sjustlaptop> you are at the default 16?
[7:11] <phantomcircuit> yeah i'll bump that up
[7:11] <sjustlaptop> yeah, try 256
[7:14] <phantomcircuit> the number rados bench reports is that aggregate for all 256 ops in flight?
[7:17] <sjustlaptop> yes
[7:17] <sjustlaptop> worse?
[7:18] <phantomcircuit> much
[7:18] <sjustlaptop> hmm, try two different rados bench clients with 16 each
[7:18] <sjustlaptop> we've seen that the rados bench tool itself has some issues with large numbers of concurrent ops
[7:18] <phantomcircuit> hmm
[7:18] <phantomcircuit> it's not the concurrency setting
[7:19] <phantomcircuit> there's some delayed performance issue
[7:19] <sjustlaptop> delayed performance issue?
[7:20] <phantomcircuit> filestore is busy
[7:20] <sjustlaptop> well yes, it would be :)
[7:20] <phantomcircuit> which is expected
[7:20] <phantomcircuit> but it was idle for about 10 seconds
[7:20] <phantomcircuit> which i didn't expect
[7:20] <sjustlaptop> idle?
[7:21] <phantomcircuit> iostat reports 0 activity
[7:21] <sjustlaptop> probably the 30 second sync interval
[7:23] <phantomcircuit> well on the bright side the blktrace looks much saner
[7:23] <phantomcircuit> http://i.imgur.com/OCoB4.png
[7:26] <phantomcircuit> hmm
[7:28] <phantomcircuit> 2013-01-18 07:27:38.289714 2aa22140700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x2aa1101d700' had timed out after 30
[7:28] <sjustlaptop> it's probably waiting to be able to push an op into the filestore
[7:28] <phantomcircuit> if they start losing heartbeats between each other i assume they block
[7:28] <sjustlaptop> in this case, it's not a case of loosing heartbeats
[7:29] <sjustlaptop> each worker thread has to check in periodically with the heartbeat system
[7:29] <sjustlaptop> in this case, a thread in the OSD op thread pool has been awol for 30 seconds
[7:29] <sjustlaptop> probably because it's blocked waiting to submit io to the filestore
[7:32] <phantomcircuit> anwyays i've got what i wanted
[7:32] <phantomcircuit> bursty random io writes
[7:33] <phantomcircuit> i could probably add bcache to rewrite io and improve more butat the cost of raw throughput
[7:33] <phantomcircuit> it might be worth it
[7:41] * jrisch (~Adium@4505ds2-hi.0.fullrate.dk) has joined #ceph
[7:43] * dmick is now known as dmick_away
[7:44] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[7:50] * wsmob_705215 (~wsmob_705@www.nowhere-else.org) has joined #ceph
[7:53] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[8:06] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Ping timeout: 480 seconds)
[8:07] * silversurfer (~silversur@122x212x156x18.ap122.ftth.ucom.ne.jp) has joined #ceph
[8:10] * jrisch (~Adium@4505ds2-hi.0.fullrate.dk) Quit (Ping timeout: 480 seconds)
[8:12] * KindOne (~KindOne@h87.44.28.71.dynamic.ip.windstream.net) Quit (Remote host closed the connection)
[8:13] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[8:16] * KindOne (KindOne@h87.44.28.71.dynamic.ip.windstream.net) has joined #ceph
[8:21] * wsmob_705215 (~wsmob_705@www.nowhere-else.org) Quit (Remote host closed the connection)
[8:32] * janisg (~troll@85.254.50.23) has joined #ceph
[8:32] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[8:38] <phantomcircuit> oopsie
[8:38] <phantomcircuit> 15 placement groups down
[8:38] <phantomcircuit> accidentally reinitialized an osd that was still backfilling :/
[8:40] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[8:51] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[8:54] * xdeller (~xdeller@broadband-77-37-224-84.nationalcablenetworks.ru) Quit (Quit: Leaving)
[8:59] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[9:10] <xiaoxi> Can I use the new mkcephfs(from v0.56.1) to initial cluster for 0.48.3?
[9:10] * loicd (~loic@178.20.50.225) has joined #ceph
[9:12] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Copywight 2007 Elmer Fudd. All wights wesewved.)
[9:22] * silversurfer (~silversur@122x212x156x18.ap122.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[9:25] * loicd (~loic@178.20.50.225) Quit (Quit: Leaving.)
[9:25] * loicd (~loic@178.20.50.225) has joined #ceph
[9:27] * ScOut3R (~ScOut3R@212.96.47.215) has joined #ceph
[9:28] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:35] * low (~low@188.165.111.2) has joined #ceph
[9:37] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[9:38] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:39] <absynth_47215> morning, anyone up?
[9:40] * tnt (~tnt@120.194-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:45] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[9:45] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[9:50] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:51] * leseb (~leseb@193.172.124.196) has joined #ceph
[9:53] * jrisch (~Adium@94.191.185.168.mobile.3.dk) has joined #ceph
[9:56] * winston-d (~zhiteng@pgdmzpr01-ext.png.intel.com) Quit (Quit: Leaving)
[9:57] <loicd> absynth_47215: good morning :-)
[9:58] <absynth_47215> hi
[10:05] <ScOut3R> morning
[10:08] * xiaoxi (~xiaoxiche@jfdmzpr03-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[10:12] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) has joined #ceph
[10:19] * LeaChim (~LeaChim@b0fadd12.bb.sky.com) has joined #ceph
[10:22] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[10:25] <absynth_47215> sigh
[10:25] * jrisch (~Adium@94.191.185.168.mobile.3.dk) Quit (Read error: Connection reset by peer)
[10:26] <absynth_47215> sometimes, i hate ceph
[10:28] <ScOut3R> absynth_47215: why? :)
[10:28] <absynth_47215> started reweighting. one osd crashes. cannot restart. repetitive crash while replaying journal
[10:29] <absynth_47215> awesome
[10:29] <absynth_47215> upside is: we wanted to get rid of that OSD anyway
[10:35] <topro> anyone here working on http://ceph.com/docs/master?
[10:35] * ScOut3R_ (~ScOut3R@212.96.47.215) has joined #ceph
[10:39] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) has joined #ceph
[10:42] * jtangwk (~Adium@2001:770:10:500:8d22:d635:d460:e02) Quit (Quit: Leaving.)
[10:43] * ScOut3R (~ScOut3R@212.96.47.215) Quit (Ping timeout: 480 seconds)
[10:46] <ScOut3R_> absynth_47215: which version were you running on that osd?
[10:57] * jtangwk (~Adium@2001:770:10:500:8d22:d635:d460:e02) has joined #ceph
[11:01] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[11:06] <topro> as the 5-minute quick-start-guide got updated with bobtail (cephx-auth) information, it would be helpful to update http://ceph.com/docs/master/start/quick-cephfs/ to include cephx auth info as well
[11:06] <topro> i.e. something like sudo mount -t ceph 192.168.0.1:6789:/ /mnt/mycephfs -o name=admin,secretfile=/etc/ceph/admin.secret from http://ceph.com/docs/master/cephfs/kernel/
[11:07] <topro> as the 5-min quick start guid links directly to http://ceph.com/docs/master/start/quick-cephfs/
[11:12] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[11:13] <absynth_47215> ScOut3R_: we have a custom argonaut version that includes the latest fixes from .48.3
[11:14] * dxd828 (~dxd828@195.191.107.205) Quit (Quit: Leaving)
[11:16] * jtangwk (~Adium@2001:770:10:500:8d22:d635:d460:e02) Quit (Quit: Leaving.)
[11:27] * xdeller (~xdeller@62.173.129.210) has joined #ceph
[11:29] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[11:31] * jtangwk (~Adium@2001:770:10:500:8d22:d635:d460:e02) has joined #ceph
[11:41] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[11:41] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[11:46] * verwilst (~verwilst@d528F423A.access.telenet.be) has joined #ceph
[11:48] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[11:59] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[12:05] <absynth_47215> nobody from inktank around, right?
[12:15] * leseb_ (~leseb@193.172.124.196) has joined #ceph
[12:15] * leseb (~leseb@193.172.124.196) Quit (Read error: Connection reset by peer)
[12:17] <jluis> absynth_47215, I'm here
[12:19] <jluis> nhm and elder might be around soon too
[12:24] <absynth_47215> hrrm, we have an issue with our cluster
[12:24] <absynth_47215> and i am afraid, its a serious one (or can become serious very soon)
[12:24] <absynth_47215> can you have a look at 157?
[12:27] * Oliver1 (~oliver1@p54839662.dip.t-dialin.net) has joined #ceph
[12:30] <jluis> I don't have access to zendesk, if that's what 157 is related to :\
[12:30] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:36] <absynth_47215> yeah
[12:39] <absynth_47215> do you have a way to reach someone who has access?
[12:40] <absynth_47215> -2> 2013-01-18 09:20:49.080735 7f5a62615780 0 filestore(/data/osd3-1) ENOTEMPTY suggests garbage data in osd data dir
[12:40] <absynth_47215> this is what kills our OSD processes
[12:40] <absynth_47215> http://tracker.newdream.net/issues/1949
[12:40] <absynth_47215> hmmm
[12:43] <jluis> absynth_47215, best bet would be elder or nhm right now; I'm not aware what the procedures are for customers, and wouldn't want to call someone in the middle of their night without being sure that's okay
[12:43] <jluis> sorry about that :\
[12:46] <absynth_47215> weren't you assigned as one of our firstline support engineers?
[12:46] * KindTwo (KindOne@h180.210.89.75.dynamic.ip.windstream.net) has joined #ceph
[12:46] * KindOne (KindOne@h87.44.28.71.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[12:46] <absynth_47215> wolfgang is east cost, so he is probably next to be awake... hopefully our cluster won't fuck up until then, pardon my french
[12:47] * KindTwo is now known as KindOne
[12:49] <absynth_47215> queried you
[12:49] <jluis> absynth_47215, I wasn't assigned to any support whatsoever
[12:51] <absynth_47215> ah, ok, then i misunderstood the initial documents
[12:52] <absynth_47215> a lot of shifting around happened during the weeks before and after x-mas
[12:52] * yoshi (~yoshi@p6124-ipngn1401marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:55] * tony (~tony@60-241-229-106.static.tpgi.com.au) has joined #ceph
[12:55] * stxShadow (~jens@p4FD0774B.dip.t-dialin.net) has joined #ceph
[12:55] <darkfaded> hehe, oncall/support shift planning is fun
[12:56] <absynth_47215> classical lose-lose situation, at the end everyone is frustrated
[12:57] <darkfaded> absynth_47215: idk, if you do it well they all got what they wanted, i.e. for xmas there's always someone hiding from family festivities who is happy to take the oncall
[12:57] <absynth_47215> i think that directly depends on your company size
[12:57] <darkfaded> okay, maybe
[12:58] <absynth_47215> with 5 people total... we have little wriggle room
[12:58] <darkfaded> yeah
[12:58] <darkfaded> well, i've done it for maybe 20 people, but you had to match pairs and all kinds of stuff
[12:58] <darkfaded> the idea was to get it all nicely clocked
[12:59] <darkfaded> and trading shifts without telling me also was really a nogo
[12:59] <darkfaded> i think it took 4 weeks to syncronise everything after unexpected stuff
[13:00] <darkfaded> one trick that might help you - if we had planned maintenance or such stuff where everyone worked 40hrs during the weekend, i made them take a day off in the week before or after it
[13:00] <darkfaded> they didnt want but it helped a lot
[13:01] <darkfaded> and also good is to sometimes give someone 2 weeks in a row because then it feels much less often that it's your turn
[13:01] <absynth_47215> the upside of such a small team is that we don't really look at our overtime a lot
[13:01] <liiwi> I once had to take a week off and have another in single days over longer time
[13:01] <darkfaded> absynth_47215: thats not an upside ;p
[13:02] <absynth_47215> of course, oncall is rewarded in overtime, but nobody's zealous in taking it or making people take it
[13:02] <absynth_47215> for me, as the boss, it is :)
[13:02] <darkfaded> absynth_47215: hehe. but they're more exhausted
[13:02] <absynth_47215> yeah, but nobody looks at them weird if they come later or do home office after a windows
[13:02] <darkfaded> thats why i did the thing with forcing them to stay home for a day
[13:02] <absynth_47215> at least i hope so
[13:03] <darkfaded> it doesn't show much if one guy goes from 500hrs+ to 492hrs+ but it helps them recover
[13:03] <darkfaded> liiwi: same reason i'd think
[13:04] <darkfaded> and yeah, coming in when you feel you'll be READY is a great bonus for a job
[13:04] <absynth_47215> hope we never get to someone with a 500 hour overtime balance...
[13:04] <darkfaded> probably not, you'd need to hire bad managers or bad designers for that to happen :>
[13:05] <liiwi> we have one day buffer, anything over that goes to manager
[13:07] * tony (~tony@60-241-229-106.static.tpgi.com.au) Quit (Quit: Colloquy for iPad - http://colloquy.mobi)
[13:07] * drafter (~drafter@62.173.129.210) has joined #ceph
[13:17] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Quit: Konversation terminated!)
[13:25] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[13:37] * sstan (~chatzilla@dmzgw2.cbnco.com) Quit (Remote host closed the connection)
[13:38] * sstan (~chatzilla@dmzgw2.cbnco.com) has joined #ceph
[13:53] * mikedawson_ (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:12] <sstan> gmornign
[14:16] <nhm> absynth_47215: I got a hold of wolfgang and kevin.
[14:17] <absynth_47215> awesome
[14:17] <absynth_47215> we kept ticket 157 updated
[14:17] <absynth_47215> we just need some advice what do do now - we are in a somewhat stable state at the moment
[14:18] <nhm> absynth_47215: I think my advice is to wait for advice from someone who knows what they are doing. ;)
[14:30] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[14:31] <absynth_47215> nhm: i.e. "sage"? ;)
[14:36] <nhm> absynth_47215: Luckily I'm not the one who decides those things. ;)
[14:43] * cmello (~cmello@201-23-160-68.gprs.claro.net.br) has joined #ceph
[14:44] <cmello> hi there!
[14:44] * stxShadow (~jens@p4FD0774B.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[14:44] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:52] * cmello (~cmello@201-23-160-68.gprs.claro.net.br) Quit (Ping timeout: 480 seconds)
[14:52] * stxShadow (~jens@jump.filoo.de) has joined #ceph
[14:52] * cmello (~cmello@201-23-160-68.gprs.claro.net.br) has joined #ceph
[14:58] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:58] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[15:00] * cmello (~cmello@201-23-160-68.gprs.claro.net.br) Quit (Ping timeout: 480 seconds)
[15:28] * BillK (~billk@58-7-235-12.dyn.iinet.net.au) Quit (Quit: Ex-Chat)
[15:31] * sleinen1 (~Adium@2001:620:0:46:c8d1:5f82:21f6:d1da) Quit (Quit: Leaving.)
[15:31] <dosaboy> can anyone tell me why the standard pool attrs listed in http://ceph.com/docs/master/rados/operations/pools/ cannot all be queried with 'ceph osd pool get data <key>'
[15:31] <dosaboy> e.g.
[15:31] <dosaboy> ceph osd pool get data size
[15:31] <dosaboy> "don't know how to get pool field size"
[15:31] <dosaboy> but pgp_num works
[15:32] * aliguori (~anthony@cpe-70-112-157-151.austin.res.rr.com) has joined #ceph
[15:39] * nhm manually fixes the cryptopp sources to actually compile with gcc 4.7
[15:44] * drokita (~drokita@199.255.228.128) has joined #ceph
[15:46] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[15:57] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[15:57] * fmarchand (~fmarchand@212.51.173.12) has joined #ceph
[15:58] * fmarchand_ (~fmarchand@212.51.173.12) has joined #ceph
[15:58] <fmarchand_> Hi everybody !
[16:00] <fmarchand_> I just have a newbee question ... what filter do I have to use if I want to check the value of a field ? example only documents that have a field greater then a given value ...
[16:00] <fmarchand_> my field is a core type long
[16:01] <fmarchand_> range filter ?
[16:02] <fmarchand> Oups I'm really sorry ... not the right channel !
[16:05] <loicd> :-D
[16:06] <fmarchand_> But I love teuthology too :)
[16:06] <loicd> fmarchand and you should :-)
[16:07] <loicd> is there a way to check, from the main thread, if a thread is asleep, waiting for a mutex ?
[16:07] <fmarchand> And it rocks on my mini-cluster !
[16:07] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[16:16] <loicd> what I'm after is a way to wait for a thread to be asleep, because of a mutex
[16:16] <loicd> and then trigger code that is supposed to wake it
[16:20] <absynth_47215> there is another bunch of 0days in java
[16:20] <absynth_47215> awesome...
[16:21] * BManojlovic (~steki@91.195.39.5) Quit (Remote host closed the connection)
[16:22] <nhm> absynth_47215: saw that
[16:22] <absynth_47215> probably more where these came from...
[16:22] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) Quit (Read error: Operation timed out)
[16:23] * xiaoxi (~xiaoxiche@jfdmzpr04-ext.jf.intel.com) has joined #ceph
[16:24] * PerlStalker (~PerlStalk@72.166.192.70) has joined #ceph
[16:25] * mattbenjamin (~matt@75.45.226.110) has joined #ceph
[16:25] <drafter> hello, could you please help with a bit silly question ) the question is
[16:26] <drafter> Is it allowed to reuse single instance of RadosClient for sequential connection tries (imagine that we are trying to connect to shutoff cluster and reconnecting every 5 secs) or it's required to create new instance for every next connection try?
[16:27] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[16:36] * sander (~chatzilla@c-174-62-162-253.hsd1.ct.comcast.net) has joined #ceph
[16:36] * verwilst (~verwilst@d528F423A.access.telenet.be) Quit (Quit: Ex-Chat)
[16:37] * mattbenjamin (~matt@75.45.226.110) Quit (Quit: Leaving.)
[16:38] <sstan> if ceph is fault-tolerant, does-it mean that if while a client writes to some RBD, and the OSDs involved with the client dies, the client's write won't be interrupted?
[16:40] * drafter (~drafter@62.173.129.210) has left #ceph
[16:51] <xiaoxi> sstan:the client's write will certainly get an error(EIO or something),but if you retry after "some" while,it could work and your previous data will not lost
[16:53] <absynth_47215> sstan: also, there's an rbd cache that might buffer the write
[16:53] <sstan> so one cannot rely on RBD it the stability of the client's system relies on it?
[16:54] <sstan> like if you mount some RBD device to / (root) and chroot into it?
[16:54] <absynth_47215> well, let me rephrase
[16:54] <absynth_47215> your original scenario is covered by any ceph installation that has a replica count of >1
[16:54] <absynth_47215> since, if the primary osd is down during the write (or dies during it), there is always a replica to write to
[16:55] <absynth_47215> after the primary osd is back up, writes get backfilled from the replica to the master
[16:55] <absynth_47215> so, in this case, rbd is reliable
[16:55] <absynth_47215> if, for some freak reason, _both_ (with replcnt=2) osds die during a write, the write is - at least we seem to see this often - suspended until at least one is back up
[16:56] <sstan> but does rbd keep in memory or has an active connection with a replica while it interacts with some OSD ?
[16:56] <absynth_47215> i am not sure i understand the question
[16:57] <absynth_47215> writes to rbd are "atomic", so if you write to the device, you can be sure the data gets eventually written to both OSDs
[16:57] <sstan> hmm does the secondary OSD know that client x is writing something on the primary
[16:57] <jmlowe1> I believe writes are transactional so they don't succeed until they actually succeed
[16:57] <liiwi> /win 12
[16:57] <liiwi> erp
[16:58] <absynth_47215> jmlowe1: yes, or that.
[16:58] <sstan> It's not very clear sorry XD
[16:58] <jmlowe1> retries abound
[16:59] <sstan> what I'm trying to figure out actually is, like I said, one relies on a RBD block that is mounted to /
[17:00] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[17:02] <jmlowe1> sstan: one of the devs may correct me if I am wrong but everything is transactional and versioned so if for any reason a write is interrupted the transaction is rolled back and tried again hopefully by this time the mon has noticed and shuffled things around so one of the replicas is now the primary and the client attempts with the new primary changing the version of the pg so that when the old primary comes back online it knows that it ne
[17:02] <jmlowe1> sstan: I do it all the time with vm's
[17:03] <sstan> It work, but I don't know if it's because of the hypervisor or not
[17:03] <Vjarjadian> which hypervisors you use?
[17:03] <jmlowe1> I don't see a reason having a root filesystem wouldn't work other than things like grub not playing nicely but the kernel itself should be ok afaik
[17:04] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[17:04] <sstan> I plan on using Xen, but I'm trying to boot a physical machine from a RADOS backed filesystem
[17:04] <jmlowe1> not really any different than iscsi or aoe
[17:05] <sstan> well ... iscsi can have multipath in case of failure. I'm trying to figure out how RBD works in case of failure
[17:05] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[17:06] <sstan> because iSCSI is stateful ... and I'm not sure if the state of RBD related operations is shared between the primary and secondary OSDs
[17:06] <jmlowe1> client always talks to primary which is decided by the mon and may be updated at any time, it will retry if any writes are interrupted
[17:08] <sstan> I'll try to compare iSCSI boot VS RBD boot, I'll tell you if it works
[17:10] <jmlowe1> I would think getting the kernel loaded would be a problem unless you have replaced the bios or use a boot drive, I don't think grub knows about rbd
[17:10] <sstan> hmm good point. That could be solved with some pxe boot that prepares the booting code
[17:11] <jmlowe1> or pxe, didn't think about that one
[17:12] * mattbenjamin (~matt@aa2.linuxbox.com) has joined #ceph
[17:14] <sstan> what I wrote isn't clear, so in summary my issue is to know how RBD compares to iscsi Multipath in a situation of OSD/iscsi-target failure
[17:15] * drokita (~drokita@199.255.228.128) Quit (Quit: Leaving.)
[17:15] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[17:16] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[17:16] <sstan> CEPH seems to have the foundation to make it as reliable
[17:16] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:20] * zzyb18 (4317e8b2@ircip3.mibbit.com) has joined #ceph
[17:21] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Quit: Leaving.)
[17:27] * zzyb18 (4317e8b2@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[17:28] * ScOut3R_ (~ScOut3R@212.96.47.215) Quit (Ping timeout: 480 seconds)
[17:29] * fmarchand (~fmarchand@212.51.173.12) Quit (Quit: Leaving)
[17:29] * fmarchand_ (~fmarchand@212.51.173.12) Quit (Quit: Leaving)
[17:35] * low (~low@188.165.111.2) Quit (Quit: Leaving)
[17:41] <nhm> hrm, I wonder if our check for sync_file_range is working properly.
[17:45] * ircolle (~ircolle@65.114.195.189) has joined #ceph
[17:52] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[17:54] * stxShadow (~jens@jump.filoo.de) Quit (Remote host closed the connection)
[17:56] * Oliver3 (~oliver1@p5483A6DC.dip.t-dialin.net) has joined #ceph
[17:56] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[17:57] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[18:01] * Oliver1 (~oliver1@p54839662.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[18:10] * xiaoxi (~xiaoxiche@jfdmzpr04-ext.jf.intel.com) Quit (Remote host closed the connection)
[18:13] * tnt (~tnt@120.194-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:18] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) has joined #ceph
[18:20] * jlogan1 (~Thunderbi@2600:c00:3010:1:49d6:5ead:ab1a:61ba) has joined #ceph
[18:21] * leseb_ (~leseb@193.172.124.196) Quit (Remote host closed the connection)
[18:27] * gaveen (~gaveen@112.135.144.107) has joined #ceph
[18:28] * alram (~alram@38.122.20.226) has joined #ceph
[18:30] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Read error: Operation timed out)
[18:36] * xdeller (~xdeller@62.173.129.210) Quit (Quit: Leaving)
[18:39] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[18:40] <sstan> * I meant pivot_root
[18:41] * loicd (~loic@178.20.50.225) Quit (Ping timeout: 480 seconds)
[18:43] * benpol (~benp@garage.reed.edu) has joined #ceph
[18:46] * sleinen1 (~Adium@2001:620:0:25:e457:56e4:4c92:97fa) has joined #ceph
[18:46] <benpol> So I file this bug the other day: http://tracker.newdream.net/issues/3806
[18:47] <benpol> I have a small test cluster with a few PGs stuck in active+degraded (and now active+remapped) state. Any idea how I can fix those PGs?
[18:49] <benpol> ...preferably something short of reinitializing the cluster of course.
[18:50] * benner (~benner@193.200.124.63) Quit (Read error: Connection reset by peer)
[18:50] * benner (~benner@193.200.124.63) has joined #ceph
[18:52] * drokita (~drokita@199.255.228.128) has joined #ceph
[18:55] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:59] * chutzpah (~chutz@199.21.234.7) has joined #ceph
[19:02] <phantomcircuit> benpol, restart the osds one at a time
[19:03] * sjustlaptop (~sam@2607:f298:a:607:48e2:445c:4ddd:e3ef) has joined #ceph
[19:04] * Oliver3 (~oliver1@p5483A6DC.dip.t-dialin.net) Quit (Quit: Leaving.)
[19:04] <benpol> phantomcircuit: all of the OSDs and OSD servers have been restarted multiple times, though I haven't tried restarting each individual OSD in sequence.
[19:05] * gaveen (~gaveen@112.135.144.107) Quit (Ping timeout: 480 seconds)
[19:05] <benpol> but I'll give it a try, are you just suggesting a restart or a "stop, wait until cluster recovers, start"?
[19:06] <phantomcircuit> i was just suggesting a restart
[19:06] <phantomcircuit> reading your ticket i doubt it'll help
[19:07] <phantomcircuit> i assume the pgmap version is steadily incrementing
[19:07] <benpol> yes, as we speak
[19:07] * gregaf (~Adium@2607:f298:a:607:3d65:f727:d43b:b7dd) Quit (Quit: Leaving.)
[19:07] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[19:10] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[19:13] * gregaf (~Adium@2607:f298:a:607:3da4:c882:d437:2dc6) has joined #ceph
[19:14] * gaveen (~gaveen@112.135.131.89) has joined #ceph
[19:20] * sjustlaptop (~sam@2607:f298:a:607:48e2:445c:4ddd:e3ef) Quit (Quit: Leaving.)
[19:21] * sjustlaptop (~sam@2607:f298:a:607:61e5:88f7:c4c:e0af) has joined #ceph
[19:22] * madkiss (~madkiss@089144192022.atnat0001.highway.a1.net) has joined #ceph
[19:23] <rlr219> benpol: what version of ceph
[19:24] * Oliver1 (~oliver1@p5483A6DC.dip.t-dialin.net) has joined #ceph
[19:24] <benpol> rlr219: 0.56.1-1~bpo60+1 (Debian Squeeze package)
[19:25] <rlr219> look at this bug. I ran into something similar and mikedawson pointed me here: http://tracker.newdream.net/issues/3720
[19:26] <rlr219> basically you have to enable tunables and let cluster recover. However, what i found out yestaerday was that you really need to be on kernel 3.6+. http://ceph.com/docs/master/rados/operations/crush-map/#tuning-crush
[19:27] <rlr219> the info on the kernels says 3.5+ but it should be 3.6+
[19:27] <benpol> That does sound awfully similar, but I already have tunables set and am running kernel version 3.7.1
[19:28] <rlr219> hhmmm. not real sure then
[19:29] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[19:35] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:36] * sjustlaptop (~sam@2607:f298:a:607:61e5:88f7:c4c:e0af) Quit (Read error: Operation timed out)
[19:36] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[19:36] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:37] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[19:38] * nwat (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[19:38] * nwat (~Adium@soenat3.cse.ucsc.edu) has left #ceph
[19:38] * nwat (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[19:44] * Ryan_Lane (~Adium@216.38.130.166) has joined #ceph
[19:47] * madkiss (~madkiss@089144192022.atnat0001.highway.a1.net) Quit (Ping timeout: 480 seconds)
[19:51] <sagewk> rl219: i'll fix the docs, thanks for the heads up.
[19:53] * ircolle (~ircolle@65.114.195.189) Quit (Quit: Leaving.)
[19:58] * yehudasa_ (~yehudasa@38.122.20.226) has joined #ceph
[19:59] * bstaz (~bstaz@ext-itdev.tech-corps.com) has joined #ceph
[20:03] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) Quit (Quit: Leaving.)
[20:06] * leseb_ (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:06] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[20:06] * Oliver1 (~oliver1@p5483A6DC.dip.t-dialin.net) has left #ceph
[20:13] * yehudasa_ (~yehudasa@38.122.20.226) Quit (Ping timeout: 480 seconds)
[20:28] <sjust> jmlowe1: you there?
[20:28] <jmlowe1> yep
[20:28] <jmlowe1> for a couple of minutes, I have a 3:00 EST
[20:28] <alexxy> hi all
[20:29] <alexxy> how can i force resync?
[20:29] <sjust> can you dump the attributes on d0c18e1d/605.00000000/head//1 in pg 1.21d on osds 0 and 6?
[20:29] <alexxy> health HEALTH_WARN 36 pgs backfill_toofull; 36 pgs stuck unclean; recovery 42672/3406940 degraded (1.253%); 2 near full osd(s)
[20:29] <alexxy> monmap e1: 1 mons at {alpha=10.0.0.254:6789/0}, election epoch 1, quorum 0 alpha
[20:29] <alexxy> osdmap e5798: 18 osds: 18 up, 12 in
[20:29] <alexxy> pgmap v181348: 3648 pgs: 3612 active+clean, 36 active+remapped+backfill_toofull; 4786 GB data, 9675 GB used, 9384 GB / 19094 GB avail; 42672/3406940 degraded (1.253%)
[20:29] <alexxy> mdsmap e34: 1/1/1 up {0=alpha=up:active}
[20:29] <sjust> oh, that's the one that we already know doesn't exist
[20:29] <sjust> nvm
[20:30] <sjust> alexxy: your disks are too full
[20:30] <alexxy> sjust: i'm in progress of migrating from raid10 to raid5
[20:31] <sjust> ok, but that's what backfill_toofull means, it doesn't want to copy data over since it will just fill the disks
[20:31] <alexxy> http://bpaste.net/show/71225/
[20:31] <alexxy> its current status
[20:31] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[20:31] <alexxy> actualy i want to remove osd.5-10
[20:31] <alexxy> and recreate raids on them
[20:32] <alexxy> as i did before on osd.0-4
[20:32] <jmlowe1> sjust: I've got about 5 more minutes, are you looking for another object for me to check?
[20:32] <alexxy> sjust: so can i force resync?
[20:32] <sjust> it would be a really bad idea
[20:33] <alexxy> ceph -s http://bpaste.net/show/71226/
[20:33] * jlogan (~Thunderbi@72.5.59.176) has joined #ceph
[20:33] <alexxy> sjust: why?
[20:33] <alexxy> i have number of copyes set to 2
[20:33] <alexxy> for both data and metadata
[20:33] <sjust> because ceph acts really badly when the disks are full, this is why it's preventing you from copying more data to those disks
[20:34] <sjust> what do you mean by forcec resync?
[20:34] <alexxy> i mean i marked osd.5-10 as out
[20:34] <alexxy> so i wanna recreate raid on them
[20:35] <alexxy> http://bpaste.net/show/71227/
[20:35] <alexxy> ceph osd tree
[20:36] <sjust> ok, but 10-17 are nearly full
[20:36] <jmlowe1> biab
[20:36] <sjust> k
[20:36] <alexxy> i will replace them in next round
[20:37] <alexxy> will it safe to reformat osd.5-10?
[20:37] <alexxy> or i can only reformat one of them first?
[20:39] * jlogan1 (~Thunderbi@2600:c00:3010:1:49d6:5ead:ab1a:61ba) Quit (Ping timeout: 480 seconds)
[20:42] <sjust> you could try increasing the weight on osd0-4 somewhat
[20:42] <sjust> since they are gibber
[20:42] <sjust> *bigger
[20:42] <sjust> that should buy you enough room
[20:42] * jmlowe1 (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[20:43] <sjust> you should wait until all pgs are active+clean with 5-10 marked out before reformatting them
[20:44] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[20:45] <phantomcircuit> hmm that's slightly annoying
[20:45] <phantomcircuit> libvirt requires you tell it about all the monitors and you cant change them without restarting the guest
[20:45] <xmltok> im begining to fire off a bunch of data into my rados gw and i am seeing heavy writes to only a select few drives, is that expected? I have 39 spindles and 1300 pgs
[20:45] <alexxy> sjust: can i adjust nearfull ratio?
[20:45] <alexxy> sjust: can i adjust it online?
[20:46] <phantomcircuit> xmltok, journals?
[20:46] <xmltok> i dont have specific drives set up for journals, afaik
[20:47] <xmltok> two of the disks have iostat ms writes up at 1-3s, everything else is either 0 or ~200ms
[20:47] <phantomcircuit> have you considered those drives could be bad?
[20:47] <xmltok> they could be
[20:47] <phantomcircuit> i had a similar problem and it turned out the drive was just bad
[20:48] <xmltok> good practice for removing an osd.
[20:49] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[20:50] * gaveen (~gaveen@112.135.131.89) Quit (Ping timeout: 480 seconds)
[20:51] <phantomcircuit> anybody think i'll have problems putting monitors on atoms?
[20:51] <phantomcircuit> 2GB ram plenty of disk space
[20:52] <xmltok> so those two drives were also receiving the majority of write requests, several times more than other drives
[20:52] <xmltok> 400/s compared to 50/s
[20:53] <phantomcircuit> xmltok, check the crush map make sure the weights are all even
[20:54] <xmltok> everythong has 1.0, nods are 10 except for the node which has 9 drives - and its 9
[20:56] <xmltok> after removing those OSD's the load shifted to two other drives, so its something in my mapping
[20:57] <xmltok> actually my pools may not have enough pg, i created .rgw with 1300 but im not sure what pool radosgw stores the data in
[20:58] <phantomcircuit> xmltok, you can check with ceph pg dump and see which pool has data in it
[20:59] <phantomcircuit> you can map the pool numbers to names with ceph osd dump
[20:59] * gaveen (~gaveen@112.135.133.129) has joined #ceph
[21:00] <xmltok> its .rgw.buckets which had 8 pg, im rebuilding
[21:02] <phantomcircuit> yeah that would do it
[21:02] * jmlowe (~Adium@149.160.195.101) has joined #ceph
[21:02] <xmltok> well i messed up radosgw bu doing that, it wont let me reuse my old bucket, hmm
[21:04] * ircolle (~ircolle@65.114.195.189) has joined #ceph
[21:08] * miroslav1 (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[21:10] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Read error: Operation timed out)
[21:17] <phantomcircuit> xmltok, what command did you use?
[21:20] <phantomcircuit> xmltok, im pretty sure you cant live change the number of placement groups
[21:20] <phantomcircuit> you'd have to stop using the pool and then make a new pool the the number of placement groups you want and them rados cppool
[21:24] * ircolle (~ircolle@65.114.195.189) Quit (Quit: Leaving.)
[21:28] * ircolle (~ircolle@65.114.195.189) has joined #ceph
[21:36] * jmlowe1 (~Adium@149.160.195.101) has joined #ceph
[21:36] * jmlowe (~Adium@149.160.195.101) Quit (Read error: Connection reset by peer)
[21:42] <bstaz> vmware-toolbox-cmd timesync status
[21:42] <bstaz> doh, ignore that
[21:42] <bstaz> too many screen windows ;)
[21:46] <dmick_away> nhm: "(08:41:17 AM) nhm: hrm, I wonder if our check for sync_file_range is working properly."
[21:46] <dmick_away> wanna discuss?
[21:47] * drokita (~drokita@199.255.228.128) Quit (Ping timeout: 480 seconds)
[21:47] * dmick_away is now known as dmick
[21:48] <dmick> an earlier question here about "why can't you 'ceph osd pool get <pool> size' when you can set it"
[21:49] <dmick> resulted in http://tracker.newdream.net/issues/3869, now fixed in master
[21:49] <dmick> (in case anyone else was following along)
[21:58] * The_Bishop_ (~bishop@e179017075.adsl.alicedsl.de) has joined #ceph
[22:00] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:02] * jmlowe1 (~Adium@149.160.195.101) Quit (Quit: Leaving.)
[22:03] * alram (~alram@38.122.20.226) Quit (Read error: Connection reset by peer)
[22:05] * The_Bishop (~bishop@e179005167.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[22:05] * alram (~alram@38.122.20.226) has joined #ceph
[22:05] * miroslav1 (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[22:05] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[22:14] <phantomcircuit> i want to add two monitors
[22:15] <phantomcircuit> the documentation says you cant have an even number of monitors, is that cant or shouldn't? since obviously for a little while i'll have 2 and then 3
[22:15] <gregaf> shouldn't; due to needing a strict majority of monitors to make progress
[22:16] <gregaf> increasing numbers should be fine, although I do recall some people have had issues that jluis will remember more about than I do
[22:16] <phantomcircuit> ok so if i have 2 and nothing goes wrong everything is fine
[22:16] <phantomcircuit> if i have 2 and 1 dies the cluster stops
[22:17] <gregaf> basically
[22:17] <xmltok> yeah, i deleted the pool and created the pool, but radosgw didnt like that i did that
[22:17] <jluis> the only issue that comes to mind is when adding the second monitor and then being unable to bring it up due to misconfiguration
[22:18] <xmltok> i can radogsw-admin list the bucket, but i can't rm it
[22:23] * The_Bishop_ (~bishop@e179017075.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[22:24] * The_Bishop (~bishop@e179017075.adsl.alicedsl.de) has joined #ceph
[22:24] <xmltok> ok, im back in business by killing the radosgw, cleaning out the pools, and creating them
[22:26] <phantomcircuit> oh debian
[22:26] <phantomcircuit> the init script on debian doesn't seem to like me having used a fqdn for the host =
[22:28] <phantomcircuit> ok only two more and i'll have my 3
[22:31] <xmltok> i've still got two hot OSDs after the increase in pg, and they are different disks so its not bad drives
[22:31] <phantomcircuit> you deleted the pool
[22:32] <phantomcircuit> the mappings are gone
[22:32] <xmltok> right, i created the pool again with 1300 pg, im seeing even distribution across them, but there is still two that are getting much more write requests than everything else
[22:32] <phantomcircuit> now that i have no idea
[22:33] * The_Bishop_ (~bishop@e179017075.adsl.alicedsl.de) has joined #ceph
[22:34] * jlogan2 (~Thunderbi@72.5.59.176) has joined #ceph
[22:34] * The_Bishop_ (~bishop@e179017075.adsl.alicedsl.de) Quit ()
[22:34] * The_Bishop (~bishop@e179017075.adsl.alicedsl.de) Quit (Remote host closed the connection)
[22:34] * The_Bishop (~bishop@e179017075.adsl.alicedsl.de) has joined #ceph
[22:34] <xmltok> i wonder if i bump my rep size to 3 if 3 drives will spike out
[22:36] <xmltok> yeah, a new drive has spun up for writes now
[22:37] * xdeller (~xdeller@broadband-77-37-224-84.nationalcablenetworks.ru) has joined #ceph
[22:39] * jlogan (~Thunderbi@72.5.59.176) Quit (Ping timeout: 480 seconds)
[22:48] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:49] <phantomcircuit> xmltok, are you writing to only one radosgw object?
[22:50] <dpippenger> can anyone suggest some ideas on how to get rid of an mds that doesn't seem to want to go away? I get notices that " 1/1/1 up {0=b=up:active(laggy or crashed)}" but I can't seem to remove b (which I think is by doing ceph mds rm 1). I get "mds gid 1 dne"
[22:50] <xmltok> im writing to one bucket with 100 subdirectories
[22:51] <xmltok> well, one million, /images/[0-9][0-9]/[0-9][0-9]/[0-9][0-9]
[22:51] <xmltok> i also had normal logging on, which may log to rados? im killing that now
[22:56] <phantomcircuit> dpippenger, is it actually laggy/crashed?
[22:56] <phantomcircuit> i see that sometimes and i dont actually use the mds at all
[22:57] <dpippenger> well it reports that state, but it's because I shut down the mds to get rid of it
[22:57] <dpippenger> I stopped using cephfs, so I'm pretty much eliminating all of my mds
[22:58] <gregaf> dpippenger: http://tracker.newdream.net/issues/2195 :(
[22:58] <dpippenger> what seemed to happen is when i removed the rest this process was left running by accident, so I think I did an mds remove while it was up on that mds node
[22:58] <dpippenger> ahhh
[22:58] <loicd> I'm curious about the rationale behind https://github.com/ceph/ceph/blob/master/src/common/Throttle.h#L32 . In which case is it useful to wait on a value larger than the maximum ?
[22:58] <dpippenger> haha ok, thanks greg
[22:59] <dpippenger> ok, I'll put one back up to quiet it down, thanks for pointing me at that bug
[23:06] <joshd> loicd: 736d837e88eef74f625f9de3d6bf8f1685268073 suggests it avoided a deadlock at one point, but I'm not sure where or why (sagewk?)
[23:09] <sagewk> i can look in a minute, on a call
[23:10] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[23:14] <loicd> there is no emergency ;-) I'm not facing a problem, just writing unit tests for Throttle.cc to better understand how it works
[23:14] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[23:15] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:20] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[23:21] * sander (~chatzilla@c-174-62-162-253.hsd1.ct.comcast.net) Quit (Ping timeout: 480 seconds)
[23:23] <nhm> dmick: wasn't a problem with that, just turns out we weren't properly wrapping a call to sync_file_range in the filestore with ifdefs.
[23:27] <xmltok> if i am planning to store ~300 million files, would i be better off fragmenting them across buckets in radosgw?
[23:31] * rturk-away is now known as rturk
[23:34] * rturk is now known as rturk-away
[23:34] * rturk-away is now known as rturk
[23:41] * gaveen (~gaveen@112.135.133.129) Quit (Remote host closed the connection)
[23:43] * ircolle (~ircolle@65.114.195.189) Quit (Quit: Leaving.)
[23:49] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[23:49] * glowell1 (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[23:49] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:57] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Quit: Leaving.)
[23:57] * sleinen1 (~Adium@2001:620:0:25:e457:56e4:4c92:97fa) Quit (Quit: Leaving.)
[23:57] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.