#ceph IRC Log


IRC Log for 2011-10-21

Timestamps are in GMT/BST.

[0:13] * Kioob (~kioob@luuna.daevel.fr) Quit (Quit: Leaving.)
[0:22] * mrjack (mrjack@office.smart-weblications.net) has joined #ceph
[0:22] * alex460 (~alex@ has joined #ceph
[0:28] * bencherian (~bencheria@aon.hq.newdream.net) Quit (Quit: bencherian)
[0:29] <sagewk> joshd: oh yeah
[0:32] * mgalkiewicz (~maciej.ga@staticline58722.toya.net.pl) has left #ceph
[0:37] <joshd> sagewk: looks good otherwise - much better with the comments
[0:50] <sagewk> joshd: cleaned it up a bit, but kept PG as optional arg.. otherwise we generate a string even for low debug levels, or lose the prefix entirely (which is very handy for grepping logs).
[0:56] <joshd> sagewk: why change it to non-const? and aren't we using NULL for pointers?
[0:56] <sagewk> oops yeah
[1:13] * jojy_ (~jojyvargh@ has joined #ceph
[1:13] * jojy (~jojyvargh@ Quit (Read error: Connection reset by peer)
[1:13] * jojy_ is now known as jojy
[1:25] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[1:27] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[2:06] * jojy_ (~jojyvargh@ has joined #ceph
[2:06] * jojy (~jojyvargh@ Quit (Read error: Connection reset by peer)
[2:06] * jojy_ is now known as jojy
[2:25] * RupS| (~rups@panoramix.m0z.net) has joined #ceph
[2:25] * RupS (~rups@panoramix.m0z.net) Quit (Read error: Connection reset by peer)
[2:32] <cp> What happens if I take a crush map and just change the negative number associated to a bucket?
[2:33] <joshd> no idea
[2:33] <joshd> sage would know, but he's gone
[2:33] <cp> OK. I'll ask back tomorrow
[2:41] * RupS (~rups@panoramix.m0z.net) has joined #ceph
[2:41] * RupS| (~rups@panoramix.m0z.net) Quit (Read error: Connection reset by peer)
[2:43] <sage> cp: the bucket id is one of the inputs to the hash function, so it'll effectively reshuffle everything between that bucket's items
[2:43] <cp> ok
[2:43] <cp> sage: thanks
[2:44] <sage> np
[2:50] <ajm> is there anything to make ceph log to syslog?
[2:51] <ajm> (like in there already)
[2:52] <joshd> ajm: there's a 'log to syslog' option you can turn on in your ceph.conf
[2:55] <ajm> thx joshd
[3:00] * alex460 (~alex@ Quit (Quit: alex460)
[3:08] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:08] * alex460 (~alex@ has joined #ceph
[3:24] * alex460 (~alex@ Quit (Quit: alex460)
[3:26] * alex460 (~alex@ has joined #ceph
[3:29] * alex460 (~alex@ Quit ()
[3:35] * RupS (~rups@panoramix.m0z.net) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * df__ (davidf@dog.thdo.woaf.net) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * peritus (~andreas@h-150-131.a163.priv.bahnhof.se) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * Ormod (~valtha@ohmu.fi) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * NaioN (~stefan@andor.naion.nl) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * jantje_ (~jan@paranoid.nl) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * wonko_be (bernard@november.openminds.be) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * f4m8_ (~f4m8@lug-owl.de) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * bugoff (bram@november.openminds.be) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * stingray (~stingray@stingr.net) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * _Tassadar (~tassadar@tassadar.xs4all.nl) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * jrosser (jrosser@dog.thdo.woaf.net) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * damoxc (~damien@94-23-154-182.kimsufi.com) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * royh (~royh@mail.vgnett.no) Quit (reticulum.oftc.net charon.oftc.net)
[3:35] * _are_ (~quassel@vs01.lug-s.org) Quit (reticulum.oftc.net charon.oftc.net)
[3:36] * RupS (~rups@panoramix.m0z.net) has joined #ceph
[3:36] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[3:36] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[3:36] * NaioN (~stefan@andor.naion.nl) has joined #ceph
[3:36] * Ormod (~valtha@ohmu.fi) has joined #ceph
[3:36] * jantje_ (~jan@paranoid.nl) has joined #ceph
[3:36] * peritus (~andreas@h-150-131.a163.priv.bahnhof.se) has joined #ceph
[3:36] * df__ (davidf@dog.thdo.woaf.net) has joined #ceph
[3:37] * jojy (~jojyvargh@ Quit (Quit: jojy)
[3:39] * wonko_be (bernard@november.openminds.be) has joined #ceph
[3:39] * f4m8_ (~f4m8@lug-owl.de) has joined #ceph
[3:39] * bugoff (bram@november.openminds.be) has joined #ceph
[3:39] * stingray (~stingray@stingr.net) has joined #ceph
[3:39] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[3:39] * _Tassadar (~tassadar@tassadar.xs4all.nl) has joined #ceph
[3:39] * _are_ (~quassel@vs01.lug-s.org) has joined #ceph
[3:39] * royh (~royh@mail.vgnett.no) has joined #ceph
[3:39] * damoxc (~damien@94-23-154-182.kimsufi.com) has joined #ceph
[3:39] * jrosser (jrosser@dog.thdo.woaf.net) has joined #ceph
[3:59] * cp (~cp@ Quit (Quit: cp)
[4:37] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[4:37] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit ()
[4:50] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[4:50] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) Quit ()
[5:09] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[5:15] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[6:01] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[6:12] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[6:15] * failbaitr (~innerheig@ Quit (Ping timeout: 480 seconds)
[6:22] * failbaitr (~innerheig@ has joined #ceph
[6:41] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[8:02] * alex460 (~alex@ has joined #ceph
[8:02] <chaos__> ajm, thanks for link to collectd repo
[8:02] <chaos__> erm ;) ceph health
[8:04] * alex460 (~alex@ Quit ()
[8:39] <NaioN> josef: I saw bug 1635 filed, Martin and I have the same error
[8:40] <NaioN> I had a heavy rsync workload and both OSDs crashed, although not at the same time, there are many hours between
[8:40] <NaioN> I've mailed my output to the mailinglist
[9:30] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[9:32] * sjust (~sam@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[9:35] * gregaf (~Adium@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[9:35] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[9:35] * sagewk (~sage@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[9:47] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[10:01] * failbaitr (~innerheig@ Quit (Ping timeout: 480 seconds)
[10:03] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[10:07] * failbaitr (~innerheig@ has joined #ceph
[10:18] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[10:19] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[10:19] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[10:28] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[10:29] * sjust (~sam@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[10:29] * sagewk (~sage@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[10:41] <chaos__> ajm, ceph health is quite inconsistent.. when mds is down everything is ok;)
[10:58] <stingray> hmm
[13:18] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[13:19] * mgalkiewicz (~mgalkiewi@ has joined #ceph
[13:21] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[13:22] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[13:31] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[13:53] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Quit: Leaving.)
[13:55] * gregaf (~Adium@aon.hq.newdream.net) Quit (Read error: Connection reset by peer)
[13:56] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[14:18] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[14:19] * fronlius (~Adium@testing78.jimdo-server.com) Quit ()
[14:24] * gregorg_taf (~Greg@ has joined #ceph
[14:24] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[14:26] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[15:06] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[15:09] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:04] * mgalkiewicz (~mgalkiewi@ Quit (Quit: Leaving)
[16:29] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[17:00] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[17:00] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[17:02] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[17:17] <ajm> is there some issue with 0.37 and memory usage? ceph-osd processes are getting very large on st artup...
[17:44] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[17:45] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[17:50] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:58] <sagewk> chaos__: i just had to rewrite much of that collectd plugin, but haven't pushed it yet. don't expect what you see there to actually work :)
[18:08] * cclien (~cclien@ec2-175-41-146-71.ap-southeast-1.compute.amazonaws.com) has joined #ceph
[18:08] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[18:23] <sagewk> sepia will probably take a while to get back online
[18:29] <sagewk> ajm: i dont think anything changed (i.e. not a _new_ memory usage problem...). altho other people complained about increased memory usage with 0.35 or 0.36 (forget which).
[18:29] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Quit: Leaving.)
[18:30] <ajm> i'm seeing ceph-osd oom on these boxes with 24gb ram
[18:31] <gregaf> wow; that's not anything we're seeing locally :/
[18:31] <gregaf> ajm: can you install the debug packages and grab some memory profiles?
[18:33] <gregaf> http://ceph.newdream.net/wiki/Memory_Profiling
[18:33] <ajm> thanks, i'll get some debug there, probably later though i'm getting on a plane in an hour or two
[18:34] <gregaf> ajm: start up your OSDs with that and you should get some output which will tell us where the memory's going :)
[18:34] <ajm> the flight has wifi though! i can try from the plane!
[18:34] <gregaf> heh
[18:34] <gregaf> we live in the future!
[18:34] <ajm> tell me about it
[18:41] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:50] * fronlius (~Adium@f054097215.adsl.alicedsl.de) has joined #ceph
[18:55] * jojy (~jojyvargh@ has joined #ceph
[19:16] * adjohn (~adjohn@ has joined #ceph
[19:25] * bchrisman (~Adium@ has joined #ceph
[19:35] <fronlius> So, quite short question that will help me to get faster into ceph: Which distribution do you recommend for use? Some features came with kernel 2.6.37 - most distributions are out of the race with that in mind - and I would like to try ceph with the latest features...
[19:35] <fronlius> Ist there btw any distribution that comes with ceph?
[19:36] <jmlowe> I'm pretty happy with ubuntu
[19:37] <fronlius> and what version of ceph does latest ubuntu deliver and what kernel? The latest LTS is at 2.6.32 i think…there aren't many of the newest features in…
[19:38] <jmlowe> I use oneiric, LTS is not where you want to be if you want experimental stuff
[19:39] <jmlowe> oneiric has a 3.0.0 kernel which has ceph in it, adding the ceph repositories for the rest of it is trivial, I'm running 0.37
[19:40] <fronlius> Okay, so there does not seem to be a "the distribution"-answer as it is for hadoop…They all use the cloudera stuff...
[19:40] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[19:41] <fronlius> okay, but I think ubuntu would be pretty much fast-forward..even though I don't like how canonicals behaviour lately..
[19:44] <jmlowe> maybe fedora 15 or 14 https://build.opensuse.org/package/show?package=ceph&project=home%3Aliewegas , I can't speak from experience
[19:45] <Tv> fronlius: cloudera isn't an OS, you can run cloudera stuff on many OSes..
[19:45] <Tv> fronlius: our QA is currently mostly ubuntu 10.10, we use an autobuilt kernel
[19:46] <fronlius> yeah okay, but you can download cloudera as finished kvm image for example…I thought maybe there was something like this approach for ceph also ;)
[19:47] <Tv> fronlius: one of my todo list entries is to make the qa env boot into more OSes, but the honest summary is this: we're a Debian company, I'm using Ubuntu a lot and some of the ubuntu magic sauce makes Ceph nicer to manage, we do provide specs for rpm platforms but they definitely aren't as tested
[19:47] <Tv> *for now*
[19:48] <Tv> for testing and prototyping that might make sense, but you really shouldn't run either ceph or hadoop in a vm, for real
[19:48] <jmlowe> if you are looking for kvm quickness, as I was when I first kicked the tires, grab a ubuntu image, add the following lines adjusted for version and apt-get install ceph
[19:48] <jmlowe> deb http://ceph.newdream.net/debian/ oneiric main
[19:48] <jmlowe> deb-src http://ceph.newdream.net/debian/ oneiric main
[19:49] <fronlius> nicey!
[19:49] <Tv> yeah the base install is really easy, it's the configuration part that gets more challenging ;)
[19:49] <Tv> we're also building chef cookbooks / a crowbar barclamp / juju charms to make bootstrapping a cluster really painless
[19:49] <jmlowe> it'll take you longer to download the image than to get it installed and running
[19:49] <Tv> http://ceph.newdream.net/docs/latest/ops/install/mkcephfs/
[19:50] <fronlius> I am totally aware, that I don't want my storage in VMs - but the only testing environment we use is vagrant, which has a lots of benefits in terms of staging it into our production system..
[19:51] <fronlius> But for now, where we just want to take a look at what the features are, how they work and try to get some of our software running with ceph, it is totally okay for us to do it in small vms :)
[20:05] * jamesturnbull (~Adium@ has joined #ceph
[20:05] * jamesturnbull (~Adium@ Quit ()
[20:05] * jamesturnbull (~Adium@ has joined #ceph
[20:06] <NaioN> sagewk: I see you set the bug 1624 to "need more info", what kind of info are you looking for? I've run into that bug several times...
[20:07] <jamesturnbull> Hi any ceph.newdream.net folks - just wanted to mention the Ceph wiki is currently done - site times out or responds with a DB error
[20:07] <jamesturnbull> s/done/down/
[20:08] <NaioN> It's very slow here...
[20:08] <gregaf> jamesturnbull: it's running for me? although I think the data center was having problems earlier so there might be some remaining issues
[20:08] <sagewk> naion: verification that it is not a btrfs hang, osd log (debug osd = 20, debug filestore = 10, debug ms = 1), a core file.
[20:08] <jamesturnbull> gregaf: I get "Sorry! This site is experiencing technical difficulties.
[20:08] <jamesturnbull> Try waiting a few minutes and reloading.
[20:08] <jamesturnbull> (Can't contact the database server: Lost connection to MySQL server at 'reading authorization packet', system error: 0 (mysql.ceph.newdream.net))"
[20:08] <NaioN> sagewk: ok I'll set a new workload and see if i hit him again
[20:08] <sagewk> naion: thanks!
[20:09] <NaioN> sagewk: the only thing I see of BTRFS is the warning in the dmesg
[20:09] <NaioN> WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
[20:09] <NaioN> I hoped it was solved in the rc10, but I still see the warnings
[20:10] <NaioN> And i checked if the patch from Liu Bo was included (which it was... http://marc.info/?l=linux-btrfs&m=131547325515336&w=2)
[20:11] * jamesturnbull (~Adium@ Quit (Quit: Leaving.)
[20:36] * adjohn (~adjohn@ Quit (Quit: adjohn)
[20:41] * jojy_ (~jojyvargh@ has joined #ceph
[20:41] * jojy (~jojyvargh@ Quit (Read error: Connection reset by peer)
[20:41] * jojy_ is now known as jojy
[20:46] <damoxc> what does it mean if a pg is in a crashed+peering state?
[20:49] <jmlowe> that's a good question, I'd also like to know what crashed+failed+peering means
[20:51] <joshd> crashed means there may be writes that sent but not committed. The pg will go into the replay state after it peers to let the clients resend the writes
[20:51] <damoxc> and if the clients are no longer there?
[20:51] <sjust> damoxc: the writes will be forgotten
[20:52] <damoxc> i've got 461 pgs in a crashed+peering state
[20:52] <sjust> damoxc: they are stuck there?
[20:52] <damoxc> sjust: yeah
[20:52] <sjust> damoxc: are there down osds?
[20:52] <damoxc> sjust: nope, osd e17467: 10 osds: 10 up, 10 in
[20:53] <sjust> damoxc: are the processes running?
[20:53] <damoxc> sjust: appear to be yhes
[20:53] <damoxc> *yes
[20:53] <damoxc> they are all in Ssl according to ps
[20:54] <sjust> logs?
[20:55] <damoxc> only minimum output unfortunately
[20:55] <damoxc> would upping the verbosity and restarting help?
[20:55] <sjust> ok, try restarting one of the osds with osd logging at 25
[20:55] <sjust> try just one
[20:56] <jmlowe> what about crashed+failed+peering?
[20:56] <damoxc> i'm seeing a bunch of no heartbeat from osd.{2,5,6,8} in the logs
[20:56] <sjust> jmlowe: I actually don't recognize failed
[20:57] <jmlowe> I had that the other day, I wound up making a new fs
[20:58] <joshd> damoxc: there's probably a deadlock in the heartbeat thread - if you have debug symbols you could attach with gdb and get a backtrace
[20:59] <damoxc> joshd: okay, i'll try and do that
[21:00] <joshd> damoxc: thanks - 'thread apply all bt' should do the trick
[21:01] * adjohn (~adjohn@ has joined #ceph
[21:05] <damoxc> typically it's stopped doing it now I've installed the debugging symbols
[21:05] <damoxc> seeing a load of journal throttle: waited for ops
[21:07] <damoxc> well i'll keep an eye out for it and give you the backtrace if it pops up again, thanks!
[21:07] <joshd> ok, great
[21:08] <joshd> I know sage has been trying to track that one down
[21:12] <sjust> jmlowe: are you sure that it was crashed+failed+peering?
[21:17] <jmlowe> well, I'm less confident now, I know I had crashed+<something>+peering
[21:19] * cp (~cp@ has joined #ceph
[21:20] <cp> with the crushmap rules 'data', 'metadata' and 'rbd' is it the rule number which is used for the corresponding pools or the rule number?
[21:20] <cp> rule name or rule number I mean
[21:42] <sagewk> cp: ceph osd dump, and look at the crush_ruleset to see the pool -> crush rule mapping
[21:44] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[21:47] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[21:55] * bchrisman (~Adium@ has joined #ceph
[22:51] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[22:54] * fronlius (~Adium@f054097215.adsl.alicedsl.de) Quit (Quit: Leaving.)
[23:19] * jmlowe (~Adium@mobile-166-137-142-139.mycingular.net) has joined #ceph
[23:19] * jmlowe (~Adium@mobile-166-137-142-139.mycingular.net) has left #ceph
[23:21] * sjust (~sam@aon.hq.newdream.net) Quit (Remote host closed the connection)
[23:24] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[23:25] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[23:53] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:58] * jojy (~jojyvargh@ Quit (Quit: jojy)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.