#ceph IRC Log


IRC Log for 2013-01-11

Timestamps are in GMT/BST.

[0:05] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[0:05] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[0:05] * Leseb_ is now known as Leseb
[0:09] * aliguori (~anthony@cpe-70-112-157-151.austin.res.rr.com) has joined #ceph
[0:18] * The_Bishop_ (~bishop@i59F6DFD5.versanet.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[0:19] * tnt (~tnt@86.188-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[0:19] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:30] * jlogan1 (~Thunderbi@2600:c00:3010:1:9cc3:821f:978c:5b0b) Quit (Ping timeout: 480 seconds)
[0:34] * madkiss (~madkiss@ Quit (Remote host closed the connection)
[0:47] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has left #ceph
[0:49] * ScOut3R (~ScOut3R@dsl5401A397.pool.t-online.hu) Quit (Remote host closed the connection)
[0:51] <paravoid> sagewk: hey
[0:51] <paravoid> sagewk: are you available to debug interactively #3770 or should I reply to the bug?
[0:52] <gregaf> he's in a meeting
[0:52] <paravoid> oh, okay
[0:53] <paravoid> thanks
[0:54] * Kioob (~kioob@luuna.daevel.fr) Quit (Quit: Leaving.)
[0:54] <paravoid> "Can you attach a hex dump of the attributes on the current/4.f9_head collection on the crashed osd?"
[0:54] <paravoid> how should I interpret that?
[0:54] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[0:54] * aliguori (~anthony@cpe-70-112-157-151.austin.res.rr.com) Quit (Remote host closed the connection)
[0:54] <paravoid> /var/lib/ceph/osd/ceph-27/current/4.f9_head/ is empty
[0:54] <dmick> maybe rados listxattr/getxattr?
[0:55] <paravoid> those get an object as an attribute, don't they?
[0:55] <dmick> ^sjust
[0:56] <sjust> paravoid: hi
[0:56] <paravoid> hi!
[0:56] <sjust> in this case, I need the xattrs on that directory
[0:56] <sjust> attr -L
[0:56] <sjust> I think
[0:56] <dmick> oh so the linux commands, right
[0:56] <sjust> nope, attr -l
[0:56] <sjust> apparently
[0:56] <sjust> so attr -l /var/lib/ceph/osd/ceph-27/current/4.f9_head/
[0:57] <dmick> if you want the values you have to script IIRC
[0:57] <sjust> and then attr -g <attrname> <pathname> to get the attr value for each script
[0:57] <sjust> *for each key
[0:57] <paravoid> oh, fs extended attributes
[0:57] <sjust> yah
[0:57] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:57] <paravoid> didn't realize that
[0:57] <paravoid> doing it
[0:58] <dmick> for a in $(attr -lq <dir>); do attr -g $a <dir>; done or something
[0:58] <paravoid> | hd
[0:58] * jlogan1 (~Thunderbi@2600:c00:3010:1:9cc3:821f:978c:5b0b) has joined #ceph
[1:02] <paravoid> sjust: there you go
[1:02] <paravoid> and sorry for pinging sage and not you, didn't realize you were the one that replied
[1:03] * nwat (~Adium@soenat3.cse.ucsc.edu) Quit (Quit: Leaving.)
[1:05] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[1:05] <AaronSchulz> yehudasa: hello, might you have any chance to take a crack at http://tracker.newdream.net/issues/3454 ?
[1:06] * miroslav (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[1:07] <sjust> paravoid: no worries
[1:07] <paravoid> want me to check anything else?
[1:07] <sjust> paravoid: ugh, that collection appears to have a well-formed attribute which really does want map version 10705
[1:07] <sjust> the directory is empty, right?
[1:07] <paravoid> yes
[1:07] <sjust> and all your pgs are active+clean?
[1:08] <paravoid> now they are
[1:08] <sjust> you can get the OSD back by deleting that directory along with the corresponding log+info from the meta directory
[1:08] <sjust> one sec though
[1:08] <sjust> want to make sure there isn't anything else I can get from this
[1:08] <yehudasa> AaronSchulz: just came back from a few weeks break, trying to figure out my priorities, not sure that it's on top
[1:09] <AaronSchulz> ah well, would be really nice though :)
[1:09] <sjust> paravoid: can you upload the output of 'find . current/meta' to cephdrop@ceph.com?
[1:09] <yehudasa> well, you can use the S3 api instead
[1:09] <sjust> sorry
[1:09] <yehudasa> for that
[1:09] <sjust> 'find current/meta'
[1:09] * Kioob (~kioob@luuna.daevel.fr) Quit (Quit: Leaving.)
[1:09] <AaronSchulz> yehudasa: hmm, how so?
[1:09] <sjust> curious as to what maps are on the OSD
[1:10] <yehudasa> AaronSchulz: S3 has a preauthenticated url scheme
[1:10] <yehudasa> which we support, you can try using it.. you'll have to set S3 key for the user
[1:14] * DrewBeer is now known as Exstatica
[1:15] <paravoid> sjust: I just attached it to the bug report (hope you don't mind)
[1:15] <sjust> sounds good, I just assumed it would be too big
[1:15] <paravoid> I gzipped it
[1:15] <paravoid> :)
[1:18] <AaronSchulz> yehudasa: hmm, http://css-tricks.com/snippets/php/generate-expiring-amazon-s3-link/ looks useful then
[1:18] <AaronSchulz> I'll try to see if that works
[1:19] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[1:20] <paravoid> AaronSchulz: that's going to be an interesting mwstore backend
[1:20] <paravoid> swift and a little of s3
[1:20] <AaronSchulz> paravoid: how so? It's just for passing urls to ffmpeg
[1:20] <AaronSchulz> not that terrible of a hack :p
[1:21] <paravoid> ;-)
[1:21] <paravoid> and here I was thinking we should move the rewrite VCL (when that happens...) closer to radosgw
[1:21] <paravoid> as to be able to use it internally as well
[1:22] <sjust> paravoid: how long had that cluster been running v0.56.1?
[1:22] <AaronSchulz> paravoid: that doesn't seem to relate to temp urls
[1:23] <AaronSchulz> I mean when we make them we do so from the full actual file url (not some url that we rewrite)
[1:23] <paravoid> sjust: this happened a few hours(?) after the upgrade
[1:23] <paravoid> maybe it didn't came back after the upgrade? not entirely sure
[1:23] <sjust> from what did you upgrade?
[1:23] <paravoid> 0.56
[1:23] <paravoid> quite messy :)
[1:25] <sjust> paravoid: you were the one having trouble with starting an osd before due to heartbeat timeout, right?
[1:25] <sjust> is this the same machine?
[1:25] <paravoid> yeah
[1:25] <sjust> same OSD?
[1:25] <paravoid> the heartbeat timeout was on all machines
[1:25] <sjust> oh
[1:27] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:27] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[1:29] * rlr219 (62dc9973@ircip1.mibbit.com) has joined #ceph
[1:30] * The_Bishop (~bishop@e177090223.adsl.alicedsl.de) has joined #ceph
[1:30] <AaronSchulz> yehudasa: does s3 use rgw subusers or just rgw users?
[1:30] <rlr219> Had an MDS crash in 0.56.1 and now it won't restart. mds/MDCache.cc: In function 'CDir* MDCache::rejoin_invent_dirfrag(dirfrag_t)' thread 7f08abcd4700 time 2013-01-10 19:18:34.294925 mds/MDCache.cc: 3959: FAILED assert(in->is_dir())
[1:30] <yehudasa> AaronSchulz: just users
[1:31] <rlr219> any ceph devs that can help?
[1:33] <gregaf> rlr219: you're using multiple active MDS servers? :(
[1:34] <paravoid> sjust: I don't mind leaving it like that for a few days if you want
[1:34] <sjust> paravoid: I believe I have what I need now, you can remove that pg dir/log/info if you want
[1:34] <rlr219> yes. we are testingso not sure what is best.
[1:35] <gregaf> well, a single MDS is a lot more stable than multiple active ones, and we don't consider any of it to be production ready for general use
[1:35] <rlr219> sjust: I just rebuilt my cluster and upgraded to bobtail. do seem to like most of what I see so far.
[1:36] <rlr219> ok. so I should make the extras active stand by?
[1:36] <gregaf> rlr219: if you can add "debug mds = 20" and "debug ms = 1" and post the logs somewhere I may be able to take a look
[1:36] <gregaf> yeah, definitely a standby of some sort
[1:36] * Cube (~Cube@ has joined #ceph
[1:36] <rlr219> give me a few minutes. Thanks.
[1:36] <gregaf> you'll probably need to get all three of them up before you can successfully turn them off, though!
[1:37] * Cube1 (~Cube@ Quit (Ping timeout: 480 seconds)
[1:38] * miroslav (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[1:38] <gregaf> rlr219: ^
[1:41] <paravoid> sjust: silly question, you said I should remove the directory and pglog/pginfo
[1:41] <paravoid> where do I find pglog/pginfo?
[1:41] <paravoid> ah
[1:41] <paravoid> found them
[1:41] <paravoid> nevermind
[1:43] <paravoid> sjust: so, could you explain a bit more what's happening here if you don't mind?
[1:44] <rlr219> gregaf: http://pastebin.com/W9e5YsTY
[1:44] <paravoid> sjust: crashed again :(
[1:44] <paravoid> same assert
[1:45] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[1:45] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[1:51] <gregaf> rlr219: hmm, did you say this is a fresh cluster? or did it have data and then get upgraded and the other two MDSes restarted successfully and this one failed?
[1:56] <gregaf> I haven't debugged much that happens in this particular interaction and I'm not sure we have the bandwidth to handle it so if you can just toss the filesystem stuff that'd be easiest
[1:56] <rlr219> fresh upgrade last night. MDSs started today data added to cephfs this afternoon
[1:57] <gregaf> why were you restarting one to begin with, then?
[1:57] <gregaf> (I believe this is a bad data-on-disk bug and am wondering when it happened)
[1:58] <rlr219> I set it up to have 3. my understanding was that the MDSs would partion out responsibility for metadata.
[1:58] <gregaf> yes
[1:59] <gregaf> but the assert you're showing me is in a restarting code path, so the daemon got turned off and back on again and I'm curious if there was another crash first or something
[1:59] <rlr219> didn't know multiples weren't statble.
[2:00] <rlr219> there was a crash earlier. i could paste whole log file but its really big (1.9 GB)
[2:01] <gregaf> okay, that's probably the more important one
[2:01] <gregaf> can you pull out just the backtrace?
[2:01] <rlr219> the original?
[2:02] <rlr219> give me a minute
[2:03] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[2:05] * buck (~buck@bender.soe.ucsc.edu) Quit (Quit: Leaving.)
[2:07] <rlr219> gregaf: http://pastebin.com/5LQde69e
[2:19] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[2:21] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[2:22] * agh (~agh@www.nowhere-else.org) has joined #ceph
[2:25] <sjust> paravoid: that means that multiple pgs were behind...
[2:25] <sjust> basically, the osd erroneously trimmed maps which were still in use by pgs
[2:25] <sjust> I need to run, but I'll be working on a fix tomorrow
[2:31] <rlr219> gregaf: you still here?
[2:31] <gregaf> yeah
[2:31] <gregaf> rlr219: I don't imagine you have the core file around from that original crash?
[2:32] <rlr219> it just so happens i do.
[2:32] <rlr219> it is 1.6 GB
[2:34] * korgon (~Peto@isp-korex- has joined #ceph
[2:37] <gregaf> n/m that thought, this assert isn't quite what I thought it was
[2:37] <gregaf> that's a bug in the migration between metadata servers and unfortunately that's something that we don't have the time for right now, sorry
[2:38] <gregaf> we're now starting to work on getting a single MDS stable enough for use ;) and there are some known issues we'll need to work through but if you run into a bug there we'll be able to deal with it a lot sooner
[2:39] <rlr219> ok. Well, I think I have my answer. again, it was testing. Unfortnuate that cephfs isn't quite ready yet, But I do like the improvements so far in bobtail!
[2:39] <gregaf> yeah, unfortunate for us too ;)
[2:40] <rlr219> gregaf: Thanks for your time.
[2:40] <gregaf> np, thanks for looking!
[2:40] <rlr219> cheers
[2:40] * rlr219 (62dc9973@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[2:49] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[2:49] * LeaChim (~LeaChim@b0faeeb0.bb.sky.com) Quit (Read error: Operation timed out)
[2:54] <gregaf> Kioob`Taff: that log you posted doesn't seem to cover any scrubbing at all, so I can't look at how much bandwidth it's taking
[2:55] <gregaf> you are doing a fair bit of writing to the cluster though; it's not huge but client writes are coming in at a few MB/second
[2:57] * mattbenjamin (~matt@ Quit (Quit: Leaving.)
[3:00] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[3:05] * jlogan1 (~Thunderbi@2600:c00:3010:1:9cc3:821f:978c:5b0b) Quit (Ping timeout: 480 seconds)
[3:08] * yanzheng (~zhyan@jfdmzpr02-ext.jf.intel.com) has joined #ceph
[3:10] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[3:10] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[3:19] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[3:29] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:10] * dmick (~dmick@2607:f298:a:607:8063:62c7:78d5:6751) Quit (Quit: Leaving.)
[4:19] * kbad (~kbad@malicious.dreamhost.com) Quit (Quit: Lost terminal)
[4:19] * mmgaggle (~kyle@alc-nat.dreamhost.com) has joined #ceph
[4:49] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[5:02] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[5:11] * tsygrl (~tsygrl314@c-75-68-140-25.hsd1.vt.comcast.net) has joined #ceph
[5:11] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[5:12] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[5:16] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[5:18] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[5:23] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[5:28] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[5:28] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[5:30] * agh (~agh@www.nowhere-else.org) has joined #ceph
[5:30] * zK4k7g (~zK4k7g@digilicious.com) has left #ceph
[5:32] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[5:44] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[5:44] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[6:07] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[6:15] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[7:07] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[7:16] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[7:52] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[8:00] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Ping timeout: 480 seconds)
[8:01] * silversurfer (~silversur@122x212x156x18.ap122.ftth.ucom.ne.jp) has joined #ceph
[8:04] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[8:04] * The_Bishop_ (~bishop@f052098161.adsl.alicedsl.de) has joined #ceph
[8:04] * The_Bishop (~bishop@e177090223.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[8:08] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[8:10] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:16] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[8:17] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:28] * madkiss (~madkiss@ has joined #ceph
[8:30] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Some folks are wise, and some otherwise.)
[8:36] <absynth_47215> morning
[8:52] <Kioob`Taff> gregaf : thanks, there was scrub running during that logs, but not finished ?
[9:08] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[9:10] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[9:11] * agh (~agh@www.nowhere-else.org) has joined #ceph
[9:11] <schlitzer|work> morning
[9:11] * silversu_ (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[9:13] * silversurfer (~silversur@122x212x156x18.ap122.ftth.ucom.ne.jp) Quit (Ping timeout: 480 seconds)
[9:17] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[9:18] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[9:18] * sel (~sel@2001:16d8:eed5:4040:55ab:13e8:2898:ce7f) has joined #ceph
[9:19] * loicd (~loic@ has joined #ceph
[9:19] * silversu_ (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Ping timeout: 481 seconds)
[9:22] * ScOut3R (~ScOut3R@ has joined #ceph
[9:22] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[9:22] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[9:34] * The_Bishop__ (~bishop@e179009205.adsl.alicedsl.de) has joined #ceph
[9:35] * The_Bishop_ (~bishop@f052098161.adsl.alicedsl.de) Quit (Read error: Connection reset by peer)
[9:39] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:40] * fc (~fc@ Quit (Quit: leaving)
[9:41] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[9:45] * fc__ (~fc@ has joined #ceph
[9:47] * tore_ (~tore@ Quit (Remote host closed the connection)
[9:49] * sbadia (~seb@yasaw.net) Quit (Quit: WeeChat 0.3.8)
[9:50] <loicd> sileht: good morning :-)
[9:51] <sileht> loicd, good morning
[9:51] <loicd> sileht: my mission today is to figure out how to run lcov with teuthology ( which https://github.com/ceph/teuthology/blob/master/coverage/cov-init.sh is supposed to do ). Do you have previous experience with that ?
[9:52] <sileht> loicd, a bit
[9:52] <loicd> nice :-)
[9:54] <sileht> loicd, the first for teuthology is to describe your test plate-form and which tasks you want to run
[9:54] <loicd> I assume the idea is to run https://github.com/ceph/teuthology/blob/master/coverage/cov-init.sh on each teuthology target, then run teuthology so that it generates data for lcov, then run https://github.com/ceph/teuthology/blob/master/coverage/cov-analyze.sh to get results.
[9:56] <sileht> all tasks that can be done are here: https://github.com/ceph/teuthology/tree/master/teuthology/task
[9:56] * sleinen1 (~Adium@2001:620:0:46:bc7a:622:8693:fbb5) Quit (Ping timeout: 480 seconds)
[9:57] <loicd> sileht: I managed to run teuthology successfully already. ( http://marc.info/?l=ceph-devel&m=135785969127686&w=2 ) but I can't figure out how to get code coverage.
[9:57] <sileht> loicd, :)
[9:58] * loicd digging
[9:58] <sileht> loicd, I have never take a look to the coverage script
[10:00] * low (~low@ has joined #ceph
[10:05] * Leseb (~Leseb@ has joined #ceph
[10:07] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[10:07] * BManojlovic (~steki@ has joined #ceph
[10:08] * yoshi (~yoshi@p2100-ipngn4002marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[10:09] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[10:12] * yanzheng (~zhyan@jfdmzpr02-ext.jf.intel.com) Quit (Remote host closed the connection)
[10:17] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[10:26] * tnt (~tnt@86.188-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[10:29] * LeaChim (~LeaChim@b0faeeb0.bb.sky.com) has joined #ceph
[10:31] * EmilienM__ (~my1@ has joined #ceph
[10:32] * EmilienM__ (~my1@ has left #ceph
[10:38] * loicd1 (~loic@ has joined #ceph
[10:40] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[10:41] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[10:42] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[10:49] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[10:52] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[10:55] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[10:55] * The_Bishop__ (~bishop@e179009205.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[10:57] * agh (~agh@www.nowhere-else.org) has joined #ceph
[11:10] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[11:10] <madkiss> what's wrong when after cluster creating, i get a lot of "stale+peering" PGs?
[11:10] <madkiss> totally unpredictable?
[11:10] * allsystemsarego (~allsystem@5-12-241-245.residential.rdsnet.ro) has joined #ceph
[11:18] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[11:19] * sel (~sel@2001:16d8:eed5:4040:55ab:13e8:2898:ce7f) Quit (Ping timeout: 480 seconds)
[11:21] * The_Bishop (~bishop@e179009205.adsl.alicedsl.de) has joined #ceph
[11:23] * loicd1 is now known as loicd
[11:23] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[11:25] * sel (~sel@2001:16d8:eed5:4040:55ab:13e8:2898:ce7f) has joined #ceph
[11:27] <loicd> Leseb: happy new year !
[11:28] <loicd> loicd: have you ever tried getting code coverage reports from teuthology ? I'm starting to suspect that https://github.com/ceph/teuthology/blob/master/coverage/cov-init.sh is outdated
[11:28] <loicd> Leseb: ^
[11:30] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[11:34] * LeaChim (~LeaChim@b0faeeb0.bb.sky.com) Quit (Ping timeout: 480 seconds)
[11:39] <Leseb> loicd: hey! bonne année !
[11:39] <loicd> :-D
[11:40] <Leseb> loicd: hop, sorry I never tried, but you seem all of the sudden really interesting by teutology. Is there any specific reason of that?
[11:41] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Quit: Leaving.)
[11:43] <Leseb> s/interesting/interested
[11:43] <loicd> Leseb: I like tests, unit & integration, all kinds. Reading tests helps me understand the code base. Reading coverage, if it's covered I know the code is not dead, for instance. If the logic puzzles me, I can try to read the tests and get usage patterns. Does that make sense ?
[11:44] * LeaChim (~LeaChim@b0faca2a.bb.sky.com) has joined #ceph
[11:44] <Leseb> loicd: yes it does :)
[11:46] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[12:03] <dec> hmm; performance/latency has gone to crap since going to 0.56 (from 0.53) earlier this week.
[12:04] <dec> the OSDs have been constantly doing 10x the disk reads/writes they previously were
[12:04] <dec> but everything looks healthy.
[12:04] <dec> actually, no, not reads - just writes
[12:05] <dec> write ops/sec aggregate has gone from ~80 to ~380 after the upgrade
[12:09] * LeaChim (~LeaChim@b0faca2a.bb.sky.com) Quit (Ping timeout: 480 seconds)
[12:15] <dec> I guess noone's around :)
[12:19] * LeaChim (~LeaChim@b0fadd12.bb.sky.com) has joined #ceph
[12:21] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[12:27] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Quit: Leaving)
[12:33] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[12:44] <sel> I'm a bit confused. I'm testing ceph with 4 servers each with 2 ods, one physical disk pr ods. The replica rate is set to two. If I remove two disk from two different servers. Am I right to assume that I then will lose data? As far as I understand each server in my setup is a failiure domain.
[13:10] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[13:10] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit ()
[13:11] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[13:11] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[13:19] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[13:20] * sleinen (~Adium@ has joined #ceph
[13:21] * sleinen1 (~Adium@2001:620:0:25:55f3:49eb:e431:8ed2) has joined #ceph
[13:28] * sleinen (~Adium@ Quit (Ping timeout: 480 seconds)
[13:28] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[13:39] <tnt> sel: depends on your crush map, but in general, yes.
[13:43] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:57] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[13:59] * terje_ (~terje@97-118-118-3.hlrn.qwest.net) Quit (Read error: Operation timed out)
[14:01] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[14:03] * terje__ (~joey@97-118-118-3.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[14:07] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[14:10] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:12] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[14:13] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[14:20] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Read error: No route to host)
[14:22] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[14:22] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[14:43] <fghaas> dec: not surprised, really, quite a few people have reported that their performance went down the drain post-bobtail
[14:44] <fghaas> not sure if a fix is in sight, but people are at least aware of the issue
[14:45] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:49] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[14:49] * julienhuang (~julienhua@106.242-224-89.dsl.completel.net) has joined #ceph
[14:49] * julienhuang (~julienhua@106.242-224-89.dsl.completel.net) Quit ()
[14:52] <tnt> mmm, that sucks. I tought bobtail was supposed to increase performance ...
[14:53] <dec> yeah, me too!
[14:53] <dec> damn!
[14:55] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[15:03] <nhm> fghaas: dec: I missed this conversation, let me read scrollback
[15:04] * aliguori (~anthony@cpe-70-112-157-151.austin.res.rr.com) has joined #ceph
[15:06] <dec> nhm: there wasn't much too it - just me comparing performance from ceph 0.53 to 0.56 after upgrading this week;
[15:06] <nhm> dec: Ok, more write ops. Are the writes being fragmented, or is more total data being written out?
[15:07] <dec> I don't really have any data on that
[15:07] <nhm> dec: yeah, I'd like to figure out what's going on though since I didn't see significant write degradation like that.
[15:08] <nhm> dec: Are you using rbd/cephfs/rgw?
[15:08] <dec> RBD
[15:08] <dec> for serving VM disks via KVM
[15:08] <dec> so it looks like the number of bits/sec written hasn't really changed
[15:08] <nhm> Ok. Could you try a rados bench? Maybe something broke in RBD.
[15:09] <nhm> hrm.. So throughput is the same, but lateny is higher? Or just more ops/sec?
[15:09] <dec> ops/sec higher and latency higher
[15:10] <nhm> fghaas: If you've been hearing that other people are having performance problems, please let me know too.
[15:10] <nhm> dec: Are you using QEMU-KVM or the kernel RBD?
[15:11] <dec> qemu-kvm
[15:11] <nhm> Is RBD cache enabled?
[15:11] <fghaas> nhm: was referring to the stuff discussed here yesterday and the day before
[15:12] <dec> nhm: RBD cache in qemu-kvm, or some caching on the ceph servers?
[15:12] <nhm> fghaas: hrm, ok. I'll have to read scrollback. Any trends?
[15:13] <fghaas> nhm: can talk in about an hour, have a call in 2 mins
[15:13] <dec> nhm: http://i.imgur.com/hIDa2.png and http://i.imgur.com/K0BE4.png -- example from one of the OSD disks
[15:13] <nhm> fghaas: np, ttyl
[15:14] <dec> you can see when we upgraded!
[15:14] <nhm> dec: yeah, definitely!
[15:14] <nhm> dec: http://ceph.com/docs/master/rbd/rbd-config-ref/
[15:16] <nhm> dec: I'm wondering if something in 0.56 might have caused RBD to stop caching as efficiently so the OSDs are getting smaller write requests.
[15:16] <nhm> dec: was anything other than Ceph upgraded at the same time?
[15:18] <dec> other than upgrading the osd + mon daemons (the clients are still v0.53 librbd)
[15:18] <dec> the other change was the OSD disk mount options
[15:19] <dec> they're all ext4, and we changed from commit=5 , data=ordered , barrier (the defaults) to commit=30, data=writeback, nobarrier
[15:19] <dec> (to try to improve performance; we've seen big performance boost with these options elsewhere)
[15:22] <nhm> dec: Ok, I'd be really careful with nobarrier and data=writeback. You are risking data corruption during power failure.
[15:23] <dec> it should be safe with the battery-backed write-cache of the disk controllers
[15:23] <dec> and the redundant PSUs and redundant DC power, etc. :)
[15:23] <dec> but I take the warning
[15:25] <nhm> dec: you know, looking back at the performance preview, ext4 writes did suffer: http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/#conclusion
[15:26] <nhm> How are you the disks configured under the OSDs?
[15:26] <dec> individual disk per OSD, with an ext4 filesystem on each
[15:27] <dec> interesting note in the performance preview.
[15:27] <tnt> What really surprises me in those perf review is that the result seems to be all over the place depending on the case ...
[15:28] <tnt> -68% to 203% is a pretty wide spread.
[15:29] <dec> yeah;
[15:29] <nhm> tnt: that's over a lot of different filesystems and configurations though.
[15:29] <dec> the preview talks about removal of explicit write flushing as a possible reason for the ext4 slowdown.
[15:30] <dec> can we reduce the new filestore_flush_min option to change this?
[15:30] <nhm> dec: yes, we changed how the filestore flusher worked because it was causing performance problems on other filesystems, but it's also possible that this hurt EXT4 performance.
[15:31] <nhm> dec: I'm doing parametric sweeps of ceph tunables right now to see if I can figure out what options are best to tune for each filesystem.
[15:31] <nhm> dec: yes, all we did was introduce that option and make it not explicitly flush writes under the default size.
[15:32] <nhm> dec: for the individual disks, are they in a R0 config or JBOD? Also, what controller are you using?
[15:33] <dec> Err, I'll have to check
[15:34] <dec> I think they have changed to JBOD presented from some LSI SAS2308s
[15:34] <nhm> dec: one reason I ask is because on the LSI SAS2208, write caching seems to be disabled if the disks are in JBOD mode, but on Areca cards it is not.
[15:35] <dec> ah, yep they have LSI 9205-8e cards, which are LSISAS2308
[15:35] <dec> and they're just HBAs so no write cache
[15:36] <dec> but none of the hardware has changed, so doesn't really explain the big performance differential between .53 and .56
[15:36] <nhm> Yeah, so those will behave sort of like the JBOD mode on the SAS2208.
[15:36] <dec> yup.
[15:37] <nhm> dec: fwiw in JBOD mode it looks like the SAS2208 saw 4k write performance degradation with ext4.
[15:37] <nhm> In the performance preview.
[15:39] <nhm> btw, what io scheduler are you using?
[15:39] <dec> noop
[15:40] <nhm> dec: I have an article coming out soon about io schedulers. With EXT4 I would give CFQ a try.
[15:41] <nhm> at least in JBOD mode.
[15:44] <jmlowe> nhm: oooh, I've been wondering about schedulers
[15:47] <nhm> tnt: oh, btw, regarding rrdtool, I wrote a perl graphics library that looks almost exactly like RRDtool. :)
[15:48] <nhm> tnt: well, library is probably the wrong term. framework? something like that.
[15:48] <nhm> jmlowe: article should be out in the next week I think. If you have any questions I can tell you what I know.
[15:49] <morpheus__> useful scheduler for xfs? :)
[15:52] * PerlStalker (~PerlStalk@ has joined #ceph
[15:53] <nhm> morpheus__: for JBOD, I'd use CFQ. If you are using any kind of RAID mode with controller cache (even single disk R0 arrays) I'd use deadline or maybe noop.
[15:54] <nhm> morpheus__: should have a comparison chart once the article is released.
[15:55] <morpheus__> great, looking forward to it
[15:55] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[15:56] <nhm> morpheus__: actually, if you are trying to optimize lots of concurrent small reads, XFS on JBOD might do better with deadline/noop too.
[16:06] <dec> I guess I can't change OSD filesystem options on-the-fly, and need to restart the OSDs?
[16:11] * sel (~sel@2001:16d8:eed5:4040:55ab:13e8:2898:ce7f) Quit (Quit: Leaving)
[16:15] <dec> neither CFQ/Deadline seems to make any difference over noop
[16:15] <dec> at least in the short term tests I've just done
[16:18] <nhm> dec: Was worth a shot. Are you running on ubuntu?
[16:21] <dec> Nope, EL6
[16:21] <nhm> what kernel?
[16:22] <dec> RedHat's 2.6.32
[16:23] <nhm> yikes, does that support syncfs?
[16:23] <nhm> By default 2.6.32 does not. It wasn't added until 2.6.38 afaik.
[16:24] <dec> Yes, it does
[16:27] <nhm> dec: is there any complaining in the ceph logs about it?
[16:27] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) has joined #ceph
[16:27] <dec> nothing I can see...
[16:28] <dec> about 2-3 times per day I see a "failed lossy con, dropping message" in OSD logs
[16:41] <paravoid> dec: hey, any news regarding that puppet module of yours? :)
[16:42] <dec> paravoid: I've started getting it ready to release publicly, but haven't had a chance to finish it
[16:42] <dec> paravoid: haven't forgotten! :)
[16:44] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[16:51] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[16:51] <noob2> on ceph 0.56 is it normal when doing a rbd rm to see this?
[16:51] <noob2> 2013-01-11 10:50:50.264130 7f8138ac5780 -1 librbd: Error listing snapshots: (95) Operation not supported
[16:51] <noob2> it removed the image fine so i'm not overly concerned about it
[16:58] * jlogan (~Thunderbi@2600:c00:3010:1:9cc3:821f:978c:5b0b) has joined #ceph
[16:59] * low (~low@ Quit (Quit: Leaving)
[17:00] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[17:00] * zK4k7g (~zK4k7g@digilicious.com) has joined #ceph
[17:00] <absynth_47215> wschulze: around?
[17:01] <wschulze> absynth_47215: yes
[17:07] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[17:08] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[17:09] * Cube1 (~Cube@ has joined #ceph
[17:10] <mikedawson> Getting an assert when trying to start an OSD process on 0.56.1 http://pastebin.com/dU2rcDEm
[17:11] <mikedawson> other osds and mons running properly
[17:11] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[17:12] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[17:13] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[17:14] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:14] <mikedawson> Actually, I'm getting this assert on 3 of my OSDs (all on separate boxes)
[17:20] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[17:21] <sstan> Is there documentation on how to set up the cluster manually?
[17:22] <noob2> sstan: yup it's on the wiki
[17:22] <noob2> i'm working on getting ceph to operate with ovirt. i'm fairly certain it's going to work
[17:24] <sstan> noob2: you mean in /docs/master ?
[17:25] <noob2> http://ceph.com/docs/master/rados/deployment/mkcephfs/
[17:25] <noob2> that's what i used to manually create the cluster
[17:25] <sstan> ah that's not manual. The script does the work
[17:25] * fc__ (~fc@ Quit (Quit: leaving)
[17:26] <noob2> oh you mean super manual. i don't think that is documented
[17:26] <sstan> I've tried that, but I'd like to see if I can deploy with raw ceph commands
[17:27] <mikedawson> sstan: mkcephfs is the traditional method, and ceph-deploy is going to be the way in the future
[17:27] <iggy> yeah, you'll likely have to piece it together from the docs, reading what the scripts do, and maybe looking at the chef stuff
[17:27] <sstan> exactly ... the information is dispersed
[17:29] <mikedawson> sstan: if you use mkcephfs on your first node, it is pretty easy to manually issue commands to add additional OSDs and MONs from that point
[17:30] <iggy> there's not a "how to setup ceph the really hard way" document
[17:30] <mikedawson> iggy: ha
[17:31] <mikedawson> any inktank devs around to look at http://pastebin.com/dU2rcDEm ? Getting this assert starting 3 of my OSDs
[17:31] <sstan> hah
[17:45] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:50] * fghaas seconds mikedawson's question
[17:51] <mikedawson> fghaas: same issue?
[17:51] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[17:51] <fghaas> sstan: you mean like this? http://ceph.com/docs/master/dev/mon-bootstrap/
[17:52] <fghaas> and then there's http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ for manually installing osds
[17:52] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[17:52] <sstan> hmm that looks good. I didn't watch what there is in the development section
[17:53] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Read error: Connection reset by peer)
[17:53] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[17:53] <sstan> thanks
[17:53] <fghaas> mikedawson: well I wasn't sure if it was you or some other person that posted that exact assertion failure the other day, and I wonder what's behind it
[17:57] <mikedawson> fghaas: gotcha
[17:59] * chutzpah (~chutz@ has joined #ceph
[18:01] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[18:01] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[18:01] * sbadia (~sbadia@yasaw.net) has joined #ceph
[18:02] * gaveen (~gaveen@ has joined #ceph
[18:03] * sbadia (~sbadia@yasaw.net) Quit ()
[18:03] * madkiss (~madkiss@ Quit (Quit: Leaving.)
[18:03] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:05] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[18:05] * sbadia (~sbadia@yasaw.net) has joined #ceph
[18:09] * danieagle (~Daniel@ has joined #ceph
[18:16] * mattbenjamin (~matt@wsip-24-234-55-160.lv.lv.cox.net) has joined #ceph
[18:17] * The_Bishop (~bishop@e179009205.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[18:17] <mikedawson> fghaas: this is my first time seeing this issue, but it looks like someone else has seen it http://www.tracker.newdream.net/issues/3770
[18:21] <fghaas> mikedawson: so it seems, yes
[18:25] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[18:28] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[18:28] <jmlowe> sjust: you around?
[18:28] * sbadia (~sbadia@yasaw.net) Quit (Quit: WeeChat 0.3.8)
[18:29] * sbadia (~sbadia@yasaw.net) has joined #ceph
[18:29] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:35] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[18:37] <mikedawson> jmlowe: he's a wanted man
[18:39] <kylehutson> Trying to get rgw working - seeing the issue mentioned at http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/8090 - answer was to "run radosgw manually" - but how to do that?
[18:39] <kylehutson> I don't have a radosgw init script (not running Ubuntu)
[18:39] <paravoid> Aaron, care to comment? Shall I open a new bug report against core or should we track this here?
[18:39] <paravoid> argh
[18:39] <paravoid> sorry, wrong paste
[18:40] <kylehutson> Downloaded the Ubuntu source, and found that it runs "radosgw … -f", but "-f" isn't mentioned anywhere in the docs or on the man page.
[18:40] * mattbenjamin (~matt@wsip-24-234-55-160.lv.lv.cox.net) Quit (Quit: Leaving.)
[18:41] * miroslav (~miroslav@66-117-145-163.lmi.net) has joined #ceph
[18:41] <kylehutson> When I run the command manually, it sits there for a few seconds, and then silently quits
[18:41] * tsygrl (~tsygrl314@c-75-68-140-25.hsd1.vt.comcast.net) Quit (Quit: Leaving)
[18:43] * sleinen1 (~Adium@2001:620:0:25:55f3:49eb:e431:8ed2) Quit (Quit: Leaving.)
[18:44] * sbadia (~sbadia@yasaw.net) Quit (Quit: WeeChat 0.3.8)
[18:45] * sbadia (~sbadia@ has joined #ceph
[18:47] * mattbenjamin1 (~matt@wsip-24-234-55-160.lv.lv.cox.net) has joined #ceph
[18:48] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[18:50] * mattbenjamin1 (~matt@wsip-24-234-55-160.lv.lv.cox.net) Quit ()
[18:51] * mattbenjamin (~matt@wsip-24-234-55-160.lv.lv.cox.net) has joined #ceph
[18:51] <jmlowe> I'm up to 4 inconsistent pg's, I'm hoping there is something useful here in terms of squashing this bug
[18:52] <joao> has anyone ever seen this happen during a teuthology run?
[18:52] <joao> ceph-fuse 84E 3.8E 81E 5% /tmp/cephtest/mnt.0
[18:56] * Aliens (~Alien@201-35-239-151.jvece701.dsl.brasiltelecom.net.br) has joined #ceph
[18:56] * Aliens (~Alien@201-35-239-151.jvece701.dsl.brasiltelecom.net.br) Quit (Remote host closed the connection)
[18:57] * The_Bishop (~bishop@2001:470:50b6:0:2d34:6844:9a8f:4304) has joined #ceph
[18:59] * mattbenjamin (~matt@wsip-24-234-55-160.lv.lv.cox.net) Quit (Ping timeout: 480 seconds)
[19:05] * sander (~chatzilla@c-174-62-162-253.hsd1.ct.comcast.net) has joined #ceph
[19:08] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[19:10] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[19:14] * sjustlaptop (~sam@2607:f298:a:697:3162:16db:f5bc:2399) has joined #ceph
[19:14] * miroslav (~miroslav@66-117-145-163.lmi.net) Quit (Read error: No route to host)
[19:15] * miroslav (~miroslav@66-117-145-163.lmi.net) has joined #ceph
[19:19] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:22] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[19:35] <yehudasa> elder / joao: channel title can be updated
[19:36] <AaronSchulz> paravoid: hm?
[19:38] * joao changes topic to 'v0.56.1 has been released -- http://goo.gl/4OJw6 || argonaut v0.48.3 released -- http://goo.gl/80aGP || argonaut vs bobtail performance preview -- http://goo.gl/Ya8lU'
[19:38] <yehudasa> joao: thanks
[19:40] * sjustlaptop (~sam@2607:f298:a:697:3162:16db:f5bc:2399) Quit (Ping timeout: 480 seconds)
[19:40] * miroslav (~miroslav@66-117-145-163.lmi.net) Quit (Quit: Leaving.)
[19:47] * nwat (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[19:49] * madkiss (~madkiss@p57A1CFCD.dip.t-dialin.net) has joined #ceph
[19:54] <gregaf> loicd: probably your best bet for teuthology code coverage is when Josh gets back from vacation since he set that all up; I'll point him at you on Monday
[19:55] <gregaf> madkiss: stale means the monitor hasn't gotten a report on that PG in a while; if you just created the cluster and you have a bunch of PGs it probably means the OSDs are taking a while to create them
[19:56] <madkiss> hu?
[19:56] <madkiss> what?
[19:56] * BManojlovic (~steki@ has joined #ceph
[19:57] <loicd> gregaf: thanks for the hint, much appreciated :-) Have a nice week-end !
[19:57] <madkiss> gregaf: what's happening with bobtail right now is that as soon as one node with 14 OSDs bails out of the cluster, the two remaining MONs lose their connectivity, then both thinking that they're out of quorum (although there is no load worth mentioning present, neither i/o wise nor network wise nor CPU wise)
[20:01] <gregaf> mikedawson: Sam's diagnosed it and has a fix planned, update coming to the bug shortly
[20:02] <fghaas> madkiss: I think you need to tell gregaf that all your 3 osd nodes are also mons, which is not 100% evident from context
[20:02] <madkiss> fghaas: correct, thanks
[20:02] * miroslav (~miroslav@sjc-static- has joined #ceph
[20:02] <gregaf> I was responding to
[20:02] <gregaf> [11:10] <madkiss> what's wrong when after cluster creating, i get a lot of "stale+peering" PGs?
[20:02] <gregaf> [11:10] <madkiss> totally unpredictable?
[20:02] <madkiss> aah
[20:03] <madkiss> I see. Well, that was argonaut earlier today. and there was no change after about 20 minutes w/ regards to the number of "stale+peering" PGs in there
[20:03] <gregaf> are you saying you have three nodes, that each have a single monitor and a bunch of OSDs?
[20:04] <gregaf> and you kill one node?
[20:04] <gregaf> if your monitors were out of the quorum you wouldn't be able to do anything; is that the case?
[20:04] <gregaf> fghaas: have you seen any bad performance reports apart from RBD? I've seen that one but not anything else
[20:05] <madkiss> three nodes, each having a single monitor and a bunch of OSDs. I kill a node, the two remaining MONs go out of quorum and the cluster hangs (e.g. a VM running on top of it) stops working right away.
[20:05] <fghaas> gregaf: yeah, the one madkiss is describing to you right now :)
[20:06] <madkiss> we even have separated networks on this one (for client and OSD traffic)
[20:06] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[20:06] <gregaf> fghaas: it doesn't count as a performance regression until more than one person sees it ;)
[20:06] <gregaf> madkiss: is that the state it's currently in?
[20:07] * scheuk (~scheuk@ has joined #ceph
[20:07] <fghaas> gregaf: (12:03:47 PM) dec: hmm; performance/latency has gone to crap since going to 0.56 (from 0.53) earlier this week.
[20:07] <fghaas> (12:04:12 PM) dec: the OSDs have been constantly doing 10x the disk reads/writes they previously were
[20:08] <gregaf> and he changed several different things with the way his disks were set up :/ (not trying to weasel out of it, there was just a lot more to that conversation)
[20:08] <gregaf> but yeah, that report concerns me a bit
[20:08] <madkiss> gregaf: well, right now it's in an "all-well state" again, we have re-powered on the third node before the weekend, but we have access to the machines
[20:08] <madkiss> and it was fairly reproducible
[20:08] <gregaf> hmm
[20:08] <scheuk> has anyone upgrade from 0.48.2 to 0.56.1 yet? Have you seen any issues yet?
[20:09] <madkiss> i'll have dinner now, should be back in 30 minutes
[20:09] <gregaf> madkiss: it almost sounds like the remaining two monitors can't talk to each other, otherwise they would be joining up
[20:09] <gregaf> you should check their communications next time it happens
[20:09] <madkiss> they *are* joining up again and again
[20:10] <gregaf> otherwise if you can get a log with "debug ms = 1" and "debug mon = 10" that would be helpful to diagnose
[20:10] <madkiss> sure
[20:10] <madkiss> shouldn't be a problem
[20:11] <madkiss> dinner, brb
[20:12] * madkiss (~madkiss@p57A1CFCD.dip.t-dialin.net) Quit (Quit: Leaving.)
[20:13] <sander> hiya
[20:13] <sander> any teuthology experts on here?
[20:15] <fghaas> sander: loicd has been digging into teuthology all day, but he left just 10 minutes ago
[20:16] <sander> cool, thanks for letting me know
[20:16] <sander> trying to follow the readme and it doesn't match what I'm seeing
[20:16] <fghaas> gregaf: I'll be hopping on the systems madkiss has been dealing with and give it a shot to reproduce, just so we have an extra pair of eyes on this
[20:16] <sander> will scurry up some help elsewhere
[20:19] <gregaf> cool, thanks fghaas
[20:20] <noob2> ceph seems to work out of the both with ovirt 3.1
[20:20] <noob2> you tell it to use local storage on the host and attach it like any other disk
[20:21] <gregaf> sander: try contacting slang1; he's been doing more work in teuthology tasks than with setup and admin but he or Dan are good options right now
[20:21] <jmlowe> scheuk: if you use rbd my advice would be to NOT do it, right now I'm losing data with 0.56.1
[20:21] <sander> will do greg, thanks. I forgot josh was away.
[20:21] * slang1 waves
[20:27] * jjgalvez (~jjgalvez@ has joined #ceph
[20:27] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[20:27] <scheuk> jmlowe: what are you seeing?
[20:28] <scheuk> I am using RBD pretty much exclusivly
[20:28] <scheuk> with a little cephfs
[20:28] <jmlowe> occasional truncated objects making for inconsistent pg's
[20:29] * mikey (~mikey@catv-213-222-190-74.catv.broadband.hu) Quit (Read error: Connection reset by peer)
[20:29] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Read error: Operation timed out)
[20:29] * Ryan_Lane (~Adium@ has joined #ceph
[20:29] * sleinen (~Adium@2001:620:0:25:39d5:eb8a:b24f:aec) has joined #ceph
[20:30] <scheuk> jmlowe: that's not good, I saw that in 0.48.2 if an OSD crashed
[20:30] <scheuk> or the hardware under the OSD crashed
[20:34] * mikey (~mikey@catv-213-222-190-74.catv.broadband.hu) has joined #ceph
[20:39] <jmlowe> scheuk: I'm seeing that right now with healthy osd's that aren't going down, I haven't had it outside of qemu to the best of my knowledge
[20:41] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[20:45] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:47] * madkiss (~madkiss@p57A1CFCD.dip.t-dialin.net) has joined #ceph
[20:49] * sleinen (~Adium@2001:620:0:25:39d5:eb8a:b24f:aec) Quit (Quit: Leaving.)
[20:52] * dmick (~dmick@2607:f298:a:607:9528:e89b:6c31:616) has joined #ceph
[20:52] <dwm37> Howdy, world.
[20:52] <dwm37> I think I'm seeing a block-size problem when mounting a CephFS filesystem.
[20:53] <dwm37> `df` is returning bogus values.
[20:54] <dwm37> ... while `ceph -s` indicates 1059GB available on this toy system, `df -h` instead returns 4.2GB.
[20:56] <dwm37> Looks like it's out by a factor of 256. That would suggest CephFS is using a block size of 128kb?
[20:58] <dwm37> (This is using the Debian packages from ceph.com/debian-testing; 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
[21:00] <dmick> dwm37: I know there's some funniness in space accounting, on purpose, but I don't remember the details and searching is failing me
[21:01] <dwm37> I can actually write files larger than the FS, but it causes issues for e.g. Samba re-exports -- clients refuse to try to write files larger than the claimed filesystem size.
[21:03] <dwm37> (This *used* to work fine; perhaps the 3.7 trunk Debian kernel image is causing issues?)
[21:13] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[21:19] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[21:23] <gregaf> dwm37: CephFS uses a non-default block size in order to be able to report large sizes
[21:23] <fghaas> gregaf: if you had a cluster where all osds where up, some osds were previously marked out, you marked them back in, then after some time your peering count would stop decreasing and you had 266 PGs peering, and you checked one with ceph pg <pg> query, and their peering_blocked_by lists were empty, and then after a while the peering count went down only to get stuck again at 125, where would you start looking?
[21:24] <dwm37> gregaf: Yeah, I read. Seems like the reporting of that block size to the kernel is going awry.
[21:24] <gregaf> it's not been a problem on any of our systems but I've seen a couple reports of gentoo toolchains getting it wrong and one of some sort of samba or cifs re-export not handling it properly
[21:25] <gregaf> about all I can tell you is that that's where your problem is and if you manage to track it down to something more specific and report back we would be grateful so we can report it upstream or wherever
[21:26] <gregaf> fghaas: I would look at sjust if I'd concluded that it actually was stuck
[21:26] <sjust> fghaas: version?
[21:27] <fghaas> 0.56.1-1precise (i.e. your latest packaged build for ubuntu 12.04 iiuc)
[21:28] <sjust> k
[21:28] <sjust> fghaas: can you package up the output of ceph pg query for a random 4 pgs?
[21:28] <sjust> fghaas: do you think you can reproduce with logging?
[21:28] <fghaas> sec. now, magically, at some point it's gone unstuck and my cluster is healthy again, but I'm sure I can reproduce
[21:29] <fghaas> what log levels to you need?
[21:29] <sjust> osd 20, filestore 20, ms 1
[21:30] <sjust> fghaas: depending on the circumstances, being in peering for a minute or two might just be a backed up filestore
[21:31] * miroslav (~miroslav@sjc-static- Quit (Ping timeout: 480 seconds)
[21:32] <fghaas> sjust: stand by
[21:33] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[21:33] <janos> is there some way for rebalance how much each osd (disk in the case) is storing? i have two hosts, 3 osd's each. one host is pretty evenly full across osd's and another is 53%, 48%, 68%
[21:33] <janos> it's that 68% one i worry about
[21:34] <jmlowe> sjust: I've had 2 more pg's go inconsistent since yesterday
[21:34] <janos> it stays ahead of all other osd's as i load more in
[21:34] <gregaf> janos: what are you using the cluster for, and what are your pg counts in each pool?
[21:35] <janos> my pg counts are about 256
[21:35] <sjust> jmlowe: refresh my memory, you haven't had any recent osd failures, and you recently upgraded from 0.56 to 0.56.1?
[21:35] <jmlowe> sjust: correct
[21:35] <janos> gregaf: large files. iso's movies
[21:35] <gregaf> janos: I meant what interface are you using — the gateway, CephFS, librados?
[21:35] <janos> 3 pools, x2 replication. each with 256 pg's
[21:36] <janos> ah
[21:36] <janos> RBD
[21:36] <janos> latest fedora rpm version
[21:36] <fghaas> sjust: ok, what I've done now was this: marked all 12 osds on one node out. after that, peering continues in an acceptable time. bringing them back in, peering gets dead slow
[21:36] <gregaf> janos: okay, so you're actually only using the RBD pool and its 256 placement groups, which is a bit low to get a statistically even distribution over 6 devices
[21:36] <gregaf> if you could transition to a pool with more PGs that ought to work
[21:37] <janos> will certainly give that a shot
[21:37] <sjust> jmlowe: at what version did this cluster start?
[21:37] <fghaas> sjust, so would you like for me to do that pg query for a random 4 PGs?
[21:37] <janos> keep with the power of 2?
[21:37] <janos> for size
[21:37] <gregaf> janos: but if you just want a reasonable band-aid you can also reduce the CRUSH weight on the OSD which has the extra data
[21:37] <sjust> fghaas: maybe, but I think I it's just a slow filestore
[21:37] <janos> naw i don't mind making a new pool and trying that out
[21:37] <sjust> or rather, slightly overloaded filestore
[21:37] <jmlowe> sjust: 0.48
[21:38] <sjust> rbd?
[21:38] <janos> btw, i moved from on-osd journals to dedicated ssd partitions yesterday. what a difference
[21:38] <sjust> jmlowe: ^
[21:38] <gregaf> we generally recommend you do 100 placement groups per OSD; the power of 2 thing isn't that important
[21:38] <jmlowe> sjust: inconsistent pg's started after 0.56.1, yes using rbd
[21:38] <gregaf> janos: ^ and yes, SSD journals reduce burst latency quite a lot!
[21:38] <sjust> jmlowe: can you re-scrub one of the inconsistent pgs?
[21:38] <janos> gregaf: sounds good. i hear the recommendation on that ping pong around a bit
[21:39] <janos> no harm trying!
[21:39] <sjust> jmlowe: if it's a bug in scrubbing itself, it shouldn't detect the error again
[21:39] <gregaf> nhm freaks out about it occasionally, but if you have a sufficient number to begin with it's just fine if not a power of 2 ;)
[21:39] <jmlowe> sjust: sure, would getting on the cluster be at all helpful to you?
[21:39] <sjust> we'll see, if the error goes away, then I have a good idea of where to look
[21:40] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[21:42] <fghaas> sjust. ok, well my problem is not only do I have a bunch of PGs peering, I also have over a thousand stuck and unclean -- which basically means I now have degraded functionality where I really ought not to have it, no?
[21:43] <jmlowe> sjust: ok, pretty sure it's not a scrub problem
[21:43] <jmlowe> /data/osd.8# ls -l ./current/2.1b5_head/DIR_5/DIR_B/DIR_7/rb.0.1edd.74b0dc51.00000000082b__head_4467A7B5__2
[21:43] <jmlowe> -rw-r--r-- 1 root root 3792896 Jan 10 15:05 ./current/2.1b5_head/DIR_5/DIR_B/DIR_7/rb.0.1edd.74b0dc51.00000000082b__head_4467A7B5__2
[21:43] <jmlowe> /data/osd.0# ls -l ./current/2.1b5_head/DIR_5/DIR_B/DIR_7/rb.0.1edd.74b0dc51.00000000082b__head_4467A7B5__2
[21:43] <jmlowe> -rw-r--r-- 1 root root 4194304 Jan 10 15:05 ./current/2.1b5_head/DIR_5/DIR_B/DIR_7/rb.0.1edd.74b0dc51.00000000082b__head_4467A7B5__2
[21:44] <jmlowe> 2.1b5 398 0 0 0 1641009152 149940 149940 active+clean+inconsistent 2013-01-11 09:22:28.657291 4331'6420 4316'16923 [8,0] [8,0] 4331'6420 2013-01-11 09:22:28.657243 1466'4227 2013-01-08 23:28:39.629722
[21:44] <jmlowe> sjust: those really should be the same size right?
[21:44] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[21:45] <sjust> jmlowe: yeah...
[21:45] <jmlowe> sjust: I'm thinking the primary is truncated since it should be closer to a 4MB object since it's part of an rbd image
[21:47] * sleinen1 (~Adium@2001:620:0:26:a9f9:9e00:60e9:2327) has joined #ceph
[21:49] <sjust> jmlowe: most likely, but the primary and the replica fundamentally just apply the same transactions. Usually, this sort of error indicates an OSD failure causing filestore corruption or a bug during recovery.
[21:49] <sjust> the fact that this happened while all osds were healthy is very strange
[21:50] <sjust> can you post your mon log/
[21:50] <gregaf> are we sure it happened while they were healthy, and not just that it was detected while they were healthy?
[21:50] <sjust> it would be helpful to know when the last scrub happened on that pg
[21:52] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[21:56] <dwm37> Okay, yup, filessystem size misreporting matches. Ceph reports a 1MB block-size; however, something's going wrong and the kernel is assuming a 4k block-size instead.
[21:57] <dwm37> The numbers I'm seeing work for exactly that case.
[21:58] <jmlowe> gregaf: all the osd's have been running since 12/8 inconsistent pg's between 12/9 and 12/11
[21:58] <jmlowe> gregaf: ceph health is good except for inconsistent pg's
[21:58] <fghaas> gregaf: just managed to reproduce exactly the bug that madkiss mentioned earlier. "out" all osds on a given node, boom all your mons get flaky
[21:59] <fghaas> kinda cool to watch, but not quite so impressive from the user's viewpoint
[22:00] <fghaas> the interesting part is, you leave _one_ osd in and the issue doesn't trigger
[22:01] <gregaf> fghaas: ah, it doesn't require a power cut either?
[22:02] <fghaas> nope, evidently it doesn't matter whether the osds are up or down, you just mark all of them out and boom
[22:02] <jmlowe> sjust: get those logs?
[22:03] <fghaas> they don't really go down though, they just stop talking and eventually they decide they're no longer quorate
[22:04] <gregaf> fghaas: you got logs?
[22:04] <fghaas> then they might come back after a while, for a few seconds
[22:04] <gregaf> I wonder if they're just going through too much disk activity
[22:05] <gregaf> I think I asked madkiss before, but are they maybe sharing disks with the OSD logs (and what levels are the OSD logs at)?
[22:05] <fghaas> they are, yes, and the osd logs are probably rather hefty as per sjust's recommendation earlier
[22:06] <gregaf> not that it should matter, but the options are basically that the disks are somehow so slow that the monitors are getting stuck, or that they're somehow burning up so much CPU time that they can't handle it
[22:06] <noob2> does anyone remember what kernel and forward you need for ceph? i thought it was 2.6.32?
[22:06] <gregaf> obviously neither of those should be able to happen, but I've not seen this behavior before even on very large clusters
[22:06] <gregaf> noob2: you mean for the kernel clients?
[22:06] <noob2> yeah
[22:06] <noob2> for the rbd module
[22:06] <gregaf> those went in as of 2.6.34 for CephFS and 2.6.37 I think for rbd
[22:07] <noob2> ok
[22:07] <gregaf> but I don't know how good a state they're in; a lot of bugs got fixed in later cycles
[22:07] <noob2> yeah i know
[22:07] <noob2> i should prob recommend kernels 3+ for the rbd kernel usage
[22:07] <noob2> i'm writing up some docs for work
[22:08] <gregaf> yeah; I think elder really wants people to have 3.4 (I think he still backports rbd fixes to there)
[22:08] <noob2> gotcha
[22:08] <sjust> jmlowe: looking
[22:08] <noob2> and ubuntu 12.04 has backports of fixes also right?
[22:09] <gregaf> not sure what the policy is there, but I think it's running post-3.4 so it should be?
[22:09] <jmlowe> gregaf: oh, that's good to know, I didn't think any ceph stuff was being backported
[22:10] <noob2> i think 12.04 is running 3.2.0-35 or something
[22:10] <gregaf> jmlowe: it's fairly new and I really want elder for the details (not even sure if it's a thing or just he wanted to make it available)
[22:10] <fghaas> gregaf: ok. let me dig into that
[22:11] <dwm37> Okay. "stat -f /mnt/ceph" shows different values for "Fundamental block size" and "block size" -- 4k for the former, 1MB for the latter.
[22:12] <gregaf> fghaas: I'd go through the logs and see what they say when exiting quorum (generally it'll be the leader not talking to them recently enough, or the leader ending it because he didn't get an accept response fast enough), then go look at the corresponding time log for the naughty daemon and see what it was doing
[22:12] <gregaf> hopefully it's obvious...
[22:12] <elder> I have backported stuff to 3.4, gregaf. I haven't checked to see if it's going into the stable tree (yet)
[22:12] <fghaas> gregaf: obvious? forget that :)
[22:12] <sjust> jmlowe: hmm, none of the three had been scrubbed cleanly since the oldest log
[22:13] <dwm37> But, on this kernel, at least, df is using the fundamental block size for calculating total filesystem size.
[22:13] <elder> Looks like my changes did not get into 3.4.24.
[22:13] <sjust> so, these started appearing 2 months ago?
[22:13] <gregaf> backwards date, sjust ;)
[22:13] <gregaf> at least I'm assuming
[22:14] <sjust> oops, 3 months
[22:14] <gregaf> oh wait, I'm just wrong, somehow had it in my head they were for January
[22:17] <dwm37> ... huh. Let's see what happens when I downgrade coreutil ...
[22:17] <jmlowe> sjust: asking me?
[22:19] <gregaf> dwm37: that is interesting…I'm not too familiar with these reporting interfaces but googling indicates maybe they both need to be increased in size
[22:19] <noob2> if i would like my crush map to store replicas on different racks should i change my 'step chooseleaf firstn 0 type host' -> 'step chooseleaf firstn 0 type rack' ?
[22:20] <dwm37> gregaf: I may have updated my coreutils roughly at the same time I first started seeing this problem.
[22:20] <gregaf> noob2: yep!
[22:20] <noob2> awesome
[22:20] <gregaf> assuming you've defined racks with hosts in them, anyway
[22:20] <dwm37> So I'm trying that first, just in case I've been maligning Ceph unfairly..
[22:20] <gregaf> and hopefully you have more racks than you want replicas
[22:20] <noob2> so here's my next question. if i only have 2 racks that i defined and want replica= 3 what would it do?
[22:20] <jmlowe> sjust: the afffected rbd images were created Jan 9, 2013, I had December on the brain when I referenced earlier dates, they should have been January not December
[22:20] <noob2> gregaf: haha that was my next question
[22:20] <dwm37> gregaf: ding-ding-ding, bingo, we have a winner.
[22:21] <dwm37> The version of `df` provided by corutils 8.13 (current Debian Testing) reports 1.1TB correctly.
[22:21] <dwm37> The version of `df` provided by coreutils 8.20 (current Debian unstable) reports 4.2GB incorrectly.
[22:21] <gregaf> noob2: ah, that's trickier and not easy to do automatically
[22:22] <noob2> ok so if i only have 2 racks i shouldn't do by rack if i want replica 3
[22:22] <noob2> i'll leave it on host
[22:22] <noob2> i think when i add more hosts i'll have a 3rd rack to specify
[22:23] <sjust> jmlowe: osd logs from when it happened would help a lot
[22:23] <dwm37> So, modern versions of df and ceph, at least, disagree on how to report filesystem sizes.
[22:23] <gregaf> noob2: yeah, unfortunately
[22:23] <sjust> any chance you could try to reproduce with filestore/osd logging at 20 and ms at 1?
[22:23] <gregaf> if you had enough different pools and things you could do stuff like have some pools put two copies in rack1 and one copy in rack2, and some pools do it opposite
[22:23] <gregaf> but that's about it right now
[22:24] <noob2> gregaf: that's ok. if i change the crushmap in the future to use racks will it cause it to thrash about and redistribute everything?
[22:24] <gregaf> dwm37: I'm googling a bit and running git grep in the kernel source
[22:24] <gregaf> noob2: oh yes
[22:24] <noob2> ok haha
[22:24] <noob2> i was thinking it would do that
[22:24] <gregaf> dwm37: it looks like Linux got f_frsize fairly recently and previously only had f_bsize
[22:24] <dwm37> Aha.
[22:25] <dwm37> Hmm, Samba exports have similar issues.
[22:25] <dwm37> So it's presumably using the same interfaces.
[22:25] <gregaf> I don't think we set f_frsize at all so there are probably some tools or interfaces which set it to a default size rather than just copying the f_bsize :(
[22:26] <dwm37> gregaf: f_frsize is set to PAGE_CACHE_SIZE
[22:26] <dwm37> gregaf: See: http://lxr.free-electrons.com/source/fs/ceph/super.c?a=mips#L75
[22:26] <gregaf> oh, haha
[22:26] <dwm37> (Google found the MIPS LXR first, it seems..)
[22:28] <dwm37> Reportedly, the number of blocks and blocks free refer to f_frsize units, not f_bsize.
[22:28] <dwm37> f_bsize is used to set the optimal transfer block size.
[22:30] <gregaf> dwm37: I see that in a random website on the internet but would love to find it in kernel source somewhere
[22:30] <gregaf> gah
[22:38] <noob2> gregaf: holy christ this is causing my cluster to thrash
[22:38] <dwm37> gregaf: Relevant coreutils commit: http://git.savannah.gnu.org/cgit/coreutils.git/commit/src?id=0863f018f0fe970ffdb9cc2267a50c018d3944c5
[22:39] <jmlowe> sjust: restarting osd's with debugging turned up
[22:39] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[22:39] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[22:39] <jmlowe> sjust: I'll let you know if I can capture something in the osd logs
[22:39] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:40] <sjust> jmlowe: unfortunately, the part I need logging on is the event that caused the inconsistency, rather than the scrub
[22:40] <sjust> so you'd have to create a new rbd image and hope it goes inconsistent
[22:41] <gregaf> dwm37: thanks; I've created http://tracker.newdream.net/issues/3793 and 3794 to track this in kernel- and user-space
[22:41] <jmlowe> sjust: I think I can do that
[22:42] <gregaf> I suspect it'll just be making the change but I don't want to do it in a hurry without researching a bit more and letting other people take some responsibility for it ;)
[22:42] <gregaf> noob2: yes; you're asking it to move all the data it holds if you change the CRUSH rules….
[22:44] <wer> is the performance of radosgw and ceph extremely crazy sick when it comes to completing a get on a 1MB(ish) file. I am getting unreasonably crazy fast results. This is sort of sick... and dangerous.
[22:45] <wer> ~96 osd's and 4 nodes..... not even 10gig yet....
[22:45] <sjust> wer: as in low latency?
[22:45] <wer> yup and high throughput.
[22:45] <sjust> if your working set is tiny, you are reading from osd page cache most likely
[22:46] <sjust> that does tend to help
[22:46] <wer> ahhh. When will I begin bumping into not hitting that cache.... err where is it?
[22:46] <dwm37> gregaf: Makes sense! I think I've found some authoritative details in the statvfs manpages; I'll update the ticket.
[22:46] <sjust> well, the osds use a standard linux fs (xfs in your case) and so the standard linux page cache
[22:47] <gregaf> dwm37: thanks a bunch!
[22:47] <wer> ok sjust, that makes sense. well, WOW for the moment anyway.
[22:51] <sjust> wer: what kind of latency and throughput are you seeing?
[22:51] <sjust> also, spinning or ssd?
[22:51] <wer> sjust: 35 to 50 ms request times... end to end, 1MB file.
[22:52] <wer> spinny
[22:52] <sjust> ah, that I would characterize as "not super great"
[22:52] <sjust> that's actually reasonable even if the object isn't in cache
[22:54] <wer> This is only an object store... not running an mds... still not sure what that does completely.
[22:55] <dwm37> wer: An object store is like a filesystem, minus a few features.
[22:55] <dwm37> wer: Not having to implement those features makes scalability *much* easier.
[22:56] <wer> yes. :) I think that is why I didn't do it. But would the mds improve access latency in this case?
[22:57] <gregaf> not likely
[22:57] <gregaf> and it's significantly less stable
[22:57] <wer> ahhh ok.
[22:59] <wer> Writes are waaay slower though. Pretty much an order of magnitude. 400ms end to end and I can not get the throughput on simultaneous writes above 100mb.... but we have not scaled out our client horizontally yet. But am thinking the OSD itself would limit throughput per request... but I seem to never get more then 100mbps no matter what I do.
[23:00] <dwm37> wer: Is that megabytes or megabits?
[23:00] <gregaf> use of radosgw means a single write actually requires several (don't recall the exact number) of sequential disk writes, unfortunately
[23:01] <gregaf> it should scale horizontally with both the number of clients and (once it becomes a bottleneck) the number of radosgw daemons, though!
[23:02] <gregaf> and if you aren't saturating a single client's bandwidth, try issuing requests in parallel; I don't know the exact radosgw limits (yehudasa?) but they're well above 100mbps!
[23:02] <dmick> mb: millibits
[23:02] <dmick> Mb: megabits
[23:02] <dmick> MB: megabytes
[23:03] <wer> gregaf: ~100mbps bits. And the sessions stack up, and concurrency stays at about 120.
[23:03] <dwm37> dmick: That's certainly the convention I prefer to use, but I don't want to take it for granted that everyone does. :)
[23:04] <dmick> every once in a while one has to wield the "say what you mean" sword
[23:05] <wer> I always do mbps and Mbps ... but then there is base 10 or base 8, and network vs throughput... take your picks
[23:05] <dmick> wer: so what you're saying is millibits per second and Megabits per second
[23:05] <dmick> mbps isn't very useful
[23:06] <wer> I have never worked with millibits man. Anyways, years ago I invented the reid system for just this problem.
[23:06] <wer> people that can't convert
[23:06] <wer> converting to a reid is simple.
[23:06] <wer> 1mbps = 1 reid
[23:07] <dmick> but you're still using millibits per second. *maybe* you mean 1Mbps = 1 reid
[23:07] <wer> Then you are free to make any conversion you want from there. So pretty much it is easy to convery speed to pounds... or milliibits to MEGABits ;)
[23:08] <wer> cause everything can = 1 reid.
[23:08] <dmick> I cannot use your reid system for anything useful :)
[23:08] <wer> so yeah, I get 1reid of throughput and sessions stack up.
[23:08] <wer> bwah?! what?! I just did ;)
[23:09] <dmick> if you want to communicate, you sorta hafta farb the same shilnots that everyone else does
[23:09] <wer> I think the osd itself will limit througput to 100Mbps per session.... but I have yet to go past 100Mbps period on writes... which isn't making sense to me.
[23:10] <wer> farbing?! looking that one up :)
[23:10] <dmick> it's okay. you can convert it to any word you want
[23:11] * jbarbee (17192e61@ircip3.mibbit.com) has joined #ceph
[23:12] * jbarbee (17192e61@ircip3.mibbit.com) has left #ceph
[23:16] <wer> Well I am going to ditch the tsung for a second and spawn a bash loop to see if the write concurrency changes :)
[23:25] <noob2> gregaf: you still around?
[23:25] * tnt (~tnt@86.188-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:26] * chutzpah (~chutz@ Quit (Remote host closed the connection)
[23:28] <Kioob> please : is there a way to disable scrubs ?
[23:29] * jjgalvez1 (~jjgalvez@ has joined #ceph
[23:29] * jjgalvez (~jjgalvez@ Quit (Ping timeout: 480 seconds)
[23:29] <wer> "I don't want no scrubs...." -- sorry.
[23:30] <wer> I do not know offhand Kioob.
[23:30] <Kioob> I know it's not normal, but scrubs really throw a lot of latency
[23:31] <Kioob> so, the time to find why, I would like to have production running...
[23:31] <iggy> Kioob: I thought I saw something in one of the last couple of articles on my ceph feed that mentioned disabling them
[23:31] <dmick> there are several config opts that might work
[23:32] <Kioob> dmick: I didn't find any option about scrub :S
[23:32] <Kioob> maybe in source code ?
[23:32] <dmick> osd_max_scrubs and osd_scrub_load_threshold are the two I'm looking at
[23:32] <Kioob> great
[23:33] <dmick> it seems like if you set osd_max_scrubs to 0 it won't ever scrub
[23:33] <Kioob> osd scrub thread timeout & osd scrub finalize thread timeout
[23:33] <dmick> those are there too but don't look like what you want
[23:33] <Kioob> very good one
[23:33] <Kioob> I take that for know
[23:34] <Kioob> thanks
[23:34] <dmick> cheers
[23:35] * Cacolord (~Cacolord@dsl-173-248-192-146.acanac.net) has joined #ceph
[23:35] * Cacolord (~Cacolord@dsl-173-248-192-146.acanac.net) has left #ceph
[23:35] <fghaas> sjust, gregaf, looks like we found the culprit (two, really)... if you're still around in a few mins, we can share some details
[23:36] <nhm> fghaas: what we were talking about earlier?
[23:36] <fghaas> nhm: related, yes
[23:36] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:36] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:39] <dwm37> gregaf: Have added lots more notes to #3793; hope it helps!
[23:41] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[23:42] * vata (~vata@2607:fad8:4:6:221:5aff:fe2a:d1dd) Quit (Quit: Leaving.)
[23:42] * chutzpah (~chutz@ has joined #ceph
[23:43] <jmlowe> sjust: hey there we go, crashed osd
[23:44] <Kioob> great great. Bandwith drop from 2Gbps to 10Mbps when I disable scrub... there is really a problem with scrub in my setup
[23:46] * sander (~chatzilla@c-174-62-162-253.hsd1.ct.comcast.net) Quit (Ping timeout: 480 seconds)
[23:50] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[23:51] * jpieper_ (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Read error: Operation timed out)
[23:55] <gregaf> noob2: back
[23:56] <noob2> yeah i found a nasty bug with exporting rbd over lio fibre
[23:56] <gregaf> thanks dwm37
[23:56] <gregaf> fghaas: we're both here now
[23:56] <noob2> when the ceph cluster goes into remapping mode the kernel panic's
[23:57] <noob2> i'm not sure why though. i can't scroll back far enough to see what the problem is
[23:57] <gregaf> noob2: you mean the LIO gateway machine that mounts the RBD volume kernel panicked after Ceph started shuffling data around due to the new CRUSH rules you set?
[23:57] <noob2> yup
[23:57] <noob2> you got it
[23:58] <noob2> i'm using ubuntu 12.10 on my gateways
[23:58] <noob2> latest everythign
[23:59] <noob2> i see a bunch of messages saying WRITE_SAME w/o UNMAP bit not support for block discard emulation
[23:59] <noob2> and then it pops
[23:59] <gregaf> sorry, what's the message?

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.