#ceph IRC Log


IRC Log for 2012-09-10

Timestamps are in GMT/BST.

[0:04] * stass (stas@ssh.deglitch.com) Quit (Read error: Connection reset by peer)
[0:04] * stass (stas@ssh.deglitch.com) has joined #ceph
[0:10] * MarkN (~nathan@ has joined #ceph
[0:10] * MarkN (~nathan@ has left #ceph
[0:31] * jtang (~jtang@ Quit (Ping timeout: 480 seconds)
[0:39] * jtang (~jtang@ has joined #ceph
[0:46] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[0:59] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[1:07] * amatter_ (amatter@c-174-52-137-136.hsd1.ut.comcast.net) has joined #ceph
[1:13] * amatter (~amatter@ Quit (Ping timeout: 480 seconds)
[1:14] * amatter (~amatter@c-174-52-137-136.hsd1.ut.comcast.net) has joined #ceph
[1:19] * jtang (~jtang@ Quit (Ping timeout: 480 seconds)
[1:19] * amatter_ (amatter@c-174-52-137-136.hsd1.ut.comcast.net) Quit (Ping timeout: 480 seconds)
[1:21] * pentabular (~sean@adsl-70-231-131-129.dsl.snfc21.sbcglobal.net) Quit (Remote host closed the connection)
[1:36] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[1:36] * maelfius1 (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[2:04] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[2:26] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[2:42] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Connection reset by peer)
[2:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[2:49] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[2:59] * danieagle (~Daniel@ has joined #ceph
[3:11] * mistur (~yoann@kewl.mistur.org) Quit (Ping timeout: 480 seconds)
[3:14] * mistur (~yoann@kewl.mistur.org) has joined #ceph
[3:34] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[4:11] * amatter_ (~amatter@ has joined #ceph
[4:16] * amatter (~amatter@c-174-52-137-136.hsd1.ut.comcast.net) Quit (Ping timeout: 480 seconds)
[4:17] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[4:22] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[5:48] * nhmlap (~nhm@174-20-43-18.mpls.qwest.net) Quit (Read error: Operation timed out)
[5:53] * amatter_ (~amatter@ Quit (Ping timeout: 480 seconds)
[6:05] * gohko (~gohko@natter.interq.or.jp) Quit (Quit: Leaving...)
[6:14] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[6:38] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[6:50] * MikeMcClurg (~mike@cpc18-cmbg15-2-0-cust437.5-4.cable.virginmedia.com) Quit (Quit: Leaving.)
[6:51] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[7:29] * loicd (~loic@ has joined #ceph
[7:42] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[7:42] * EmilienM (~EmilienM@ADijon-654-1-74-63.w109-217.abo.wanadoo.fr) has joined #ceph
[8:05] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[8:09] * loicd (~loic@ Quit (Quit: Leaving.)
[8:25] * andret (~andre@pcandre.nine.ch) has joined #ceph
[8:39] * jtang (~jtang@ has joined #ceph
[9:03] * BManojlovic (~steki@ has joined #ceph
[9:18] * MikeMcClurg (~mike@cpc18-cmbg15-2-0-cust437.5-4.cable.virginmedia.com) has joined #ceph
[9:20] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:33] * Leseb (~Leseb@ has joined #ceph
[9:34] * loicd (~loic@magenta.dachary.org) has joined #ceph
[9:38] * fc (~fc@ has joined #ceph
[10:02] * jtang (~jtang@ Quit (Ping timeout: 480 seconds)
[11:09] * MikeMcClurg (~mike@cpc18-cmbg15-2-0-cust437.5-4.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[11:21] * MikeMcClurg (~mike@cpc18-cmbg15-2-0-cust437.5-4.cable.virginmedia.com) has joined #ceph
[11:57] * MikeMcClurg (~mike@cpc18-cmbg15-2-0-cust437.5-4.cable.virginmedia.com) Quit (Quit: Leaving.)
[12:08] * joao (~JL@ has joined #ceph
[12:48] * MikeMcClurg (~mike@ has joined #ceph
[13:40] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[13:46] * MikeMcClurg1 (~mike@ has joined #ceph
[13:47] * nhmlap (~nhm@174-20-43-18.mpls.qwest.net) has joined #ceph
[13:52] * MikeMcClurg (~mike@ Quit (Ping timeout: 480 seconds)
[14:07] * eternaleye_ (~eternaley@tchaikovsky.exherbo.org) has joined #ceph
[14:08] * eternaleye (~eternaley@tchaikovsky.exherbo.org) Quit (Read error: Connection reset by peer)
[14:15] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[14:27] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:06] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) has joined #ceph
[15:21] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[15:24] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[15:25] * loicd (~loic@magenta.dachary.org) has joined #ceph
[15:26] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[15:35] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[15:35] * loicd (~loic@ has joined #ceph
[15:52] * loicd (~loic@ Quit (Quit: Leaving.)
[15:52] * loicd (~loic@magenta.dachary.org) has joined #ceph
[15:53] * loicd (~loic@magenta.dachary.org) Quit ()
[16:06] * markl (~mark@tpsit.com) has joined #ceph
[16:16] * markl (~mark@tpsit.com) Quit (Quit: leaving)
[16:16] * markl (~mark@tpsit.com) has joined #ceph
[16:16] * nhmlap_ (~nhm@67-220-20-222.usiwireless.com) has joined #ceph
[16:18] * nhmlap (~nhm@174-20-43-18.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[17:08] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:44] * amatter (amatter@c-174-52-137-136.hsd1.ut.comcast.net) has joined #ceph
[17:48] * amatter_ (~amatter@ has joined #ceph
[17:54] * amatter (amatter@c-174-52-137-136.hsd1.ut.comcast.net) Quit (Ping timeout: 480 seconds)
[18:10] <amatter_> I have 890 pgs stale+active+clean. As far as I can tell from a pg dump they are all empty and were on two osd hosts that crashed and have been rebuilt. All of the non-empty PGs had replicas elsewhere and were recreated, but the empty ones did not. Is there a way to remove the pgs all together? Here's my pg dump_stuck stale: http://pastebin.com/3Yj8nhHH
[18:11] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) has joined #ceph
[18:14] * BManojlovic (~steki@ has joined #ceph
[18:16] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:21] * sagelap (~sage@2600:1013:b024:9653:4c83:dbe9:6afb:6fd0) has joined #ceph
[18:29] * amatter_ (~amatter@ Quit ()
[18:29] * mgalkiewicz (~mgalkiewi@staticline58611.toya.net.pl) has joined #ceph
[18:30] * sagelap (~sage@2600:1013:b024:9653:4c83:dbe9:6afb:6fd0) Quit (Ping timeout: 480 seconds)
[18:33] * pentabular (~sean@ has joined #ceph
[18:33] * jlogan (~Thunderbi@2600:c00:3010:1:8131:e4ec:e12c:5709) has joined #ceph
[18:33] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:41] * thingee_zz is now known as thingee
[18:47] * amatter (~amatter@ has joined #ceph
[18:49] * sagelap (~sage@ has joined #ceph
[18:51] <sjust> amatter_: yeah, one sec
[18:52] * slangmo (~slangmo@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[18:56] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:03] * slangmo (~slangmo@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: Colloquy for iPhone - http://colloquy.mobi)
[19:05] * deepsa (~deepsa@ has joined #ceph
[19:07] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[19:11] <sagewk> elder: finally got the btrfs stuff into testing. rebased on latest -rc.
[19:12] <joao> does anyone have any idea why this is failing the build? http://ceph.com/gitbuilder-precise-amd64/log.cgi?log=9b564d19b51bc5e9a221b905500ead321d872fd4
[19:12] <sagewk> though i probably should have merged to avoid re-rebasing some of them
[19:14] <mgalkiewicz> hi guys I have a huge problems with performance in postgresql in my production cluster. Basically everything takes a lot of time like simple insert to the table lasts 4-12 seconds.
[19:14] <sagewk> joao: weird. try looking at the commit that gitbuilder bisected it down to.. probably a makefile change
[19:14] <nhmlap_> huh, "all circuits are busy" trying to call into the standup.
[19:14] <mgalkiewicz> I would really appreciate some debugging tips. I am using argonaut, kernel 3.2 and postgres data are on rbd volume. In my staging cluster everything works fine.
[19:15] <mgalkiewicz> I can see a lot of writes on osd machine even though clients do not use it heavily
[19:16] <sagewk> where are the writes coming from?
[19:17] <elder> sagewk, I'd like to know what your thoughts are on updating master. I just want our stuff being tested with -next soon so we know it's going to be good for the next release.
[19:17] * chutzpah (~chutz@ has joined #ceph
[19:17] <mgalkiewicz> ceph-osd, ceph-mon processes and [btrfs-submit-1]
[19:18] <jmlowe> how insane would it be to crc32 all objects on write or update at the osd a la btrfs or zfs?
[19:18] <nhmlap_> ok, I'm still not able to call into the stand-up. Not sure what the deal is. I'm working on testing and getting more drives ordered for the test box.
[19:18] <jmlowe> strike all the objects, any object that is written or updated
[19:28] * dmick (~dmick@2607:f298:a:607:912d:3cba:f807:4929) has joined #ceph
[19:28] * MikeMcClurg1 (~mike@ Quit (Quit: Leaving.)
[19:31] * maelfius (~mdrnstm@ has joined #ceph
[19:31] * Ryan_Lane (~Adium@ has joined #ceph
[19:33] <joshd> mgalkiewicz: you might check whether the slowness is related to particular osds
[19:37] <nhmlap_> joshd: can you get to newdream's redmine?
[19:39] <joshd> nhmlap_: yeah, no problems here
[19:40] <nhmlap_> hrm, maybe something is up with newdream's vpn. I can get to a page about 10% of the time.
[19:43] <joao> sagewk, can it be this?
[19:43] <joao> # Ran 5 tests, 0 skipped, 0 failed.
[19:43] <joao> no such file: ./src/test/cli/mon-store-tool/*.t
[19:44] <sagewk> joao: could be... are there no .t files in that dir or something?
[19:45] <mgalkiewicz> joshd: looks like both osds write similar amount of data
[19:45] <sjust> mikeryan: vidyo sprint planning
[19:45] <joao> sagewk, nope, none; I didn't create any
[19:45] <joao> I had no idea it was required though :x
[19:45] <dmick> it's a fun little hidden test :)
[19:46] <joao> oh, I see
[19:46] <mgalkiewicz> joshd: it is based on the observation of iotop output and iostat
[19:49] <joao> wait, are we having a sprint planning today?
[19:49] <joshd> mgalkiewicz: could it be postgres' vacuum/other background processing giving extra load?
[19:50] <dmick> joao: just asked, you don't need to connect
[19:50] <joshd> mgalkiewicz: it'd be useful to try to determine if this is a difference from your staging environment purely on the client side, or if the osds themselves are slower
[19:51] <joao> dmick, okay, thanks
[19:51] <mgalkiewicz> joshd: If the database is running on local partition on the same machine it works fine a lot of faster
[19:52] <mgalkiewicz> joshd: after moving data to rbd volume it slow downs a lot
[19:53] <mgalkiewicz> clients in both clusters works the same
[19:53] <joshd> ok, so the next step would be looking at where the operations are taking a long time on the osds
[19:54] <mgalkiewicz> how to check this?
[19:56] <joshd> if you have the admin socket enabled on the osds, you can see the operations in progress with ceph --admin-daemon /path/to/osd/admin_socket dump_ops_in_flight
[20:00] <mgalkiewicz> joshd: got sth like this https://gist.github.com/3692559
[20:01] <joshd> hmm, that age is high - it's in seconds
[20:01] <joshd> I'm surprised there aren't more ops too
[20:03] <mgalkiewicz> running several time gave me num_ops: 22 max
[20:05] <mgalkiewicz> https://gist.github.com/3692593
[20:07] <joshd> there's definitely some high latency going on there, which is probably causing the slowness
[20:08] <mgalkiewicz> how do you know?
[20:08] <joshd> the age > 4 of a bunch of those ops
[20:09] <joshd> age shouldn't be more than 1
[20:10] <joshd> it means something is backed up (osd filesystem or journal)
[20:12] * MooingLemur (~troy@ has joined #ceph
[20:12] <mgalkiewicz> backed up like what?
[20:13] <joshd> like too many requests for it to handle efficiently, or something has gone wrong with the underlying fs
[20:14] <joshd> do your osd logs contain entries saying 'JOURNAL FULL'?
[20:15] <joshd> if so, the underlying fs isn't keeping up with the journal
[20:15] <mgalkiewicz> joshd: here some output from iostat https://gist.github.com/3692659 btrfs partiton with ceph data is dm-2
[20:17] <mgalkiewicz> log from osd.0 https://gist.github.com/3692679
[20:18] <mgalkiewicz> osd.1 looks similar and I have noticed weird line: osd.1 328 mon hasn't acked PGStats in 30.636403 seconds, reconnecting elsewhere
[20:18] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[20:19] <MooingLemur> when creating a new cluster via mkcephfs, it appears that it doesn't create keys for the osds, or at least it doesn't put a keyring file in the osd's data dir. http://pastie.org/private/gb1bnsmxwwxbddh72rxqpa Is this expected behavior for a new cluster? Is that a manual step when auth is enabled?
[20:19] <mgalkiewicz> the next weird thing is mon.n12c1@2(peon).paxos(monmap active c 1..3) lease_expire from mon.0 is in the past (2012-09-10 20:18:30.902044); clocks are too skewed for us to function
[20:20] <mgalkiewicz> even though I have configured mon lease wiggle room = 0.5 and all ceph servers are using ntpd
[20:21] <joshd> it's possible ntp isn't keeping them as in sync as you'd like
[20:21] <joshd> that would cause the pg stats issue, and could cause other problems
[20:23] <mgalkiewicz> joshd: my ntpd reports high precision
[20:23] <mgalkiewicz> precision = 0.642 usec
[20:25] <mgalkiewicz> I dont think it causes those problems staging have the same ntpd settings
[20:27] * wijet (~wijet@staticline58611.toya.net.pl) has joined #ceph
[20:27] <joshd> do your monitors in staging report the clock skew issue though?
[20:29] <dmick> MooingLemur: it should. Did you specify cephx in the conf file?
[20:30] <joshd> mgalkiewicz: it's also possible the slowness is from aging btrfs. has your staging gotten similar load for a similar period of time?
[20:31] <mgalkiewicz> joshd: yes but not so often. Take a look at https://gist.github.com/3692758. Why it complains about "0.230939s in the future" if the lease wigle room is set to 0.5
[20:32] <joshd> mgalkiewicz: were the mons never restarted after you change the wiggle room setting?
[20:32] <mgalkiewicz> joshd: btrfs filesystem was created in july
[20:32] <mgalkiewicz> joshd: sure
[20:34] <joshd> mon_clock_drift_allowed looks to be one of those settings that can't be set dynamically, but is read at startup
[20:35] <joshd> so restarting the mons will make it take effect
[20:35] <mgalkiewicz> I have restarted all components after changing this
[20:36] <joshd> did you actually say 'wiggle room' in the config? the setting is 'mon clock drift allowed'
[20:37] <mgalkiewicz> hmm
[20:38] * dmick (~dmick@2607:f298:a:607:912d:3cba:f807:4929) Quit (Quit: Leaving.)
[20:39] <mgalkiewicz> mon lease wiggle room
[20:39] <mgalkiewicz> I had to google it and now I have found out that it was renamed long time ago:)
[20:43] * mtk0 (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[20:44] <mgalkiewicz> it should not start with the wrong option, shouldnt it?
[20:45] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[20:51] <joao> nice
[20:51] <joao> just found that we have a test for message encoding/decoding
[20:51] <amatter> to convert an existing osd to use an external journal device, do I need to do anything to move the existing journal to the device?
[20:51] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[20:52] <amatter> i'm getting "journal FileJournal::open: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected dcc138ab-1e9d-4df4-bd2c-2f54d06a33ec, invalid" which suggests I need to move the existing journal
[20:53] <joshd> mgalkiewicz: yeah, there should have been a warning about that at least
[20:53] <amatter> hmm. maybe not: an earlier line says "journal open /dev/sdd1 fsid dcc138ab-1e9d-4df4-bd2c-2f54d06a33ec fs_op_seq 1068076" showing that id being assigned to the journal dev
[20:53] * dmick (~dmick@2607:f298:a:607:1d88:5b53:8eec:5ac2) has joined #ceph
[20:55] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[20:55] <joshd> amatter: to be safe you should flush the old journal (ceph-osd -i $OSD_NUM --flushjournal), change the ceph.conf to point to the new journal, then initialize it (ceph-osd -i $OSD_NUM --mkjournal)
[20:55] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[20:57] * mtk0 (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[20:57] <amatter> joshd: thanks. I get unrecognized arg --flushjournal. I'm on 0.48.1argonaut. Has this deature been added since?
[20:58] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[20:58] <joshd> oh, there's a dash in there --flush-journal, sorry
[20:58] <amatter> joshd: bingo. Thanks.
[21:02] <mgalkiewicz> joshd: ok I have changed that option, restarted everything, logs are clear, I will check performance now
[21:04] <joao> can anyone please sum up what the ceph-object-corpus is all about?
[21:05] <MooingLemur> dmick: my ceph.conf is http://bpaste.net/show/44816/
[21:06] <dmick> MooingLemur: so, yeah, there's no cephx mentioned there. You want auth supported = cephx in [global], I believe
[21:06] <MooingLemur> oh, I need [global]. I have the auth supported line outside of any section.
[21:06] <MooingLemur> line 1 :)
[21:06] <dmick> doh
[21:07] <dmick> I missed that completely :0
[21:07] <dmick> um...not sure, but try it in [global]
[21:07] <mgalkiewicz> joshd: hmm there is stil 1.5-2.3s simple insert into database
[21:07] <MooingLemur> it appeared to do something a little half-assed with my config :)
[21:07] <MooingLemur> thanks.
[21:08] <dmick> well let's see if that fixes it first :)
[21:08] <mgalkiewicz> joshd: and mons report from time to time mon.n12c1@2(peon).paxos(monmap active c 1..3) lease_expire from mon.0 is in the past (2012-09-10 21:03:41.894316); clocks are too skewed for us to function
[21:08] * mtk0 (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[21:09] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[21:11] * eternaleye_ is now known as eternaleye
[21:12] * mtk0 (~mtk@ool-44c35bb4.dyn.optonline.net) Quit ()
[21:13] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[21:13] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[21:14] <joshd> mgalkiewicz: so the easiest way to track this down is probably to figure out what's different from your staging environment
[21:14] <joshd> mgalkiewicz: try doing 'ceph osd tell \* bench', and see the results in ceph -w
[21:15] <MooingLemur> dmick: it does not appear to change the behavior. there are no /var/lib/ceph/mon/ceph-*/keyring files
[21:15] <MooingLemur> err. not /mon/
[21:15] <dmick> well it was a nice try
[21:15] <MooingLemur> /osd/
[21:15] <dmick> ok
[21:15] <MooingLemur> the /mon/ subdirs do have a keyring
[21:15] <dmick> did you say you were using mkcephfs?
[21:15] <MooingLemur> yep
[21:17] <joshd> mgalkiewicz: if that's the same on your staging cluster, I'd be surprised
[21:18] <dmick> what args are you giving to mkcephfs?
[21:19] <MooingLemur> dmick: cd /etc/ceph; mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring
[21:19] <mgalkiewicz> joshd: production https://gist.github.com/3693141
[21:19] <dmick> MooingLemur: seems sane to me. hm.
[21:20] <MooingLemur> using the gentoo package. ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac)
[21:20] <dmick> perhaps try with -v, and examine the output carefully.
[21:20] <dmick> (you do have passwordless root ssh to all nodes working?...and you're doing mkcephfs as root/sudo?)
[21:22] <MooingLemur> passwordless ssh from the host that I'm running on at least for now. doing as root.
[21:22] <MooingLemur> https://gist.github.com/de5d930a5e9c2530916d
[21:22] <MooingLemur> that's the most recent mkcephfs after correcting the ceph.conf. Not with -v though
[21:22] <joshd> mgalkiewicz: that's 30MB/s for a single osd... are your journals on the same fs as your data or something?
[21:26] * EmilienM (~EmilienM@ADijon-654-1-74-63.w109-217.abo.wanadoo.fr) has left #ceph
[21:28] <mgalkiewicz> by data you mean files non-related to ceph?
[21:29] <mgalkiewicz> staging did: bench: wrote 1024 MB in blocks of 4096 KB in 28.588518 sec at 36678 KB/sec
[21:34] <joshd> no, I meant the ceph osd data dir
[21:34] <joshd> but it sounds like there's not much difference from staging there
[21:35] <joshd> there isn't more network latency between the production osds, is there?
[21:35] <mgalkiewicz> 100mbit ethernet so dont think so
[21:36] <joao> sagewk, gregaf1, any tips on how-not-to-get-bitten by the encoding tests whenever one changes a message format?
[21:36] <mgalkiewicz> osd directory is on the same filesystem with osd.0.journal
[21:36] <sagewk> joao: if they're biting you, that probably means you didn't change them in a safe way...
[21:37] <joao> sagewk, yeah, I thought that would probably be the reason
[21:37] <joao> not sure what would be considered "safe", since I changed a couple of fields on MMonProbe and updated the HEADER and COMPAT versions
[21:38] <joshd> mgalkiewicz: having the journal on the same fs (even same disk) will hurt performance
[21:38] <sagewk> joao: you can add new fields, but not change existing ones, without making an incompatible change
[21:39] <sagewk> if you want to change something, though, we may find that we need to do something trickier with the upgrade transition
[21:39] <joao> so should I just leave the old slurp fields on MMonProbe?
[21:39] <joshd> mgalkiewicz: can you try a rados bench on an unused pool in each environment?
[21:40] <mgalkiewicz> sure how to perform it?
[21:41] <joshd> rados -p data bench 60 write
[21:42] <MooingLemur> dmick: looks like I needed to specify a keyring file in the [osd] and [mds] sections, too. Now it seems a bit happier :)
[21:43] <joao> sagewk, I'm just going to revert MMonProbe to its state post-sync/pre-slurp-stuff-removal
[21:44] <joao> although it feels somewhat kind of wrong that we're leaving slurp code lying around
[21:45] <mgalkiewicz> joshd: production https://gist.github.com/3693354
[21:46] <sagewk> necessary, at least initially
[21:48] <mgalkiewicz> joshd: I forgot that staging has a bit slower disks and btrfs filesystem is on raid1, production has the same filesystem on raid0
[21:48] <joao> sagewk, this begs the question: should we also leave slurp code on the monitor and leave it being reachable?
[21:49] <mgalkiewicz> joshd: staging https://gist.github.com/3693370
[21:49] * mtk0 (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[21:49] <sagewk> i think no. i think the new code should migrate its local data to the new format (where possible), and then either catch-up or sync.
[21:49] <sagewk> and old code trying to slurp can fail
[21:49] <sagewk> (slurp from us
[21:49] <sagewk> )
[21:50] <joao> should we handle their messages and let them, somehow, know that it won't work? or just drop their messages?
[21:50] <Tv_> gmane is not keeping up with vger :(
[21:50] <mgalkiewicz> and staging has argonaut 48.1 and production argonaut 48
[21:50] <gregaf1> joao: sagewk: I thought we were going to prevent old monitors from even connecting to new monitors
[21:51] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[21:51] <gregaf1> so I don't think we need to worry about them getting slurp requests
[21:51] <sagewk> hmm yeah, okay. that would be easiest...
[21:51] <sagewk> in which case yeah, you can change encoding at will
[21:51] <sagewk> (for the msgs)
[21:52] * mtk0 (~mtk@ool-44c35bb4.dyn.optonline.net) Quit ()
[21:53] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[21:53] <joao> alright
[21:53] <joshd> mgalkiewicz: 48.1 vs 48 might make a difference for small i/o, I can't remember if the journaling fix for small i/o was in 0.48 or not
[21:53] <joshd> mgalkiewicz: but the rados bench clearly shows the production cluster is getting much higher latency for each operation
[21:54] <mgalkiewicz> yep
[21:54] <joao> sagewk, correct me if I'm wrong, I'm assuming the simplest way to get teuthology to run would be to revert the MMonProbe message for the time being, no?
[21:54] <sagewk> joao: hmm, how does teuthology depend on slurp at all?
[21:55] <joao> it does not; but the gitbuilder isn't compiling it
[21:55] <joao> erm, compiling my branch
[21:55] <joao> actually
[21:55] <joao> it compiles, it just doesn't build
[21:55] <joao> because the ceph-dencoding test fails, due to the MMonProbe message in the ceph-object-corpus being unable to be decoded
[21:57] <sjust> mgalkiewicz: try rados -p todo-list bench 60 write -b 1024 -t 1
[21:57] <sagewk> ooh, i see.
[21:58] <dmick> MooingLemur: I'm surprised; I would have thought it would default
[21:58] <dmick> but maybe that was a more recent addition
[21:58] <sagewk> there is a way to mark an incompat change in the object corpus
[21:58] <sagewk> and commit that as a submodule commit in your tree
[21:59] <joao> I suppose that would require changing the public object-corpus repository, and that it would make the other branches fail, no?
[22:00] <sagewk> no, the parent repo specifies the submodule commit to use
[22:00] <sagewk> so it'd only be updated in your branch
[22:00] <mgalkiewicz> sjust: production https://gist.github.com/3693459
[22:00] <dmick> MooingLemur: probably fixed in commit 3c90ff4e96481daa0ee6042ead516dbc1864ef4a
[22:01] <sjust> and then do ceph --admin-daemon /path/to/osd/admin_socket dump_ops_in_flight again while the test is running until you see some output
[22:01] <mgalkiewicz> staging https://gist.github.com/3693464
[22:02] <joao> sagewk, would the gitbuilder be able to deal with that?
[22:02] <dmick> MooingLemur: that wasn't until 0.51. Sorry to mislead
[22:03] <sjust> can you post your two ceph.conf's?
[22:04] <joao> well, dinner is ready; bbiab
[22:04] <sagewk> joao: yep!
[22:05] <mgalkiewicz> sjust: output from production https://gist.github.com/3693485
[22:07] <sjust> mgalkiewicz: can you try that again on the other production osd, looks like most ops are waiting on the other osd
[22:07] <mgalkiewicz> sjust: config https://gist.github.com/3693495
[22:07] <mgalkiewicz> sjust: ok
[22:08] <sjust> staging only has one osd?
[22:08] <mgalkiewicz> yep
[22:08] <sjust> that generally will buy you lower latency by itself, but not a factor of 10
[22:09] <sjust> can you post the output of ceph osd tree?
[22:10] <mgalkiewicz> https://gist.github.com/3693517
[22:11] <mgalkiewicz> sjust: the other osd https://gist.github.com/3693526
[22:11] <mgalkiewicz> sjust: how did you know that most ops are waiting on the other osd?
[22:12] <sjust> most of the ops from the most recent dump are at flag_point "waiting for subops"
[22:12] <sjust> mgalkiewicz: sorry, I meant the dump_ops_in_flight on the other osd
[22:12] <sjust> during the same rados bench workload
[22:13] <sjust> it only dumps the ops from the osd whose admin socket you connected to
[22:13] <mgalkiewicz> ok
[22:14] <sjust> also, how many pgs are in the todo-list pool?
[22:15] <sjust> on production vs todo-list-test on not-production
[22:15] <mgalkiewicz> how to check this?
[22:16] <mikeryan> sagewk: sjust: just got my summary email out to the list, with a big fat disclaimer on top
[22:16] <sjust> cool
[22:18] <sjust> mgalkiewicz: sorry, need to start a cluster to test the command
[22:19] <mgalkiewicz> sjust: k
[22:19] <joshd> 'ceph pg dump' will tell you
[22:20] <sjust> I think it's pool 37 on production, so it'll be the pgs that look like 37.[0-9]+
[22:20] <sjust> how many of those are there?
[22:22] <nhmlap_> mikeryan: were you ever able to figure out if it was rados bench or the messenger?
[22:23] <mgalkiewicz> # ceph pg dump | grep '^37\.' | wc -l returns 8
[22:23] <sjust> ok, that's the default, that should be ok
[22:23] <sjust> actually, can you post the output of ceph pg dump | grep '^37\.' ?
[22:24] <mgalkiewicz> https://gist.github.com/3693591
[22:25] <sjust> was the first dump_ops_in_flight from 0 or 1?
[22:25] <mgalkiewicz> from 0 I guess
[22:26] <mikeryan> nhmlap_: rados bench has one bottleneck at 250 mbyte/sec
[22:26] <mikeryan> the next bottleneck after we fix that will most likely be in the messenger
[22:26] <mikeryan> at around 500 mbyte/sec per connection
[22:26] <mikeryan> having more than one osd almost completely mitigates that
[22:32] <sjust> mgalkiewicz: if you can get me the other dump_ops_in_flight, we can get a more complete picture, but it looks like the other osd is much slower. Is it possible that the disk is going bad?
[22:34] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Quit: Leaving)
[22:34] <mgalkiewicz> Well machines were rented the same day so they are probably in the same age. I dont think that the disks are broken because in raid0 they would already crashed the system
[22:34] <sjust> each disk is a raid 0?
[22:34] <sjust> oh, both osds on the same disk
[22:34] <sjust> ?
[22:35] <mgalkiewicz> there are 2 servers with 2 disk. Each server have both of them in raid0
[22:35] <mgalkiewicz> one server is osd0 and the other osd1
[22:35] <sjust> how are these rented?
[22:36] <mgalkiewicz> www.hetzner.de
[22:37] <mgalkiewicz> so I will run the benchmark on both server in the same time and provide you with dump_ops_in_flight from both ok?
[22:39] <sjust> yep
[22:39] <nhmlap_> mikeryan: what's the 250MB/s bottleneck you are seeing?
[22:39] <sjust> are they virtualized?
[22:40] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:41] <mgalkiewicz> clients are on vms but osds on physical machines
[22:41] <sjust> k
[22:41] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[22:43] <sjust> can you test the throughput on each raid0 with dd oflag=dsync bs=4k if=/dev/zero of=<test_file>
[22:43] <sjust> ?
[22:43] <sjust> that should give us an idea of the relative speed
[22:43] <amatter> if I wanted to migrate all the data off of an osd so that I could remove it from the cluster, what's the best approach? set weight = 0?
[22:45] <joshd> amatter: that will do it. if you don't want too much data reshuffling at once, you can slowly lower the weight to 0
[22:45] <amatter> I have one machine that has terrible i/o performance (bad cable, maybe, not sure yet) and it's killing the performance across the whole cluster of eight osd s
[22:47] <joshd> yeah, that can be an issue with a small cluster
[22:47] <mgalkiewicz> osd0 https://gist.github.com/3693704, osd1 https://gist.github.com/3693707
[22:49] <elder> Stepping away for a bit.
[22:49] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) has joined #ceph
[22:50] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Read error: No route to host)
[22:51] <mgalkiewicz> sjust: will dd end up after some time?
[22:51] * trhoden (~trhoden@pool-108-28-184-124.washdc.fios.verizon.net) has joined #ceph
[22:52] <joshd> mgalkiewicz: I think you need a count=1000 or something in the dd too
[22:53] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[22:55] <mikeryan> nhmlap_: i'm not sure where it's coming from
[22:56] <nhmlap_> mikeryan: hrm, I meant more is it a per-osd limit?
[22:56] <mikeryan> ah, that's a good question
[22:56] <mikeryan> i only ran against a single OSD
[22:56] <mikeryan> i can run against multiple
[22:57] <mgalkiewicz> sjust: osd0 4096000 bytes (4.1 MB) copied, 153.372 s, 26.7 kB/s
[22:57] <nhmlap_> mikeryan: that sounds about like what I remember seeing 1 OSD getting.
[22:57] <mgalkiewicz> sjust: osd1 4096000 bytes (4.1 MB) copied, 223.697 s, 18.3 kB/s
[22:58] <nhmlap_> mikeryan: I think that was on an SSD that could only do around 250MB/s though, so my results weren't very conclusive.
[22:58] <mgalkiewicz> written to the btrfs filesystem designated to ceph
[23:00] <amatter> I think oflag=dsync (syncing after each 4k) is not going to be reflective of real world performance. Try something like dd bs=4096 count=40960 if=/dev/zero of=/data/test conv=fdatasync
[23:01] <joshd> mgalkiewicz: could you try the ops_in_flight from each osd during a rados bench with -t 100? there will be more ops in progress then
[23:01] <mgalkiewicz> k
[23:02] <joshd> and yeah, amatter's right about the dd flags, but the ops in flight should tell us more
[23:04] <mgalkiewicz> osd0 https://gist.github.com/3693836, osd1 https://gist.github.com/3693841
[23:04] <nhmlap_> mikeryan: the msbench results are interesting. The 1900MB/s result makes me think that we should be able to continue pushing performance up so long as we have enough rados bench instances.
[23:06] <nhmlap_> sjust: btw, with 6 OSDs I was able to get ~130MB/s per OSD. You should be proud of the filestore. ;)
[23:11] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) Quit (Quit: Leaving.)
[23:18] * EmilienM (~EmilienM@ADijon-654-1-61-178.w92-130.abo.wanadoo.fr) has joined #ceph
[23:20] <joshd> mgalkiewicz: it seems like osd0 is slowing down osd1, since the number waiting for subops is 81 vs 146
[23:21] <joshd> mgalkiewicz: that probably means the fs or disk underneath osd0 has some problems
[23:26] <mgalkiewicz> is it a good idea to temporary shutdown problematic osd and check performance then?
[23:26] * EmilienM (~EmilienM@ADijon-654-1-61-178.w92-130.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[23:26] <sjust> mgalkiewicz: that might provide an interesting data point
[23:30] <mgalkiewicz> may s.m.a.r.t. provide some data which will confirm your thesis?
[23:31] <joshd> possibly, but smart doesn't always warn when disks are dying, and slowness might not be due to the disk dying
[23:32] <joshd> temporarily running one osd would do it
[23:33] <joshd> or adding a new one, and taking osd0 offline after things are reshuffled
[23:33] <gregaf1> there's some natural variance that can happen; just because one disk is slower than the other doesn't mean they're dying
[23:34] * lofejndif (~lsqavnbok@1RDAADKJA.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:34] <mgalkiewicz> joshd: and what about disk write reported by iostat: dm-2 462.04 2.19 2873.79 4680424 6130397528 where ~2.6MB/s is constantly written to disk when the clients do nothing
[23:36] <joshd> are you sure that's coming from ceph-osd?
[23:37] <joshd> try looking at iotop on the osd while the client is idle
[23:38] <mgalkiewicz> it not that easy because I have many clients but definitely osd is the most active process
[23:39] <mgalkiewicz> some output from iotop
[23:39] <mgalkiewicz> https://gist.github.com/3694084
[23:40] <amatter> I have 890 pgs stale+active+clean. As far as I can tell from a pg dump they are all empty and were on two osd hosts that crashed and have been rebuilt. All of the non-empty PGs had replicas elsewhere and were recreated, but the empty ones did not. Is there a way to remove the pgs all together? Here's my pg dump_stuck stale: http://pastebin.com/3Yj8nhHH
[23:40] <amatter> hate to repeat, maybe I missed the response
[23:41] <sjust> amatter: totally forgot, really looking for the command now :)
[23:41] <amatter> sjust: thanks
[23:42] <sjust> are they all in the same pools?
[23:42] <sjust> can you post ceph pg dump?
[23:43] <mgalkiewicz> sjust, joshd: thx for help I will setup another osd and shutdown osd.0
[23:48] * pentabular (~sean@ has left #ceph
[23:48] <joshd> mgalkiewicz: you're welcome, I hope that fixes it
[23:51] <joao> sagewk, I suppose I should push a branch to the ceph-object-corpus public repo to make this work, no?
[23:53] <sagewk> yeah
[23:54] <joao> should I have write access to it?
[23:54] <joao> I believe it is hosted on ceph.newdream.net, is it not?
[23:54] <joao> I might have a *really* outdated repo though
[23:56] <sagewk> oh, it is
[23:58] * pentabular (~sean@ has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.