#ceph IRC Log


IRC Log for 2012-09-12

Timestamps are in GMT/BST.

[0:01] * The_Bishop (~bishop@2a01:198:2ee:0:a519:8bc:b4e3:6dc6) has joined #ceph
[0:03] * Ryan_Lane (~Adium@25.sub-166-250-42.myvzw.com) has joined #ceph
[0:11] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:12] * lofejndif (~lsqavnbok@1GLAAADE9.tor-irc.dnsbl.oftc.net) has joined #ceph
[0:16] * EmilienM (~EmilienM@ADijon-654-1-133-33.w90-56.abo.wanadoo.fr) has left #ceph
[0:28] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[0:33] * Ryan_Lane (~Adium@25.sub-166-250-42.myvzw.com) Quit (Quit: Leaving.)
[0:45] * Glowplug (~chatzilla@c-69-246-99-102.hsd1.mi.comcast.net) Quit (Quit: ChatZilla [Firefox 15.0.1/20120907164141])
[0:46] * Karcaw (~evan@96-41-198-212.dhcp.elbg.wa.charter.com) has joined #ceph
[0:59] * lofejndif (~lsqavnbok@1GLAAADE9.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[1:10] <mrjack_> http://pastebin.com/77zQF3ui fixes http://tracker.newdream.net/issues/3052 for me
[1:11] <dmick> mrjack_: thanks. I'll update the bug with your proposed fix
[1:14] <mrjack_> the fix was from gregaf
[1:15] <dmick> ah
[1:15] <dmick> will note that as well :)
[1:15] <mrjack_> [00:56] <gregaf> the patch I gave you is a bit of a wide brush so it probably won't go in, but I made a bug so we'll investigate it a bit more (see if ext3's ABI is off, or our assumptions about return codes, etc) and come up with something to handle it
[1:16] <dmick> and yes, I agree with his notes
[1:17] <mrjack_> ok
[1:21] <mrjack_> i cannot login to tracker.newdream.net
[1:21] <mrjack_> i reset password but cannot login
[1:21] <mrjack_> do i need openID ?
[1:26] <dmick> you don't
[1:26] <dmick> but you do need to click through on the confirmation email, if you got one
[1:26] <dmick> (it's not really clear about the fact that it's sending you one)
[1:27] * nhmlap (~nhm@67-220-20-222.usiwireless.com) Quit (Ping timeout: 480 seconds)
[1:27] <dmick> PM me your username and I can doublecheck, maybe fix
[1:29] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:29] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Read error: Connection reset by peer)
[1:29] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[1:29] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:33] <mrjack_> f.wiessner
[1:33] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:36] * nhmlap (~nhm@67-220-20-222.usiwireless.com) has joined #ceph
[1:48] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[1:49] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[1:49] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[1:52] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[1:54] * sagelap (~sage@163.sub-70-197-139.myvzw.com) has joined #ceph
[1:58] <Tv_> sjust, sagelap: http://tracker.newdream.net/issues/3135 http://tracker.newdream.net/issues/3136 http://tracker.newdream.net/issues/3137 http://tracker.newdream.net/issues/3138
[1:58] <sjust> k
[2:00] * Tv_ (~tv@2607:f298:a:607:5905:afb4:18b:79c5) Quit (Quit: Tv_)
[2:06] * sagelap is now known as sage
[2:07] * sage is now known as sagelap
[2:19] * chutzpah (~chutz@ Quit (Quit: Leaving)
[2:23] <joshd> sagelap: does wip-rbd-snap-race look good?
[2:23] <joshd> I've got a version for stable too
[2:35] * pentabular (~sean@ has joined #ceph
[2:37] * sagelap (~sage@163.sub-70-197-139.myvzw.com) Quit (Ping timeout: 480 seconds)
[3:01] * Cube (~Adium@ Quit (Quit: Leaving.)
[3:27] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[3:32] * iggy_ (~iggy@theiggy.com) has joined #ceph
[3:52] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[4:47] * maelfius (~mdrnstm@ Quit (Quit: Leaving.)
[5:35] * pentabular (~sean@ has left #ceph
[5:41] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[5:47] * dmick (~dmick@2607:f298:a:607:1d88:5b53:8eec:5ac2) Quit (Quit: Leaving.)
[6:02] * ivan` (~ivan`@li125-242.members.linode.com) Quit (Remote host closed the connection)
[6:03] * ivan` (~ivan`@li125-242.members.linode.com) has joined #ceph
[6:23] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[6:29] * pentabular (~sean@adsl-70-231-142-192.dsl.snfc21.sbcglobal.net) has joined #ceph
[6:30] * pentabular is now known as Guest6751
[6:30] * Guest6751 is now known as pentabular
[6:30] * ivan` (~ivan`@li125-242.members.linode.com) Quit (Read error: Operation timed out)
[6:30] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Operation timed out)
[6:30] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) Quit (Read error: Operation timed out)
[6:30] * johnl (~johnl@2a02:1348:14c:1720:1ddb:3b25:17ee:d009) Quit (Remote host closed the connection)
[6:30] * ivan` (~ivan`@li125-242.members.linode.com) has joined #ceph
[6:31] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[6:31] * MK_FG (~MK_FG@ Quit (Remote host closed the connection)
[6:32] * MK_FG (~MK_FG@ has joined #ceph
[6:32] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) has joined #ceph
[6:36] * johnl (~johnl@2a02:1348:14c:1720:b1b7:4a10:6713:9272) has joined #ceph
[6:40] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[6:40] * loicd (~loic@magenta.dachary.org) has joined #ceph
[6:44] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[6:47] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has left #ceph
[7:08] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[7:48] * sileht (~sileht@sileht.net) Quit (Ping timeout: 480 seconds)
[7:52] * loicd (~loic@ has joined #ceph
[7:56] * sileht (~sileht@sileht.net) has joined #ceph
[8:07] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[8:08] * themgt (~themgt@96-37-22-79.dhcp.gnvl.sc.charter.com) Quit (Quit: Pogoapp - http://www.pogoapp.com)
[8:19] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[8:30] * pentabular (~sean@adsl-70-231-142-192.dsl.snfc21.sbcglobal.net) has left #ceph
[8:56] * luckky (~73b83113@2600:3c00::2:2424) has joined #ceph
[9:02] * luckky (~73b83113@2600:3c00::2:2424) Quit (Quit: TheGrebs.com CGI:IRC (Ping timeout))
[9:04] * BManojlovic (~steki@ has joined #ceph
[9:20] * Leseb (~Leseb@ has joined #ceph
[10:00] * EmilienM (~EmilienM@ADijon-654-1-133-33.w90-56.abo.wanadoo.fr) has joined #ceph
[10:01] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[10:11] * steki-BLAH (~steki@237-173-222-85.adsl.verat.net) has joined #ceph
[10:15] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[10:22] * pradeep (~dfb65940@2600:3c00::2:2424) has joined #ceph
[10:24] <pradeep> hi
[10:31] * pradeep (~dfb65940@2600:3c00::2:2424) Quit (Quit: TheGrebs.com CGI:IRC (EOF))
[10:41] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:42] <Leseb> hi guys
[10:48] <joao> morning :)
[10:52] * loicd (~loic@ Quit (Read error: No route to host)
[10:54] * loicd (~loic@ has joined #ceph
[10:56] * deepsa (~deepsa@ Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[10:57] * deepsa (~deepsa@ has joined #ceph
[11:03] <jamespage> yehudasa, when you are around do you have some time to help me get radosgw running with apache2 mod-fcgid
[11:03] <jamespage> I'd like to provide a 'default' configuration in the Ubuntu packages for apache2 so its a bit easier to get up and running
[11:05] * deepsa_ (~deepsa@ has joined #ceph
[11:05] <jamespage> objective would be that it supports both mod_fcgid or mod_fastcgi depending on user preference
[11:05] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[11:05] * deepsa_ is now known as deepsa
[11:11] * deepsa_ (~deepsa@ has joined #ceph
[11:13] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[11:13] * deepsa_ is now known as deepsa
[11:24] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:29] * deepsa_ (~deepsa@ has joined #ceph
[11:29] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[11:33] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[11:33] * deepsa_ is now known as deepsa
[11:34] * steki-BLAH (~steki@237-173-222-85.adsl.verat.net) Quit (Ping timeout: 480 seconds)
[11:34] * fc (~fc@ Quit (Ping timeout: 480 seconds)
[11:44] * BManojlovic (~steki@ has joined #ceph
[12:00] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[12:00] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit ()
[12:09] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has joined #ceph
[13:08] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[13:09] * loicd (~loic@ has joined #ceph
[13:15] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[13:17] * deepsa (~deepsa@ has joined #ceph
[13:19] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[14:00] * loicd (~loic@ has joined #ceph
[14:14] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[14:24] * loicd (~loic@ has joined #ceph
[14:29] * benner (~benner@ Quit (Remote host closed the connection)
[14:29] * benner (~benner@ has joined #ceph
[14:48] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[15:06] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) has joined #ceph
[15:07] * loicd (~loic@ has joined #ceph
[15:43] * loicd1 (~loic@ has joined #ceph
[15:45] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[15:45] * loicd (~loic@jem75-2-82-233-234-24.fbx.proxad.net) has joined #ceph
[15:46] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) has joined #ceph
[15:51] * loicd1 (~loic@ Quit (Ping timeout: 480 seconds)
[15:52] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[15:53] * loicd (~loic@jem75-2-82-233-234-24.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[15:53] * deepsa (~deepsa@ has joined #ceph
[16:02] * lofejndif (~lsqavnbok@82VAAGEWA.tor-irc.dnsbl.oftc.net) has joined #ceph
[16:06] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[16:25] * cattelan (~cattelan@2001:4978:267:0:21c:c0ff:febf:814b) Quit (Ping timeout: 480 seconds)
[16:27] * danieagle (~Daniel@ has joined #ceph
[16:29] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[16:31] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[16:47] * loicd (~loic@ has joined #ceph
[16:52] * jmlowe (~Adium@173-161-9-146-Illinois.hfc.comcastbusiness.net) has joined #ceph
[16:53] * Leseb_ (~Leseb@ has joined #ceph
[16:53] * Leseb (~Leseb@ Quit (Read error: Connection reset by peer)
[16:53] * Leseb_ is now known as Leseb
[17:01] * Leseb (~Leseb@ Quit (Quit: Leseb)
[17:02] * loicd (~loic@ Quit (Quit: Leaving.)
[17:19] * sagelap (~sage@180.sub-70-197-147.myvzw.com) has joined #ceph
[17:24] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:28] * loicd (~loic@ has joined #ceph
[17:30] * glowell (~Adium@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[17:32] * jmlowe (~Adium@173-161-9-146-Illinois.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[17:33] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[17:36] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[17:47] * Tv_ (~tv@2607:f298:a:607:5905:afb4:18b:79c5) has joined #ceph
[17:56] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[17:56] * hjvjhv (~kuyggvj@9KCAABNMH.tor-irc.dnsbl.oftc.net) has joined #ceph
[17:58] * jmlowe (~Adium@2001:18e8:2:28a4:310b:3b7d:e7eb:7b8e) has joined #ceph
[17:59] * jmlowe1 (~Adium@140-182-59-128.dhcp-bl.indiana.edu) has joined #ceph
[18:00] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[18:00] * jmlowe2 (~Adium@140-182-59-128.dhcp-bl.indiana.edu) has joined #ceph
[18:00] * jmlowe1 (~Adium@140-182-59-128.dhcp-bl.indiana.edu) Quit (Read error: Connection reset by peer)
[18:02] * jlogan (~Thunderbi@2600:c00:3010:1:d92d:2040:c9ba:d0e3) has joined #ceph
[18:03] * hjvjhv (~kuyggvj@9KCAABNMH.tor-irc.dnsbl.oftc.net) has left #ceph
[18:04] * jmlowe2 (~Adium@140-182-59-128.dhcp-bl.indiana.edu) Quit (Read error: Operation timed out)
[18:04] * sagelap (~sage@180.sub-70-197-147.myvzw.com) Quit (Read error: Connection reset by peer)
[18:04] * sagelap (~sage@180.sub-70-197-147.myvzw.com) has joined #ceph
[18:05] * jmlowe (~Adium@2001:18e8:2:28a4:310b:3b7d:e7eb:7b8e) Quit (Read error: Operation timed out)
[18:06] * sagelap1 (~sage@ has joined #ceph
[18:06] * jmlowe (~Adium@129-79-193-122.dhcp-bl.indiana.edu) has joined #ceph
[18:12] * sagelap (~sage@180.sub-70-197-147.myvzw.com) Quit (Read error: Operation timed out)
[18:14] <sagewk> elder: ping!
[18:15] <damien> joshd: sorry I'd headed off when you responded yesterday, how would I print out the tv.tv_nsec in gdb?
[18:16] <joao> sagewk, let me know when wip-mon-gv hits testing or master; I expect to be done with the unit-tests Real Soon Now and will focus on getting the store conversion working next
[18:17] <sagewk> joao: how about you start working on conversion by merging it into a temp branch? i'd like to make sure it is sufficient for the conversion before merging it into master...
[18:17] * lofejndif (~lsqavnbok@82VAAGEWA.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[18:18] <joshd> damien: just 'print tv.tv_nsec' - it's already in frame 0
[18:18] <joshd> damien: btw I put your backtrace in http://www.tracker.newdream.net/issues/3133
[18:19] <damien> joshd: I can give an improved one now with the qemu bits in as well if it's any help?
[18:19] <joshd> they're probably irrelevant
[18:19] <damien> joshd: and I get "$1 = <optimised out>"
[18:20] <joshd> figures
[18:20] <damien> got that earlier, assumed I was doing it wrong
[18:20] <joshd> is there anything in particular that triggers this? like a really long-lived vm?
[18:20] <damien> it happens on boot
[18:20] <joshd> or one doing lots of i/o?
[18:20] <joshd> ah
[18:20] <damien> only with caching enabled
[18:20] <elder> sagewk, I'm here
[18:20] <damien> if I disable caching its fine
[18:21] <damien> its running windows on the guest if that helps on the sort of boot workload it's under
[18:21] <joshd> sigfpe generally occurs when you divide by zero, or INT_MIN/-1
[18:21] <joshd> the line in question is (double)sec() + ((double)nsec() / 1000000000.0L);, which is obviously not doing either of those
[18:22] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:22] <joshd> is the host 32-bit?
[18:22] <damien> it's quite odd, other instances are fine, I've tried on multiple hosts, even destroyed and re-created the instance and it still suffers
[18:23] <damien> no 64bit
[18:23] <joshd> hmm, so it's something weird with the I/Os this particular instance does at boot
[18:24] <joshd> could you get a log with --debug-rbd 20 so we can see what pattern of requests trigger it?
[18:25] <damien> where can I add that in?
[18:25] * nick`m (~kuyggvj@9KCAABNMH.tor-irc.dnsbl.oftc.net) has joined #ceph
[18:25] <damien> qemu-kvm complains on its command line
[18:25] <sagewk> elder: all of the format 1 'rbd map' commands failed in qa last night
[18:25] <damien> and adding it to [global] in ceph.conf doesn't seemed to have helped
[18:25] <sagewk> elder: presumably due to the new stuff in the testing branch...
[18:25] <elder> Well that sort of sucks.
[18:25] <sagewk> ENOENT
[18:25] <elder> Yes, presumably...
[18:26] <sagewk> probably something pretty simple
[18:26] <elder> OK, I'll look at it.
[18:26] <joshd> damien: add it to the qemu command line rbd drive spec, i.e. -drive rbd:pool/image:debug_rbd=20 etc.
[18:26] <damien> joshd: gotcha
[18:26] <joshd> probably want log_file=/something too
[18:28] <damien> joshd: http://damoxc.net/rbd.log.gz
[18:35] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[18:36] <joshd> damien: would it be possible to compile a debugging patch to print out the relevant values?
[18:36] * jlogan (~Thunderbi@2600:c00:3010:1:d92d:2040:c9ba:d0e3) Quit (Remote host closed the connection)
[18:37] * jlogan (~Thunderbi@2600:c00:3010:1:d92d:2040:c9ba:d0e3) has joined #ceph
[18:37] <damien> joshd: sure thing, to qemu or librbd?
[18:37] <joshd> librbd
[18:39] <damien> no probs
[18:41] <damien> what version will I be patching against?
[18:42] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) has joined #ceph
[18:43] <joshd> you're using argonaut, right?
[18:43] <nhmlap> ok, looking at our crc32c implementation, it looks like we are using slicing-by-8?
[18:43] <damien> currently got 0.51 installed on that box
[18:44] <joshd> oh, ok
[18:44] <damien> wanted to make sure it wasn't fixed in a newer release before bugging you guys
[18:44] <joshd> no, definitely haven't seen any SIGFPE bugs
[18:44] <joshd> I'll do the patch against 0.51
[18:45] <damien> cheers
[18:48] <damien> gotta pop out for a bit but will be back in ~1hr and will apply and build with the patch
[18:48] <joshd> ok, thanks
[18:49] * maelfius (~mdrnstm@ has joined #ceph
[18:52] * loicd (~loic@ has joined #ceph
[18:53] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[18:58] * thingee (~thingee@ps91741.dreamhost.com) has left #ceph
[18:58] <amatter> morning all. I'm trying to isolate a performance issue. I have 7 osd's, high spec machines with raid0 (2x 3tb 7.2k drives). All my pgs are active+clean. All access is via cephfs + samba on dedicated gateway machines, also high-spec with 4x bonded gb-nics. I'm getting a lot of slow requests in the monitor log, but there is barely any io traffic into/out of the cluster. Pointers on what to examine?
[19:00] * The_Bishop (~bishop@2a01:198:2ee:0:a519:8bc:b4e3:6dc6) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[19:00] <joshd> amatter: on the osds, use the admin socket to see requests in progress, i.e. on an osd machine 'ceph --admin-daemon /path/to/osd/admin/socket dump_ops_in_flight'
[19:02] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[19:02] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) Quit (Ping timeout: 480 seconds)
[19:09] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) has joined #ceph
[19:09] <joshd> damien: https://gist.github.com/3708213
[19:11] * dmick (~dmick@2607:f298:a:607:9c04:b691:ad19:b925) has joined #ceph
[19:11] <amatter> joshd: http://pastebin.com/k55cWkBr showing 8 ops in progress, on several other ods there's like zero or one ops, however, it's not consistent which machine is backlogged
[19:12] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[19:12] <gregaf> damn, those are some old ops
[19:14] * Cube (~Adium@ has joined #ceph
[19:14] <jmlowe> good god I'm about to have a melt down here, tell me what makes the iphone 5 new!
[19:14] <Tv_> i see a lot of mentions of mds there
[19:15] <Tv_> amatter: are you using cephfs?
[19:15] <amatter> Tv_: yes
[19:15] <Tv_> amatter: as the mythbusters like to say.. there's your problem! ;)
[19:16] <dmick> jmlowe: it's one more than 4
[19:16] <Tv_> amatter: cephfs has not had as much qa coverage as the other parts.. though i don't know how it'd make an osd slow
[19:16] <Tv_> amatter: just eliminating a likely reason
[19:16] <amatter> Wow, now w'eve got 37 ops_in_flight on that machine, oldes 58 secs
[19:17] <sjust> TV_: that looks like an osd bottleneck, not a mds bottleneck
[19:17] <Tv_> sjust: yeah it does
[19:17] <amatter> yes, the ods can't keep up
[19:17] <sjust> amatter: can you try running rados bench?
[19:18] <Tv_> amatter: what filesystem are the osds on?
[19:18] <Tv_> amatter: where's their journal stored?
[19:19] <amatter> btrfs, the journal is on the same device, however in one case one one osd I'm using a ssd for the journal just to compare
[19:19] <amatter> the device is a raid0, though
[19:20] <amatter> the issue seems more to be that the load is not being distributed very well, I'll have two osds with load avergaes of >5.0 and the rest 0 to 1.0
[19:20] * Ryan_Lane (~Adium@39.sub-166-250-35.myvzw.com) has joined #ceph
[19:20] <amatter> the two or three osds rotate around the group, not consistently the same machines
[19:21] <amatter> sjust: is rados bench the same as ceph tell osd x bench?
[19:22] <amatter> when ceph-osd is not running, using dd with sync I get consistently around 70MB/sec on each machine.
[19:22] <jmlowe> dmick: one more than 4, I must have it or maybe a 12 step program for apple device junkies
[19:22] <sjust> no, it's a seperate utility
[19:23] <sjust> rados bench -p <scratch pool> 50 write
[19:23] <sjust> where <scratch pool> is a pool you don't mind scribbling on
[19:23] <amatter> also, I'm not swamping the cluster, I'm writing a few 10MB files to the cluster and everything comes grinding to a halt
[19:24] * chutzpah (~chutz@ has joined #ceph
[19:25] <amatter> sjust: okay, doing rados bench now
[19:28] <amatter> http://pastebin.com/LJKcYimB rados write bench. not sure what kind of figures I should be expecting to see. I would expect higher than what I'm seeing.
[19:28] <sjust> that indeed is not working
[19:29] <sjust> can you post the output of ceph pg dump and the binary from ceph osd getmap -o <map>
[19:35] <amatter> http://www.mattgarner.com/ceph/pgdump_12092012.txt http://www.mattgarner.com/ceph/osdmap_12092102.bin
[19:36] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[19:38] <dmick> jmlowe: I hear anxiolytics can help
[19:38] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:41] <joshd> amatter: regarding badly distributed load, how many pgs do you have in the pools you're using?
[19:42] <sjust> joshd: data and metadata appear to have 256 each
[19:42] <sjust> which is on the low side, but almost certainly not a problem
[19:43] * amatter_ (~amatter@ has joined #ceph
[19:43] <sjust> amatter: hs-san-1-ha has 2048, which I assume is the rbd pool?
[19:43] <sjust> or the filesystem?
[19:43] <amatter_> sorry, got disconnected for a minute. may have missed conversation
[19:43] * amatter (~amatter@ Quit (Ping timeout: 480 seconds)
[19:43] <sjust> amatter: what is hs-san-1-ha used for?
[19:44] <amatter_> cephfs
[19:44] <sjust> k
[19:45] <sjust> amatter_: can you restart osd.0 with debug tracker = 20 in your ceph.conf under [osd] ?
[19:45] <sjust> that should generate a better picture of where the bottleneck is
[19:45] <amatter_> sjust: ok
[19:46] <sjust> actually, come to think of it, can you post your ceph.conf first?
[19:46] <sjust> sorry, debug optracker = 20
[19:48] <nhmlap> sjust: so it looks like we are using a pretty good crc32 chucksum implementation. I'm still trying to figure out if it's the same as the 3 cycles per byte version that intel developed a couple of years ago.
[19:48] <sjust> k
[19:49] <jmlowe> nhmlap: crc32 checksum, what are you using it for? any chance the objects are crc
[19:50] <sjust> nope, just the on wire messages
[19:50] <joao> is there any way to make teuthology to ignore a failure on a daemon?
[19:50] <jmlowe> nhmlap: 'ed and we can get some nice scrubbing/repair action going?
[19:50] <joao> has anyone ever needed such a thing?
[19:50] <sjust> joao: what are you trying to do?
[19:50] <nhmlap> sjust: there is a very interesting implementation and whitepaper for a set of very fast implementations here: http://code.google.com/p/crcutil/
[19:50] <joao> sjust, failing monitors by injecting options to make them assert at predefined points
[19:51] <sjust> jmlowe: we do now do a deep scrub of object contents as well as metadata in the background courtesy of mikeryan :)
[19:51] <joao> to force another monitor to recover from that same failure
[19:51] <jmlowe> nhmlap: maybe even some dedupe, although that may take sha or md
[19:51] <nhmlap> jmlowe: Good question. One of the other guys can chime in. I'm just trying to figure out how to speed things up. ;)
[19:51] <jmlowe> sjust: damn, I could have used that back in April
[19:52] <sjust> heh
[19:52] <jmlowe> sjust: probably would have caught that fiemap problem before it trashed my vms
[19:52] <sjust> ouch
[19:52] * adjohn is now known as Guest6830
[19:52] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[19:56] * Guest6830 (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[19:57] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[19:57] * adjohn is now known as Guest6832
[19:57] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[19:58] <joao> well, got to change this whole approach
[19:59] <joao> it appears that making a daemon to assert will make it hard (if not impossible) to bring it back up
[19:59] <gregaf> joao: look at the osd thrasher?
[19:59] <sjust> osd thrasher kills from the outside, he needs to kill from the inside
[20:00] <joao> yeah, what sjust said
[20:00] <sjust> you just need to hook into where ceph.py launches the daemons and prevent it from propagating the exception, instead restarting the daemon
[20:02] <joao> alright, will take a look at that
[20:02] * Guest6832 (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[20:05] <amatter_> sjust: http://pastebin.com/MeJdUiVD ceph.conf from osd.0
[20:05] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[20:06] <amatter_> also, I've enabled optracker debugging
[20:09] * mgalkiewicz (~mgalkiewi@staticline57333.toya.net.pl) has joined #ceph
[20:09] <sjust> ok, run rados bench again and the post the log
[20:09] <mgalkiewicz> hi guys
[20:09] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[20:10] <mgalkiewicz> sjust: we have spoken recently about osd poor performance. I have installed third osd and shutted down the one you pointed out as possible source of the problem
[20:10] <sjust> did it help?
[20:10] <mgalkiewicz> nope
[20:11] <mgalkiewicz> however I have collected new interesting data
[20:12] <sjust> ?
[20:12] * nhmlap perks up
[20:13] <mgalkiewicz> sjust: here is the output from sar command http://pastie.org/4709337
[20:14] <amatter_> sjust: http://www.mattgarner.com/ceph/rados_bench_12092102.txt http://www.mattgarner.com/ceph/ceph-osd.0.log
[20:16] <mgalkiewicz> it looks like osd is generating 250 IOPS but the clients dont generate more than 1 and I have around 22 clients
[20:16] <mgalkiewicz> 9 of them are using the storage for mongodb and 13 for postgresql
[20:18] <sjust> amatter_: ok, can you turn on debug optracker = 20 on all of your osds, repeat the rados bench run and repost?
[20:18] <mgalkiewicz> here is the average for shut down osd.0 (the machine still runs mon and mds but mds is not used by rbd) https://gist.github.com/3708795
[20:18] <sjust> osd0 appears to be behaving properly, so it's one of the other osds
[20:18] <amatter_> sjust: ok, thanks for checking
[20:19] <sjust> amatter_: no problem, I'd like to know what's slowing down the osds :)
[20:19] * jjgalvez (~jjgalvez@ has joined #ceph
[20:21] <amatter_> sjust: what am I looking for in the log to identify the issue
[20:21] <mgalkiewicz> osd1 (also running) generates even more IOPS around 400
[20:21] <sjust> I was just going to run all of the logs through my analyzer script
[20:21] <amatter_> oic
[20:22] <sjust> which you can find here if you interested:
[20:22] <sjust> https://github.com/ceph/ceph-tools/blob/master/analysis/log_analyzer.py
[20:25] <sjust> mgalkiewicz: what are md0,1,2,3?
[20:25] <mgalkiewicz> it looks for me like osd is writing much more data than clients of course there is always some overhead but still it is quite high value
[20:25] <mgalkiewicz> md3 is raid0 on 2 disks sda and sdb
[20:25] <sjust> k
[20:26] <mgalkiewicz> 0 swap, 1 boot, 2 root
[20:26] <sjust> k
[20:27] <sjust> this is data from the old node or the new node?
[20:27] <mgalkiewicz> first gist is from the new one and second one from the old one with osd0 shut down
[20:28] <sjust> amatter_: the script pieces together the events from all of the ops in the log and gives you the 100 slowest and the events that happened for those ops in chronological order
[20:28] <mgalkiewicz> do you want from osd1 (old but still running) as well?
[20:28] <sjust> amatter_: I hope to see that one osd is the replica or primary for all of those ops, that'll indicate that one disk is bad
[20:28] <sjust> amatter_: alternately, I hope to see a pattern as to which event is taking the bulk of the time
[20:29] <nhmlap> sjust: I imagine we'll see that all of the ops are backing up on one or two OSDs.
[20:29] <sjust> nhmlap: that's the hope
[20:29] * gvkhjv (~kuyggvj@82VAAGE27.tor-irc.dnsbl.oftc.net) has joined #ceph
[20:33] <amatter_> on side note: all of the osds use a usb memory stick for the OS, the raid0 is dedicated to the osd. The usb stick isn't very fast. maybe this is affecting things I didn't anticipate?
[20:33] <sjust> amatter_: hmm, it should not matter
[20:34] <sjust> unless the journal or data ended up inadvertantly on the root fs
[20:34] * nick`m (~kuyggvj@9KCAABNMH.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[20:34] <jmlowe> atime?
[20:35] <amatter_> the journal is on the raid (except for one which is on an ssd) and the root fs is mounted noatime
[20:36] * pradeep (~dfe3ea16@2600:3c00::2:2424) has joined #ceph
[20:37] <amatter_> sjust: all the logs http://www.mattgarner.com/ceph/ceph-osd-12092012.tgz
[20:38] <pradeep> hi all, what is the solution to this problem:root@pradeep:~# sudo ceph-fuse -m :6789 /home/pradeep/cephfs ceph-fuse[2834]: starting ceph client ceph-fuse[2834]: ceph mount failed with (110) Connection timed out ceph-fuse[2832]: mount failed: (110) Connection timed out
[20:40] * pradeep (~dfe3ea16@2600:3c00::2:2424) Quit ()
[20:40] <amatter_> hi pradeep: perhaps no spaces between the ip address and port?
[20:40] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) has joined #ceph
[20:40] <mgalkiewicz> now I have disabled mds and iops drop down to 56 and mon drop down to 0 (all on machine with osd0). Is it possbile that mon itself does so many writing?
[20:41] <joshd> the mons do get periodic stats from the osds and write them to disk, but I wouldn't expect them to be that much I/O
[20:43] <mgalkiewicz> I wasnt able to eliminate clock skew on mons even with mon clock drift allowed = 0.5
[20:45] <mgalkiewicz> and I am quite sure that ntp server synchronizes the time properly
[20:47] <joshd> you don't have clock_offset set, do you?
[20:47] <mgalkiewicz> dont think so
[20:48] * pentabular (~sean@adsl-70-231-142-192.dsl.snfc21.sbcglobal.net) has joined #ceph
[20:49] * pentabular is now known as Guest6838
[20:49] * Guest6838 is now known as pentabular
[20:49] <pentabular> okay now fix github... go!
[20:49] <pentabular> ...1 ...2 ...3 ...
[20:50] <pentabular> aw well, worked last time. \)
[20:50] * lofejndif (~lsqavnbok@1GLAAADMT.tor-irc.dnsbl.oftc.net) has joined #ceph
[20:50] <pentabular> ;)
[20:52] <joshd> mgalkiewicz: try setting osd_mon_report_interval_min = 30 on the osds
[20:52] <joshd> it defaults to 5, so that will make them report stats less often
[20:53] <mgalkiewicz> ok but i dont think it will change anything
[20:54] <mgalkiewicz> is it possible to get live r/w statistics per rados pool or rbd volume?
[20:55] <joshd> per rbd volume, you can get the current ops via debugfs for kernel rbd, or the admin socket for librbd
[20:55] <mgalkiewicz> I have checked almost all clients and none of them are writing any significant amount of data
[20:56] <mgalkiewicz> hmm I must be using kernel rbd but I am not sure how to check this?
[20:56] <joshd> are you using rbd map and have /dev/rbd*?
[20:56] <joshd> that's kernel rbd
[20:56] <mgalkiewicz> yep on the client
[20:57] <mgalkiewicz> but I would prefer to get overall stats on osd side not the client side
[20:58] <amatter_> sjust: here's the result of the python script on those log files. Probably the same thing you're looking at. http://www.mattgarner.com/ceph/analysis.txt
[20:58] <joshd> mgalkiewicz: you can get overall stats from the osd (I forget if they're per-pg or pool), but they're not tied to a particular client
[20:58] <joshd> mgalkiewicz: via the admin socket, perf dump
[20:59] <mgalkiewicz> I have one pool for each client so it will be sufficient
[20:59] <mgalkiewicz> well maybe not for each but still is quite good
[21:00] <nhmlap> mgalkiewicz: earlier you were mentioning that the clients are writing out little data, but the OSDs were active right?
[21:00] <mgalkiewicz> yes
[21:00] <nhmlap> mgalkiewicz: out of curiousity, were the OSDs still active if the clients are idle?
[21:01] <joshd> amatter_: there's a large gap between journal and data on osd.5 sometimes: 2012-09-12 12:30:59.045798 (osd.5): sub_op_commit 2012-09-12 12:32:29.986875 (osd.5): sub_op_applied
[21:01] <joshd> amatter_: is there anything from btrfs in syslog on osd.5?
[21:02] <mgalkiewicz> it looks so I am not able to check all clients at once if they are idle or writing out little data but osds are writing constantly megabytes to disks
[21:03] <mgalkiewicz> joshd: perf dump returns quite nasty json is it possible to format it a little bit?
[21:04] <joshd> mgalkiewicz: pipe it to python -mjson.tool
[21:04] * jlogan (~Thunderbi@2600:c00:3010:1:d92d:2040:c9ba:d0e3) Quit (Quit: jlogan)
[21:04] <amatter_> joshd: no. that's the one with the ssd. Let me move the journal back to the raid
[21:06] <sjust> amatter_: looking now
[21:08] <mgalkiewicz> joshd: well it does not look like I would expect maybe it is more readable for you https://gist.github.com/3709165
[21:08] <amatter_> I've moved the journal on osd.5 back to the raid just in case of a problem with the ssd, waiting for things to settle again before another bench
[21:08] <nhmlap> mgalkiewicz: I ask, because it would be interesting to know what exactly is being written out to disk.
[21:09] * Ryan_Lane (~Adium@39.sub-166-250-35.myvzw.com) Quit (Quit: Leaving.)
[21:09] <mgalkiewicz> nhmlap: tell me what to do and I will provide you the output/logs
[21:10] <nhmlap> mgalkiewicz: one thing to keep in mind is that if your journal is on the same disk, and your writes are small, you may see far more data being written to disk because of the way the data is aligned for journal writes.
[21:10] <mgalkiewicz> well journal is on the same disk
[21:10] * Ryan_Lane (~Adium@145.sub-166-250-37.myvzw.com) has joined #ceph
[21:11] <mgalkiewicz> but we can move it
[21:12] <amatter_> hmm. still dismal bench results
[21:12] <nhmlap> mgalkiewicz: Any idea how big the writes from the clients are?
[21:12] <mgalkiewicz> I can estimate give me a sec
[21:13] <amatter_> could there be some sync() in the ceph logging which is holding things up due to slow io on the root partition? Let me move ceph logging to tempfs
[21:13] * jlogan (~Thunderbi@2600:c00:3010:1:e09b:e760:9ba1:c8ae) has joined #ceph
[21:14] <sjust> amatter_: that might help
[21:14] <sjust> amatter_: are you running 12.04?
[21:15] <amatter_> sjust: yes
[21:15] <sjust> btrfs?
[21:15] <amatter_> sjust: yes
[21:16] <sjust> is osd.4 the one with the ssd journal?
[21:16] <amatter_> osd.5
[21:17] <joshd> sjust: do you notice anything in mgalkiewicz's perf dump?
[21:17] <sjust> joshd: haven't looked at it yet
[21:17] <sjust> amatter_: let's try marking osd.5 out, waiting for recovery, and see if that helps
[21:18] <sjust> ceph osd out 5
[21:18] <sjust> and then wait until all pgs are active+clean
[21:18] <sjust> it's taking osd.5 a minute in several cases to apply transactions to btrfs
[21:18] <amatter_> sjust: ok, that could be a long time
[21:18] <sjust> actually, how much data is there?
[21:18] <amatter_> 4277 GB used
[21:19] <mgalkiewicz> nhmlap: I dont expect it to be more than 200kB/s (writes) and even less reads. These values are rather overestimated.
[21:19] <sjust> amatter_: hmm, one sec
[21:19] <nhmlap> mgalkiewicz: Are you talking about the throughput, or the size of the individual IOs?
[21:20] <mgalkiewicz> writes and reads to rbd volume from all clients combined
[21:20] <amatter_> I think there was a problem with the journal being on the ssd (the ssd is giving poor performance) but now I've moved the journal back to the raid on that one (osd.5)
[21:21] <nhmlap> mgalkiewicz: hrm, what I'm interesting in is the size of the individual IOs that the clients are generating. IE are there tons of very very small writes.
[21:21] <sjust> but no improvement?
[21:21] * sakib (~sakib@ has joined #ceph
[21:21] <joshd> mgalkiewicz: if you do perf dump, sleep 5, then another perf dump you can see how much I/O the osd is seeing in that time
[21:21] <amatter_> sjust: not apparently
[21:22] <joshd> mgalkiewicz: op_w, op_r, etc are counts of writes and reads
[21:22] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[21:25] <mgalkiewicz> ok so sar reports that the average request size in sectors is around 17 on each client
[21:26] <nhmlap> how big are your sectors?
[21:26] <mgalkiewicz> checking
[21:26] * loicd (~loic@ has joined #ceph
[21:27] * mistur (~yoann@kewl.mistur.org) Quit (Ping timeout: 480 seconds)
[21:29] <mgalkiewicz> the only info I can find about rbd volume is size 5120 MB in 1280 objects
[21:30] <mgalkiewicz> fdisk reports 512bytes per sector
[21:30] * mistur (~yoann@kewl.mistur.org) has joined #ceph
[21:31] <nhmlap> mgalkiewicz: Ok, that makes it sound like your average write size is around 4KB
[21:31] <nhmlap> er 8KB, sorry
[21:32] <mgalkiewicz> yep
[21:33] <nhmlap> Unfortunately the averages don't tell us the whole story. It could typically be much smaller with ocassional big writes.
[21:33] <mgalkiewicz> there are 0.33 iops per second
[21:33] <nhmlap> mgalkiewicz: yikes
[21:34] <nhmlap> mgalkiewicz: this is a database ontop of rbd?
[21:34] * Ryan_Lane (~Adium@145.sub-166-250-37.myvzw.com) Quit (Quit: Leaving.)
[21:34] <mgalkiewicz> yep 13 postgresql clients and 9 mongodb
[21:35] <nhmlap> mgalkiewicz: Have you ever used fio or any other benchmarking tools?
[21:35] <mgalkiewicz> nope
[21:36] <amatter_> so I've moved all the ceph logging on osds to tempfs but no improvement. going to regather the logs and do another ops analysis
[21:37] <mgalkiewicz> I didnt need benchmark to realize that all databases are very slow. When everything worked fine I was satisfied.
[21:37] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[21:38] <nhmlap> mgalkiewicz: Hrm, would you mind stopping your clients and running "rados -p <pool> -b 8192 bench 300 write -t 16"
[21:38] <nhmlap> mgalkiewicz: ah, sorry, I'm coming into this late. It worked fine before?
[21:39] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[21:40] <mgalkiewicz> nhmlap: I think so but I dont really remember when
[21:40] <mgalkiewicz> it is not possible to stop the clients
[21:40] <joshd> nhmlap: the problem was high latency for each op (around 3-4 seconds)
[21:40] <mgalkiewicz> I have already did the bench last time
[21:40] <mgalkiewicz> done*
[21:40] <joshd> we thought one of the osds was the cause, but replacing it didn't help
[21:41] <joshd> mgalkiewicz: you said turning off the mds and monitors left no I/Os going?
[21:42] * BManojlovic (~steki@ has joined #ceph
[21:42] <mgalkiewicz> but on the node with osd.0 which is shut down right now
[21:43] <joshd> but even with the monitors on a separate node, and the osd disks relatively idle, the high latency is still there?
[21:44] <sjust> amatter_: cool
[21:44] <mgalkiewicz> node1: osd.0 (down), mds.0 (up), mon0 (up), node2: osd1 (up), mds1 (up), mon1 (up), node3: mon2 (up), node4: osd2 (up)
[21:46] <mgalkiewicz> nhmlap: so how to check what osd is writing to disk?
[21:48] <nhmlap> mgalkiewicz: I usually just watch sar/collectl for the disks with 1 second intervals to get a rough idea.
[21:49] <nhmlap> mgalkiewicz: based on your sar output, it doesn't look like your raid arrays have much activity though.
[21:49] <nhmlap> at least on that node.
[21:51] <mgalkiewicz> http://pastie.org/4709337 looks rather busy (vg-ceph) considering very little writes from clients
[21:51] <amatter_> http://www.mattgarner.com/ceph/ops_analysis2.txt
[21:51] <nhmlap> mgalkiewicz: do any of the OSD nodes have high wait, svctms, or util% on any of the raid arrays? Do any of them have high cpu utilization in general?
[21:52] <amatter_> looks like a lot of waiting on osd.4 but I think that's just because once one gets backed up it just keeps getting further behind.
[21:53] <mgalkiewicz> look pastie above 40 %utils and 137 await isnt high enough?
[21:53] <nhmlap> mgalkiewicz: with small IOs, I've seen 100% util and 6000+ wait times!
[21:54] <nhmlap> mgalkiewicz: but with more IOPs than what you are getting.
[21:54] <nhmlap> mgalkiewicz: I'm just thinking that it doesn't look like the layer below ceph is holding you back.
[21:55] <mgalkiewicz> this is the opposite conclusion from sjust's who said last time that filesystem or disks might be slow
[21:55] <mgalkiewicz> so I have checked them and for me sth is wrong with ceph
[21:56] <nhmlap> mgalkiewicz: that probably means one of us is wrong. ;)
[21:56] <mgalkiewicz> for sure but still I have no idea how to debug it futher
[21:56] <nhmlap> mgalkiewicz: Does the sar output look like that for all of your OSDs? I don't know how many nodes you have.
[21:57] * jjgalvez1 (~jjgalvez@ has joined #ceph
[21:57] <nhmlap> Also, you mentioned you had run rados bench before. Did you do it wth small writes? What was the performance?
[21:57] <mgalkiewicz> this is the fastest node installed today (only osd) without mon and mds
[21:58] * senner (~Wildcard@68-113-228-222.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[21:58] <nhmlap> mgalkiewicz: how many osds total, and how many osds per node?
[21:58] <sjust> amatter_: can you try it again, but with rados -p <pool> bench 100 write -t 1 -b 1024 ?
[21:59] <sjust> that will do a small number of small writes
[21:59] <mgalkiewicz> https://gist.github.com/3693526
[22:00] * jjgalvez (~jjgalvez@ Quit (Ping timeout: 480 seconds)
[22:00] <nhmlap> Ah, interesting.
[22:01] <nhmlap> mgalkiewicz: Was that test run on a client node?
[22:01] <mgalkiewicz> nhmlap: 2 running osd, one osd per node (physical server)
[22:01] <mgalkiewicz> no on osd
[22:02] <Tv_> who knows redmine well.. why isn't #3140 visible on http://tracker.newdream.net/rb/master_backlogs/lab ?
[22:02] <Tv_> oh is it the bug thing?
[22:02] <Tv_> yup
[22:02] <Tv_> hate that
[22:03] <nhmlap> mgalkiewicz: networking between the two nodes is reasonable?
[22:03] <Tv_> i dislike redmine :(
[22:04] <nhmlap> Tv_: I'm not fond of it.
[22:04] <mgalkiewicz> nhmlap: yep 100mbit not very busy
[22:04] <dmick> all bug trackers suck
[22:04] <dmick> some suck nonsuicidally, but they all suck
[22:05] <gregaf> mgalkiewicz: sorry, 100mbit, not 1gigabit?
[22:05] <nhmlap> mgalkiewicz: what's the ping time between the two servers?
[22:05] <sjust> can you try running iperf from one to the other?
[22:05] <Tv_> whee i just turned off all other categories in the admin view ;)
[22:05] <amatter_> sjust: new bench http://pastebin.com/i97N15eq
[22:05] <Tv_> there are no bugs in the sepia lab now!
[22:06] <mgalkiewicz> 100mbit not 1 gigabit because it is not easy to change mons ip address to force data to go through second interface
[22:06] <nhmlap> Tv_: awesome
[22:06] <sjust> do you have the corresponding logs?
[22:06] <mgalkiewicz> ping between two nodes time=0.487 ms
[22:06] <sjust> amatter_: yeah, looks like some ops are taking 10 times longer
[22:07] <sjust> op tracker logs should give some useful insight
[22:07] <mgalkiewicz> but the clients communicate with ceph components through 1gigabit
[22:07] <mgalkiewicz> but the nodes among themselves through 100mbit
[22:07] <sjust> mgalkiewicz: doesn't matter, osd-osd communication will be the bottleneck
[22:07] <sjust> assuming you have replication turned no
[22:07] <sjust> *on
[22:07] <sjust> or rather, don't have it turned off
[22:08] <nhmlap> sjust: he's not getting anywhere close to 100mbit speeds though.
[22:08] <mgalkiewicz> yes I have replication but still 100mbits are not saturated
[22:09] <amatter_> perhaps I have too many pgs?
[22:09] * Ryan_Lane (~Adium@145.sub-166-250-37.myvzw.com) has joined #ceph
[22:09] <sjust> amatter_: no, more likely too few
[22:09] <amatter_> sjust: 2048 between 7 osd nodes
[22:10] <nhmlap> mgalkiewicz: The latencies in your rados bench test sure are high though.
[22:11] <sjust> amatter_:
[22:11] <sjust> pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0 crash_replay_interval 45
[22:11] <sjust> pool 1 'metadata' rep size 4 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 12 owner 0
[22:11] <sjust> pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0
[22:11] <sjust> pool 3 'hs-san-1' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 16 owner 18446744073709551615
[22:11] <sjust> pool 6 'hs-san-1-ha' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 79 owner 0
[22:11] <sjust> pool 8 'test-pool' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 735 owner 18446744073709551615
[22:11] <sjust> hs-san-1-ha has 2048
[22:11] <mgalkiewicz> lets investigate why, what might be the reason and how to check it
[22:11] <sjust> data and metadata have 256 each
[22:11] <sjust> and test-pool has 8
[22:11] <sjust> you want around 100/osd for each pool
[22:12] <amatter_> sjust: true, I only use hs-san-1-hafor cephfs
[22:12] <nhmlap> mgalkiewicz: your drive wait and service time don't seem high enough to justify ~1s latencies in rados bench.
[22:12] <sjust> but that doesn't explain the variance in latency for the most recent test
[22:12] <amatter_> but there may be an issue with the metadata pool
[22:12] <amatter_> for cephfs
[22:12] <sjust> maybe, the 256 isn't really low enough to be a big problem
[22:12] <sjust> anyway, do you have the op tracker logs from the most recent rados bench?
[22:12] <nhmlap> mgalkiewicz: and those pauses...
[22:14] <nhmlap> mgalkiewicz: could you try doing 8kb writes to the underlying OSD filesystems?
[22:14] <mgalkiewicz> k
[22:15] <nhmlap> mgalkiewicz: it'd need to be lots of 8kb files. fileop from iozone3 can do that.
[22:16] <mgalkiewicz> thought about dd
[22:16] <mgalkiewicz> isnt good enough?
[22:16] <nhmlap> you could do that, but the io will look a lot different than what the OSD does.
[22:17] <nhmlap> If the result is pathetically slow than it's good enough. ;)
[22:17] <nhmlap> s/than/tehn
[22:17] <nhmlap> bah
[22:20] <amatter_> sjust: created test-pool2 w/2048 pgs. Much better bench results: http://pastebin.com/rJKWq59U http://pastebin.com/ybkpqXxc . I should have realized the bench on the pool w/8 pgs wasn't reliable
[22:21] <sjust> each disk is a raid0 of two 7200 spinning disks, right?
[22:21] <nhmlap> amatter_: I've done that too. No worries. :)
[22:21] <amatter_> sjust: yes
[22:22] <sjust> can you post the op tracker logs from that/
[22:22] <sjust> ?
[22:22] <sjust> from the one with small few writes
[22:22] <sjust> or redo it
[22:23] <sjust> a small portion of the writes are taking >.5s while the rest are <.05s
[22:23] <sjust> that is what is killing your performance
[22:23] <amatter_> is there a way to reset the log w/o restarting the osd? just "cat > ceph-osd0.log"
[22:23] <sjust> that would probably do it
[22:25] <amatter_> sjust: the benw / -t1 -b 1024?
[22:26] <gregaf> amatter_: sjust: don't think so, just replaces the filesystem link to the inode data...
[22:26] <mgalkiewicz> nhmlap: https://gist.github.com/3709630
[22:26] <sjust> gregaf: ok, how do you reset the log?
[22:26] * Ryan_Lane1 (~Adium@39.sub-166-250-35.myvzw.com) has joined #ceph
[22:26] <gregaf> once upon a time you could send a SIGHUP, not sure if there's any way to do it now or not
[22:27] <sjust> truncating the log should do it
[22:27] <amatter_> gregf: it appears the cat > worked
[22:27] * sakib (~sakib@ Quit (Quit: leaving)
[22:28] <gregaf> yeah, I'm just confused about how the piping works :)
[22:28] <gregaf> thought it was an unlink rather than a truncate
[22:28] * Ryan_Lane (~Adium@145.sub-166-250-37.myvzw.com) Quit (Ping timeout: 480 seconds)
[22:28] <amatter_> oic
[22:29] <nhmlap> mgalkiewicz: huh. You don't have any wierd settings that would make buffered IO flush rapidly do you? That seems very slow...
[22:31] <mgalkiewicz> without conv=fdatasync it is 1.2GB/s
[22:31] <mgalkiewicz> /dev/mapper/vg-ceph on /srv/ceph type btrfs (rw,relatime,space_cache)
[22:31] <amatter_> sjust: here's the analysis of the 2048 pg test2 pool using the one-op-at-a-time bench: http://www.mattgarner.com/ceph/ops_analysis3.txt
[22:32] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:32] * pentabular (~sean@adsl-70-231-142-192.dsl.snfc21.sbcglobal.net) has left #ceph
[22:32] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[22:33] * mikegrb (~michael@mikegrb.netop.oftc.net) has joined #ceph
[22:34] <nhmlap> mgalkiewicz: hrm, you could increase the count to like 100000
[22:34] <nhmlap> that's probably too small of a filesize.
[22:34] <mgalkiewicz> with or without fdatasync?
[22:34] <nhmlap> with please
[22:35] <sjust> amatter_: are the clocks the same on the 7 nodes?
[22:36] <nhmlap> for reference, if I do "dd bs=8096 count=100000 if=/dev/zero of=blah.out conv=fdatasync" on my old desktop with an ancient 250MB sata disk, I get about 30MB/s.
[22:36] <nhmlap> sorry, 250GB, no MB.
[22:36] <nhmlap> not that old. ;)
[22:36] <amatter_> sjust: should be, they sync with ntp via cronjob. let me check
[22:37] <sjust> amatter_: they seem to be off by a least a few seconds
[22:37] <damien> joshd: sorry took a little longer than anticipated, made the patched build and generated a new log, was that what you wanted?
[22:37] <mgalkiewicz> nhmlap: 809600000 bytes (810 MB) copied, 7.3204 s, 111 MB/s
[22:38] <nhmlap> mgalkiewicz: much better
[22:38] <joshd> damien: yeah, it should hopefully contain the time values that cause the crash
[22:38] <Tv_> Quote of the day: " i looked into the ceph source codes, it seems very complicated for me."
[22:38] <damien> joshd: http://damoxc.net/rbd.log.gz
[22:39] <nhmlap> mgalkiewicz: so buffered writes to a single file are reasonably fast. That's not what ceph is doing, but at least it means your disk isn't totally broken.
[22:40] <mgalkiewicz> nice
[22:40] <dmick> you would think something as mature as gdb would actually have complete doc by now, wouldn't you? well, if you did, you'd be wrong
[22:41] <nhmlap> mgalkiewicz: So I'm back to wondering why the ops are taking so long to complete when you only issue 1 tiny op at a time, when the underlying svctime for requests to the disk aren't *that* bad.
[22:42] <nhmlap> oh wait, was the database still writing data out during your rados bench test?
[22:42] <mgalkiewicz> like I said clients cannot be stopped so all clients where probably writing sth
[22:43] <nhmlap> ah, so they could have been hogging some queue.
[22:44] <mgalkiewicz> but still they are not generating many data
[22:45] <amatter_> sjust: ok. I've sync'd all the clocks on the osds, mons and mdss
[22:45] <amatter_> sjust: rerun the bench analysis?
[22:45] <sjust> yeah, it was throwing off the script
[22:46] <joshd> damien: do you still get the same backtrace there? it looks like it's crashing in a different place, or not getting to flush the log before exiting
[22:46] <nhmlap> mgalkiewicz: It'd be helpful if we had a tool that could look at all of the objects on the filesystem and plot a histogram of the object sizes.
[22:46] <mgalkiewicz> does it exist?
[22:46] <nhmlap> mgalkiewicz: nope. :(
[22:47] <mgalkiewicz> so maybe we should enable more debugging options and just go through them
[22:48] <mgalkiewicz> go through generated logs
[22:48] <nhmlap> mgalkiewicz: ok, would you mind installing collectl and running "collectl -sD -oT -i0.1"?
[22:48] * jmlowe (~Adium@129-79-193-122.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[22:49] <mgalkiewicz> ok
[22:49] <nhmlap> mgalkiewicz: that will take samples for your disks every 0.1 seconds. With the higher resolution we might get a better idea of what the typical io size is like.
[22:50] <nhmlap> mgalkiewicz: we could also push up debugging like you said. That will report the sizes, but we'll have to wade through the logs which will be more work.
[22:50] <amatter_> sjust: http://www.mattgarner.com/ceph/ops_analysis4.txt
[22:52] <mgalkiewicz> nhmlap: will this command end or should I kill it after a while?
[22:52] <sjust> amatter_: can you try turning journal aio = true on?
[22:52] <nhmlap> mgalkiewicz: won't end, just kill it after you are satisfied.
[22:53] <sjust> and restarting the osds
[22:53] <nhmlap> mgalkiewicz: I forgot, there is also a --dskfilt option that will let you filter which devices to monitor.
[22:53] <nhmlap> might come in handy
[22:54] <mgalkiewicz> https://gist.github.com/3709855
[22:55] <mgalkiewicz> ceph filesystem is dm-2
[22:55] <nhmlap> hrm, I think there must be an option I'm missing to up the time resolution. It's annoying not having the sub-second values.
[22:55] <nhmlap> I mean the data is there, just not the label.
[22:57] <nhmlap> hrm, resolution still isn't really high enough to get an idea of how big typical writes are.
[22:58] <nhmlap> looks like collectl thinks it's more around 4-5KB though.
[22:58] <damien> joshd: yeah that was the same trace
[22:58] <nhmlap> for the aerages.
[22:59] <damien> joshd: http://dpaste.com/800142/
[22:59] <nhmlap> And the jouranl writes will be a minimum of 8KB if I remember right, so I'm going to guess that the actual data writes are going to be more like 2KB.
[23:00] <mgalkiewicz> nhmlap: check it with -i0.01?
[23:00] <nhmlap> mgalkiewicz: you could try. I'm not sure at what point collectl will become inaccurate. try with -oTm this time instead of -oT
[23:01] <nhmlap> probably only take a couple of seconds of output though. :)
[23:01] <joshd> damien: shoot, it's converting the other utime_t, elapsed
[23:01] <nhmlap> oh, you can do --dskfilt dm-2 too.
[23:02] <nhmlap> theoretically that should limit output to just your osd disk.
[23:02] <joshd> damien: in gdb, can you go to frame 1 (f 1) and print elapsed?
[23:02] <joshd> probably optimized out, but worth a shot
[23:02] <mgalkiewicz> https://gist.github.com/3709909
[23:02] <mgalkiewicz> nhmlap: already pasted sry
[23:03] <mgalkiewicz> without dskfilt
[23:03] <joshd> damien: otherwise, modify the patch to print elapsed's sec and nsec values instead of start_times
[23:03] <nhmlap> mgalkiewicz: no worries
[23:05] <nhmlap> mgalkiewicz: lots of cases of 4KB writes.
[23:07] <amatter_> sjust: okay, enabled aio on all osds but I don't think it went so well: died on the third op : http://pastebin.com/ct0QqkiC
[23:08] <damien> joshd: $1 = {tv = {tv_sec = 0, tv_nsec = 58000}}
[23:09] <sjust> can you restart with debug osd = 20, debug filestore = 20, debug journal = 20 on one of the nodes/
[23:10] * jmlowe (~Adium@173-15-112-198-Illinois.hfc.comcastbusiness.net) has joined #ceph
[23:10] <amatter_> sjust: http://pastebin.com/ct0QqkiC
[23:10] <amatter_> sjust: oops: http://www.mattgarner.com/ceph/ops_analysis5.txt
[23:10] <nhmlap> mgalkiewicz: if you look in the underlying osd mount point under the current directory, there should be a bunch of subdirectories. Would you mind just doing a "ls -alR" on one of the directories that starts with a number?
[23:10] <sjust> are the osds still running?
[23:10] <sjust> or did they crash?
[23:10] <nhmlap> mgalkiewicz: the actual directory is named "current"
[23:10] <amatter_> sjust: still running
[23:11] <sjust> oh, nvm on the additional debugging
[23:11] <amatter_> sjust: i can restart the bench and then it goes a few ops then hangs again
[23:11] <nhmlap> mgalkiewicz: there should be files of various sizes.
[23:12] <nhmlap> mgalkiewicz: not as good as a histogram, but maybe we can get a rough idea.
[23:12] <sjust> amatter_: all pgs active+clean?
[23:12] <amatter_> sjust: yest
[23:12] <amatter_> sjust: pgmap v301800: 4880 pgs: 4880 active+clean; 2003 GB data, 4302 GB used, 27351 GB / 31671 GB avail
[23:13] <joshd> damien: hmm, that doesn't cause a SIGFPE on my machine
[23:13] <sjust> can you post the log tar ball?
[23:15] <mgalkiewicz> nhmlap: https://dl.dropbox.com/u/5820195/ls_current.gz
[23:16] <amatter_> sjust: http://www.mattgarner.com/ceph/ceph-osd.log.tgz
[23:16] <sjust> amatter_: something really crazy is going on with your journals, turn aio back off, turn on debug filestore = 20, debug journal = 20, debug osd = 20, debug optracker = 20, truncate the logs, rerun rados bench ... -t 1 -b 1024, and post the log tar ball
[23:17] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[23:18] <amatter_> seem to be getting much better read performance from cephfs w/ aio
[23:18] <sjust> aio won't affect read performance, probably something else
[23:18] <joshd> damien: that's very strange. could you do a 'thread apply all bt' in gdb?
[23:18] <amatter_> hmm. ok.
[23:19] <nhmlap> mgalkiewicz: are you using rbd caching?
[23:20] <mgalkiewicz> since I am using kernel rbd it is not possible or sth
[23:20] <mgalkiewicz> at least this is what I was told here
[23:20] <amatter_> hmm, with all that logging, i'm going to run out of ram on tempfs before I can complete the bench
[23:21] <amatter_> we'll see. :)
[23:22] <damien> joshd: do you know if there's an easy way to save all that to a file?
[23:23] <nhmlap> mgalkiewicz: Ok. so the ls idea ended up not being useful. I think if I remember right it's maybe doing small writes into existing 4MB objects for RBD or something.
[23:23] <joshd> damien: I'm not sure
[23:23] <nhmlap> mgalkiewicz: joshd would know how it works in that circumstance.
[23:24] <nhmlap> mgalkiewicz: but my take away from all of this so far is that writes are almost certainly 4k or smaller, and possibly much smaller.
[23:25] <nhmlap> mgalkiewicz: That would explain seeing far more traffic on the backend OSD than on the client due to the additional 4K that is used to align the journal writes.
[23:25] <joshd> nhmlap: there's no caching in the kernel client, although there can be a small amount of request merging
[23:26] <mgalkiewicz> I have found sth like this http://www.digipedia.pl/usenet/thread/11905/8845/
[23:26] <mgalkiewicz> I am using argonaut
[23:27] * jmlowe (~Adium@173-15-112-198-Illinois.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[23:29] <damien> joshd: http://dpaste.com/800173/ threads 2-16 appear to be missing
[23:30] <nhmlap> mgalkiewicz: ok, so 8KB write for postgres. You also have clients doing mongodb or something?
[23:30] <amatter_> sjust: just waiting for all the pgs to settle due to restarting the osds
[23:30] <mgalkiewicz> yes
[23:30] <nhmlap> And yeah, it might be worth trying out 0.51.
[23:31] <mgalkiewicz> how safe is upgrade from 0.48argonaut to 0.51?
[23:31] <mgalkiewicz> is it possible to do this without any downtime?
[23:32] <gregaf> it should be; I think we're testing it now — but I wouldn't do it in production ;)
[23:32] <nhmlap> mgalkiewicz: do you have any other hardware you could test on first?
[23:32] <mgalkiewicz> staging cluster
[23:32] <joshd> damien: unfortunately nothing else looks suspicious there
[23:33] <mgalkiewicz> 48 is lts so I would prefer using it anyway
[23:33] <amatter_> sjust: new bench http://pastebin.com/h0sejUVy uploading log tarball now
[23:36] <nhmlap> mgalkiewicz: what version of mongodb are you running?
[23:37] <mgalkiewicz> 2.0.6
[23:37] <mgalkiewicz> or close to not sure if all db are already upgraded
[23:38] <nhmlap> mgalkiewicz: ok. It looks like in versions prior to 2.0 there was a server level read/write lock. IE if a write is happening and is slow, the entire thing stops.
[23:39] <nhmlap> mgalkiewicz: In newer versions there is some kind of locking-with-yield support.
[23:39] <mgalkiewicz> I am sure that all dbs are 2.0.X
[23:40] <amatter_> sjust: ok, logs from all osds are uploaded here: http://www.mattgarner.com/ceph/ceph-osd.log.tgz warning, big files
[23:43] <nhmlap> mgalkiewicz: any idea how active mongodb is vs postgres?
[23:43] <joshd> damien: could you post the core file? there might be some more information about the nature of the floating point exception
[23:44] <mgalkiewicz> nhmlap: checking
[23:44] <joshd> damien: or just print $_SIGINFO
[23:45] <mgalkiewicz> probably most active client: Average: dev254-0 1.00 0.00 17.23 17.23 1.08 1046.53 733.27 73.33
[23:45] <damien> joshd: $1 = void
[23:45] <mgalkiewicz> postgres is around 3 times less active
[23:46] <mgalkiewicz> avg
[23:47] <nhmlap> mgalkiewicz: Ok. Does your mongodb working set mostly fit into memory? Any idea how often it's doing page faults?
[23:48] <joshd> damien: could you try running qemu with the problematic guest under gdb, and when it crashes print $_SIGINFO?
[23:49] <joshd> it sounds like it's not saved in the core file
[23:49] <damien> joshd: yeah, that's what I did
[23:49] <mgalkiewicz> nhmlap: https://gist.github.com/3710184
[23:49] <damien> joshd: core file is http://damoxc.net/core.gz
[23:49] <nhmlap> mgalkiewicz: oh, also, are you sharding?
[23:50] <joshd> damien: thanks, I'll try to dig around to see if the siginfo is hidden in there somewhere
[23:50] <mgalkiewicz> nhmlap: what do you mean?
[23:51] <nhmlap> mgalkiewicz: like scaling mongo over multiple servers
[23:52] <nhmlap> mgalkiewicz: anyway, it looks like even in 2.0+ the global lock will still result in writes occuring serially.
[23:52] <mgalkiewicz> there are 9 mongodb each one on a different VM with only one database
[23:52] <damien> joshd: thanks, going to head off for the night now, if you need any more information let me know what and I'll sort it out tomorrow morning
[23:52] <mgalkiewicz> nhmlap: why it is bad?
[23:53] <joshd> damien: ok, thanks. I'm afraid I'll run out of ideas soon though
[23:54] <nhmlap> mgalkiewicz: basically Mongodb wants to write out small chunks of data really fast and can only write the next chunk of data once the previous one completes. If any given little chunk of data takes longer it slows the whole thing down.
[23:55] <mgalkiewicz> it should not influence postgresql on a different vm
[23:56] <nhmlap> mgalkiewicz: that's true, but postgresql is much less active than mongodb?
[23:57] <mgalkiewicz> it depends on database, I have checked some clients and it looks like postgresql is 3 times less active (IOPS) and bytes written
[23:58] <mgalkiewicz> btw mongostat did not show any lock
[23:59] <nhmlap> mgalkiewicz: ah, that's true. I've never used mongostat.
[23:59] <nhmlap> mgalkiewicz: It seems like it should be locked all the time if the writes are so slow.
[23:59] <nhmlap> But if it's never hitting disk...

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.