#ceph IRC Log

Index

IRC Log for 2012-04-19

Timestamps are in GMT/BST.

[0:02] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[0:03] * BManojlovic (~steki@212.200.243.246) Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:04] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[0:05] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit ()
[0:07] * lofejndif (~lsqavnbok@1RDAAA0ZM.tor-irc.dnsbl.oftc.net) has joined #ceph
[0:11] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[0:16] <todin> on a osd with btrfs a new btrfs snapshot should be create every 30sec, or when the jounral is full an get's wraped, why are on my osd every 5sec. new snap_* dir?
[0:17] <gregaf> todin: unless you changed the defaults then every 5 seconds is correct
[0:19] <sjust> todin: it should be removing old ones as it goes
[0:20] <todin> gregaf: ok, I thought the time was 30 sec
[0:20] <todin> sjust: yes the old ones are removed, threrefore I have high load wiht tht btrfs-cleaner process
[0:20] <sjust> ah
[0:21] <todin> therefore I thought that that interval could be increased
[0:22] <gregaf> todin: assuming your journal is large enough to absorb the writes for longer, you can bump up the snapshot interval using filestore_max_sync_interval (default 5)
[0:23] <todin> gregaf: my jounral partiton could absorb ca. 30 sec
[0:23] <gregaf> yeah; I'd bump it up to 20-30 seconds then :)
[0:23] <todin> gregaf: that's runtime configurable?
[0:23] <gregaf> yep, in the config or the command line, same as our standard config stuff
[0:24] <gregaf> (in the config it's "filestore max sync interval", of course)
[0:24] <gregaf> *config file
[0:25] <todin> ok, I will try it
[0:27] <todin> in the osd log with jounral debug the latnecy value is the jounral write latency in ms?
[0:27] * Tv (~tv@md10536d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[0:27] <gregaf> I think so???sjust?
[0:28] <sjust> I think it's in seconds
[0:28] <sjust> one sec
[0:28] <sjust> the lines begining "do_write latency "?
[0:29] <todin> huu, I do not have one right now here
[0:29] <sjust> looks like seconds to me
[0:30] <todin> so 0,05 will be 50ms?
[0:30] <sjust> yeah
[0:30] <todin> hmm that's quite high for an ssd
[0:31] <sjust> can you paste the line, that would indeed be high
[0:31] <sjust> ?
[0:31] <todin> give me a sec
[0:35] <todin> 2012-04-19 00:35:18.842173 7fdcbcda8700 10 journal queue_completions_thru seq 399186 queueing seq 399186 0x25c34a0 lat 0.060338
[0:37] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[0:44] <todin> the latnecy rise sometimes to lat 2.438129 and the whole cluster performance drops very hard
[0:46] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[0:49] * technicool (~manschutz@pool-96-226-55-169.dllstx.fios.verizon.net) Quit (Quit: technicool)
[1:03] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[1:24] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[1:24] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[1:31] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[1:49] * LarsFronius (~LarsFroni@31-18-137-57-dynip.superkabel.de) has joined #ceph
[1:56] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[1:58] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[1:59] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[1:59] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Read error: Connection reset by peer)
[1:59] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) Quit ()
[2:00] * lofejndif (~lsqavnbok@1RDAAA0ZM.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[2:02] * adjohn (~adjohn@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[2:06] * LarsFronius (~LarsFroni@31-18-137-57-dynip.superkabel.de) Quit (Quit: LarsFronius)
[2:32] * yoshi (~yoshi@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:39] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[3:09] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[3:11] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit ()
[3:13] * loicd (~loic@204.16.154.194) has joined #ceph
[3:16] * adjohn (~adjohn@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: adjohn)
[3:27] * loicd (~loic@204.16.154.194) Quit (Quit: Leaving.)
[3:36] * joao (~JL@89-181-153-140.net.novis.pt) Quit (Ping timeout: 480 seconds)
[3:51] * chutzpah (~chutz@216.174.109.254) Quit (Quit: Leaving)
[3:53] * loicd (~loic@99-7-168-244.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[4:34] * technicool (~manschutz@pool-96-226-55-169.dllstx.fios.verizon.net) has joined #ceph
[4:35] * technicool (~manschutz@pool-96-226-55-169.dllstx.fios.verizon.net) Quit ()
[4:47] * dmick (~dmick@aon.hq.newdream.net) has left #ceph
[6:26] * tjikkun (~tjikkun@82-169-255-84.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[7:19] * cattelan is now known as cattelan_away
[8:48] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[8:50] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit ()
[8:52] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:12] * tjikkun (~tjikkun@82-169-255-84.ip.telfort.nl) has joined #ceph
[9:17] * f4m8_ is now known as f4m8
[9:32] * oliver1 (~oliver@p4FFFE564.dip.t-dialin.net) has joined #ceph
[9:39] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[9:49] * jks2 (jks@193.189.93.254) Quit (Read error: Connection reset by peer)
[10:26] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[11:07] * joao (~JL@89-181-153-140.net.novis.pt) has joined #ceph
[11:10] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[12:06] * yoshi (~yoshi@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:20] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:21] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[12:22] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:23] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[12:32] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:32] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[12:41] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:42] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[15:45] * f4m8 is now known as f4m8_
[15:46] <nhm> good morning #ceph
[15:48] <nhm> todin: regarding latency: how often do those spikes happen and how long do they last for?
[15:58] * lofejndif (~lsqavnbok@9KCAAETAD.tor-irc.dnsbl.oftc.net) has joined #ceph
[16:12] <todin> nhm: ca. every 30 sec for 3-5 sec
[16:13] <todin> nhm: but I think I found it, it seems to be a faulty ssd
[16:18] <nhm> todin: ah, ok. Glad you were able to figure it out!
[16:22] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[16:36] * technicool (~manschutz@pool-96-226-55-169.dllstx.fios.verizon.net) has joined #ceph
[16:42] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[16:43] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has left #ceph
[16:43] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[16:43] * cattelan_away is now known as cattelan
[17:15] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:24] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:29] * technicool (~manschutz@pool-96-226-55-169.dllstx.fios.verizon.net) Quit (Quit: technicool)
[17:44] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[18:05] * aliguori (~anthony@32.97.110.59) has joined #ceph
[18:10] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Read error: Connection reset by peer)
[18:11] <sagewk> elder: yay, david howells reposted xstat patches
[18:17] * MoXx (~Spooky@fb.rognant.fr) has joined #ceph
[18:18] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[18:27] * oliver1 (~oliver@p4FFFE564.dip.t-dialin.net) has left #ceph
[18:40] <elder> I didn't realize it was something you thought was important.
[18:40] <elder> It allows us to leverage things like aggregate space used by a directory stored in ceph maybe?
[18:41] <gregaf> I think the selective stats is what gets him excited
[18:42] <elder> Oh yeah, so we can return efficient results when the "easy" stuff is all that's requested.
[18:42] <gregaf> there's a much higher chance of satisfying things locally for many workloads
[18:42] <gregaf> and the MDS is the bottleneck in metadata-heavy workloads
[18:48] <sagewk> yeah, ceph's getattr really wants a mask telling it what fields it cares about
[18:48] <sagewk> so things like access() that needs uid/gid/mode don't go wandering off to the mds to get a valid file size
[18:49] <sagewk> and maybe someday ls --color won't request a file size either
[18:52] * bchrisman (~Adium@108.60.121.114) has joined #ceph
[18:57] <sagewk> force10 switch being replaced now.. estimating about an hour
[18:57] <elder> Good timing, meeting soon.
[19:08] <nhm> excellent
[19:08] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[19:14] * chutzpah (~chutz@216.174.109.254) has joined #ceph
[19:16] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) has joined #ceph
[19:17] * loicd (~loic@99-7-168-244.lightspeed.sntcca.sbcglobal.net) Quit (Quit: Leaving.)
[19:21] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) Quit ()
[19:25] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Quit: LarsFronius)
[19:31] * grape (~grape@216.24.166.226) has joined #ceph
[19:36] * lofejndif (~lsqavnbok@9KCAAETAD.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[19:40] * Tv (~tv@ma30536d0.tmodns.net) has joined #ceph
[19:50] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[19:50] <sagewk> 30 more minutes
[19:51] <dmick> until?...
[19:52] <joao> <sagewk> force10 switch being replaced now.. estimating about an hour
[19:52] <joao> I suppose this is the missing context :p
[19:52] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[19:52] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit ()
[19:52] <sagewk> they're done
[19:53] <nhm> nice
[19:54] <elder> Time flies.
[19:54] <elder> That was a very quick 30 minutes.
[19:55] <dmick> I'd...thought we were coordinating. Oh well. Hopefully it didn't screw up anyone's testing.
[19:59] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Read error: Operation timed out)
[20:00] <sagewk> oh, i mean 30 minutes until they're done
[20:02] <sagewk> dmick: they are coordinating, i announced in earlier
[20:02] <sagewk> s/in/it/
[20:04] <nhm> dmick: at least for me, I assumed the worst and have just avoided those machines entirely. ;)
[20:04] <nhm> dmick: so it's all good
[20:05] <dmick> sagewk: yes, I'd thought they'd be going through me. I'm fine with it as long as no one was affected, just not what I thought was happening this morning
[20:06] <dmick> and, apparently, we *do* have a link from the F10s to the 4948; somehow that wasn't communicated either, but that means I can make some more vlan progress
[20:07] <sagewk> ah. i think you weren't in yet when they started
[20:07] <dmick> yep
[20:20] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[20:20] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit ()
[20:21] * Tv (~tv@ma30536d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[20:28] <nhm> huh, HP is certifying Ubuntu 12.04 for proliant servers. Interesting.
[20:33] * tjikkun (~tjikkun@82-169-255-84.ip.telfort.nl) Quit (Read error: Connection reset by peer)
[20:36] * tjikkun (~tjikkun@82-169-255-84.ip.telfort.nl) has joined #ceph
[20:42] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[20:47] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[20:54] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[20:55] <dmick> network looking good; switch transition seems finished
[21:12] * adjohn (~adjohn@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[21:22] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[21:44] * lofejndif (~lsqavnbok@1RDAAA11L.tor-irc.dnsbl.oftc.net) has joined #ceph
[21:45] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[22:00] * adjohn (~adjohn@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: adjohn)
[22:14] <nhm> dmick: good deal
[22:15] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[22:16] <dmick> plana03 and 58 seem dead; I assume the owners know about this
[22:21] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[22:22] * adjohn (~adjohn@mfc0536d0.tmodns.net) has joined #ceph
[22:27] * adjohn (~adjohn@mfc0536d0.tmodns.net) Quit ()
[22:28] * BManojlovic (~steki@212.200.243.246) has joined #ceph
[22:31] * adjohn (~adjohn@mfc0536d0.tmodns.net) has joined #ceph
[22:31] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[22:32] * adjohn (~adjohn@mfc0536d0.tmodns.net) Quit ()
[22:33] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[22:33] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Read error: Connection reset by peer)
[22:33] * lofejndif (~lsqavnbok@1RDAAA11L.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[22:40] * adjohn (~adjohn@mfc0536d0.tmodns.net) has joined #ceph
[22:40] * adjohn (~adjohn@mfc0536d0.tmodns.net) Quit ()
[22:41] * lofejndif (~lsqavnbok@83TAAE3ME.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:16] * imjustmatthew (~imjustmat@pool-74-110-201-39.rcmdva.fios.verizon.net) has joined #ceph
[23:18] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) has joined #ceph
[23:20] <joao> sagewk, sjust, three different mismatches for the same failure point, depending on journaling being enabled or disabled on each one of the stores
[23:20] <sagewk> cool.
[23:20] <sagewk> do you have logs?
[23:21] <joao> I'll reproduce it again and keep them
[23:21] <sagewk> k
[23:21] <joao> this is the greatest thing about this test...
[23:21] <joao> I can say "I'll reproduce it"
[23:25] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) Quit (Read error: Connection reset by peer)
[23:25] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) has joined #ceph
[23:29] * alo (~alo@host90-43-static.242-95-b.business.telecomitalia.it) has joined #ceph
[23:29] <alo> hi
[23:30] <alo> someone could help me? I have to add a osd-node to a cluster...
[23:30] <gregaf> alo: have you found the instructions?
[23:31] <alo> not yet :(
[23:31] <gregaf> there's a little bit here: http://ceph.newdream.net/docs/master/ops/manage/grow/osd/#adding-a-new-osd-to-the-cluster
[23:31] * aliguori (~anthony@32.97.110.59) Quit (Remote host closed the connection)
[23:31] <alo> thanks!
[23:32] <alo> i'll look
[23:32] <gregaf> and while it's out of date, this is a little more informative on stuff like changing the crush map: http://ceph.newdream.net/wiki/OSD_cluster_expansion/contraction
[23:33] <alo> ok
[23:33] <gregaf> just try and keep track of what you've done so if (when) you have any questions we can tell what state you're in ;)
[23:33] <joao> sagewk, metropolis:~joao/logs/journal-tests/
[23:33] <alo> great!
[23:33] <alo> really thanks
[23:34] <joao> sagewk, read the DESC file
[23:34] <joao> in case you want to reproduce
[23:34] <alo> I wanna test a conf with more nodes
[23:34] <gregaf> alo: welcome :)
[23:34] * lofejndif (~lsqavnbok@83TAAE3ME.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[23:35] <alo> I've made some test with 2 osd, 2 mds and 3 mon... really amazing performance for a distribuited filesystem.
[23:35] <gregaf> feel free to ask whatever, whenever ?????if nobody's around you might have to idle but you'll get an answer eventually
[23:35] <gregaf> thanks!
[23:36] * lofejndif (~lsqavnbok@09GAAE4D7.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:37] <nhm> alo: what kind of performance are you seeing?
[23:42] <sagewk> joao: broken test... it's cloning from/to the same object
[23:42] <sagewk> let's add an assert in the filestore layer to catch that, too.
[23:42] <sagewk> oh wait, nevermind..
[23:42] <sagewk> 2!=7
[23:50] <joao> two of the mismatches happen on 0.3_head/obj7, they're just "inverted" depending on which store has the journaling sync'ing; the other mismatch is on 0.1_head/obj2
[23:51] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[23:51] <alo> nmh: about 60/70MB/s for a copy of large files, 15/20 Mb/s copying a large amount of small file (old hardware, low ram, sata disks)
[23:52] <joao> sagewk, I'll be afk for about an hour or two, but will be available on gtalk (as soon as I park the car at least)
[23:52] <joao> brb
[23:52] <sagewk> k
[23:52] <dmick> sagewk: can I quote you on 2!=7?
[23:52] <joao> lol
[23:52] <sagewk> joao thanks!
[23:53] <sagewk> dmick: they're close..
[23:53] <sagewk> small type and all that
[23:53] <dmick> heh
[23:53] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[23:54] <nhm> alo: that look like about what we see internally with 2 OSDs on 7200rpm drives. Working on getting those up a bit.
[23:55] <joao> oh, btw
[23:55] <joao> before I go
[23:55] <joao> sagewk, I pushed onto wip-journal-no-sync
[23:56] <sagewk> and that's the code you ran to generate this?
[23:56] <joao> it has some fixes to the idempotent test that should also be on wip-2226-minorfixes
[23:56] <joao> yes
[23:56] <sagewk> k thanks
[23:56] <joao> add --filestore-journal-sync-enable 0 to disable the journal sync
[23:57] <joao> it's obviously enabled by default, and (mental note) I should change that to '-disable 1' when I get back
[23:57] <sagewk> :)
[23:58] <joao> later

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.