#ceph IRC Log


IRC Log for 2011-10-19

Timestamps are in GMT/BST.

[0:03] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[0:15] * Nightdog (~karl@190.84-48-62.nextgentel.com) Quit (Remote host closed the connection)
[0:15] <sandeen_> ok, after a bit of probably-embarassing hackityhack I have it making 2 attrs, one large one small.
[0:15] <sandeen_> and no corruption
[0:18] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (Ping timeout: 480 seconds)
[0:34] <sandeen_> in a loop... I can't break it. grr. well, call it a day.
[0:34] * jmlowe (~Adium@mobile-166-137-140-087.mycingular.net) has joined #ceph
[0:34] * jmlowe (~Adium@mobile-166-137-140-087.mycingular.net) has left #ceph
[0:39] <sandeen_> sjust, how often did you run into this corruption?
[0:39] <sjust> are you using ext4?
[0:41] <sagelap2> sandeen_: you might try flipping between large and small attrs.. iirc that was what the prevoius bug was related to
[0:41] <sjust> I actually ran into a similar problem on btrfs actually, not ext4
[0:42] <sandeen_> sjust, yes
[0:42] <sandeen_> sagelap2, I'm doing 8 byte write, big attr, small attr, which is what the other case semes to do ....
[0:42] <sandeen_> sjust, oh.
[0:43] <sandeen_> i'm letting it run in a loop to see if I can tickle something.
[0:44] <sandeen_> sagelap2, the files with corruption ahd both sized attrs on them, and they were written first big, then small
[0:46] <yehudasa> sandeen_: that reminds me of a fiemap issue we saw a few months back
[0:47] <sandeen_> well, this one corrupts metadata ....
[0:47] <sandeen_> wrong block count.
[0:48] <yehudasa> yeah, actually thinking about it that issue didn't involve xattrs at all\
[0:53] <sandeen_> :)
[0:56] <yehudasa> .. and I can't reproduce the problem anymore any way
[1:17] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:30] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:48] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:55] * gregaf1 (~Adium@aon.hq.newdream.net) has joined #ceph
[2:55] * gregaf (~Adium@aon.hq.newdream.net) Quit (Read error: Connection reset by peer)
[2:55] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) Quit (Quit: cp)
[3:04] * jojy (~jojyvargh@ Quit (Quit: jojy)
[3:14] * bencherian_ (~bencheria@aon.hq.newdream.net) has joined #ceph
[3:19] * bencherian_ (~bencheria@aon.hq.newdream.net) Quit (Read error: Connection reset by peer)
[3:19] * bencherian_ (~bencheria@aon.hq.newdream.net) has joined #ceph
[3:25] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[4:10] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[4:21] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit (Quit: jojy)
[4:38] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[4:45] * sagelap2 (~sage@soenat3.cse.ucsc.edu) Quit (Ping timeout: 480 seconds)
[5:47] * votz (~votz@pool-108-52-121-23.phlapa.fios.verizon.net) Quit (Remote host closed the connection)
[6:59] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[7:30] * sandeen_ (~sandeen@sandeen.net) Quit (Quit: This computer has gone to sleep)
[7:56] * Kioob (~kioob@luuna.daevel.fr) Quit (Quit: Leaving.)
[7:59] * bencherian_ (~bencheria@aon.hq.newdream.net) Quit (Quit: bencherian_)
[8:27] * bencherian (~bencheria@cpe-76-173-232-163.socal.res.rr.com) has joined #ceph
[8:46] * fronlius (~Adium@f054104165.adsl.alicedsl.de) has joined #ceph
[8:59] * pmjdebruijn (~pascal@overlord.pcode.nl) has joined #ceph
[9:00] <pmjdebruijn> hi again
[9:00] <pmjdebruijn> https://github.com/NewDreamNetwork/ceph/blob/master/debian/changelog
[9:00] <pmjdebruijn> it seems the debian packages hasn't been updated
[9:00] <pmjdebruijn> or am I looking on the wrong git repo again?
[9:02] <NaioN> pmjdebruijn: http://ceph.newdream.net/debian/pool/main/c/ceph/
[9:02] <NaioN> pmjdebruijn: http://ceph.newdream.net/docs/latest/ops/install/mkcephfs/#installing-the-packages
[9:49] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[10:47] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Remote host closed the connection)
[10:50] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[11:31] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[11:34] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) Quit ()
[11:40] * alex460 (~alex@per92-2-212-194-143-97.dsl.sta.abo.bbox.fr) has joined #ceph
[11:46] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[12:23] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:23] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[12:37] * hijacker (~hijacker@ Quit (Quit: Leaving)
[14:40] * bencherian (~bencheria@cpe-76-173-232-163.socal.res.rr.com) Quit (Quit: bencherian)
[15:29] * hijacker (~hijacker@ has joined #ceph
[15:55] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[16:36] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[16:37] <jmlowe> so am I out of luck here or is there something to be done about this error "log 2011-10-18 22:12:32.277818 osd.2 14 : [ERR] 1.2c log bound mismatch, info (226'114,226'115]+backlog actual [89'6,226'114]"
[16:46] * iribaar (~iribaar@ Quit (Quit: Leaving)
[16:51] * gregorg_taf (~Greg@ Quit (Ping timeout: 480 seconds)
[16:52] * gregorg (~Greg@ has joined #ceph
[16:56] * wido (~wido@rockbox.widodh.nl) Quit (Read error: Connection reset by peer)
[17:00] * wido (~wido@rockbox.widodh.nl) has joined #ceph
[17:23] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[17:48] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:55] * bencherian_ (~bencheria@mobile-166-205-140-202.mycingular.net) has joined #ceph
[17:59] <SpamapS> How critical is the order of ops at this page? http://ceph.newdream.net/wiki/Monitor_cluster_expansion
[18:00] <sagewk> pmjdebruijn: that's just because the stable branch hasn't been merged into master yet
[18:00] <SpamapS> if I copy the data, and start mon on beta, and *then* add beta to the ceph.conf on all nodes, is that going to cause major problems?
[18:00] <SpamapS> (so basically swapping steps 3 and 4
[18:00] <sagewk> (1,2,3 any order) then 4
[18:01] <sagewk> hmm
[18:01] <sagewk> 3 needs to happen before 4 on the beta node for the init script to start that daemon. the conf update can happen at any other time on other nodes
[18:02] <Tv> probably 1 before 2, too?
[18:02] <Tv> i so wish to make that rsync unnecessary, there..
[18:04] <SpamapS> I can do 3 on beta itself, thats easy
[18:04] <SpamapS> and I can coordinate it that way on all otehr nodes too..
[18:04] <SpamapS> but.. more code.. more problems.. ;)
[18:05] <Tv> hence me wanting to get rid of the rsync, etc
[18:05] <SpamapS> seems pretty doable
[18:05] <SpamapS> the first mon should be willing to share with new mon's, right?
[18:05] <Tv> also, i have a plan to get rid of the [osd.X] [mon.X] etc sections ;)
[18:06] <Tv> SpamapS: yeah, the real challenge is what if you bootstrap a cluster from scratch and there's no "first"
[18:06] <SpamapS> Currently with the juju charm I'm writing I'm starting with a single node cluster, mkcephfs'ing on that, but then treating every added node as a hot-add.
[18:07] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[18:07] <Tv> SpamapS: yeah, i feel the need for something more automatic than that
[18:07] <Tv> can't rely on humans to time it right
[18:07] <SpamapS> juju is timing it right. ;)
[18:07] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[18:07] <Tv> SpamapS: so you treat mon.alpha as the special first one?
[18:07] <SpamapS> but it would be *so* nice if I didn't have to write code to do that.
[18:08] <SpamapS> Tv: right, the first node up is the special "leader" and does the mkcephfs.
[18:08] <SpamapS> Tv: and becomes the one that does all the 'ceph mon add' ops
[18:08] <SpamapS> if it goes away, the next one in the chain does so
[18:08] <Tv> SpamapS: and then mon.alpha goes down and your cluster is broken?
[18:09] <Tv> ah
[18:09] <SpamapS> I do believe there may be a period where things require manual fixing if your first one goes away and you haven't actually removed it from juju
[18:09] <SpamapS> because all the other ones will be waiting for the leader to act
[18:10] <SpamapS> but that would only be relevant if needing to add nodes while the leader is in a half-down state
[18:10] <SpamapS> so just remove it before adding and no harm done
[18:10] <SpamapS> (this is all in theory, I'm only barely at the point where the ceph mon add works ;)
[18:10] <Tv> SpamapS: you might want to look at the osd autodeploy stuff we have on the chef side
[18:11] <Tv> SpamapS: https://github.com/NewDreamNetwork/ceph-cookbooks/tree/wip-simple (for now, merging to master very soon)
[18:11] <Tv> bootstrap_osd recipe
[18:11] <Tv> pushed all the mon actions to be done from the osd host itself
[18:12] <SpamapS> Tv: I'm actually planning on comparing side by side what you have there once I'm done cranking it out in pure shell, so I can make an argument for/against writing charms in chef. :)
[18:13] <Tv> SpamapS: yeah, all of the "smarts" should be split out of chef/crowbar/juju when possible
[18:13] <SpamapS> Tv: if I get stuck tho, I will peek.
[18:13] <Tv> separate glue from things glued
[18:13] * bencherian_ (~bencheria@mobile-166-205-140-202.mycingular.net) Quit (Quit: bencherian_)
[18:16] <sagewk> ajm: ping
[18:17] <ajm> sagewk: hi
[18:19] <sagewk> ajm: pushed new wip-unfound-backport branch based on 0.34
[18:19] <ajm> yay
[18:19] <ajm> let me get it compiled
[18:20] <Tv> SpamapS: i've been hitting my head on upstart a lot lately.. is there an easy way to say "start if not already running", without getting errors?
[18:20] <Tv> like start-stop-daemon --oknodo
[18:21] <SpamapS> Tv: start foo || :
[18:22] <Tv> SpamapS: that still spams stderr :(
[18:22] <SpamapS> Tv: start foo 2>/dev/null || :
[18:22] <SpamapS> you knew that was coming
[18:22] <Tv> i don't like your hammer
[18:24] <SpamapS> It strives pretty hard to only do one thing.. have oft thought that a wrapper that makes it more script-friendly would be useful.
[18:24] <Tv> i don't like the fact that the above assumes things like i didn't typo foo
[18:24] <Tv> i want confirmation that the job is known to upstart etc
[18:25] <Tv> thinking of "if status | something; then start ...; fi"
[18:25] <SpamapS> status foo would help w/ that
[18:25] <Tv> but the format is not nice to parse :
[18:25] <Tv> (
[18:25] <SpamapS> its very nice
[18:25] <SpamapS> before the comma is very predictable
[18:26] <SpamapS> jobname goal/status
[18:26] <Tv> "|grep /running," ???
[18:26] <Tv> that's kinda ugly
[18:26] <sagewk> ajm: hold off on actually deploying that new code until do some final testing :)
[18:27] <Tv> actually i care about goal not status; if it's already "told to be up", that's all i was going to do anyway
[18:27] <SpamapS> yeah
[18:27] <SpamapS> but start --oknodo would be a nice feature
[18:27] <SpamapS> Just to assert that its goal is start
[18:28] <Tv> SpamapS: honestly i think those commands should be idempotent
[18:28] <Tv> SpamapS: "set goal to this"
[18:30] * SpamapS digs through the bug list to see if its already been proposed
[18:30] <ajm> sagewk: ok, lmk
[18:33] <SpamapS> Tv: https://bugs.launchpad.net/upstart/+bug/878322 feel free to subscribe/expand/submit a patch. ;)
[18:34] <Tv> SpamapS: for reference, http://tickets.opscode.com/browse/CHEF-1424
[18:36] <SpamapS> restart is a really silly command in upstart
[18:36] <SpamapS> stop/start is much safer
[18:37] <SpamapS> restart doesn't re-load the job definition.. and doesn't work at all if there is a pre-stop
[18:37] <Tv> yeah i sort of agree there, restart has often acted weird for me on sysvrc
[18:37] <SpamapS> Hah, as Dan states.. in the first comment. :)
[18:47] <Tv> SpamapS: https://github.com/NewDreamNetwork/ceph-cookbooks/commit/a3461156b80636588f175dba8e4a81c938ebc1ac
[18:48] <pmjdebruijn> sagewk: can I access the stable branch?
[18:49] <sagewk> pmjdebruijn: git checkout -b stable origin/stable
[18:49] <SpamapS> Tv: looks like a winner to me. :)
[18:49] <df__> sagewk, btw, have you ever had issues with ceph kernel client interacting badly with apparmor?
[18:50] <sagewk> df__: never used it. there's no acl support currently...
[18:50] <sagewk> df__; tho noah is pretty close to having something working
[18:50] <df__> no, likewise, we don't use it and aren't running acls with ceph
[18:51] <df__> but i had too kernel bugs last night that killed processes accessing stuff from a ceph mountpoint
[18:51] <pmjdebruijn> sagewk: thanks
[18:51] <df__> gah, s/too/two/
[18:52] <sagewk> df__: are you running master or for-linus?
[18:53] <df__> master
[18:53] * bchrisman (~Adium@ has joined #ceph
[18:53] <sagewk> df__; if you have the time to try for-linus, and/or bisect (even partially) that would be extremely helpful :)
[18:54] <df__> will do if i can reproduce it
[18:54] <df__> out of a several hundred step job over multiple machines, it only affected a single job
[18:54] <sagewk> ajm: ok pushed updated wip-unfound-backport. had to backport 2 other fixes from the other day
[18:55] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:55] <sagewk> df__: suppose it could be worse... :/ i really hope its something new in master tho and doesn't affect for-linus (which is what 3.1 is getting)
[18:56] <df__> btw, woo to the kernel resolver, i hope that has gone in 3.1 (can't remember where the for-linux branched)
[18:57] <sagewk> it'll be 3.2
[18:57] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit (Quit: jojy)
[18:57] <sagewk> df__; mount.ceph does it for you though, it really shouldn't matter in practice?
[18:58] <df__> hmm, wonder why that wasn't doing it ... will have to check the setup here
[19:05] * gregaf1 (~Adium@aon.hq.newdream.net) Quit (Quit: Leaving.)
[19:10] * sandeen_ (~sandeen@sandeen.net) has joined #ceph
[19:10] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[19:11] * fronlius (~Adium@f054104165.adsl.alicedsl.de) Quit (charon.oftc.net solenoid.oftc.net)
[19:11] * mrjack (mrjack@office.smart-weblications.net) Quit (charon.oftc.net solenoid.oftc.net)
[19:11] * alexxy (~alexxy@ Quit (charon.oftc.net solenoid.oftc.net)
[19:11] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (charon.oftc.net solenoid.oftc.net)
[19:11] * RupS (~rups@panoramix.m0z.net) Quit (charon.oftc.net solenoid.oftc.net)
[19:11] * NaioN (~stefan@andor.naion.nl) Quit (charon.oftc.net solenoid.oftc.net)
[19:11] * Ormod (~valtha@ohmu.fi) Quit (charon.oftc.net solenoid.oftc.net)
[19:11] * df__ (davidf@dog.thdo.woaf.net) Quit (charon.oftc.net solenoid.oftc.net)
[19:11] * peritus (~andreas@h-150-131.a163.priv.bahnhof.se) Quit (charon.oftc.net solenoid.oftc.net)
[19:11] * jantje_ (~jan@paranoid.nl) Quit (charon.oftc.net solenoid.oftc.net)
[19:12] * fronlius (~Adium@f054104165.adsl.alicedsl.de) has joined #ceph
[19:12] * mrjack (mrjack@office.smart-weblications.net) has joined #ceph
[19:12] * alexxy (~alexxy@ has joined #ceph
[19:12] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[19:12] * RupS (~rups@panoramix.m0z.net) has joined #ceph
[19:12] * NaioN (~stefan@andor.naion.nl) has joined #ceph
[19:12] * Ormod (~valtha@ohmu.fi) has joined #ceph
[19:12] * jantje_ (~jan@paranoid.nl) has joined #ceph
[19:12] * peritus (~andreas@h-150-131.a163.priv.bahnhof.se) has joined #ceph
[19:12] * df__ (davidf@dog.thdo.woaf.net) has joined #ceph
[19:13] <joshd> jmlowe: if you can reproduce with osd debugging on, that'd be great - it's http://tracker.newdream.net/issues/1526, but we haven't seen it happen for a few weeks
[19:21] <ajm> sagewk: up on the new osds, i only have 160 degraded now but its increasing slowly (was 192 before)
[19:22] <NaioN> I see a lot of these messages in the osd logs, does anybody know what they mean?
[19:22] <NaioN> 2011-10-19 19:21:19.707477 7fac5a6b4700 journal throttle: waited for bytes
[19:22] <NaioN> 2011-10-19 19:21:19.780174 7fac5aeb5700 journal throttle: waited for ops
[19:24] <NaioN> I have a setup with two osds (both have these messages in the log)
[19:29] <sandeen_> Inode 1051537, i_blocks is 24, should be 16. Fix? yes
[19:29] <sandeen_> \o/
[19:29] * sandeen_ reproduces
[19:30] * alex460 (~alex@per92-2-212-194-143-97.dsl.sta.abo.bbox.fr) Quit (Quit: alex460)
[19:32] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[19:37] <NaioN> I've filestore btrfs snap = 0, because else I also get the problem with blocking osds
[19:38] <sjust> NaioN: are the OSDs processing write operations?
[19:39] <NaioN> yes
[19:39] <sjust> NaioN: journal throttle indicates that the journal is full and is being flushed before that op can go through
[19:39] <NaioN> it's a initial rsync
[19:39] <NaioN> well multiple rsyncs (about 20)
[19:40] <NaioN> the journal is a ssd disk of 160GB...
[19:40] <Tv> NaioN: did you set "osd journal size" too?
[19:40] <NaioN> no
[19:41] <NaioN> 2011-10-19 16:28:22.161999 7f302b1c0720 journal _open /dev/sda fd 14: 160041885696 bytes, block size 4096 bytes, directio = 1
[19:41] <NaioN> from the osd log
[19:41] <Tv> yeah it should do the right thing then
[19:41] <NaioN> so it detects the size
[19:41] <sjust> yeah, I think it autodetects in that case
[19:42] <NaioN> and it didn't reach the 160G of size at the moment
[19:42] <ajm> sagewk: i'm back up to 192 degraded
[19:42] <Tv> NaioN: what do you mean by that last line?
[19:42] <NaioN> 2011-10-19 19:37:33.087813 7fac5aeb5700 -- send_message dropped message osd_op_reply(1283902 10000151c9d.00000000 [write 0~263220 [1@-1]] ondisk = 0) v1 because of no pipe on con 0x1b01640
[19:43] <NaioN> Tv: which last line?
[19:43] <Tv> NaioN: and it didn't reach the 160G of size at the moment
[19:43] <Tv> what didn't reach 160G, observed how
[19:43] <NaioN> i'm seeing these messages also at the moment
[19:43] <NaioN> oh the rsync
[19:43] <NaioN> it's about 140G
[19:43] <Tv> NaioN: i'll easily believe a 14% overhead
[19:43] <sjust> across all 20?
[19:44] <NaioN> so the whole rsync workload fitted into the journal
[19:44] <gregaf> that many bytes is actually 149GB…
[19:44] <NaioN> yes across all 20
[19:44] <gregaf> stupid base-10 disk manufacturers ;)
[19:44] <NaioN> :)
[19:44] <NaioN> it's an initial rsync that's still running
[19:44] <Tv> so you filled your journal, and now are waiting for it to be flushed to slower disk?
[19:44] <NaioN> in total it's i think about a T
[19:44] <gregaf> Tv: yeah, that's basically what that message means
[19:44] <Tv> that sounds like what's just going to happen if you have a fast journal (ssd) and a slower actual storage (hdd)
[19:45] <NaioN> well I see activity on the disks too
[19:45] <Tv> now, if it gets *intolerably* slow, then that's an issue
[19:45] <Tv> or, if you pause the rsync, it doesn't free up journal space by itself
[19:45] <NaioN> well the disks is a mdraid of 12 disks
[19:45] <NaioN> in raid6
[19:45] <NaioN> and it can read and write with more than 100M/s
[19:45] <Tv> sagewk: btw one good osd perf counter would be free journal space.. is that in there already?
[19:45] <NaioN> so that shouldn't be a problem
[19:46] <Tv> NaioN: but it'll still be slower than the ssd journal, so if you keep pushing until you fill the journal, it'll get backlogged
[19:46] <Tv> NaioN: the journal will make bursts of writes faster, but can't take a sustained load any faster than the actual disks can
[19:47] <NaioN> well the bottleneck is the 1g network interface
[19:48] <NaioN> so it never gets more than 100M/s
[19:48] <NaioN> but I see bursts of writes on de ssd
[19:48] <NaioN> and continues writes on the md0 (mdraid)
[19:49] <gregaf> NaioN: you have 2 OSDs and 1 client?
[19:49] <NaioN> yes
[19:49] <NaioN> its server -> rsync -> server -> cephfs -> cluster
[19:50] <NaioN> so the intermediate server mounts the cephfs and exposes a rsyncd
[19:50] <NaioN> and the first server spawns 20 rsync clients that connect to the rsyncd
[19:50] <NaioN> an writes to directories on the cephfs of the intermediate server
[19:51] <NaioN> Tv: it looks like the workload is send to the ssd and the mdraid device...
[19:52] <NaioN> because i see no reads on the ssd
[19:52] <Tv> NaioN: journal is only ever read on recovery
[19:52] <NaioN> ok
[19:52] <NaioN> that explains
[19:52] <NaioN> so the commits comes faster
[19:53] <gregaf> in general blocking on the journal just isn't a problem, though I am curious about why that would be happening in this scenario — 160GB is a lot and it ought to be network bound...
[19:53] <Tv> actual, yeah ignore my earlier comment about journal making bursts of writes faster; ceph writes to both journal and backing store at once, it doesn't have separate journal->backing store workers
[19:53] <Tv> gregaf: yeah it smells odd
[19:53] <NaioN> gregaf: yes that was my thought also
[19:54] <NaioN> and some clients (rsync) got disconnected with error 12's
[19:54] <gregaf> Tv: it only has to wait on one of them to hit disk though; the journal definitely makes bursts faster :)
[19:54] <NaioN> rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Connection reset by peer (104)
[19:54] <Tv> gregaf: only until you fill up the write cache of the disks, that's not that much
[19:54] <NaioN> rsync: read error: Connection reset by peer (104)
[19:54] <Tv> gregaf: like, <<160GB
[19:54] <NaioN> so there are 18 left
[19:55] <Tv> gregaf: oh i mean disk+OS.. but the same thing, <<160GB
[19:55] <gregaf> Tv: it can stay in page cache for the main store as long as it's hit disk in the journal
[19:55] <gregaf> oh, yeah, <<160GB
[19:55] <gregaf> which is what makes it odd
[19:56] <gregaf> NaioN: well that's an rsync error, not a ceph error?
[19:57] <NaioN> yes thats the rsync error
[19:57] <NaioN> but i suspect the error occurs if the IO stalls for to long for that rsync client
[19:58] <gregaf> *shrug* I dunno rsync that well, but I doubt that's it; probably some kind of network congestion due to running 20 clients over a GigE
[19:59] * bencherian (~bencheria@aon.hq.newdream.net) has joined #ceph
[19:59] <gregaf> (it's just one client to Ceph and it's just not capable of preferring one over the others given the workload you're doing)
[19:59] <NaioN> gregaf: hmmmm welll haven't thought of that
[19:59] <df__> gregaf, you feeling better today?
[19:59] <NaioN> but we use this as our primary backup method
[19:59] <gregaf> heh, yeah, thanks
[19:59] <Tv> rsyncd has a configurable IO timeout
[19:59] <Tv> it's highly likely that's what triggered here
[20:00] <gregaf> wait, NaioN, you turned off btrfs snaps?
[20:00] <NaioN> yes
[20:00] <Tv> frankly, my experience says 20 parallel rsyncs is a bad idea no matter what your storage system is
[20:00] <NaioN> Tv: :)
[20:00] <NaioN> normally we do a lot more
[20:00] <gregaf> does anybody remember the specifics of how it does the syncing in that case?
[20:00] * Iribaar (~Iribaar@ has joined #ceph
[20:01] <NaioN> but with different storage
[20:01] <NaioN> with the snaps on I get btrfs errors...
[20:01] * bencherian (~bencheria@aon.hq.newdream.net) Quit ()
[20:06] * bencherian (~bencheria@aon.hq.newdream.net) has joined #ceph
[20:07] <sjust> NaioN: can you estimate the frequency of those messages?
[20:08] <NaioN> I also see a lot of these messages: 2011-10-19 19:37:33.087813 7fac5aeb5700 -- send_message dropped message osd_op_reply(1283902 10000151c9d.00000000 [write 0~263220 [1@-1]] ondisk = 0) v1 because of no pipe on con 0x1b01640
[20:09] <NaioN> sjust: they come in bursts of about 10 to 20 with about 5 till 10 minutes between
[20:10] <NaioN> well sometimes less minutes between
[20:11] <NaioN> they came after most of the rsync clients started, so no messages at the start
[20:12] <jmlowe> Any hints on what this is or what to do about it "log 2011-10-19 14:11:33.855237 osd.1 2 : [ERR] 1.8f log bound mismatch, empty but (226'137,226'138]"
[20:12] <NaioN> at the moment everything is still running (except of 2 clients) and I still get the errors of the journal
[20:14] <joshd> jmlowe: if you can reproduce with osd debugging on, that'd be great - it's http://tracker.newdream.net/issues/1526, but we haven't seen it happen for a few weeks
[20:15] <sjust> NaioN: In general, that isn't an error, just a notification
[20:15] <NaioN> sjust: ah ok
[20:16] <sjust> NaioN: it may indicate that you are hitting a bottleneck though
[20:16] <sjust> NaioN: with btrfs snapshot off, I think we do a btrfs sync to flush the journal
[20:17] <NaioN> could that take a while?
[20:17] <gregaf> sjust: NaioN: could a sync take long enough to hit that limit?
[20:17] * bencherian (~bencheria@aon.hq.newdream.net) Quit (Quit: bencherian)
[20:17] <sjust> NaioN: generally, it shouldn't, and the journal should absorb the delay anyway
[20:18] <NaioN> and the btrfs transactions?
[20:18] <NaioN> are they needed and could they have an effect?
[20:19] <sjust> those should only affect recovery, I think
[20:24] * bencherian (~bencheria@aon.hq.newdream.net) has joined #ceph
[20:28] <NaioN> sjust: hmmmm I'm just realizing something
[20:28] <NaioN> both osds have 1 gigabit connection
[20:28] <NaioN> but because of replication in effect they only have half of it?
[20:38] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[20:41] <joshd> NaioN: that's true if they use the same nic to talk to the cluster and the client (set by 'cluster addr' and 'public addr' in ceph.conf)
[20:41] <NaioN> joshd: ok so it's better to split them
[20:41] <NaioN> sjust: I just found the error on the rsyncd side:
[20:41] <NaioN> 2011/10/19 15:34:47 [6539] rsync error: timeout in data send/receive (code 30) at io.c(137) [receiver=3.0.7]
[20:42] <NaioN> 2011/10/19 15:34:47 [6539] rsync: connection unexpectedly closed (359 bytes received so far) [generator]
[20:42] <NaioN> 2011/10/19 15:34:47 [6539] rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.7]
[20:42] <NaioN> so it looks like the error is between the rsync client en server
[20:42] * bencherian (~bencheria@aon.hq.newdream.net) Quit (Quit: bencherian)
[20:44] * bencherian (~bencheria@aon.hq.newdream.net) has joined #ceph
[20:45] <NaioN> joshd: I can use those options for the individual osds/mons/mdss?
[20:47] <joshd> yeah, although it's only relevant for osds
[20:47] <joshd> the rest ignore public_addr
[20:49] <NaioN> ok
[20:49] <joshd> err, other way around
[20:49] <joshd> cluster_addr is only used for osds
[20:51] <NaioN> so you could build a seperate network for the osds to communicate and pushdata to be replicated
[20:51] <joshd> exactly
[21:10] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[21:17] <gregaf> NaioN: in your particular case with only one client that can only push 1Gbit connection the replication traffic isn't a problem
[21:37] <NaioN> gregaf: yes i noticed
[21:37] <NaioN> because the initial write goes 50-50 to both osds
[21:37] <NaioN> so the replication write goes also 50-50 to both osds
[22:21] <SpamapS> Is there a clear automated way to tell if a mon has already been added before running 'ceph mon add ..' ? I want to distinguish "couldn't add because its already there" failures from "couldn't add because of something else"
[22:21] <SpamapS> right now I'm manually extracting the list with monmaptool .. but that seems hacky
[22:22] <Tv> SpamapS: yup; we're brainstorming soon about that whole topic, it'll change to be better..
[22:22] <SpamapS> cool
[22:22] <Tv> sagewk: on that topic.. whenever is good for you to talk about mon bootstrapping, i'm working on that general topic anyway
[22:23] <sagewk> tv: k give me a few minutes
[22:32] * sandeen_ (~sandeen@sandeen.net) Quit (Quit: This computer has gone to sleep)
[22:36] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[22:55] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[22:57] * sandeen_ (~sandeen@ has joined #ceph
[23:32] * hijacker (~hijacker@ Quit (Ping timeout: 480 seconds)
[23:33] * hijacker (~hijacker@ has joined #ceph
[23:37] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:47] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[23:48] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:52] <sandeen_> sagewk, dunno if you saw, but I can reproduce it now with a very hacky streamtest
[23:52] <sagewk> sandeen_: yay!
[23:52] <sandeen_> I'm sure I've abused it horribly, if I ever show it to you please don't laugh, I just forced it to work ;)
[23:52] <sagewk> no worries, we definitely want to see it so we can verify this is fixed in the future.
[23:53] <sandeen_> now to see what's wrong :( turning off delalloc makes it go away... and the small attr update doesn't seem to matter.
[23:53] <sagewk> and/or morph it into any more general stress test tool
[23:53] <sandeen_> sent you a patch to show you what I am running now, I wasn't quite sure how to make the setattr work
[23:54] <sandeen_> it seemed to want a "bufferlist" passed to it?
[23:54] <sandeen_> so... I did something like that ;)
[23:56] <sagewk> sandeen_: hmm so just setting a small attr is enough
[23:56] <sandeen_> no just the big one
[23:56] <sandeen_> I think?
[23:56] * sandeen_ re-checks
[23:56] <sandeen_> yeah the big one
[23:57] <sandeen_> because it takes another block
[23:57] <sandeen_> bl2 has the big buffer
[23:57] <sandeen_> hm I should test not modifying the buffer so it shares blocks
[23:57] * sandeen_ tries
[23:58] <sagewk> oh i see, yeah.
[23:58] <sandeen_> er
[23:58] <sandeen_> yeah. :)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.