#ceph IRC Log


IRC Log for 2012-02-07

Timestamps are in GMT/BST.

[0:20] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[0:24] * fronlius (~fronlius@f054097033.adsl.alicedsl.de) Quit (Quit: fronlius)
[0:28] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[0:29] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[0:46] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[0:57] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[0:59] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[0:59] <eightyeight> ok. i've made some changes
[1:00] <eightyeight> if i want to take advantage of the btrfs checksumming, then it doesn't make sense for me to use linux software raid underneath btrfs
[1:00] <eightyeight> so, i've done the following:
[1:00] <eightyeight> http://ae7.st/p/48
[1:01] <eightyeight> as well as http://ae7.st/p/3c for the osd journal
[1:02] <eightyeight> my /etc/ceph/ceph.conf looks like: http://ae7.st/p/83
[1:02] <eightyeight> yet, i get the following errors when starting /etc/init.d/ceph:
[1:02] <eightyeight> http://ae7.st/p/2b
[1:02] <eightyeight> thoughts?
[1:03] <eightyeight> oh. i did issue mkcephfs(8) before starting the service
[1:04] <eightyeight> http://ae7.st/p/6f is the result of mkcephfs(8)
[1:07] <joshd> eightyeight: what does the osd log say?
[1:08] <Sargun> eightyeight: I didn't figure I'd see you here
[1:09] <eightyeight> i turned off debugging. so, here's the relevant log: http://ae7.st/p/5d
[1:09] <eightyeight> Sargun: :)
[1:09] <Sargun> Ceph is the shit, isn't it.
[1:09] <eightyeight> we'll see. :)
[1:10] <eightyeight> i like the architecture much more than gluster or moose
[1:10] <eightyeight> seems to be the perfect fit for what i'm architecting
[1:10] <joshd> eightyeight: oh, you probably just need to mkdir /data/osd.0 - mkcephfs doesn't do it for you so you don't accidentally put it in the wrong place
[1:11] <eightyeight> joshd: /data/osd.0 already exists
[1:14] <joshd> eightyeight: is /dev/sda mounted at /data/osd.0?
[1:15] <eightyeight> /dev/sda on /data/osd.0 type btrfs (rw)
[1:16] * joao (~joao@ Quit (Quit: joao)
[1:22] <joshd> eightyeight: are there 'magic', 'fsid', 'whoami', and 'ceph_fsid' files in /data/osd.0?
[1:22] * lollercaust (~paper@85.Red-83-41-151.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[1:23] <eightyeight> joshd: the dir is empty, actually
[1:23] <joshd> ok, then it wasn't initialized properly by mkcephfs
[1:24] <eightyeight> unmount, try again?
[1:25] <joshd> you can init it manually: http://ceph.newdream.net/wiki/OSD_cluster_expansion/contraction#Format_the_OSD
[1:26] <eightyeight> i'm assuming the "cosd" command is "ceph-osd"L
[1:26] <eightyeight> s/L/?/
[1:26] <joshd> yeah
[1:26] <joshd> fixed that
[1:27] <joshd> that assumes your monitors are running already too
[1:27] <eightyeight> and "/path/to/osd/keyring" is "/data/keyring.osd.0"?
[1:27] <eightyeight> right
[1:27] <joshd> yup
[1:27] * __nolife (~Lirezh@83-64-53-66.kocheck.xdsl-line.inode.at) has joined #ceph
[1:30] <eightyeight> heh. it appears i've left the old mount mounted on the client this whole time, and now is timing out on the umount(8)
[1:30] <joshd> if you want to force it, umount -lf
[1:31] <eightyeight> yeah
[1:34] <eightyeight> mount(8) seems to be hanging again
[1:36] <joshd> but the osd is up and everything is fine according to ceph -s?
[1:36] <eightyeight> no, actually
[1:37] <eightyeight> 2012-02-06 17:36:48.635883 osd e2: 0 osds: 0 up, 0 in
[1:37] <eightyeight> from "ceph -s"
[1:41] <eightyeight> ah. crap
[1:43] <joshd> figured it out?
[1:44] <eightyeight> nope
[1:44] <eightyeight> thought so
[1:44] <eightyeight> still 0 osds
[1:45] <joshd> what's the osd log?
[1:45] <eightyeight> nothing new in the past 20 minutes
[1:46] <joshd> is the ceph-osd process running?
[1:46] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[1:47] <eightyeight> new logs: http://ae7.st/p/7b
[1:48] <joshd> that all looks normal
[1:48] <eightyeight> yaeh
[1:48] <joshd> just fyi, you might want to use a new kernel if you're using btrfs
[1:49] <eightyeight> definitely
[1:49] <eightyeight> i'll likely be rolling my own kernel by hand, if i can't convince the team to use debian unstable for this
[1:49] <eightyeight> ubuntu is just too frozen
[1:50] <joshd> so is the osd process still running, and deadlocked? or has it exited with nothing in the log?
[1:51] <eightyeight> there is a pid for ceph-osd
[1:51] <joshd> if it gets stuck in D state, check dmesg - it's probably btrfs
[1:52] <eightyeight> how to tell if it is stuck in "D state"? nothing useful in dmesg(1)
[1:53] <joshd> ps aux | grep ceph-osd
[1:53] <eightyeight> root 1133 0.2 0.6 373248 74860 ? Ssl 17:46 0:01 /usr/bin/ceph-osd -i 0 -c /etc/ceph/ceph.conf
[1:54] <joshd> Ssl is the state column, so that's normal
[1:54] <eightyeight> ah
[1:55] <joshd> I'd suggest restarting the osd with 'debug osd = 20', 'debug monc = 10', 'debug ms = 1'
[1:55] <joshd> it may be stuck trying to talk to the monitors
[1:55] <eightyeight> ok
[1:59] <eightyeight> 'debug mon = 10' i assume you ment?
[1:59] <joshd> no, monc is monitor client
[1:59] <eightyeight> oh
[2:00] <eightyeight> so, all this under "[osd]" in the /etc/ceph/ceph.conf on the server?
[2:00] <joshd> yeah
[2:00] <eightyeight> ok
[2:06] <eightyeight> heh. that's a bit verbose
[2:06] <eightyeight> http://ae7.st/p/40
[2:06] <eightyeight> probably more than you're looking for. maybe not enough. let me know.
[2:09] <joshd> that's enough - it looks like it's working normally
[2:10] <joshd> the question is whether the monitors accepted the osd_boot message, I think
[2:12] <joshd> if you add 'debug ms = 1' and 'debug mon = 20' to the monitor section and restart them, the monitor logs should have a clue
[2:13] <joshd> particularly after osd_boot shows up in the monitor logs
[2:21] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) has joined #ceph
[2:35] * Tv|work (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:49] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[2:58] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) Quit (Quit: Ex-Chat)
[4:04] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:18] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:38] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[6:32] * Meths_ (rift@ has joined #ceph
[6:37] * Meths (rift@ Quit (Read error: Operation timed out)
[7:19] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (Read error: Connection reset by peer)
[7:24] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[8:02] * Guest1785 (~matthew@pool-96-228-59-130.rcmdva.fios.verizon.net) has joined #ceph
[8:05] <Guest1785> I'm getting a bunch of "...send_message dropped message osd_op_reply(...) v1 because of no pipe on con 0x..." in my OSD logs when writing large batches of small files to a cluster with 2 OSDs, is this normal/okay?
[8:24] * henrycc (~Henry@ has joined #ceph
[8:40] * henrycc (~Henry@ Quit (Quit: Leaving)
[8:53] * amichel (~amichel@ip68-230-60-21.ph.ph.cox.net) has joined #ceph
[8:55] * adjohn (~adjohn@50-0-164-170.dsl.dynamic.sonic.net) has joined #ceph
[9:00] <amichel> So I was in a few days ago talking about OSD sizing on a single server with many disks (backblaze-ish design). @nhm I think you were helping me get some insight into it.
[9:01] <amichel> I'm getting ready to finally configure the OSDs, but someone had expressed some desire for testing against this kind of design and since mine is essentially a clean box with no fixed production schedule, I thought I might see if there was any sort of "test plan" you guys might want executed to get some numbers?
[9:05] <amichel> The port multipliers create some unfortunate bottlenecks, so I'm not sure how granular it will make sense to go, but I'll try about anything that would be helpful.
[9:25] <iggy> amichel: seperate disks for journals is generally favored for performance... maybe some tests with 1 of the drives broken up for just journling or one maybe staggring the journals with the osds
[9:26] <amichel> How big are the journals typically?
[9:27] <iggy> we had this discussion the other night
[9:27] <iggy> nothing firm was stated really, basically ednough to cover 5 secs or so of osd writes
[9:29] <amichel> hhmm
[9:35] <amichel> Looks like my best case sequential for a single stripe is ~600MB/s
[9:36] <amichel> so I need what, 3 gigs or so by that measure?
[9:36] <amichel> Per stripe that is
[9:38] <amichel> I'm using SSD for system disk, would it make sense to just chop slices off that for journal space?
[9:42] <amichel> Seems like slicing off a full 3T spindle for journal would be a bit wasteful
[10:04] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[10:16] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[10:16] * joao (~joao@ has joined #ceph
[10:17] * amichel (~amichel@ip68-230-60-21.ph.ph.cox.net) Quit (Quit: Bad news, everyone!)
[10:25] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[10:27] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[10:42] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[11:02] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[11:02] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[11:06] * adjohn (~adjohn@50-0-164-170.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[11:06] * adjohn (~adjohn@50-0-164-170.dsl.dynamic.sonic.net) has joined #ceph
[11:06] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[11:07] * adjohn (~adjohn@50-0-164-170.dsl.dynamic.sonic.net) Quit ()
[11:24] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[11:41] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[11:51] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[11:51] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[11:51] * fronlius_ is now known as fronlius
[12:02] * fronlius_ (~fronlius@p578b21b6.dip0.t-ipconnect.de) has joined #ceph
[12:02] * bugoff_ (bram@november.openminds.be) Quit (Remote host closed the connection)
[12:02] * bugoff (bram@november.openminds.be) has joined #ceph
[12:03] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Ping timeout: 480 seconds)
[12:03] * fronlius_ is now known as fronlius
[12:05] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[12:05] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[12:11] * fronlius (~fronlius@p578b21b6.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[12:13] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:28] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Read error: No route to host)
[12:28] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[12:32] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) has left #ceph
[12:37] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[12:43] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[13:24] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[13:28] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[13:36] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[13:36] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[13:36] * fronlius_ is now known as fronlius
[14:18] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[14:22] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[14:42] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) has left #ceph
[15:03] <lxo> hey, remember an alleged btrfs bug I mentioned that could be causing zero-sized files to appear. I'm now convinced it may be a ceph bug after all
[15:04] <lxo> AFAICT snapshots can be taken when files have just been created and not written into yet. this seems to occur particularly often for pginfo files, and recovery doesn't behave very well when it encounters empty files
[15:04] <lxo> empty pginfo files, I mean
[15:07] <lxo> shouldn't pginfo updates follow the same “write to temp then move” protocol that AFAICT we follow for data files?
[15:08] <lxo> (and if ICT correctly, this leaves open the question of how small data files end up empty when they shouldn't)
[15:08] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[15:14] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[15:30] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[15:41] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[15:41] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[16:06] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[16:07] * henrycc (~Henry@219-86-164-64.dynamic.tfn.net.tw) has joined #ceph
[16:29] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Operation timed out)
[16:34] * jmlowe (~Adium@c-98-223-195-84.hsd1.in.comcast.net) has joined #ceph
[16:35] <jmlowe> I've got one pg stuck in active+backfill, any hints?
[16:35] <jmlowe> dumping the pg's gives me this
[16:35] <jmlowe> 0.2dc 1 0 1 0 1114112 150093 150093 active+backfill 1119'1927 772'1465 [10,0] [7,10,5] 108'322 2012-01-25 13:13:27.613815
[16:39] * jmlowe (~Adium@c-98-223-195-84.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[16:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[17:00] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[17:00] <- *jmlowe* help
[17:02] <lxo> how could I go about getting osd to take an empty pginfo file as if it contained e.g. a sequence of 8 NULs or so, so that it doesn't crash upon encountering such empty files? what would I lose if I did that instead of failing more gracefully than a segfault, for manual intervention?
[17:05] * andreask (~andreas@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[17:05] <lxo> on a related note, should ceph *ever* create zero-sized files, even as data files? or would it be safe to assume that empty files are symptoms of this bug, and have them reported and dropped?
[17:05] <lxo> (to be re-fetched from another replica or regarded as lost)
[17:16] * Ludo (~Ludo@88-191-129-65.rev.dedibox.fr) Quit (Server closed connection)
[17:16] * Ludo (~Ludo@88-191-129-65.rev.dedibox.fr) has joined #ceph
[17:40] <jmlowe> ok, I've got some other strange things going on here, rbd map fails with add failed: (2) No such file or directory but I can see the target with ls
[17:56] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:01] * lollercaust (~paper@85.Red-83-41-151.dynamicIP.rima-tde.net) has joined #ceph
[18:02] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Read error: No route to host)
[18:03] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[18:05] <jmlowe> ok, looks like I've got it good and broken, not sure how to proceed
[18:14] * Tv|work (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:14] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[18:25] * henrycc (~Henry@219-86-164-64.dynamic.tfn.net.tw) Quit (Ping timeout: 480 seconds)
[18:26] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[18:39] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[18:43] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[18:50] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Ping timeout: 480 seconds)
[18:53] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[19:05] * lollercaust (~paper@85.Red-83-41-151.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[19:06] <joshd> jmlowe: pgs stuck in active+backfill is definitely a bug
[19:07] * bchrisman (~Adium@ has joined #ceph
[19:07] * chutzpah (~chutz@ has joined #ceph
[19:10] <joshd> jmlowe: if you could turn on osd debugging and restart the primary for the stuck pg (osd.7) we'd like to see the logs
[19:11] <joshd> jmlowe: it may not get stuck again after the restart, but if it does, we'd need a log with debugging to figure out why
[19:12] <joshd> jmlowe: for the rbd map problem, is there anything in dmesg?
[19:13] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[19:26] * _are_ (~quassel@vs01.lug-s.org) Quit (Server closed connection)
[19:26] * _are_ (~quassel@vs01.lug-s.org) has joined #ceph
[19:29] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[19:32] * fronlius (~fronlius@testing78.jimdo-server.com) Quit ()
[19:38] <jmlowe> I've got osd's crashing left and right
[19:40] <joshd> lxo: ceph can create 0-length data files
[19:42] <lxo> anyone else experiencing btrfs snapshot failure causing ceph osds to die, like this http://pastebin.com/nVEqcn78 ?
[19:43] <joshd> lxo: I think we've seen that in a few qa runs
[19:43] <lxo> joshd, oh well... of course one can create zero-length in the filesystem (but will these have osd objects created for them?), or even zero-length objects using the object store interface
[19:44] <joshd> lxo: there's an osd operation that just creates a file by touching it, and it's exposed by librados, I'm not sure if the mds or clients use it though
[19:45] <lxo> I'm getting that all that time after adding disks or changing crushmaps in a large cluster. looks like the orphan queue grows too large and takes too long to clean before a new snapshot can complete or something, and then it fails. easy to work around, but still a pain
[19:45] <lxo> I've seen empty files with snapdir in their names, too, but I haven't had any problem with those
[19:49] <joshd> lxo: snapdirs should always be empty (just have some xattrs)
[19:51] <lxo> excellent
[19:52] <joshd> jmlowe: what backtraces are you seeing?
[19:54] <jmlowe> ok, the osd's are now staying up, I'm here:
[19:54] <jmlowe> 2012-02-07 13:54:19.944399 pg v1422196: 2376 pgs: 2330 active+clean, 24 active+clean+replay, 2 active+clean+scrubbing, 16 active+backfill, 4 active+replay+backfill; 946 GB data, 2362 GB used, 18854 GB / 22334 GB avail; 2143/504411 degraded (0.425%)
[19:55] <joshd> lxo: did the osd where you're seeing 0-length pginfo ever get powered off unexpectedly, or did the osd crash before?
[19:55] <jmlowe> and tons of restarting backfills
[19:56] <jmlowe> monitor election
[19:56] * fghaas (~florian@85-127-86-65.dynamic.xdsl-line.inode.at) has joined #ceph
[19:57] * Meths_ is now known as Meths
[19:58] <lxo> joshd, sort of. the osd won't start with empty pginfo files, so when I see them, I have to fix them up by hand before it comes back up. but then, even during normal operation, I see empty pginfo files in snap_* dirs that were supposed to be stable, but that, if they're made current for recovery, will get the the osd to crash
[19:59] <lxo> the sort of is because I know the osds don't have such files when they (re)join the cluster, for I had to fix them up by hand, even if the osd failed before due to time-outs or snapshotting failures like the ones above
[19:59] <Tv|work> sjust: http://blitiri.com.ar/p/libfiu/
[20:00] <lxo> at some point I even saw one of these pginfo files get size zero in several snap_$V, before it finally was filled in within current/
[20:02] <lxo> now, this *could* still be a btrfs bug, say, misreporting sizes or something, but my straces seem to indicate the pginfo file is opened in place, after being unlinked, and nothing seems to stop another thread from taking a snap_$V before the pginfo data is actually written
[20:03] <lxo> now, if only I could find out where in the sources the osd deals with pginfo files... grep hasn't been exactly helpful :-)
[20:10] <jmlowe> ok, part of my problem is hardware/driver related
[20:20] <lxo> ok, I found read_state in PG.cc. now... what happens if I catch exceptions in info decoding and return gracefully, pretty much like we deal with exceptions decoding the logs?
[20:21] <lxo> would that deal to information loss, or can it all be recovered (more expensively) from object's xattrs?
[20:22] <lxo> s/deal/lead/ (wow, that's double plus dyslexic :-)
[20:28] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Ping timeout: 480 seconds)
[20:35] * amichel (~amichel@salty.uits.arizona.edu) has joined #ceph
[20:37] * jmlowe (~Adium@129-79-134-204.dhcp-bl.indiana.edu) has joined #ceph
[21:01] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[21:13] <Guest1785> I'm getting a bunch of "...send_message dropped message osd_op_reply(...) v1 because of no pipe on con 0x..." in my OSD logs when writing large batches of small files to a cluster with 2 OSDs, is this normal/okay?
[21:23] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[21:50] * lollercaust (~paper@85.Red-83-41-151.dynamicIP.rima-tde.net) has joined #ceph
[21:51] <jmlowe> ok I'm back to a sable point where I'm stuck with active+backfill
[21:52] <jmlowe> 0.2dc 1 0 1 0 1114112 126222 126222 active+backfill 2025'2662 2064'1166 [10,0] [8,10,5] 108'3227 2012-01-25 13:13:27.613815
[21:52] <jmlowe> which one should I restart?
[21:52] <jmlowe> 10,8,5,0 ?
[21:52] <joshd> jmlowe: 8
[21:53] <joshd> the first list is of "up" osds that are supposed to have the pg, the second is the list of "acting" osds, which might be different from the up set until the ones in the up set can be brought fully up to date
[21:54] <jmlowe> : [ERR] 0.71 osd.10 missing 5d9fc71/rb.0.14.00000000151e/head
[21:54] <jmlowe> 2012-02-07 15:53:38.794279 log 2012-02-07 15:53:32.811785 osd.10 1112 : [ERR] 0.71 scrub 275 missing, 0 inconsistent objects
[21:54] <jmlowe> 012-02-07 15:54:22.323978 pg v1422536: 2376 pgs: 2366 active+clean, 6 active+clean+replay, 1 active+clean+scrubbing, 2 active+clean+inconsistent, 1 active+replay+backfill; 947 GB data, 2022 GB used, 18864 GB / 22004 GB avail; 1/499907 degraded (0.000%)
[21:55] <jmlowe> 2012-02-07 15:54:57.249868 pg v1422540: 2376 pgs: 2367 active+clean, 6 active+clean+replay, 2 active+clean+inconsistent, 1 active+replay+backfill; 947 GB data, 2022 GB used, 18864 GB / 22004 GB avail; 1/499907 degraded (0.000%)
[21:55] <joshd> Guest1785: that might happen occassionally, due to clients timing out due to high load, but they'll resend the requests so they'll be fine
[21:56] <jmlowe> what should I do now?
[21:57] <jmlowe> I'm getting scrub errors
[21:58] <jmlowe> now pd dump is 0.2dc 1 0 1 0 1114112 230208 230208 active+replay+backfill 2025'2662 2093'1150 [10,0] [5,10,7] 108'3227 2012-01-25 13:13:27.613815
[21:59] <Guest1785> joshd: Thanks. At what point should I be concerned if I'm seeing a bunch during testing? Or how do I find out what's causing timeouts?
[22:02] <joshd> jmlowe: for the scrub errors, I'm guessing those are from the pgs inconsistent?
[22:03] <jmlowe> should I just let it run, is it going to do anything?
[22:03] <jmlowe> 2012-02-07 16:03:04.527688 pg v1422656: 2376 pgs: 2354 active+clean, 6 active+clean+replay, 15 active+clean+inconsistent, 1 active+replay+backfill; 947 GB data, 2022 GB used, 18864 GB / 22004 GB avail; 1/499907 degraded (0.000%)
[22:04] <joshd> jmlowe: check dmesg on the osds with inconsistent pgs, and if there's nothing there, turn on debugging with "ceph tell osd N injectargs '--debug-osd 20'"
[22:05] <joshd> jmlowe: then you can re-run scrub manually with 'ceph scrub N', where N is the osd number
[22:05] <joshd> jmlowe: and the logs will tell us why the pgs are inconsistent
[22:07] <jmlowe> tons of 2012-02-07 16:06:53.384661 7f210c74d700 log [ERR] : 0.1a7 osd.10 missing 134fffa7/rb.0.1a.000000011c8f/head
[22:07] <joshd> jmlowe: what's the pg dump?
[22:08] <jmlowe> whole thing?
[22:08] <joshd> Guest1785: you could turn on ms debugging on the clients, but it shouldn't be a problem due to the resending. if the clients aren't making progress, that's a problem
[22:08] <joshd> jmlowe: pastebin or similar
[22:10] <jmlowe> https://slashtmp.iu.edu/files/download?FILE=jomlowe%2F63978WfKA6S
[22:10] <jmlowe> passwd; pgdump.txt
[22:12] <joshd> jmlowe: almost all the inconsistent ones have 10 as the primary, is there anything in dmesg on it?
[22:15] <jmlowe> not sure what I'm looking for
[22:17] <joshd> lxo: sjust tells me that the pg_info file is used to store extra info that wouldn't fit in xattrs, so ignoring the decode failure wouldn't work. the way to do it fully consistently would be to stop the osd, manually delete the directory for the bad pg, then start the osd again and let recovery resync it
[22:17] <joshd> jmlowe: any kind of error from the filesystem
[22:18] <joshd> jmlowe: were you doing anything special that triggered all these problems?
[22:19] <jmlowe> woke up this morning to find 8 of 10 osd's down
[22:19] <jmlowe> been running for a couple of weeks just fine
[22:19] <jmlowe> https://slashtmp.iu.edu/files/download?FILE=jomlowe%2F19941n5eKay
[22:19] <jmlowe> password: osd.10.log
[22:21] <jmlowe> Going to be around for a bit Josh?
[22:21] <joshd> yeah
[22:21] <joshd> it'll take a little while to look through the log
[22:21] <jmlowe> ok, need to relocate, I've had my fill of datacenter voc's
[22:22] <joshd> hehe, no problem
[22:23] <jmlowe> ok back in 30 or so
[22:23] * jmlowe (~Adium@129-79-134-204.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[22:35] * fronlius_ (~fronlius@f054115095.adsl.alicedsl.de) has joined #ceph
[22:41] <Guest1785> joshd: Thanks for your help.
[22:45] <joshd> Guest1785: you're welcome
[22:53] * jmlowe (~Adium@c-98-223-195-84.hsd1.in.comcast.net) has joined #ceph
[22:56] <jmlowe> back, find anything?
[22:57] <joshd> jmlowe: well, the initial crash was from a sync that took > 10 minutes
[22:57] <jmlowe> ok
[22:58] <jmlowe> any thoughts on what to do next?
[22:58] <joshd> jmlowe: after that, it was trying to replay the journal, but debugging wasn't on yet, and then some time later starting scrubbing and found it was missing a bunch of objects
[23:00] <joshd> jmlowe: check syslog and kern.log on osd10 - there should be at least a warning about a stalled task around 12:20-12:40
[23:00] <joshd> if there's anything else around that time, it might be interesting
[23:02] <jmlowe> I shut everything down and rebooted around that time, came up with the 3.2 kernel rather than the 3.0 kernel I have been using, 3.2 causes my raid controller to become extremely slow after heavy writes, 100KBs slow
[23:02] <joshd> wow, that might cause a sync to take 10 minutes
[23:03] <jmlowe> took me awhile to figure out that was what was happening, forgot that kernel was on there
[23:07] <joshd> jmlowe: the slow raid controller driver might be why that one pg seems to be stuck in backfill as well
[23:09] <jmlowe> should be better now, also I believe that started before I rebooted into the broken kernel
[23:12] <joshd> jmlowe: well, without debug logging we can't tell much about what caused it
[23:12] <joshd> jmlowe: the big question to me is, why did osd 10 go active+clean when it had so many missing objects?
[23:13] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[23:14] <joshd> jmlowe: did you upgrade all your osds to 3.2, so they were all crashing due to the slow raid controller?
[23:14] <joshd> jmlowe: or did you see different crashes?
[23:14] <jmlowe> yes, all osd's were crashing due to slow raid controller
[23:15] <jmlowe> initial problem this morning with crashed osd's was not due to raid controller, no idea why that happened
[23:15] <joshd> do you still have logs from that time?
[23:17] <joshd> also, when you upgraded, are you sure the raid controller flushed its buffers before the reboot? some controllers will tell you they finished writing, but then fail to store the data on disk before the power goes off
[23:18] <jmlowe> doesn't look like I have any logs except the one you have, everything is zero length
[23:20] <jmlowe> not sure about raid controller buffer flush
[23:20] <jmlowe> they do have bbu's
[23:21] <jmlowe> is there any way for me to recover?
[23:24] <joshd> since 10 seems to be the problem for most of the pgs, I'd suggest stopping it, moving its data directory, and recreating it
[23:28] <joshd> the inconsistent pgs will still be marked inconsistent, but once there's a new 10, the repair command is more likely to use the correct data (right now it treats the primary's copy as correct if there's no other way to tell)
[23:31] * Guest1785 (~matthew@pool-96-228-59-130.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[23:34] <jmlowe> ceph-osd --mkfs -i 10 --monmap /tmp/monmap seems to be taking it's time
[23:35] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[23:37] <nhm> amichel: just wanted to mention I saw your comments last night. I've got to run now, but please email me at nhm@clusterfaq.org. I'd be really interested in your test results.
[23:39] <jmlowe> right, doesn't like having anything not ceph related in the osd dir
[23:40] * ircleuser (~ivsipi@216-239-45-4.google.com) has joined #ceph
[23:52] * ircleuser (~ivsipi@216-239-45-4.google.com) has left #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.