#ceph IRC Log


IRC Log for 2012-01-31

Timestamps are in GMT/BST.

[0:16] * lollercaust (~paper@40.Red-88-19-215.staticIP.rima-tde.net) Quit (Quit: Leaving)
[0:18] * MarkN (~nathan@ has joined #ceph
[0:19] * MarkN (~nathan@ has left #ceph
[0:44] <sagewk> gregaf, sjust: you guys looked at http://tracker.newdream.net/issues/1983 right?
[0:44] <sjust> I helped greg with it a little bit
[0:45] <gregaf> umm, a little?
[0:45] <gregaf> I seem to recall it not being a trivial thing that I saw the problem for so I thought somebody who knew the state machine better should look at it
[0:45] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: No route to host)
[0:45] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[0:46] * ameen (~ameen@unstoppable.gigeservers.net) has joined #ceph
[0:54] * MarkN (~nathan@ has joined #ceph
[0:54] * MarkN (~nathan@ has left #ceph
[0:54] * The_Bishop (~bishop@cable-89-16-138-109.cust.telecolumbus.net) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[0:55] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Remote host closed the connection)
[0:55] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[1:40] * yoshi (~yoshi@p11133-ipngn3402marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:43] * Tv|work (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:54] * ceph (~hylick@ Quit (Quit: Leaving.)
[2:19] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:57] * adjohn is now known as Guest1041
[2:57] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[2:59] * Guest1041 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: Operation timed out)
[3:05] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[3:30] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Remote host closed the connection)
[3:30] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[3:30] * jmlowe (~Adium@c-98-223-195-84.hsd1.in.comcast.net) has joined #ceph
[3:36] * isaac54 (~isaac@68-119-70-42.dhcp.mtgm.al.charter.com) Quit (Read error: Operation timed out)
[3:54] * adjohn is now known as Guest1047
[3:54] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[3:54] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit ()
[4:00] * Guest1047 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Ping timeout: 480 seconds)
[4:01] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:42] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:45] * ameen (~ameen@unstoppable.gigeservers.net) Quit (Ping timeout: 480 seconds)
[5:19] * jiaju (~jjzhang@ has joined #ceph
[5:20] * jiaju (~jjzhang@ Quit ()
[5:20] * jiaju (~jjzhang@ has joined #ceph
[6:07] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[6:38] * ameen (~ameen@unstoppable.gigeservers.net) has joined #ceph
[7:04] * darkfader (~floh@ Quit (Remote host closed the connection)
[7:04] * darkfader (~floh@ has joined #ceph
[7:26] * jmlowe1 (~Adium@c-98-223-195-84.hsd1.in.comcast.net) has joined #ceph
[7:26] * jmlowe (~Adium@c-98-223-195-84.hsd1.in.comcast.net) Quit (Read error: Connection reset by peer)
[7:32] * jmlowe (~Adium@c-98-223-195-84.hsd1.in.comcast.net) has joined #ceph
[7:32] * jmlowe1 (~Adium@c-98-223-195-84.hsd1.in.comcast.net) Quit (Read error: Connection reset by peer)
[7:45] * jmlowe (~Adium@c-98-223-195-84.hsd1.in.comcast.net) Quit (Read error: Connection reset by peer)
[8:36] * jiaju (~jjzhang@ Quit (Remote host closed the connection)
[9:07] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[10:01] * yoshi (~yoshi@p11133-ipngn3402marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[10:24] * MK_FG (~MK_FG@ Quit (Ping timeout: 480 seconds)
[11:17] * joao (~joao@89-181-157-105.net.novis.pt) has joined #ceph
[11:42] * stass (stas@ssh.deglitch.com) Quit (Ping timeout: 480 seconds)
[11:49] * stass (stas@ssh.deglitch.com) has joined #ceph
[12:35] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[13:33] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[13:39] * joao is now known as Guest1080
[13:39] * joao_ (~joao@89-181-157-105.net.novis.pt) has joined #ceph
[13:39] * joao_ is now known as joao
[13:43] * joao is now known as Guest1082
[13:43] * joao_ (~joao@89-181-157-105.net.novis.pt) has joined #ceph
[13:43] * joao_ is now known as joao
[13:43] * Guest1080 (~joao@89-181-157-105.net.novis.pt) Quit (Ping timeout: 480 seconds)
[13:49] * Guest1082 (~joao@89-181-157-105.net.novis.pt) Quit (Ping timeout: 480 seconds)
[13:54] * joao is now known as Guest1083
[13:54] * joao_ (~joao@89-181-157-105.net.novis.pt) has joined #ceph
[13:54] * joao_ is now known as joao
[13:57] * MK_FG (~MK_FG@ has joined #ceph
[13:59] * Guest1083 (~joao@89-181-157-105.net.novis.pt) Quit (Ping timeout: 480 seconds)
[14:07] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Connection reset by peer)
[14:09] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:29] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[14:32] * gregorg_taf (~Greg@ has joined #ceph
[14:37] * gregorg (~Greg@ Quit (Ping timeout: 480 seconds)
[14:38] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[15:07] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:39] * lollercaust (~paper@40.Red-88-19-215.staticIP.rima-tde.net) has joined #ceph
[15:49] * gregorg (~Greg@ has joined #ceph
[15:59] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[16:08] * ceph (~hylick@ has joined #ceph
[16:18] * Tv|work (~Tv|work@aon.hq.newdream.net) has joined #ceph
[17:22] * lollercaust (~paper@40.Red-88-19-215.staticIP.rima-tde.net) Quit (Quit: Leaving)
[17:34] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[17:34] <Kioob`Taff> Hi
[17:40] <nhm> hello!
[17:43] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[17:47] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:54] <lxo> wow, the 0.41 recovery dance is quite a show! well done, folks!
[17:56] <lxo> I'm running into osd/PG.cc: 4142: FAILED assert(pg->log.tail <= pg->info.last_complete) but I'm hoping a sufficient number of osd restarts will eventually get me past that
[17:56] <Kioob`Taff> I have some frozen clients (ceph 0.41), a simple "ls" doesn't answer
[17:56] <Kioob`Taff> how can I see what's apen ?
[17:57] <Tv|work> lxo: now you're just making me think of http://www.youtube.com/watch?v=ywWBy6J5gz8
[17:57] <lxo> the failed asserts may also be related with my having to roll back one of the osds to an older snapshot of the filesystem
[17:58] <lxo> Tv|work, heh
[17:59] <Tv|work> i still can't tell how serious those people were, making that
[18:08] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[18:21] * ceph (~hylick@ Quit (synthon.oftc.net larich.oftc.net)
[18:21] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (synthon.oftc.net larich.oftc.net)
[18:21] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (synthon.oftc.net larich.oftc.net)
[18:21] * darkfader (~floh@ Quit (synthon.oftc.net larich.oftc.net)
[18:23] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[18:24] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[18:24] * ceph (~hylick@ has joined #ceph
[18:25] * darkfader (~floh@ has joined #ceph
[18:31] <gregaf> lxo: that assert is not going to fix itself with restarts, either your snapshot rollback broke the world or some of the (new?) code did
[18:32] <yehudasa> lxo: sounds like that's a roll back of one of your osds
[18:33] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[18:33] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Remote host closed the connection)
[18:33] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[18:33] <yehudasa> lxo: I'm not sure what would be the best way to proceed from here.. sjust?
[18:37] <lxo> FWIW, ceph -w shows some “log 2012-01-31 15:36:57.229043 osd.4 1 : [ERR] 1.31 last_complete 56'2791 < log.tail 3401'23492” from osds that later on crash on that assert. are these realted?
[18:38] <lxo> or perhaps related? :-)
[18:38] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[18:39] <yehudasa> lxo: it's the same error that you see in the assertion
[18:40] <lxo> yaeh, but this message is from a different spot AFAICT. and IIRC I was already getting these on 0.40 and before
[18:41] <lxo> we surely don't crash right away when we issue that ERR; some osds spit out a handful of these and run on for several minutes before crashing with that assert
[18:42] <sjust> Ixo: were there any messages prior to that assert?
[18:42] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[18:46] <lxo> sjust, not really. on one of the osds (osd.2), after the osd journal replay, a couple of read_log dups on an unrelated PG (1.4a), 3 last_complete < log_tail (for 1.4a, 1.79 and 1.31), and then several minutes of journal throttle waiting for ops and starting backfilling of other osds
[18:46] <lxo> oh, and a few pipe nothing to send messages
[18:47] <lxo> and then, after a bunch of journal throttle messages, the assert comes out of the blue
[18:47] * vodka (~paper@40.Red-88-19-215.staticIP.rima-tde.net) has joined #ceph
[18:56] <lxo> just checked earlier logs, I used to get the read_log dup, but not the last_complete ERRs. which is odd, because they looked familiar
[18:57] <lxo> and code to print that ERR *was* present in 0.40, too
[18:57] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:59] * bchrisman (~Adium@ has joined #ceph
[19:04] * chutzpah (~chutz@ has joined #ceph
[19:04] <lxo> ok, any point in my trying to dig up further info?
[19:05] <lxo> before rolling back the entire cluster to the same snapshot, that is
[19:05] <sagewk> lxo: those warnings popped up in the past, but not recently.
[19:05] <sagewk> the osds are still failing when they start up? can you generate a log for one of them?
[19:05] <sagewk> how old was the snapshot?
[19:05] * aliguori (~anthony@ has joined #ceph
[19:06] <lxo> a few days old. the data in the cluster hadn't changed much since then, but it had propagated from the 2 osds that got the initial data dump to the remaining 9 osds
[19:07] <lxo> what's more: I'd rolled back each of the two osds to it a few times after rearranging the crushmap
[19:07] <lxo> 0.40 didn't seem to mind ;-)
[19:08] <lxo> I guess I'll just roll back and be done with it, unless you think we can dig up useful info from the current broken cluster
[19:10] <lxo> speaking of rearranging the crushmap... I ran into an interesting situation after loading a broken map into the cluster: mons and osds would crash because of the second BUG_ON in crush_choose
[19:11] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[19:11] <lxo> I ended up having to relax the BUG_ON into a skip in order to get the mons to let me load a new crushmap. but the osds would still crash after that, so I had to restart them with a similarly-relaxed osdmap
[19:12] <lxo> err similarly relaxed crush_choose
[19:13] <lxo> I was a bit surprised crushtool didn't catch the bug. AFAICT the problem was that I mixed different-type children in a single parent
[19:29] * vodka (~paper@40.Red-88-19-215.staticIP.rima-tde.net) Quit (Quit: Leaving)
[19:29] <sagewk> when you say roll back, you're doing something manually in btrfs, or using hte cluster_snapshot mechanism?
[19:37] <lxo> well... I'm using the clustersnap osd state as created by cluster_snap, but I'm snapshotting it into current with btrfs. something like “for d in snap_* current; do btrfs su del $d; done; btrfs su snap clustersnap_saved current; btrfs su snap current snap_$(cat current/commit_op_seq); ceph-osd -i N --mkjournal”
[19:37] <lxo> is there any other way to do it?
[19:42] <lxo> oh, sometimes the first start-up of such a rolled back snapshot fails to find a recent-enough osdmap and crashes (or, as it happens, it crashes because of the invalid crushmap in the incremental osdmap chain), but it suffices to figure out which osdmap file it tries to open before failing and copy it out of a mon to get it going again
[19:43] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[19:44] <lxo> ok, cluster rolled back to the last-known-good snapshot, all osds are up, let the fun begin! :-)
[19:49] <sagewk> lxo: yeah, the second part of making hte cluster_snapshot thing actually usable is snapshotting the monitors too, but i didn't get that far
[19:53] <lxo> is that actually necessary? I mean, I was surprised that rolling back just the osds worked; I'd expected to have to roll back the mon state as well, but it turned out that, in spite of changes to the cluster, the osds and the mon recovered just fine after rolling back just the osds (0.40)
[19:53] <sagewk> it works in some cases, but the monitors trim old osdmaps based on osd progress w/ peering and recovery, so that can screw you.
[19:54] <sagewk> also, it doesn't protect you from corrupted monitor state then (like the bad crushmaps)
[19:54] <lxo> right
[20:05] * The_Bishop (~bishop@cable-89-16-138-109.cust.telecolumbus.net) has joined #ceph
[20:05] * adjohn is now known as Guest1148
[20:05] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[20:11] <lxo> dance, little sweet octopuses, dance!
[20:13] * Guest1148 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Ping timeout: 480 seconds)
[21:09] <Kioob> root! front5:/ceph/radins/libs/_core# stat geoloc/
[21:09] <Kioob> File: ��geoloc/��
[21:09] <Kioob> Size: 18446744073709547695 Blocks: 0 IO Block: 65536 r�pertoire
[21:09] <Kioob> huge size for a directory, no ? :p
[21:09] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[21:10] <iggy> Kioob: there's some emails on the list about that
[21:10] <Kioob> mmm I think I should subscribe to the list
[21:11] <iggy> it's not very high traffic
[21:31] * Tv|work (~Tv|work@aon.hq.newdream.net) has left #ceph
[21:35] * Tv|work (~Tv|work@aon.hq.newdream.net) has joined #ceph
[21:45] <sagewk> joshd: something up witht eh teuthology lock server?
[21:45] <joshd> sagewk: looks fine to me
[21:47] <Tv|work> sagewk: networking :(
[21:47] <sagewk> blarg
[21:49] * BManojlovic (~steki@ has joined #ceph
[22:18] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[22:24] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:40] * lollercaust (~paper@212.Red-83-55-54.dynamicIP.rima-tde.net) has joined #ceph
[23:04] <lxo> is it expected that, after ceph osd lost N, bringing up osdN will mark it up but not in? should I expect any problems if I issue ceph osd in N so that it is used again? lost_at remains set after that
[23:14] * adjohn is now known as Guest1162
[23:14] * Guest1162 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: Connection reset by peer)
[23:14] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[23:24] <lxo> so, rolling back to the osd snapshots from a few days ago worked fine. now what I don't get is why it uses backfill for a few pgs, but not for most
[23:26] * lollercaust (~paper@212.Red-83-55-54.dynamicIP.rima-tde.net) Quit (Ping timeout: 480 seconds)
[23:28] * lollercaust (~paper@212.Red-83-55-54.dynamicIP.rima-tde.net) has joined #ceph
[23:29] <lxo> I'm also a bit surprised that the degraded counts seem to be off. say, when the primary is one of the osds that held all the data, and one of the two replicas is the other, the degraded object count implies the data is missing in two replicas, not just one
[23:29] <lxo> in 0.40 the counts were correct in this regard
[23:30] <sjust> Ixo: that's odd
[23:30] <sjust> Ixo: backfill would only come into play if the logs aren't contiguous
[23:30] <sjust> if the cluster was reasonably healthy at the time the snapshot was taken, it wouldn't do backfill in most cases
[23:31] <lxo> aah. it was healthy, but with only 2 active osds in spite of 3 requested replicas
[23:31] <sjust> Ixo: hmm, it should need to backfill to get the 3rd replica
[23:33] <lxo> it's replicating most pgs in the old-fashioned way. it started only 2 backfills at first, and when one completed it started another
[23:34] <lxo> in the mean time, 3k+ pgs show up as “active” and are replicated as before
[23:35] <sjust> Ixo: it throttles the number of concurrent recovery operations on a per-osd basis
[23:35] <lxo> oh, the degraded count is also off when the primary is not one of the preloaded osds, and only one of the replicas is. it says the degraded count is nearly 3 times the number of objects. I suppose this could be correct if the pg was to end up in none of the two preloaded osds, but I don't think that's the case
[23:37] <lxo> well, each osd seems to be the primary for 5 of the pgs that are currently being replicated (i.e., that are active and have their degraded-object counts going down)
[23:37] <sjust> Ixo: actually, scratch that, it marks backfill on activate, not when recovery starts
[23:38] <sjust> can you post ceph pg dump output?
[23:38] <lxo> meanwhile, the backfill for the only remaining pg is *not* making progress
[23:38] <lxo> you probably don't want all the 3k+ lines. anything specific you're looking for?
[23:39] <lxo> though if you really want I can pastebin it all somewhere
[23:39] <sjust> Ixo: pastebining it would be good
[23:46] <lxo> hmm, pastebin.com is down, and pastebin.ca won't accept 356KB > 150KB. what now?
[23:46] <darkfader> uuencode and gzip!
[23:48] <sjust> Ixo: hmm
[23:49] <lxo> hmm, xz and base64 encoding would probably make it fit, but it would be quite a pain to access
[23:50] <sjust> Ixo: you can email me gzipped at sam.just@dreamhost.com
[23:54] <lxo> sent
[23:58] * kirby_ (~kfiles@pool-151-199-52-228.bos.east.verizon.net) Quit (Quit: Ex-Chat)
[23:58] <sjust> Ixo: looking
[23:58] <lxo> FWIW, I found it odd that, before I brought up the other osds, all pgs were in active+clean state, rather than active+clean+degraded, as expected because of the missing third replica

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.