#ceph IRC Log


IRC Log for 2010-08-23

Timestamps are in GMT/BST.

[0:07] * MarkN (~nathan@ has joined #ceph
[2:09] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) has joined #ceph
[2:23] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[4:58] * alexxy (~alexxy@ Quit (Quit: No Ping reply in 180 seconds.)
[4:59] * alexxy (~alexxy@ has joined #ceph
[5:07] * alexxy (~alexxy@ Quit (Quit: No Ping reply in 180 seconds.)
[5:07] * alexxy (~alexxy@ has joined #ceph
[5:10] * alexxy (~alexxy@ Quit ()
[5:16] * bbigras_ (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) Quit (Remote host closed the connection)
[5:31] * bbigras (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[5:32] * bbigras is now known as Guest495
[7:07] * f4m8_ is now known as f4m8
[8:22] * MarkN (~nathan@ Quit (Quit: Leaving.)
[8:31] * allsystemsarego (~allsystem@ has joined #ceph
[10:05] * Yoric (~David@ has joined #ceph
[10:13] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[10:14] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[10:27] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) has joined #ceph
[14:26] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[14:46] * ghaskins_mobile (~ghaskins_@ Quit (Quit: This computer has gone to sleep)
[15:01] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[15:55] * f4m8 is now known as f4m8_
[17:17] * neale_ (~neale@va.sinenomine.net) has joined #ceph
[18:06] * ghaskins_mobile (~ghaskins_@ Quit (Quit: This computer has gone to sleep)
[18:48] * Yoric (~David@ Quit (Quit: Yoric)
[19:08] <sagewk> todinini: the kernel crashes after boot, even without mapping an rbd device or mounting a ceph fs?
[19:25] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[19:26] * NoahWatkins (~jayhawk@waterdance.cse.ucsc.edu) has joined #ceph
[20:08] * cowbar_ (3af5b6381a@dagron.dreamhost.com) has joined #ceph
[20:09] * cowbar (cfa287e760@dagron.dreamhost.com) Quit (Read error: Connection reset by peer)
[20:12] * cowbar_ is now known as cowbar
[20:43] * Osso (osso@AMontsouris-755-1-5-251.w86-212.abo.wanadoo.fr) has joined #ceph
[20:46] * Osso (osso@AMontsouris-755-1-5-251.w86-212.abo.wanadoo.fr) Quit ()
[20:46] * Osso (osso@AMontsouris-755-1-5-251.w86-212.abo.wanadoo.fr) has joined #ceph
[21:08] * sage (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[21:23] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:29] <wido> hi
[21:33] <wido> i've got something weird, my cluster stays in a degraded state. Due to #371 one of my OSD's is down, but the cluster won't recover, not even after 24 hours. This morning it was still 1.600% degraded, i then restarted the 11 (of the 12) running OSD's, it then got back to 1.085% but stays there
[21:33] <wido> "pg v166251: 3240 pgs: 2896 active+clean, 4 peering, 251 crashed+peering, 89 degraded+peering; 1350 GB data, 1529 GB used, 4528 GB / 6058 GB avail; 31516/2950803 degraded (1.068%)"
[21:34] <wido> and my replication level is at 2, how come my "used" is not at 2700GB?
[21:46] <sagewk> hmm i'll take a look
[21:51] <sagewk> somehow osd3 came up with a bad address in the map
[21:55] <wido> could i have broken that with my osdmap modification?
[21:55] <wido> injecting the new crushmap
[21:57] <sagewk> don't think so.. i suspect it's a bug in the monitor.
[21:58] <sagewk> i'm going to turn up monitor debugging.. let's hope it comes up again (i saw a few instances of it in the logs)
[21:59] <wido> ok, but the amount of data used is also pretty weird
[22:00] <wido> but i might now a explenation for this, a few days ago i removed a rados pool (S3 bucket) which contained a few hunderd GB of data, but removing it didn't decrease the amount of data
[22:00] <sagewk> any sparse files/objects?
[22:01] <wido> i'm not sure that you mean by that?
[22:03] <sagewk> are any of the files or objects you stored sparse? (i.e. create file, seek some big offset, write some data, but don't write anything in between)
[22:04] <sagewk> probably not if it's mostly movies and such in there
[22:04] <wido> oh no, i don't think so. It were just movies indeed
[22:04] <sagewk> btw cluster recovering after restart of osd3.
[22:04] <wido> uploading random content to test the gateway and just fill the cluster
[22:08] <wido> about osd3, a bug i assume?
[22:08] <sagewk> yeah, tho i'm not sure where. opening an issue so we don't forget about it.
[22:10] <wido> one good thing, the cluster has been degraded for a long time, flapping OSD's and so
[22:11] <wido> but my VM's just started nicely, ext4 not complaining
[22:12] <sagewk> yay
[22:13] <wido> oh, not completely true. After a few I/O's ext4 went into RO
[22:22] <wido> there seems to be some data corruption, i've checked various files too, not all the checksums are correct
[22:23] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:25] <wido> for example "ubuntu-10.04-desktop-amd64.iso" the md5sum doesn't match the sum like in the MD5SUMS file
[22:26] <wido> in "/mnt/ceph/static/ubuntu/10.04"
[22:34] * ghaskins_mobile (~ghaskins_@ Quit (Quit: This computer has gone to sleep)
[22:46] <wido> i'm going afk, i'll do some more tests tomorrow
[22:46] <wido> and see what got corrupted
[22:46] <sagewk> k
[22:46] <sagewk> there's a vbindiff util in /usr/src on logger that is helpful for finding what changed in the file
[22:47] <wido> for now my VM "alpha" is fine, but beta won't start anymore
[22:47] <wido> ah cool, i'll play around with that to match the original ISO
[22:48] <wido> ttyl!
[23:01] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[23:09] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[23:44] * Osso (osso@AMontsouris-755-1-5-251.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.