#ceph IRC Log

Index

IRC Log for 2010-08-13

Timestamps are in GMT/BST.

[0:02] <sagewk> todinini: i found your bug. fixed in the rbd-separate branch of ceph-client.git
[0:13] * DJCapelis (~djc@capelis.dj) Quit (Server closed connection)
[0:13] * DJCapelis (~djc@capelis.dj) has joined #ceph
[0:35] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Ping timeout: 480 seconds)
[0:42] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[1:16] * darkfade1 (~floh@host-82-135-62-109.customer.m-online.net) Quit (Server closed connection)
[1:16] * darkfader (~floh@host-82-135-62-109.customer.m-online.net) has joined #ceph
[1:19] * allsystemsarego (~allsystem@188.25.130.190) Quit (Quit: Leaving)
[3:06] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[3:07] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit ()
[7:13] * mtg (~mtg@port-87-193-189-26.static.qsc.de) has joined #ceph
[8:10] * allsystemsarego (~allsystem@188.25.130.190) has joined #ceph
[8:17] <klp> how stable is ceph these days?
[8:27] <cowbar> klp: the people who could give a good answer on that are likely sleeping now
[8:27] <cowbar> but if you stick around til morningish PDT I'm sure you can get an answer
[8:29] <MarkN> as far as my test system goes - i have around 100Gb replicated 2 times, with one mds and it has been OK, it needs the occasional restart, but have not lost any of the data yet
[8:29] <MarkN> with my experimental cluster it is less stable however.
[8:29] <klp> yeah I used it months ago and it had some ways to go :)
[8:30] <MarkN> for a basic set up it is not too bad at the moment, but soon as you try some exotic things you may run into issues
[8:34] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[8:56] <wido> klp: what do you want to do with it?
[8:56] <wido> i think that should be a good question to start with
[8:56] <wido> do you fully control all the IOps on your system? e.g. are it your applications or is it a shared env?
[9:00] <klp> I control it
[9:09] <wido> ok, what kind of data?
[10:54] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Server closed connection)
[10:55] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[12:58] * allsystemsarego (~allsystem@188.25.130.190) Quit (Ping timeout: 480 seconds)
[13:16] * allsystemsarego (~allsystem@188.27.166.200) has joined #ceph
[13:17] * Anticimex (anticimex@netforce.csbnet.se) Quit (Remote host closed the connection)
[13:59] <todinini> how can I list the rbd images in kvm rbd? I created a few but cannot remember the names
[14:19] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:28] <wido> todinini: rbd list
[14:28] <wido> it's the same as the rbd images
[14:40] <todinini> wido: with rbd list my rbd images are listed, but not my rbd kvm images, maybe I did not convert them correctly
[14:41] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[15:01] <todinini> with the rbd kernel from today the /sys/class/rbd/add is no longer there
[15:39] <todinini> I found that problem there is a new module called rbd
[15:39] <todinini> but the echo "10.0.1.2,10.0.1.3 name=admin rbd foo" > /sys/class/rbd/add
[15:40] <todinini> doesn't return
[16:39] <sagewk> todinini: which commit are you on?
[16:45] <todinini> I get this bt on my cmon http://pastebin.com/kQxaL138
[16:46] <todinini> sagewk: rbd branch from today ba468ce0ad4d6865c858fe8a080946e0add42369
[16:47] <sagewk> k i'm seeing the same thing, looking at it.
[16:53] <todinini> sagewk: which of the two issuse can you repruduce?
[16:53] <sagewk> add hanging
[16:53] * gregphone (~gregphone@166.205.136.3) has joined #ceph
[16:54] <todinini> sagewk: ok
[17:12] * mtg (~mtg@port-87-193-189-26.static.qsc.de) Quit (Quit: Verlassend)
[17:15] <sagewk> todinini: pushed fix
[17:16] <sagewk> i'm refactoring the rbd code to make it more palatable upstream, missed a few things
[17:19] <todinini> sagewk: ok, building new kernel
[17:33] <todinini> sagewk: and the bt in the cmon?
[17:33] <sagewk> not sure about that one.. any idea what triggered it? was the mon debug level turned up?
[17:35] <todinini> I have no idea, it cores right after the start, the debug level doesn't matter
[17:36] <sagewk> did you recently upgrade the cluster from v0.21 to unstable?
[17:39] <todinini> yep a few days ago, right now I build the userspacetool from the rbd branch
[17:39] <sagewk> are all the cmon's running the unstable version?
[17:45] <todinini> sagewk: all cmon's are running the unstable version, yesterday I swichted from unstable to rbd
[17:46] <sagewk> can you post a mon log of it crashing (with debug mon = 20, debug ms = 1 in [mon])?
[17:48] <todinini> http://pastebin.com/EtUfuf6P
[17:56] * gregphone (~gregphone@166.205.136.3) has left #ceph
[17:58] <sagewk> do you mind putting that $mon_data/monmap/latest somewhere i can take a look?
[18:02] <todinini> http://tuxadero.com/multistorage/latest
[18:02] <todinini> gotta go
[18:03] <sagewk> thanks
[18:07] <sagewk> file looks ok... what's ./cmon -v say?
[18:23] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) has joined #ceph
[18:40] <wido> hi sagewk
[18:42] <wido> your fix seems to work pretty well. Had some "SLUB: Unable to allocate memory on node" messages today, but no OOM killer coming by
[18:42] <wido> i changed the recovery ops to 1 this morning and then started testing, i could now keep accessing the cluster while it was degraded and even uploading files via the S3 gateway went fine
[18:43] <wido> right now i have one OSD down (killed it) and am uploading a lot of data, to see how it recovers when it comes back after some big changes
[18:43] <sagewk> ok cool
[18:53] <wido> btw, will daemon load it's new config with a reload or is a restart needed?
[18:53] <wido> cosd / cmon / cmds
[18:58] <sagewk> restart is needed
[19:12] <darkfader> i love recovery
[19:12] <darkfader> it has so nice magic
[19:12] <darkfader> i'm always a bit happy if one of the osd's dies and it just magically makes copies elsewhere
[19:15] <wido> well, pretty weird this. My OSD was down for a few hours, i uploaded about 10GB of data, but then it came back, the cluster expanded, but was never degraded
[19:15] <wido> no data movement at all
[19:16] <darkfader> hm.
[19:17] <wido> i just emptied the OSD fully, fresh mkfs on the OSD, see what that does
[19:17] <wido> btw, will the current rbd code compile/run on 2.6.35?
[19:19] <gregaf> there's not data on it at all?
[19:19] <gregaf> if the OSD is down long enough it'll get marked out and so the replication will happen across all remaining nodes
[19:19] <wido> gregaf: there was, but i killed it for a few hours, when it came back, nothing seemed to move back to it
[19:19] <gregaf> and when it comes back in it'll get data replicated to it but the cluster won't ever be degraded
[19:20] <wido> ah, ok, i get it
[19:20] <wido> so it moved without me really noticing it
[19:20] <gregaf> well, it should have!
[19:20] <wido> right now i just removed all the data and did a mkfs on it, so when it joins now, it will be fully empty. Like a HDD which crashed
[19:20] <gregaf> you're not supposed to notice unless you're watching, and a recovery mechanism that takes the cluster from not-degraded to degraded would be a bad mechanism ;)
[19:21] <wido> you're right :)
[20:12] <wido> is the current rbd branch (and rbd-separate) for 2.6.35 or for the next branch?
[20:12] <gregaf> it's always for the current development kernel
[20:14] <wido> tnx, need to upgrade then :)
[21:00] <wido> the recover of the fully emptied node went almost fine, think it got a kernel panic (rebooted out of the blue), but after the reboot the cluster recovered
[21:01] <wido> memory usage of the osd still seems a bit high, but much, much, much better then yesterday
[21:08] * Guest139 (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) Quit (Server closed connection)
[21:08] * bbigras (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[21:08] * bbigras is now known as Guest327
[21:25] * conner (~conner@leo.tuc.noao.edu) Quit (Server closed connection)
[21:25] * conner (~conner@leo.tuc.noao.edu) has joined #ceph
[21:30] * bbigras_ (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[21:34] * Guest327 (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) Quit (Remote host closed the connection)
[23:05] * allsystemsarego (~allsystem@188.27.166.200) Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.