#ceph IRC Log


IRC Log for 2011-10-27

Timestamps are in GMT/BST.

[0:06] * cp (~cp@dhcp184-48-60-82.whsj.sjc.wayport.net) has joined #ceph
[0:06] * cp (~cp@dhcp184-48-60-82.whsj.sjc.wayport.net) Quit ()
[0:24] * nwatkins (~nwatkins@kyoto.soe.ucsc.edu) has left #ceph
[0:40] <ajm> sagewk: yes I upgraded from 0.34 -> 0.37
[0:40] <ajm> sagewk: is it just slow / needs more time or I broke things again?
[0:40] <sagewk> ajm: ok cool. the above patch should do the trick
[0:41] <ajm> oic patch now
[0:41] <ajm> lemme try
[0:41] <sagewk> ajm: there was a bug that made it O(N^2) instead of O(N)
[0:41] <ajm> ::needs to figure out / read that O() shit::
[0:43] <ajm> that patch is against HEAD I guess? offsets don't match.
[0:50] <sagewk> yeah
[0:50] <sagewk> i pushed it to the stable branch too, you can grab it from there as well
[0:52] <ajm> eh
[0:52] <ajm> onliner is easier to do by hand :)
[1:04] <damoxc> sagewk: would wrapping libcephfs for python be at all useful?
[1:05] <sagewk> damoxc: to someone, i'm sure. i would probably wait for a use case before coding it up, though
[1:05] <damoxc> sagewk: yeah thought as much
[1:05] <sagewk> damoxc: there are already librados python bindings that probably many use cases
[1:05] <damoxc> sagewk: yeah, i've made some using cython too, i started them before the rbd bindings existed
[1:07] <damoxc> sagewk: and figured may as well wrap some of rados whilst i was doing that
[1:08] <ajm> sagewk: seems happier
[1:11] * nwatkins (~nwatkins@kyoto.soe.ucsc.edu) has joined #ceph
[1:13] <nwatkins> gregaf: wrt to your hadoop updates, I see that ceph_replication has been updated to take a string (on the Java side), but the JNI code still expects the integer file handler. Did an update get lost?
[1:13] <sagewk> ajm: great
[1:13] <sagewk> ajm: thanks
[1:14] <ajm> sagewk: no thank you :)
[1:18] <gregaf> nwatkins: hrmm, let me look
[1:26] <gregaf> nwatkins: looks like sagewk and colin managed to break it between them, though I've no idea why I didn't have trouble with it — maybe I did manage to lose a patch :/
[1:30] <nwatkins> gregaf: it looks weird: recently updated code (ceph_replication) matches the header file from 2009, but the recently updated definition matches the old header. I guess I'll just wait for that to get sorted out?
[1:30] <gregaf> yeah
[1:30] <gregaf> I presume you noticed this because it failed to build in some fashion?
[1:31] <gregaf> I actually remember looking at this and sighing, but I can't find anywhere that I changed it, even though I definitely got things running
[1:31] <nwatkins> It built fine (I didn't check warnings). But, Java noticed that the mismatch in function specs
[1:31] <gregaf> probably lost it somewhere in a rebase :/
[1:31] <nwatkins> at runtime
[1:31] <gregaf> or maybe I just didn't actually invoke that function and nothing complained beforehand so I forgot
[1:31] <gregaf> anyway, won't take too long, I'll poke you once I've pushed it
[1:31] <nwatkins> Ok, thanks greg
[1:35] * adjohn (~adjohn@ Quit (Quit: adjohn)
[2:02] <gregaf> nwatkins: can you try this patch and make sure it works before I push? http://pastebin.com/MVgiunbv
[2:02] <gregaf> aside: %&*@()@&% freaking jni and include paths
[2:07] <nwatkins> gregaf: what's that aside?
[2:08] <gregaf> I'm just not a fan of working with JNI
[2:08] <nwatkins> gregaf: heh. yeh i hear ya
[2:08] <gregaf> especially since for some reason java 6 doesn't want to add its headers to my default includes so I always need to remember to set env variables for compilation
[2:09] <nwatkins> there are some m4 scripts in the m4 repository for finding the JNI headers. i think after some cleanup we might be able to get Maven2 to do this for us. Not sure yet about those details.
[2:14] <nwatkins> gregaf: that fixed the issue. there is still a new bug I am seeing now: http://pastebin.com/a3vcjPEU
[2:15] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:15] <gregaf> can you get a backtrace on that core?
[2:16] <nwatkins> gregaf: yeh... do you happen to remember off hand how to tell linux where to stash the core dumps?
[2:19] <gregaf> nwatkins: you modify some file somewhere :p
[2:19] <gregaf> http://linux.die.net/man/5/core
[2:19] <gregaf> /proc/sys/kernel/core_pattern
[2:19] <gregaf> I think it defaults to root
[2:20] <nwatkins> gregaf: yeh that's right. Hrmm, i set that but i'm not getting a core.
[2:20] <gregaf> try "ulimit -c unlimited"
[2:20] <gregaf> it might be too large
[2:20] <gregaf> or if it is trying to create it in the root dir and you're not running as root it'll get EPERM and fail
[2:21] <ajm> i keep breaking these things: http://adam.gs/osd.10.log.bz2
[2:24] <nwatkins> gregaf: that worked. I'm having trouble getting the symbols to load
[2:27] <gregaf> nwatkins: I'm not the guy to help you with that unfortunately — do you have anything at all or is it just dumping addresses?
[2:28] <nwatkins> Yeh, just addresses
[2:28] <gregaf> bah humbug :(
[2:28] <gregaf> didn't you do a local build and install?
[2:28] <gregaf> I don't think I've ever had a problem with locally-built stuff :/
[2:29] <gregaf> if you didn't, I guess make sure debug packages are installed?
[2:29] <gregaf> that's the best I can come up with
[2:32] <nwatkins> gregaf: ok, i'll get back to you a few minutes
[2:32] <gregaf> nwatkins: I'm heading out pretty soon, but if you put it here I can check it later tonight
[2:33] <nwatkins> I'll shoot ya in an email
[2:34] <gregaf> that works too — thanks!
[2:35] <nwatkins> gregaf: thx a lot
[2:35] * nwatkins (~nwatkins@kyoto.soe.ucsc.edu) has left #ceph
[2:52] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[3:21] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:22] * The_Bishop (~bishop@port-92-206-76-12.dynamic.qsc.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[4:26] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[5:04] * adjohn is now known as Guest14907
[5:04] * Guest14907 (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Read error: Connection reset by peer)
[5:04] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[5:05] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Read error: No route to host)
[5:06] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[6:04] * gregaf (~Adium@aon.hq.newdream.net) Quit (synthon.oftc.net charm.oftc.net)
[6:04] * nhm (~mark@penguin.msi.umn.edu) Quit (synthon.oftc.net charm.oftc.net)
[6:04] * Iribaar (~Iribaar@ Quit (synthon.oftc.net charm.oftc.net)
[6:04] * in__ (~n0de@ Quit (synthon.oftc.net charm.oftc.net)
[6:04] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (synthon.oftc.net charm.oftc.net)
[6:04] * nolan (~nolan@phong.sigbus.net) Quit (synthon.oftc.net charm.oftc.net)
[6:04] * elder (~elder@cfcafwp.sgi.com) Quit (synthon.oftc.net charm.oftc.net)
[6:04] * efoster (~efoster@ Quit (synthon.oftc.net charm.oftc.net)
[6:04] * __jt__ (~james@jamestaylor.org) Quit (synthon.oftc.net charm.oftc.net)
[6:04] * chaos__ (~chaos@hybris.inf.ug.edu.pl) Quit (synthon.oftc.net charm.oftc.net)
[6:05] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[6:05] * nhm (~mark@penguin.msi.umn.edu) has joined #ceph
[6:05] * Iribaar (~Iribaar@ has joined #ceph
[6:05] * in__ (~n0de@ has joined #ceph
[6:05] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[6:05] * nolan (~nolan@phong.sigbus.net) has joined #ceph
[6:05] * elder (~elder@cfcafwp.sgi.com) has joined #ceph
[6:05] * efoster (~efoster@ has joined #ceph
[6:05] * __jt__ (~james@jamestaylor.org) has joined #ceph
[6:05] * chaos__ (~chaos@hybris.inf.ug.edu.pl) has joined #ceph
[8:00] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Ping timeout: 480 seconds)
[8:22] * primusinterpares (~lans@cpe-98-151-252-191.socal.res.rr.com) has joined #ceph
[8:33] * primusinterpares (~lans@cpe-98-151-252-191.socal.res.rr.com) Quit (Quit: Leaving.)
[9:20] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[9:58] * Nadir_Seen_Fire (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[9:58] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[10:05] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[10:29] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[10:48] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[10:56] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:03] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[11:53] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:09] <stingray> src/common/config_opts.h is full of unused stuff
[14:10] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[14:12] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[14:41] * fronlius1 (~Adium@testing78.jimdo-server.com) has joined #ceph
[14:41] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[14:55] * mgalkiewicz (~mgalkiewi@ has joined #ceph
[14:55] <mgalkiewicz> hello
[14:55] <mgalkiewicz> I have kernel 2.6.32 (debian squeeze) and I want to use rbd
[14:56] <mgalkiewicz> is it possible to compile rbd module for this kernel?
[14:56] <mgalkiewicz> I would like to avoid kernel upgrade
[14:57] <mgalkiewicz> and the second thing do I have to modprobe the module on server and client or client only?
[15:09] * stingray headdesk
[15:17] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[15:18] <stingray> for some reason, when recovery is happening, stuff like rbd rm is stuck.
[15:41] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[15:47] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[15:50] <pmjdebruijn> mgalkiewicz: no clue, we just upgraded ours
[15:50] <pmjdebruijn> mgalkiewicz: we're considering pushing rbd via iscsi or something, so we have dedicated storage headends
[15:51] <pmjdebruijn> so kernel versions/upgrades/choices won't affect other machines
[15:51] <pmjdebruijn> sorry for not really answering your question though :(
[15:52] <mgalkiewicz> ok and what about modprobing rbd?
[15:52] <mgalkiewicz> is it required for server and client?
[15:52] <mgalkiewicz> by server I mean machine with mon
[15:53] <pmjdebruijn> afaik only for clients
[15:53] * pmjdebruijn isn't an expert though
[15:53] * pmjdebruijn is just scraping the surface as well
[15:59] <mgalkiewicz> hmm upgrading kernel for clients might be an option
[16:00] <mgalkiewicz> thx anyway
[16:13] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[16:30] * ognatortcele (~ognatortc@ has joined #ceph
[16:55] * mgalkiewicz (~mgalkiewi@ Quit (Remote host closed the connection)
[16:58] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[17:15] * MK_FG (~MK_FG@ Quit (Quit: o//)
[17:16] * MK_FG (~MK_FG@ has joined #ceph
[17:34] * felly (~felly@19NAAENB6.tor-irc.dnsbl.oftc.net) has joined #ceph
[17:40] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[17:43] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:04] <sagewk> mgalkiewicz, pmjdebruijn: modprobe rbd only needed on the clients
[18:13] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:15] * ognatortcele (~ognatortc@ Quit (Read error: Connection reset by peer)
[18:15] * ognatortcele (~ognatortc@ has joined #ceph
[18:17] * morse (~morse@supercomputing.univpm.it) Quit (Read error: Connection reset by peer)
[18:18] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:20] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:21] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:21] * nwatkins (~nwatkins@kyoto.soe.ucsc.edu) has joined #ceph
[18:31] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:55] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[19:15] * bchrisman (~Adium@ has joined #ceph
[19:21] * felly (~felly@19NAAENB6.tor-irc.dnsbl.oftc.net) Quit (Quit: Leaving.)
[19:31] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[19:31] * fronlius1 (~Adium@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[19:36] <nwatkins> gregaf: here is the trace with debug turned on. Turning on debug for messenger and client didn't do anything. Are the arguments still the same (they are listed on the first line of the trace) http://pastebin.com/KhJN5zfS
[19:41] <joshd> nwatkins: I think there was a change that made it necessary to add --log-file '' to get debugging output
[19:41] <gregaf> nwatkins: can you up the java debug levels too?
[19:41] <gregaf> not actually sure how to do that offhand, but it's probably pretty simple
[19:42] <nwatkins> Yeh, I can do that
[19:42] <gregaf> …oh, wait, you did do the Java
[19:42] <gregaf> heh, n/m
[19:42] <gregaf> so it's just trying to list the root like it should be
[19:42] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) has joined #ceph
[19:43] <nwatkins> Ahh, ok. So you were just interested in the logs from Ceph File System, not the rest of Hadoop?
[19:44] <gregaf> mostly, yeah
[19:44] <gregaf> joshd might be right about the —log-file option, I'm not really up on the current state of the bizarre logging interface :x
[19:44] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Quit: Leaving.)
[19:45] <gregaf> but it'd be good if I could see whether it's sending out messages or doing everything locally
[19:47] <nwatkins> gregaf: here is the output with the client logging http://pastebin.com/FraPb60D
[19:50] <gregaf> well that's weird
[19:50] <nwatkins> gregaf: that was dumped to console, and here is a much more detailed log from --log-file http://pastebin.com/W2e5Zp5b
[19:51] <gregaf> nwatkins: can you paste your patch for that "new buflen" output too?
[19:57] <nwatkins> gregaf: http://pastebin.com/9KKncxUq
[19:58] <nwatkins> The only thing missing from that patch is that originally I was return 1 from get replication (when I was hunting around for the original bug). I returned that to normal.
[20:09] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[20:12] <sjust> stingray: I looked at that patch, it's actually not quite right
[20:13] <sjust> stringray: it'll either return end or begin if the entry is not present, either of which would be incorrect
[20:13] <sjust> stringray: it would be better to assert(0) if we don't find the entry
[20:22] * fronlius (~Adium@f054115090.adsl.alicedsl.de) has joined #ceph
[20:22] <stingray> sjust: okay, now I have osd that is crashing repeatedly. What next?
[20:23] <sjust> stingray: the bug seems to be either a corrupt log or an error in how scrub handles logs, can you post logs with osd debug at 20?
[20:23] <sjust> *a log
[20:23] <stingray> (in other news, ARM announced 604-bit v8)
[20:24] <stingray> 64-bit
[20:24] <stingray> damn keyboard
[20:24] <stingray> sjust: okay, I will
[20:24] <stingray> I need to thoroughly break the cluster first.
[20:25] <stingray> 'cause it was running with my broken patch
[20:25] <stingray> which apparently affects nothing critical
[20:25] <stingray> ;[
[20:25] <sjust> in cases where find_entry is called correctly (i.e. not in scrub) it shouldn't matter
[20:26] <sjust> and scrub doesn't normally have much in the way of side effects :)
[20:26] <stingray> [stingray@stingr ceph]$ grep -r find_entry src
[20:26] <stingray> src/osd/PG.cc: p = log.find_entry(v);
[20:26] <stingray> src/osd/PG.h: list<Entry>::iterator find_entry(eversion_t v) {
[20:26] <stingray> === cut here ===
[20:27] <sjust> ah
[20:28] <stingray> so I have a monsterlog from when I was tracking this done but it's very hard to figure out what's relevant
[20:28] <stingray> s/done/down/
[20:28] * stingray headdesk
[20:29] <sjust> have you posted it in the channel?
[20:31] <stingray> no
[20:31] <gregaf> nwatkins: okay, I think I see what's happening here, just need to make sure the obvious fix isn't asinine
[20:31] <stingray> it's the same logfile that is active now
[20:31] <stingray> 876017342 bytes long
[20:31] <sjust> probably too big for pastebin :(
[20:33] <nwatkins> gregaf: ahh, great. let me know if you want me to test out a patch or anything.
[20:33] <stingray> I have pretty much unlimited hosting space but not bandwidth :)
[20:34] * aliguori (~anthony@ has joined #ceph
[20:58] <Tv> FYI office dwellers (sagewk!): I think I have a fever rising. I'm going to take the rest of the day as a sick day and head home before I get too dizzy, which is what my body always does when I get a fever. Crowbar status: PXE BIOSes suck, not sure how to work around everything.
[21:06] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[21:21] <stingray> hehe, tell me about pxe bioses
[21:21] <stingray> I mean, really
[21:23] <stingray> when Tv is back and functioning, can someone refer him to me - maybe I know just that about pxe bios that he wants (I am assuming the problem is localboot. It's always localboot)
[21:23] <stingray> in the meantime I'll go home too.
[21:24] <joshd> it was localboot - don't know the details though
[21:29] * FoxMURDER (~fox@ip-89-176-11-254.net.upcbroadband.cz) Quit (Ping timeout: 480 seconds)
[21:32] <NaioN> what's the problem with it?
[21:32] <NaioN> we use PXE a lot without much troubles
[21:32] <NaioN> also with localboot
[21:34] <NaioN> the only trouble we had was that we use large pxe images...
[21:36] <stingray> some pxe bioses don't localboot with gpxe
[21:36] <NaioN> we use it for diskless servers
[21:36] <stingray> or gpxelinux
[21:36] <stingray> they only localboot with pxelinux.0
[21:36] <NaioN> oh ok
[21:36] <NaioN> we use that one
[21:37] <stingray> I needed to distribute pxe configs by http
[21:37] <stingray> because I had some complex request triggered boot-once thing
[21:37] <stingray> that boots an image from network that does something to local system and restarts
[21:37] <stingray> on next reboot I need localboot
[21:37] <NaioN> ok
[21:37] <stingray> I struggled with gpxe and gpxelinux for a while
[21:37] <stingray> at the end I had to write a tftp proxy
[21:37] <stingray> in python
[21:37] <NaioN> we do it the provide the image for the server
[21:38] <stingray> that relayed some requests to http server
[21:38] <stingray> :(
[21:38] <NaioN> but you could do a always boot with pxe
[21:38] <NaioN> and in the pxe you decide to do somtehing or do a localboot
[21:39] <NaioN> then you are much more flexible
[21:39] <stingray> I always boot with pxe
[21:39] <NaioN> We use that for our imaging of workstations
[21:39] <stingray> the question is - which config I show
[21:39] <stingray> yes yes I use it for that too
[21:49] * fronlius1 (~Adium@f054105239.adsl.alicedsl.de) has joined #ceph
[21:53] * fronlius (~Adium@f054115090.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[22:15] <sjust> stingray: a bit of background, were the affected osds recently restarted before the first instance of that bug?
[22:49] * MarkN (~nathan@ has joined #ceph
[22:49] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[23:05] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:11] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[23:12] * bchrisman (~Adium@ has joined #ceph
[23:13] <ajm> sagewk: http://adam.gs/osd.10.log.bz2 i pasted this late yesterday, this one broke too :(
[23:54] <gregaf> nwatkins: sorry this is taking so long, turned out to be a little more complicated than I thought
[23:58] <nwatkins> gregaf: that's ok. Is the issue in ceph proper or in the wrappers?
[23:58] <gregaf> the userspace client, or some portions of it

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.