#ceph IRC Log


IRC Log for 2011-04-07

Timestamps are in GMT/BST.

[0:02] * sagelap (~sage@ has joined #ceph
[0:21] * cmccabe (~cmccabe@ Quit (Ping timeout: 480 seconds)
[0:28] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:41] * cmccabe (~cmccabe@m310536d0.tmodns.net) has joined #ceph
[0:41] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[1:04] * sagelap (~sage@ has joined #ceph
[1:10] * sagelap (~sage@ has left #ceph
[1:34] * cmccabe1 (~cmccabe@ has joined #ceph
[1:40] * cmccabe (~cmccabe@m310536d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[1:49] * julienhuang (~julienhua@mtl93-4-82-226-130-144.fbx.proxad.net) has joined #ceph
[1:59] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[2:23] * julienhuang (~julienhua@mtl93-4-82-226-130-144.fbx.proxad.net) Quit (Quit: julienhuang)
[2:27] * cmccabe1 (~cmccabe@ Quit (Quit: Leaving.)
[2:39] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:53] * greglap (~Adium@ has joined #ceph
[3:10] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:13] <mwodrich> I don't suppose there's anyone still in here who can fix the RADOS gateway like yesterday?
[3:14] <mwodrich> it's been acting up all day today again
[3:19] <greglap> mwodrich: do you know what they did to fix it?
[3:46] <mwodrich> let me take a look at my logs from yesterday
[3:47] <mwodrich> assuming I have a log set up...
[3:48] <mwodrich> no log from yesterday...
[3:48] <mwodrich> all I remember was that Sage said something with authorization was broken and he made some sort of temporary fix
[3:49] <greglap> oh, I vaguely recall something about the OSD breaking its auth somehow
[3:49] <mwodrich> yes, I think that's right
[3:49] <greglap> I don't think I can do much with that though, sorry
[3:50] <mwodrich> that's ok, I have to go anyway - I have a train to catch
[4:00] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[4:43] * Juul (~Juul@slim.dhcp.lbl.gov) has joined #ceph
[5:00] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[5:47] * lidongyang_ (~lidongyan@ Quit (Remote host closed the connection)
[5:47] * lidongyang (~lidongyan@ has joined #ceph
[5:47] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:08] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[6:14] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[7:22] * samsung (~samsung@ has joined #ceph
[7:45] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (synthon.oftc.net oxygen.oftc.net)
[7:45] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (synthon.oftc.net oxygen.oftc.net)
[7:45] * johnl (~johnl@johnl.ipq.co) Quit (synthon.oftc.net oxygen.oftc.net)
[7:45] * jbdenis (~jbdenis@brucciu.sis.pasteur.fr) Quit (synthon.oftc.net oxygen.oftc.net)
[7:45] * todin_ (tuxadero@kudu.in-berlin.de) Quit (synthon.oftc.net oxygen.oftc.net)
[7:45] * cclien_ (~cclien@ec2-175-41-146-71.ap-southeast-1.compute.amazonaws.com) Quit (synthon.oftc.net oxygen.oftc.net)
[7:45] * Jiaju (~jjzhang@ Quit (synthon.oftc.net oxygen.oftc.net)
[7:45] * nolan (~nolan@phong.sigbus.net) Quit (synthon.oftc.net oxygen.oftc.net)
[7:45] * iggy (~iggy@theiggy.com) Quit (synthon.oftc.net oxygen.oftc.net)
[7:47] * iggy (~iggy@theiggy.com) has joined #ceph
[7:49] * nolan (~nolan@phong.sigbus.net) has joined #ceph
[7:58] * Jiaju (~jjzhang@ has joined #ceph
[8:01] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:02] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[8:02] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[8:02] * johnl (~johnl@johnl.ipq.co) has joined #ceph
[8:02] * jbdenis (~jbdenis@brucciu.sis.pasteur.fr) has joined #ceph
[8:02] * todin_ (tuxadero@kudu.in-berlin.de) has joined #ceph
[8:02] * cclien_ (~cclien@ec2-175-41-146-71.ap-southeast-1.compute.amazonaws.com) has joined #ceph
[8:41] * cclien_ (~cclien@ec2-175-41-146-71.ap-southeast-1.compute.amazonaws.com) Quit (synthon.oftc.net oxygen.oftc.net)
[8:41] * todin_ (tuxadero@kudu.in-berlin.de) Quit (synthon.oftc.net oxygen.oftc.net)
[8:41] * jbdenis (~jbdenis@brucciu.sis.pasteur.fr) Quit (synthon.oftc.net oxygen.oftc.net)
[8:41] * johnl (~johnl@johnl.ipq.co) Quit (synthon.oftc.net oxygen.oftc.net)
[8:41] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (synthon.oftc.net oxygen.oftc.net)
[8:41] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (synthon.oftc.net oxygen.oftc.net)
[8:45] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[8:45] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[8:45] * johnl (~johnl@johnl.ipq.co) has joined #ceph
[8:45] * jbdenis (~jbdenis@brucciu.sis.pasteur.fr) has joined #ceph
[8:45] * todin_ (tuxadero@kudu.in-berlin.de) has joined #ceph
[8:45] * cclien_ (~cclien@ec2-175-41-146-71.ap-southeast-1.compute.amazonaws.com) has joined #ceph
[8:55] * alexxy (~alexxy@ has joined #ceph
[9:17] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:40] * MK_FG (~MK_FG@ Quit (Quit: o//)
[9:41] * MK_FG (~MK_FG@ has joined #ceph
[10:10] * Yoric (~David@ has joined #ceph
[10:27] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[10:31] * Juul (~Juul@slim.dhcp.lbl.gov) Quit (Quit: Leaving)
[11:14] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[12:06] * Yoric (~David@ Quit (Quit: Yoric)
[13:14] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[13:15] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[14:32] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[15:06] * lxo (~aoliva@ Quit (Ping timeout: 480 seconds)
[15:07] * lxo (~aoliva@ has joined #ceph
[15:45] * todin_ (tuxadero@kudu.in-berlin.de) Quit (Quit: leaving)
[15:54] * allsystemsarego (~allsystem@ has joined #ceph
[16:29] * samsung (~samsung@ Quit (Ping timeout: 480 seconds)
[17:17] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[17:29] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:40] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:54] * greglap (~Adium@ has joined #ceph
[18:22] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[18:26] * alexxy (~alexxy@ has joined #ceph
[18:30] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[18:38] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:49] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[18:59] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:02] * pombreda (~Administr@ has joined #ceph
[19:03] <pombreda> howdy :) are there any ceph dev hanging @ the linux collaboration summit today?
[19:04] <cmccabe> pombreda: I think sage is there
[19:05] <pombreda> cmccabe: I am haning in the filesystem session this morning
[19:06] <pombreda> sage: if you are around, I would be honored to meet in person :P
[19:06] <cmccabe> pombreda: sage was definitely there yesterday and tuesday (at the FS+VM sessions)
[19:16] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:17] * lxo (~aoliva@ Quit (Read error: Connection reset by peer)
[19:17] * lxo (~aoliva@ has joined #ceph
[19:22] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[19:27] <Tv> gregaf: FYI meeting room is taken
[19:28] <gregaf> argh
[19:29] <gregaf> there's management stuff going on, I'll set up in here
[19:32] * pombreda (~Administr@ Quit (Quit: Leaving.)
[19:57] <cmccabe> I seem to have created an undeletable bucket on amazon's servers
[19:57] <cmccabe> does anyone have a client that can handle control characters in object names?
[19:58] <cmccabe> I can rule out /usr/bin/s3 and anything based on libboto...
[20:03] * pombreda (~Administr@ has joined #ceph
[20:04] <pombreda> I am now listening to sage ceph presentation @ the linux collaboration summit :)
[20:09] <gregaf> heh, I didn't even know he was giving one — let us know how it is! :)
[20:15] <mwodrich> maybe try CrossFTP or Cyberduck?
[20:16] <mwodrich> of course we've seen that CrossFTP might have some wires crossed as it were, but you might see if it works
[20:17] * Yoric (~David@138-154.79-83.cust.bluewin.ch) has joined #ceph
[20:20] <pombreda> gregaf: he is neato good so far :)
[20:21] <pombreda> gregaf: I saw falshed for a short while a slide with my playground user id on it :D
[20:21] <pombreda> *flashed :P
[20:21] <gregaf> heh
[20:23] <gregaf> if you don't mind my asking, how'd you get an invite to the summit?
[20:28] <bchrisman> gregaf: I know cfuse doesn't support locking… do you know the current/intended behavior when a lock call is made? Should it assert? Silently return true? We saw a fuse crash when an application did that and I'm wondering whether it's worth retesting and finding the crash.
[20:28] <gregaf> no idea
[20:28] <Tv> cmccabe: aws also has it's own webby s3 file explorer, iirc
[20:29] <gregaf> it doesn't have any stubs or anything so whatever behavior it exhibits is provided by fuse
[20:29] <bchrisman> okay.. thx
[20:29] <Tv> bchrisman: sounds like it should return an error (not crash)
[20:29] <Tv> bchrisman: if you know what exact call, i may be able to say more
[20:29] <Tv> like, fcntl etc
[20:29] <pombreda> gregaf: good round of applause on sage ceph presentation @ linux summit :P
[20:30] <wido> a active, non-clean pg's shouldn't block writes, should it?
[20:30] <gregaf> bchrisman: maybe fuse provides local locking through the standard vfs bits and the application was expecting data to exist due that didn't due to network effects?
[20:30] <pombreda> gregaf: good volume of questions from the audience , which I think a *good thing*
[20:30] * aliguori (~anthony@ has joined #ceph
[20:30] <gregaf> yay
[20:30] <bchrisman> I'll chase down the cfuse exit/crash then…
[20:30] <gregaf> someday maybe I will get to do one :(
[20:31] <gregaf> wido: don't think it should, but sjust could say more
[20:32] <wido> Ok, 5 OSD's failed today, I'm sure my crushmap would allow failure of those 5 OSD's, but my FS has started blocking
[20:32] <wido> running with replication = 3, 4 OSD's on the same node failed and another due to a bad disk
[20:32] <sjust> active, non-clean pg should only block writes on objects it hasn't recovered yet
[20:32] * Yoric (~David@138-154.79-83.cust.bluewin.ch) Quit (Quit: Yoric)
[20:33] <wido> sjust: What's the easiest way to find that out? I currently have 45 active PG's
[20:33] <wido> pg v239261: 10608 pgs: 45 active, 10563 active+clean; 4944 GB data, 16268 GB used, 48893 GB / 65205 GB avail; 17329/3806706 degraded (0.455%)
[20:33] <wido> Oh, got to run, I'll be back in about 1.5 hours!
[20:33] <sjust> ok
[20:34] <gregaf> bchrisman: on #989 was your copy using posix file locking for some reason?
[20:34] <sjust> wido: actually, those pg's might have unfound objects, I'm not completely sure on the details there
[20:34] <gregaf> just noticed the log output about file_setlock etc and I don't know why a copy would be using that
[20:35] <gregaf> lunchtime, bbl
[20:48] <bchrisman> gregaf: on that one… it should not have been as I was just doing a copy. However, we do have some other daemons which may have been running that might've been using the posix locking there.
[20:48] <bchrisman> gregaf: yeah.. that might have been our other daemons… ok
[20:49] <bchrisman> what I can do is drop those daemons and retry
[20:56] <Tv> whoa libtool broke gitbuilder.. fixing.. somehow.. :-/
[20:57] <Tv> it's trying to relink crap when running clitests
[20:57] <Tv> i wonder how that can happen
[21:05] * pombreda (~Administr@ Quit (Quit: Leaving.)
[21:12] <Tv> back to green
[21:33] * Yoric (~David@138-154.79-83.cust.bluewin.ch) has joined #ceph
[22:08] <gregaf> bchrisman: I doubt the locking is breaking the rstats, just confused about how it got in there
[22:16] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[22:19] <wido> sjust: back
[22:20] <wido> my osd 0,1,2,3 and 35 are down, but for example, one of the active PG's is on OSD 38,34,16
[22:20] <wido> that should recover I assume?
[22:30] <wido> my 'osdc' and 'mdsc' are emtpy, so no stalling writes it seems?
[22:30] * pombreda (~Administr@ has joined #ceph
[22:31] * pombreda (~Administr@ Quit ()
[22:32] <sjust> wido: sorry for the delay
[22:32] <sjust> osdc?
[22:32] <wido> yes, in /sys/kernel/debug/ceph/x
[22:32] <wido> outstanding writes to the OSD's or MDS'es
[22:32] <sjust> ah, ok
[22:33] <wido> a 'rados -p data ls' also continues and exits
[22:33] <wido> same goes for metadata
[22:34] * samlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[22:37] <sjust> wido: can you try rados -p casdata bench 10 write?
[22:37] * samlaptop (~sam@ip-66-33-206-8.dreamhost.com) has left #ceph
[22:38] <gregaf> who cares about casdata?
[22:38] <wido> sjust: works, gives me about 80MB/sec
[22:38] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[22:39] <sjust> wido: but the fs hangs?
[22:39] <wido> I've got a second client where the fs is mounted to, browsing works, but reading a file stalled after 36MB
[22:39] <wido> another file stalled after 32MB
[22:39] <sjust> wido: try checking ceph pg dump for unfound objects
[22:40] <sjust> 4th column of the output
[22:42] <wido> for what I can see now, no unfound objects ( ceph pg dump -o -|grep active|grep -v clean|awk '{print $1" "$4}' )
[22:44] <sjust> have you tried mounting the cluster elsewhere (might be a kernel client problem)
[22:45] <wido> sjust: Yes, I have it mounted on a second client which was idle, that one is stalling too
[22:46] <wido> my rsync on the first one has been stalling now for about 4 hours
[22:46] <wido> and the second one I just tried, didn't do any I/O for the last few days
[22:46] <sjust> sounds like at least one pg is broken
[22:46] <sjust> hmm
[22:48] <wido> 'rados df' tells me the same, no unfound objects, only degraded ones
[22:48] <sjust> can you get me the output of ceph pg dump?
[22:49] <wido> sjust: sure, but I think your key is still loaded, you could log one if that's easier
[22:49] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[22:49] <sjust> ah, ok
[22:49] <wido> samuelj@slider, that's your key?
[22:49] <sjust> yeah
[22:49] <wido> root@logger.ceph.widodh.nl and from there you can go to 'root@amd' (first client) and 'root@noisy' (second client)
[22:50] <wido> amd is the one where the rsync is stalling in the screen
[22:51] <wido> sjust: logger is part of a second cluster, that one is still running fine, the cluster where 'amd' and 'noisy' are in is broken
[22:51] <sjust> ok
[22:51] <wido> those machines are IPv6 only, so 'logger' is a gateway to get into those machines
[22:51] <sjust> gotcha
[22:52] <wido> logs go to noisy in /var/log/remote/ceph
[22:52] * Yoric (~David@138-154.79-83.cust.bluewin.ch) Quit (Quit: Yoric)
[22:58] <sjust> wido: looks like several of the pg's still have missing objects
[22:59] <wido> sjust: that's weird, I have 5 failed OSD's, but the crushmap should prevent that replicas were stored at those OSD's
[23:04] <wido> I'm going afk, I'll take a look at it tomorrow, see what it does when I bring osd 0,1,2,3 back
[23:04] <sjust> ok
[23:05] <wido> tnx for your help!
[23:05] <sjust> sure!
[23:05] <sjust> thanks for the testing!
[23:05] <wido> (but still, it should have recovered ;) )
[23:05] <wido> If it is interesting enough, I could leave it in this state
[23:05] <sjust> seems to be a problem with recovery
[23:06] <wido> ok, we'll see, no rush
[23:06] <wido> ttyl!
[23:14] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:46] * sagelap (~sage@m1f0436d0.tmodns.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.