#ceph IRC Log


IRC Log for 2012-06-12

Timestamps are in GMT/BST.

[0:00] <sagewk> it will add ancestors, and move the leaf, but not adjust internal tree structure (yet)
[0:06] <sagewk> tv_: i'm worried about single command (or osd with bad hierararchy information) making potentially huge changes to the tree structure. would rather blame that on an admin for now
[0:10] <Tv_> sagewk: yup
[0:10] <Tv_> sagewk: on the topic, thought of this: http://tracker.newdream.net/issues/2540
[0:10] <Tv_> sagewk: it was just confusing when you're not dealing with an actual layout, but made up entries ;)
[0:10] <sagewk> yeah
[0:11] <Tv_> and things like.. now i have a host at top-level, that refuses to under a rack
[0:11] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit (Quit: adjohn)
[0:11] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[0:11] <Tv_> but as long as this can be edited decently somewhere else, that's good enough
[0:12] <sagewk> yeah
[0:12] * chutzpah (~chutz@ Quit (Quit: Leaving)
[0:17] <sagewk> tv_: pushed 2540
[0:25] * stass (stas@ssh.deglitch.com) Quit (Remote host closed the connection)
[0:27] <Tv_> sagewk: see branch upstart-set-crush; veto the ceph.conf change if you want
[0:28] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) Quit (Quit: LarsFronius)
[0:29] <sagewk> looks good to me.
[0:30] <sagewk> and there is no ceph.conf change?
[0:30] <Tv_> sagewk: osd_crush_location, osd_crush_weight
[0:30] <sagewk> oh, the names.. those sound good to me
[0:30] <Tv_> sagewk: ok, putting it in master
[0:30] <sagewk> cool
[0:31] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:31] <Tv_> well, my vms are in different racks and it does the right thing ;)
[0:43] * MarkN (~nathan@ has joined #ceph
[0:44] * MarkN (~nathan@ has left #ceph
[1:01] * lofejndif (~lsqavnbok@09GAAF21X.tor-irc.dnsbl.oftc.net) has joined #ceph
[1:11] * The_Bishop (~bishop@p5DC11566.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[1:35] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[1:52] * Tv_ (~tv@2607:f298:a:607:bd15:990e:65cd:46db) Quit (Quit: Tv_)
[2:02] <elder> dmick, to get the fixed functionality for rebooting the right kernel, do I need to run chef on my target nodes?
[2:05] <elder> (I'm trying it, just in case.) I had some machines not come back with the right kernel, but I don't know if the problem is the grub thing or not.
[2:06] <dmick> All but a very few of the planas should be updated
[2:06] <dmick> you can tell if you're on one of the ones that got missed
[2:06] <dmick> look at /etc/grub.d/10_linux and see if the line near the bottom that says
[2:06] <dmick> if [ "$list" ] && ! $in_subment
[2:06] <dmick> *menu
[2:07] <dmick> has an additional "&& false" that disables it
[2:07] <dmick> (that's what the chef edit does)
[2:07] <elder> I'll look, but I just ran the "chef" task so maybe that's going to update it for me.
[2:08] <dmick> oh look at that. I was not even aware that existed
[2:09] <elder> Apparently it runs the chef task like the overnight test runs do.
[2:10] <elder> I think doing it once in a while gets you updated.
[2:10] <dmick> didn't know overnight did either.
[2:10] <dmick> but yes, I run that chef-solo manually a lot
[2:11] <elder> Well, maybe it doesn't. I'm just going on my limited understanding and some somewhat cryptic notes.
[2:13] <elder> Here's what I have:
[2:13] <elder> if [ "$list" ] && ! $in_submenu; then
[2:13] <elder> echo "submenu \"Previous Linux versions\" {"
[2:13] <elder> in_submenu=:
[2:13] <elder> fi
[2:13] <elder> So I suppose that means I don't have the fix.
[2:13] <dmick> yeah, that's not right
[2:14] <elder> Is it easy to update? Otherwise I don't mind just doing my manual fix for the time being.
[2:18] <dmick> well by the looks of it running the chef task should do so
[2:18] <dmick> or you could run this manually:
[2:18] <dmick> wget -q -O- https://raw.github.com/ceph/ceph-qa-chef/master/solo/solo-from-scratch | sh
[2:19] <elder> WTF, that seemed to have downloaded the entire fricken repository.
[2:20] <elder> Well, maybe not... I don't know what that was about.
[2:21] <elder> A second attempt returned immediately.
[2:23] <dmick> it pulls down a git repo of the ceph-qa-chef repo as a way to get the various pieces local to execute them
[2:23] <elder> Yikes.
[2:23] <dmick> and then throws it away
[2:23] <elder> Well, as long as it gets the job done.
[2:24] <dmick> there's a lot of "github is our network service that's always there" assumption running around
[2:24] <elder> So is that just manually updating the script so future grub.cfg files get built with the || false on it?
[2:24] <dmick> among a bunch of other stuff, yeah, but that's the part relevant to you right now
[2:25] <dmick> /etc/grub.d are the "source" for /boot/grub/grub.cfg
[2:25] <elder> OK, well I did that one machine and will do the same on the other two in this cluster.
[2:25] <elder> Thanks.
[2:25] <dmick> sure
[2:39] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[2:42] * fzylogic (~fzylogic@ Quit (Quit: DreamHost Web Hosting http://www.dreamhost.com)
[2:43] * stass (stas@ssh.deglitch.com) has joined #ceph
[2:56] * lofejndif (~lsqavnbok@09GAAF21X.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[3:22] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:23] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:54] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit (Quit: adjohn)
[3:56] * renzhi_away is now known as renzhi
[4:09] <renzhi> Hi, how stable is the cephfs library? production ready? :)
[5:12] * cattelan_away is now known as cattelan_away_away
[5:31] * Kioob`Taff1 (~plug-oliv@local.plusdinfo.com) Quit (Ping timeout: 480 seconds)
[5:33] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) has joined #ceph
[5:33] * renzhi (~renzhi@ Quit (Ping timeout: 480 seconds)
[5:42] * renzhi (~renzhi@ has joined #ceph
[5:58] * The_Bishop (~bishop@2a01:198:2ee:0:a053:1eb7:3d22:9c57) has joined #ceph
[6:20] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[6:24] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) Quit (Quit: LarsFronius)
[6:44] <iggy> renzhi: depends what features of the rest of the system you are going to use
[6:46] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit (Quit: adjohn)
[7:12] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[7:15] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit ()
[7:34] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[7:43] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Ping timeout: 480 seconds)
[8:02] <renzhi> iggy: I looked up the API, and I'm sure if it's complete
[8:02] <renzhi> or is there anyone using it in production?
[8:03] <iggy> oh, you mean in that respect... yeah i think not, unless you are willing to track changes
[8:03] <iggy> they were talking about changing something the other day
[8:04] <renzhi> we are using native rados, with librados now
[8:05] <iggy> that should be good to go
[8:05] <renzhi> librados has been giving quite an interesting run in the last week or so :)
[8:06] <renzhi> we were load testing it with thousands of client connections, and it seems like each connection is taking quite a few threads internally
[8:06] <renzhi> we ran out of thread resources very quickly.
[8:07] <renzhi> right now, we are sharing one instance of cluster handle among multiple io context, but I'm not too sure about the race condition.
[8:07] <renzhi> There is this #2525 issue.
[8:07] <renzhi> so we are still in testing mode.
[8:30] * renzhi (~renzhi@ Quit (Ping timeout: 480 seconds)
[8:44] * renzhi (~renzhi@ has joined #ceph
[8:54] * BManojlovic (~steki@ has joined #ceph
[8:59] * Kioob`Taff1 (~plug-oliv@89-156-116-126.rev.numericable.fr) has joined #ceph
[9:14] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:28] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:35] * gregaf (~Adium@2607:f298:a:607:91ec:25b8:8ba1:3069) Quit (Read error: Connection reset by peer)
[9:36] * sagewk (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) Quit (Read error: Operation timed out)
[9:39] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[10:02] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[10:29] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[10:29] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Read error: Connection reset by peer)
[10:29] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[10:33] * fghaas (~florian@ has joined #ceph
[10:41] <fghaas> morning, I'm trying to troubleshoot a radosgw issue, but can't convince radosgw to up its logging verbosity. anyone able to give information as to whether the log level should be set via environment variables, as in https://github.com/ceph/teuthology/blob/master/teuthology/task/apache.conf, or via "debug rgw" in ceph.conf?
[10:42] <fghaas> or, more basically, whether rgw debug output should end up somewhere in /var/log/ceph, in the web server error log, or in syslog?
[11:23] * fghaas (~florian@ Quit (Ping timeout: 480 seconds)
[11:29] * fghaas (~florian@ has joined #ceph
[11:59] * ninkotech (~duplo@ has joined #ceph
[12:25] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:28] * fikhool (~fikh@1RDAACK2B.tor-irc.dnsbl.oftc.net) has joined #ceph
[12:35] * fikhool (~fikh@1RDAACK2B.tor-irc.dnsbl.oftc.net) Quit (autokilled: This host violated network policy. If you feel an error has been made, please contact support@oftc.net, thanks. (2012-06-12 10:35:38))
[12:50] * renzhi (~renzhi@ Quit (Ping timeout: 480 seconds)
[13:00] * renzhi (~renzhi@ has joined #ceph
[13:47] * morse (~morse@supercomputing.univpm.it) Quit (Quit: Bye, see you soon)
[13:55] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[14:03] * morse (~morse@supercomputing.univpm.it) Quit (Quit: Bye, see you soon)
[14:04] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[14:04] * renzhi is now known as renzhi_away
[14:12] * renzhi_away (~renzhi@ Quit (Ping timeout: 480 seconds)
[14:26] * renzhi_away (~renzhi@ has joined #ceph
[14:26] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[15:43] * mtk (trP9T6oSW2@panix2.panix.com) Quit (Remote host closed the connection)
[15:44] * mtk (~mtk@ool-44c358a7.dyn.optonline.net) has joined #ceph
[15:55] * cattelan_away_away is now known as cattelan_away
[16:01] * szaydel (~szaydel@c-67-169-107-121.hsd1.ca.comcast.net) has joined #ceph
[16:02] * szaydel (~szaydel@c-67-169-107-121.hsd1.ca.comcast.net) has left #ceph
[16:34] * nhm (~nh@184-97-241-32.mpls.qwest.net) has joined #ceph
[17:00] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:06] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:06] * BManojlovic (~steki@ has joined #ceph
[17:06] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[17:32] * bchrisman1 (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:33] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Quit: Ex-Chat)
[17:53] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[17:59] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[18:16] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit (Quit: adjohn)
[18:26] * Tv_ (~tv@2607:f298:a:607:bd15:990e:65cd:46db) has joined #ceph
[18:52] * bchrisman (~Adium@ has joined #ceph
[19:01] * Kioob`Taff1 (~plug-oliv@89-156-116-126.rev.numericable.fr) Quit (Quit: Leaving.)
[19:01] * Kioob`Taff2 (~plug-oliv@89-156-116-126.rev.numericable.fr) has joined #ceph
[19:19] * fghaas (~florian@ Quit (Ping timeout: 480 seconds)
[19:29] * adjohn (~adjohn@ has joined #ceph
[19:33] <joao> sagewk, sage, how bad was I breaking up?
[19:33] <joao> *badly
[19:34] <joao> for some reason, my internet connection has been having huge latency spikes over the last couple of days (and I notice it specially when ssh'ing into the planas)
[19:35] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Quit: LarsFronius)
[19:38] <gregaf> joao: so what's this bug you were mentioning?
[19:38] <joao> basically, the paxos isn't properly recovering
[19:39] <joao> not sure where the bug lies yet, as I've been working on it
[19:39] <gregaf> does it start up properly the first time?
[19:39] <gregaf> or are you talking about killing a monitor and then bringing it back up after changes have been made?
[19:39] <joao> yeah, but then I hit yet another bug (somewhere down the line) and were not able to reproduce it due to this one
[19:40] <gregaf> given that slurping doesn't exist and there isn't a replacement, recovery certainly isn't going to work properly???am I missing something?
[19:41] <joao> during the new proposal, any monitor with a higher pn should send it to the leader
[19:41] <joao> that means sharing its state
[19:41] <joao> that does happen, and the leader proposes that new value
[19:41] <joao> that new value is accepted
[19:41] <joao> everything goes according to plan
[19:42] <joao> except that the leader does not store that value, or if it does it is a corrupted value
[19:42] <gregaf> ah, I see
[19:42] <joao> which will screw someone during the next update_from_paxos()
[19:42] <gregaf> the new value is not stored/is corrupted? or one of the historic values?
[19:44] <joao> by historical you mean a higher pn that the one the leader is initially proposing?
[19:44] <joao> if so, yes, it's an historical
[19:44] <joao> basically, I mkfs'd the leader and rerun the whole thing
[19:44] <gregaf> that's the new value then, right?
[19:44] <gregaf> by historical I mean "previously committed but not stored everywhere"
[19:44] <joao> so yes, it's the new value
[19:45] <joao> it's not stored everywhere because I mkfs'd the leader
[19:45] <gregaf> okay
[19:45] <joao> and the leader has no idea about it
[19:45] <joao> I'm going to look into the whole paxos in-memory variables first
[19:46] <joao> I have a feeling there may be a variable that is no longer used and may be the culprit, and I may have forgotten to either remove it or reuse it
[19:46] * chutzpah (~chutz@ has joined #ceph
[19:47] <gregaf> okay
[19:47] <joao> after all, the store is clearly not stored
[19:47] <joao> mon.a@0(leader).paxos(paxos recovering c 1..6) store_state nothing to commit
[19:47] <joao> and this is a blatant lie
[19:47] <gregaf> ah, not so good then
[19:47] <gregaf> okay, let me know if you'd like some help or a rubber ducky or something :)
[19:48] * chutzpah (~chutz@ Quit ()
[19:48] <joao> although I don't get the rubber ducky thing, thank you :p
[19:48] <gregaf> really?
[19:48] <gregaf> http://en.wikipedia.org/wiki/Rubber_duck_debugging
[19:49] <joao> oh, there's a name for that
[19:49] <joao> awesome
[19:53] * fghaas (~florian@ has joined #ceph
[19:54] <joao> btw, does any of you know of a tool to go through leveldb's keys and values?
[19:55] <joao> do we have such a thing?
[19:55] <sjust> joao: nope
[19:55] <joao> k thanks :)
[19:58] * chutzpah (~chutz@ has joined #ceph
[19:59] <nhm> gregaf: I didn't realize there was a name for that either. Awesome.
[20:05] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[20:23] * aliguori (~anthony@ has joined #ceph
[20:28] * tjiftjaf (~Pete@09GAAF34P.tor-irc.dnsbl.oftc.net) has joined #ceph
[20:28] <tjiftjaf> why this channel has no topic?!
[20:32] <joao> apparently, because we have no channel operator to set it ;)
[20:32] <tjiftjaf> apparently, because we have no channel operator to set it ;)
[20:33] <tjiftjaf> lol
[20:33] * sakib (~Adium@ip. has joined #ceph
[20:37] * sakib (~Adium@ip. has left #ceph
[20:48] * BManojlovic (~steki@ has joined #ceph
[21:01] * fghaas (~florian@ has left #ceph
[21:19] * tjiftjaf (~Pete@09GAAF34P.tor-irc.dnsbl.oftc.net) Quit (Quit: K-Lined)
[21:28] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[21:34] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[21:51] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[21:52] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[22:10] <Tv_> so leveldb broke make distchek
[22:10] <Tv_> hrrmph
[22:10] * lofejndif (~lsqavnbok@09GAAF39I.tor-irc.dnsbl.oftc.net) has joined #ceph
[22:18] <jmlowe> I have a crashing osd
[22:19] <jmlowe> http://pastebin.com/N9qU1c5t
[22:20] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:35] <Tv_> jmlowe: that looks familiar..
[22:35] <jmlowe> what's my best course of action?
[22:35] <Tv_> jmlowe: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6831/focus=6850
[22:36] <Tv_> jmlowe: sjust is your best course of action ;)
[22:38] <jmlowe> I suspect I have a bad copy of an object hanging out there, I'm going from 2 to 3 oss's and replication level 3
[22:39] <jmlowe> so if I could knock out the bad copy I might be ok
[22:49] <Tv_> sagewk: #2415 is in progress without assignee?
[22:49] <sagewk> that's me
[22:49] * aliguori (~anthony@ Quit (Remote host closed the connection)
[22:56] <sjust> I'm here now
[22:57] <sjust> jmlowe: that has happened twice now?
[23:00] <jmlowe> a couple of times
[23:00] * yehudasa_ (~yehudasa@aon.hq.newdream.net) has joined #ceph
[23:01] * gregaf (~Adium@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[23:01] * sjust (~sam@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[23:01] * LarsFronius (~LarsFroni@2a02:8108:3c0:5a:e56a:dce8:47f5:f0c9) has joined #ceph
[23:02] * gregaf (~Adium@2607:f298:a:607:1dbd:59b6:79e3:4e20) has joined #ceph
[23:03] <jmlowe> actually it happens every time
[23:03] <jmlowe> on startup
[23:04] <jmlowe> most recent log snippit http://pastebin.com/pU8nRc7R
[23:05] * sagewk1 (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) has joined #ceph
[23:06] * LarsFronius (~LarsFroni@2a02:8108:3c0:5a:e56a:dce8:47f5:f0c9) Quit ()
[23:07] * sagewk (~sage@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:08] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:08] * sjust-phone (~sjust@aon.hq.newdream.net) has joined #ceph
[23:10] <sjust-phone> hmm, i guess the leveldb store is corrupted
[23:10] <jmlowe> what's my best course of action?
[23:10] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) has joined #ceph
[23:10] * sjust-phone (~sjust@aon.hq.newdream.net) Quit ()
[23:11] <sjust> hmm, can you tar up the omap directory under current in the misbehaving osd?
[23:11] <jmlowe> mkfs.btrfs, ceph-osd ???mkfs?
[23:11] <sjust> yeah, I guess so
[23:12] <sjust> the current/omap contains the leveldb stuff, if i can get a copy I can figure out what happened
[23:13] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[23:14] * bchrisman (~Adium@ has joined #ceph
[23:17] <jmlowe> let's see if this works https://iu.box.com/s/d97d34c0ccda51ad7d72
[23:17] <joao> woohoo
[23:18] <joao> nothing is blowing up
[23:18] * yehudasa__ (~yehudasa@aon.hq.newdream.net) has joined #ceph
[23:18] <jmlowe> sjust: any luck getting the tarball?
[23:25] * yehudasa_ (~yehudasa@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:26] <sjust> one sec
[23:27] <sjust> got it, thanks!
[23:27] <sjust> the best way to recover is going to be to blast the osd, sorry
[23:32] <jmlowe> ok, that's no big deal
[23:33] <sjust> still, that's the second time we've seen that
[23:35] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[23:38] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Operation timed out)
[23:39] <iggy> sort of OT... You guys that are using chef, satisfied? Have you used any other config mgmt system?
[23:45] <Tv_> iggy: i've used plenty, not satisfied with any of them ;)
[23:45] <Tv_> iggy: chef is definitely getting a lot of attention out there, so that's why we feel we need to support it
[23:52] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:53] <iggy> it seems to be one of the few "big names" that doesn't have a free/enterprise split
[23:55] <sjust> jmlowe: fyi: bug #2563

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.