#ceph IRC Log


IRC Log for 2011-01-25

Timestamps are in GMT/BST.

[0:00] <bchrisman> yeah.. so I think you're right about it serving up basic ls data out of a cache and then having something else waiting on a response from a dead node or something?
[0:00] <greglap> I'm not sure about that though
[0:00] <greglap> it doesn't seem very likely to me but I think maybe you lost some PGs and didn't notice somehow?
[0:00] <greglap> although that ought to be lighting up the world if that were the case, so maybe not
[0:01] <greglap> briefly, on a subject I do actually know about
[0:01] <greglap> you're starting up 3 MDSes but you're actually only using one of them
[0:01] <bchrisman> my rack doesn't reference the devices directly though.. it references them through the host entries….that's how I was hoping it was saying "ahh rack is composed of three hosts with four devices each.. let's distribute things appropriately".. but maybe I was thinking of things a little too abstractly.
[0:02] <greglap> you need to run "ceph mds set_max_mds #" to change that, it's not automatically derived from the config file
[0:02] <bchrisman> ahhh
[0:02] <greglap> since you can do a few different things with them, like have standbys in case an MDS goes down
[0:02] <bchrisman> okay… I can set that and retest.. that would make a bit of sense from the symptoms.
[0:02] <greglap> while you're doing failure cases I'd stick to just having one active MDS and a standby, though — failed MDSes aren't tested nearly as well, and in the short to medium term at least to recover from a lost MDS you need to spin up another daemon to take over for it
[0:03] <bchrisman> if I had multiple PG's on the same node that went down… would it show that in ceph -s as PGs that aren't available or objects that are lost somehow?
[0:03] <bchrisman> ahhh
[0:04] <greglap> sjust has been handling these cases more recently than me
[0:04] <greglap> if he's around
[0:04] <bchrisman> ok… umm… since we're on the topic.. how do I mark an mds as primary/standby? :)
[0:04] <greglap> right now, it's just random
[0:05] <greglap> I'm merging in some stuff that'll let you set up standby-replay behaviors on specific nodes
[0:05] <greglap> generally speaking when a PG changes ownership the new OSD gets notified by the old OSDs that were still up
[0:05] <bchrisman> ok.. so 'ceph mds set_max_mds' will configure the max.. and one will be primary the others standby?
[0:06] <greglap> and I'm realizing that I don't know what happens if you lose all the OSDs at once
[0:06] <greglap> ceph mds set_max_mds sets how many active MDSes there can be
[0:06] <greglap> it's not assigning a specific one to the role or anything
[0:06] <greglap> but if you set_max_mds 3 you can have up to 3 active MDSes that will partition the filesystem among themselves
[0:07] <greglap> that means if you have 5 MDSes up then 2 of them will be standbys
[0:08] <greglap> okay, sjust is coming in but I have to go now, ttyl
[0:08] <bchrisman> hmm
[0:08] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[0:09] <sjust> bchrisman: sorry for the delay, I'm trying to catch up on the log
[0:12] <sjust> if you lose an osd, other osds should be able to recover it's pgs using the replicated objects
[0:13] <sjust> if you lose all of them and then bring them up, it should come back up to a consistent state with only the operations which had not been committed missing
[0:13] <sjust> does that help?
[0:15] <bchrisman> sjust: perhaps.. in this case, I was testing a single node failure in a 3-node clsuter.
[0:16] <sjust> how many osd daemons and mds daemons?
[0:16] <bchrisman> originally 12 osd and 3 mds
[0:17] <sjust> did the hang still occur once all of the daemons were brought back up?
[0:17] <bchrisman> I didn't bring them back up.. left the node down...
[0:17] <sjust> ok
[0:17] <bchrisman> I can boot it back up.
[0:18] <bchrisman> I/O still went through
[0:18] <sjust> no, thats ok, looking at your crushmap now
[0:18] <bchrisman> I can create files....
[0:18] <bchrisman> but the ls (with default redhat color which does lookup of file type) hangs and /bin/ls doesn't.
[0:18] <sjust> I'm guessing that the mds caches that information and does not need to go out to an osd
[0:18] <sjust> but I'm not sure
[0:19] <bchrisman> yeah.. LUSTRE also caches the data required for /bin/ls, but not the —color data.
[0:19] <bchrisman> so I figured there's something similar going on here...
[0:20] <sjust> maybe, I'm trying to decode your crushmap (I haven't actually configured one myself)
[0:21] <bchrisman> ahh ok.. yeah.. that was my first-stab approach.. things seemed to run okay.. and worked with the 'pulled drive' so long as I killed off the appropriate cosd.
[0:21] <sjust> yeah
[0:21] <sjust> it *should* work as long as all pg's have replicas on different hosts
[0:21] <sjust> oh
[0:22] <sjust> actually, if the mds is still down, parts of the file hierarchy should be inaccessible if there wasn't a standby mds designated
[0:22] <bchrisman> ahh was wondering how to designate that in the config file.
[0:22] <bchrisman> or rather.. setup the thing in general.. :)
[0:22] <sjust> :)
[0:23] <bchrisman> I could just run one mds on a node that I'm not going to yank.. would be a good starting test...
[0:23] <bchrisman> but if it should handle an mds failure… we'd eventually want to test that.
[0:23] <sjust> indeed
[0:24] <sjust> actually, I think you did have a standby mds designated
[0:24] <bchrisman> can I tell which mds was/is active?
[0:24] <sjust> the ceph -s you posted was after you killed a machine?
[0:24] <bchrisman> yup
[0:25] <sjust> if I'm reading this right, one of the standbys took over correctly leaving one more standby
[0:25] <bchrisman> this line here: mds e5: 1/1/1 up {0=up:active}, 1 up:standby
[0:26] <sjust> yeah, 0=up:active indicates that there is an active mds
[0:26] <bchrisman> not clear to me what the 1/1/1 is unless that's a bitmap of up/standby
[0:26] <sjust> I don't remember what that means myself
[0:26] <bchrisman> ahh so mds0 is 'up' and mds1 is standby?
[0:27] <bchrisman> heh.. maybe I"ll go look for that in the source.. :)
[0:27] <sjust> I believe it means that some mds has role 0 and is active, and there is one additional mds in up:standby
[0:27] <sjust> hmm, looking at this, 16 creating pgs has me wondering
[0:27] <sjust> what commands did you execute after you took down one of the machines?
[0:30] <bchrisman> a couple of 'touch' files.. maybe a couple of dd's...
[0:30] <sjust> oh, but no ceph osd commands?
[0:31] <sjust> ok
[0:31] <bchrisman> no ceph commands
[0:31] <bchrisman> except status
[0:31] <sjust> what does ceph osd dump -o - output?
[0:32] <sedulous> gregaf: i found a possible solution! the NBD server (Network Block Device) can use multiple files on any file system to emulate a block device
[0:34] <sedulous> gregaf: for example, one can create 1000 * 10 MB files on a WebDAV share and get a 10 GB block device
[0:34] <bchrisman> sjust: one sec
[0:34] <sjust> bchrisman: ok
[0:35] <bchrisman> sjust: http://pastebin.com/Z8RnADtm
[0:35] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[0:35] <bchrisman> pasted more there than I meant.. but
[0:37] <bchrisman> trying to interpret the output there..
[0:38] <bchrisman> would be nice to have a rosetta stone for that stuff :)
[0:39] <greglap> sjust: I don't think he actually had 3 MDSes active — 3 configured but only one running...the issue's not with the MDS dying
[0:39] <sjust> right
[0:39] <sjust> greglap: I think it may be the crushmap
[0:39] <greglap> my thought was that maybe he had 16 PGs that didn't have replicas on the nodes that stayed up
[0:39] <sjust> that is what I am thinking
[0:39] <greglap> and nobody got notified to take them over, and there's no log for them
[0:39] <greglap> so for some reason that we should fix, they're showing up as "Creating"
[0:40] <greglap> but I don't know
[0:40] <sjust> the crushmap wiki page suggests that you need to use something similar to 'step chooseleaf firstn 0 type host' to get it to split over hosts
[0:40] <bchrisman> on a quiescent fs like this, I'm guessing we should have 0 creating?
[0:40] <sjust> bchrisman: creating indicates that the pgs were completely lost and had to be recreated
[0:40] <bchrisman> ah
[0:41] <bchrisman> okay..
[0:41] <sjust> which I don't think it is supposed to do without user intervention
[0:41] <greglap> sjust: oh, is that intended behavior somehow?
[0:41] <sjust> greglap: I don't think so
[0:42] <bchrisman> is 'choose' a synonym for chooseleaf?
[0:42] <sjust> I'm not sure, still examining the wiki page
[0:42] <greglap> I mean, I realized I don't know what the expected behavior is here
[0:42] <greglap> bchrisman: no, they're different
[0:42] <bchrisman> ahh and it's referencing device rather than host there.
[0:42] <sjust> yeah
[0:42] <greglap> I don't remember the precise semantics but chooseleaf dives from whatever level you're in to the bottom
[0:42] <greglap> choose is used for going the next level down
[0:43] <sjust> ok, the first thing seems to be that some pgs had all of their replicas on the node that went down, those would not be accessible
[0:43] <bchrisman> ahh… so that's what I was hoping for.. choosing the next level down, which would be host.. but I'm referencing 'device' in that 'data' stanza
[0:43] <sjust> http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH has an example similar to what you are trying to do
[0:43] <bchrisman> sjust: good.. that means that the problem is a misconfiguration.. full positive there.
[0:44] <sjust> bchrisman: not quite, I don't think it should have given up on the downed osds without user intervention
[0:44] <sjust> how long did it go before the creating started?
[0:44] <bchrisman> ahh yes.. I was vaguely working off of that.. except the crushmap I generated from the tool had some problems.. so I went a'hacking around.
[0:44] <greglap> sjust: have we actually tested that failure mode?
[0:45] <sjust> greglap: I have tested taking osds down in groups of 1 and 2 for a few minutes at a time
[0:45] <greglap> it's not like it actually is creating them, so I wonder if we just have an interface issue
[0:45] <bchrisman> sjust: I wasn't monitoring the 'created', so I don't know for sure when they showed up.
[0:45] <greglap> yes, but in this scenario we lost all the PG data
[0:45] <sjust> bchrisman: try bringing up the missing daemons and see if the pg data comes back
[0:45] <sjust> greglap: yeah, but if you bring enough of the osds back up, the pgs would be back
[0:46] <bchrisman> okay.. I'll bring that node back online.. will a service ceph start be all that's needed to have the osds on that node rejoin?
[0:46] <bchrisman> or will I need to issue ceph commands to have them rejoin?
[0:46] <sjust> greglap: no io could occur until that happens, of course, but they should be recoverable
[0:46] <sjust> bchrisman: I think they should come back in automatically
[0:46] <bchrisman> sjust: ok...
[0:47] <sjust> once they are started
[0:47] <greglap> they've been marked both down and out, does that affect rejoin at all?
[0:48] <greglap> gotta run again, want to get out and bike while I still have sunlight :)
[0:48] <sjust> greglap: heh, ok
[0:52] <bchrisman> sjust: node back up but: osd e15: 8 osds: 8 up, 8 in
[0:52] <bchrisman> the cosd's are running on the node I brought back up
[0:53] <sjust> bchrisman: one sec
[0:54] <bchrisman> sjust: (no prob) osd dump shows no more osd's than before.
[0:56] <sjust> try running ceph osd in 4
[0:57] <bchrisman> hmm.. on one of the never-failed nodes: osd4 does not exist :)
[0:58] <sjust> your ceph.conf still has osd4-7?
[0:58] <bchrisman> I haven't changed the ceph.conf.. so yes.
[0:58] <sjust> ok
[0:59] <bchrisman> (just verified)
[0:59] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) has joined #ceph
[0:59] <bchrisman> ceph -s on cycled node show same output.
[1:00] <sjust> could you post the log from one of the osds on the failed node?
[1:00] <sjust> also the monitor logs
[1:02] <bchrisman> sure… /var/log/ceph/osd rather than /var/log/ceph/stat/...?
[1:03] <sjust> yes
[1:03] <sjust> /var/log/ceph/osd.4.log for osd4, I think
[1:03] <bchrisman> yeah http://pastebin.com/k9vphfWK
[1:04] <bchrisman> hunting for new monitor
[1:05] <bchrisman> osd looks for monitor.. can't find it.
[1:06] <bchrisman> in the mon log on reboot node log [INF] : mon.1 calling new monitor election
[1:06] <bchrisman> ahhh
[1:07] <bchrisman> cephx server osd.4: unexpected key: req.key=d68c8e665337ea58 expected_key=5ea1ceffcd671806
[1:07] <bchrisman> odd..
[1:07] <sjust> oh
[1:07] <sjust> I am not familiar with the cephx stuff, I'm afraid
[1:07] <bchrisman> looked like I couldn't configure the cluster without it… which is why I used it.
[1:08] <bchrisman> plus it generates a key, which I md5 for an fsid for an nfs export
[1:08] <bchrisman> trying to figure out that timing.
[1:09] <bchrisman> those unexpected keys are only coming from osd4-7… which are the ones which were crashed.
[1:09] <sjust> right
[1:09] <bchrisman> somehow the key became invalid?
[1:09] <sjust> apparently, looking at the wiki
[1:11] <sjust> http://ceph.newdream.net/wiki/OSD_cluster_expansion/contraction has information for marking a key authorized
[1:11] <sjust> what is the output of ceph auth list?
[1:11] <bchrisman> weird… needs new key for an osd which was previously pat of the cluster?
[1:12] <sjust> yeah, that seems odd
[1:12] <bchrisman> there's a key for each osd in the list
[1:12] <bchrisman> will look at config on rebot node
[1:12] <sjust> I think yehuda should have more information
[1:13] <bchrisman> May be a problem re-adding an existing osd? Maybe it expects all osds to be new?
[1:13] <sjust> bchrisman: if thats true it is likely a bug
[1:14] <bchrisman> yeah...
[1:14] <bchrisman> was looking to fix my crushmap too.
[1:15] <bchrisman> Ithought I was emulating that crushmap in the wiki.. but I probably borked something in there.. will look again.
[1:19] <bchrisman> If I have a 'superfluous' layer in the crushmap, should still work I imagine? in my example, I really only have one rack… so I removed that layer of abstraction… I can put that back in to make my crushmap better match, but it'd only have one rack.. which seems like it might violate some data partitioning thing.
[1:20] <bchrisman> I guess the example starts with a 'root' element anyways.. which woudl be a SPOF
[1:21] <bchrisman> there's also no metadata stanza in that, yet there are in other example crushmaps.. I'm guessing that metadata is not required for librados but might be for ceph?
[1:21] <sjust> right
[1:21] <sjust> the mds stores metadata in the metadata pool
[1:23] <bchrisman> cool.. okay.. yeah.. referencing 'device' in the data pool rule looks like it would basically short-circuit the architecture I put into the file.
[1:23] <sjust> i think you just need to replace the choose line with 'step chooseleaf firstn 0 type rack
[1:23] <sjust> oop
[1:24] <sjust> *i think you just need to replace the choose line with 'step chooseleaf firstn 0 type host'
[1:24] <bchrisman> yeah.. that's what I'm doing… yup
[1:24] <sjust> ok
[1:26] <bchrisman> that failure mode may not really matter then, since it's basically an incorrect crushmap...
[1:26] <bchrisman> though I guess it should have come back up… will see when I test this.
[1:26] <sjust> yeah, I'll ask yehuda about the cephx problem tommorrow
[1:31] <bchrisman> should have a from-scratch reproduce with new crushmap half an hour… not sure how long you guys are around..
[1:31] <bchrisman> will post here though regardless
[1:32] <sjust> I'll be around for about another hour
[2:01] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:08] <bchrisman> hmm.. updated crushmap.. still hang on ls (but not /bin/ls)
[2:09] <bchrisman> I'll put up my modified crushmap
[2:09] <sjust> ok
[2:10] <bchrisman> http://pastebin.com/DsMrKWZ8
[2:11] <bchrisman> ceph -s: http://pastebin.com/k34Vfqx1
[2:12] <bchrisman> oh wait..
[2:12] <bchrisman> hang went away
[2:12] <bchrisman> so it was temporary
[2:12] <bchrisman> but it was there for a fair bit
[2:13] <bchrisman> standard ls works
[2:15] <sjust> it should take a bit for the failure to be detected, and then for the replica to take over
[2:15] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) has left #ceph
[2:16] <sjust> you the delay before the downed osds are detected can be reduced by tweaking some of the tunables
[2:16] <sjust> *the delay before the downed osds are detected can be reduced by tweaking some of the tunables
[2:16] <bchrisman> yeah.. cool… great to see when properly config'ed, node failure works as expected.
[2:16] <sjust> just out of curiosity, can you bring the failed node back up?
[2:16] <sjust> see if we hit the same cephx problem
[2:17] <bchrisman> yeah.. will test that too.. see if I get the auth issue again.
[2:26] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[2:26] <bchrisman> heh… osds: 12 up, 12 in..
[2:26] <bchrisman> it's like key info went down with the metadata in that other config… :)
[2:27] <greglap> bchrisman: the keys do rotate, so it's possible the OSDs just got too far behind and couldn't re-join
[2:27] <bchrisman> wonder if there's a lockout condition where the keys to allow an osd in are stored on the osd? is that possible?
[2:27] <bchrisman> ahhh
[2:27] <bchrisman> that makes sense.. thanks.
[2:27] <bchrisman> it was a matter of not leaving it down too long.
[2:27] <greglap> but I'm curious — on this run what did ceph -s report before you brought the node back up?
[2:28] <greglap> last time it said osds: 8 up, 8 in
[2:28] <bchrisman> http://pastebin.com/ZC8NL3Jy
[2:28] * verwilst_ (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[2:28] <greglap> which makes me assume that the OSDs on the downed node declared "out", not just "down"
[2:28] <greglap> yeah
[2:29] <greglap> that may also be the difference
[2:29] <sjust> bchrisman: that is what I woudl have expected
[2:29] <sjust> greglap: how long is it supposed to wait before marking an osd out?
[2:29] <greglap> once an OSD goes down I believe there are deliberate extra measures you have to take to bring it back in
[2:29] <greglap> not sure, it's a configurable
[2:29] <sjust> do you happen to know which one?
[2:29] <greglap> no, sorry
[2:29] <greglap> sagewk might, and could talk about the procedures for getting a node back in
[2:30] <bchrisman> if an osd wants to come back after keys rotate… then there's a way to find the current key and give it to the incoming osd's?
[2:31] <greglap> I'm afraid I really don't remember all the steps involved
[2:31] <sjust> well, you could probably use 'ceph auth add osd.4 -i keyring.osd.4' or some variant as if it were a new node
[2:31] <greglap> that might be what you're supposed to do
[2:31] <greglap> did anyone check the wiki to see if we have anything about this?
[2:31] <sjust> http://ceph.newdream.net/wiki/OSD_cluster_expansion/contraction
[2:31] <sjust> that is as close as I found
[2:31] <greglap> failure cases and repairs are definitely something that ought to be documented nicely
[2:32] * verwilst_ (~verwilst@dD576FAAE.access.telenet.be) Quit ()
[2:32] <bchrisman> yeah.. saw the expansion thing.. but wasn't sure it applied to reentry
[2:33] <bchrisman> will it automatically alter the crushmap when osds ar declared out?
[2:34] <greglap> yeah, that's what "out" means, basically
[2:34] <bchrisman> ahh gotchya
[2:34] <bchrisman> cool
[2:34] <greglap> there's two states for OSDs involving the crush map: "up/down", "in/out"
[2:35] <greglap> that way an OSD can go down (planned or not) for a brief while without data rebalancing across the cluster
[2:35] <bchrisman> good good
[2:35] <greglap> so you can choose your tradeoff between reduced replication levels when an OSD goes down, and how often you rebalance
[2:36] <bchrisman> I see doc for ceph osd down (N)
[2:36] <greglap> yeah, that forces it down
[2:36] <bchrisman> that will also be subject to a timeout-kick?
[2:36] <greglap> basically just for testing, there's no reason to use it regularly that I can think of
[2:36] <greglap> ?
[2:37] <greglap> IIRC "ceph osd down" just makes the monitor mark an OSD down in the osd map
[2:37] <greglap> if an OSD gets a map saying it's down, it kills itself to prevent consistency issues, so it does actually die
[2:38] <bchrisman> ah
[2:38] <bchrisman> marking it down in the mon will send that map out?
[2:39] <sjust> yes
[2:39] <greglap> yeah, maps are distributed from the monitor to some sample of the OSDs (I don't remember exactly how), and then between OSDs and clients
[2:40] <greglap> I think OSDs subscribe to map updates at various points and whenever they talk to the monitor they get the newest map
[2:40] <bchrisman> ahh
[2:41] <greglap> (that is how it works between clients and OSDs, if they talk to each other they require their peer to have a map at least as new as their own is)
[2:41] <bchrisman> ahh
[2:42] <bchrisman> if they don't, the lower rev process will go hit the mon for a new map then, I assume?
[2:42] <bchrisman> thank you guys so much for your help.
[2:42] <greglap> nope, it's not so scalable to always go to the monitor
[2:42] <greglap> peers just send the map updates to whoever they're talking to
[2:43] <bchrisman> ahh okay… they're authenticated/trusted then...?
[2:43] <greglap> yeah, in the Ceph model all servers are trusted
[2:43] <greglap> clients are too, basically, although we've made some strides in that direction
[2:43] <bchrisman> ok
[3:02] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:09] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[3:11] * ajnelson (~Adium@dhcp-63-189.cse.ucsc.edu) Quit (Quit: Leaving.)
[3:23] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[3:29] * joshd (~jdurgin@adsl-75-28-69-238.dsl.irvnca.sbcglobal.net) Quit (Quit: Leaving.)
[3:43] * yx (~yx@1RDAAABLL.tor-irc.dnsbl.oftc.net) Quit (charon.oftc.net kilo.oftc.net)
[3:43] * alexxy[home] (~alexxy@ Quit (charon.oftc.net kilo.oftc.net)
[3:44] * yx (~yx@1RDAAABLL.tor-irc.dnsbl.oftc.net) has joined #ceph
[3:44] * alexxy[home] (~alexxy@ has joined #ceph
[3:56] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[4:42] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[4:44] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[4:51] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[5:07] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[5:52] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[5:52] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[7:07] * yx (~yx@1RDAAABLL.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[7:08] * yx (~yx@ip68-102-111-171.ks.ok.cox.net) has joined #ceph
[7:09] * MarkN (~nathan@ Quit (Quit: Leaving.)
[8:09] * gregorg (~Greg@ Quit (Quit: Quitte)
[8:09] * gregorg (~Greg@ has joined #ceph
[9:03] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[9:59] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[10:51] * allsystemsarego (~allsystem@ has joined #ceph
[11:08] * Yoric (~David@ has joined #ceph
[11:37] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[12:38] * Meths_ (rift@ has joined #ceph
[12:45] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[13:17] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[13:58] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[14:07] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[14:07] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:24] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[14:26] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:34] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:35] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[14:35] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:44] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[14:47] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[14:53] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[15:11] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[15:12] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[15:29] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[15:29] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[15:39] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[15:39] * Yoric (~David@ has joined #ceph
[15:47] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[15:49] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[15:52] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[15:56] * Meths_ is now known as Meths
[16:01] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[16:01] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[16:01] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[16:16] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[16:17] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[16:28] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[16:29] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[16:32] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[16:37] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[17:22] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[17:29] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:30] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[17:32] * ajnelson (~Adium@dhcp-225-235.cruznetsecure.ucsc.edu) has joined #ceph
[17:53] * greglap (~Adium@ has joined #ceph
[18:00] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:04] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:08] * ajnelson (~Adium@dhcp-225-235.cruznetsecure.ucsc.edu) Quit (Quit: Leaving.)
[18:20] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:23] * ajnelson (~Adium@dhcp-63-189.cse.ucsc.edu) has joined #ceph
[18:40] * greglap (~Adium@ Quit (Quit: Leaving.)
[18:49] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[19:03] * cmccabe (~cmccabe@ has joined #ceph
[19:06] <jantje> hi !
[19:07] <gregaf> hey
[19:07] <jantje> you're up early? :)
[19:09] <gregaf> it's 10am here, just about everybody's at work
[19:10] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[19:15] * Yoric (~David@ Quit (Quit: Yoric)
[19:39] <wido> hi
[19:39] <wido> gregaf: I see you reply'ing on e-mails around 07:00 sometimes :)
[19:39] <gregaf> I try and get up at 7, and that's often the first thing I do
[19:40] <wido> ah, ok
[20:28] <darkfader> hi
[20:29] <gregaf> hi
[20:51] <jantje> hi :)
[20:53] <cmccabe> gregaf: yeah, _PC_NAME_MAX is 255 on ext3, and PATH_MAX is 4096
[20:53] <sagewk> i think just make it PATH_MAX and be done with it. that way we don't have to bother checking what the specific limit is on this particular fs
[20:54] <sagewk> the memory isn't significant
[20:54] <cmccabe> but I read this:
[20:54] <cmccabe> http://insanecoding.blogspot.com/2007/11/pathmax-simply-isnt.html
[20:55] <cmccabe> "Now performing a test on my Linux system, I noticed that it limits a path component to 255 characters on ext3, but it doesn't stop me from making as many nested ones as I like. I successfully created a path 6000 characters long. Linux does absolutely nothing to stop me from creating such a large path, nor from mounting one large path on another. Running getcwd() in such a large path, even with a huge buffer, fails, since it doesn't work with anything past PA
[20:55] <sagewk> i think pjd tests this.. or at least the name_max part
[20:55] <cmccabe> pjd?
[20:55] <sagewk> in any case, name_max is always <= path_max, so it'll work in our case
[20:55] <gregaf> path max doesn't refer to the longest a path can get, but to the longest path you can use to change something
[20:56] <sagewk> posix unit test suite, see qa/workunits/pjd.sh
[20:56] <gregaf> it's purely for memory use, since a huge 6k-long-path-to-directory doesn't actually need to get dereferenced at any single point
[20:56] <gregaf> I'm sure if that guy was in / and tried to move something from there into his hugely-nested directory he'd find he couldn't
[20:56] <cmccabe> heh
[20:57] <cmccabe> I sincerely hope that readdir_r doesn't just blast that 6000-byte path into the buffer you give it though
[20:57] <sagewk> it returns filenames, not paths..
[20:57] <cmccabe> oh yeah
[20:58] <cmccabe> so the mount point stuff is irrelevant
[20:58] <cmccabe> yeah, perhaps PATH_MAX is the easiest way. It avoids a call to statvfs
[20:58] <sagewk> if we really want to be pedantic we can do a statfs in FileStore::mount and get the actual name_max. but it's always < PATH_MAX.
[20:59] <cmccabe> k
[20:59] <Tv|work> also see realpath(3)
[20:59] <Tv|work> ENAMETOOLONG..
[21:02] <Tv|work> readdir_r actually tells you how to handle readdir_r the right way
[21:02] <Tv|work> i mean readdir_r(3)
[21:02] <Tv|work> by asking pathconf and allocating the right amount
[21:03] <cmccabe> tv: that has a race condition because between the time you do pathconf and the time you call readdir, what is at that path could have changed
[21:04] <Tv|work> cmccabe: use fpathconf, then
[21:04] <Tv|work> and fdopendir, if necessary
[21:05] <Tv|work> or dirfd, to go the other way
[21:05] <cmccabe> tv: looks like fdopendir / dirfd are 2008 additions to POSIX
[21:06] <cmccabe> tv: which explains the wide variety of websites complaining that "there is no POSIX way to use readdir correctly!"
[21:06] <Tv|work> cmccabe: well you can always ignore opendir(3) and go for readdir(2)..
[21:06] <Tv|work> silly libc wrappers on perfectly fine syscalls ;)
[21:07] <cmccabe> tv: I never used that crazy thing. I sort of imagine it being even less portable
[21:07] <Tv|work> (yeah that's linux not posix, but i don't care much)
[21:07] <cmccabe> well anyway, mystery solved.
[21:10] <Tv|work> http://womble.decadent.org.uk/readdir_r-advisory.html
[21:21] <Tv|work> sage: soo.. ssh to cephbooter.ceph.dreamhost.com times out. any ideas?
[21:22] <cmccabe> tv: ssh from where?
[21:23] <Tv|work> cmccabe: my desktop, ceph.newdream.net, not sure what more to try
[21:24] <cmccabe> tv: flab and flak seem to be down too?
[21:24] <Tv|work> i can get into flak
[21:25] <Tv|work> from there to cephbooter times out
[21:25] <gregaf> sage is out of the office for a while
[21:26] <Tv|work> ssh from my desktop to flab times out
[21:26] <cmccabe> doesn't seem to work from yakko either
[21:26] <gregaf> looks like for me too
[21:26] <gregaf> odd
[21:26] <Tv|work> yeah this is what stopped me yesterday
[21:26] <gregaf> they have been moving machines around down there, maybe our network broke
[21:27] <gregaf> but I think sjust has been using them...
[21:28] <cmccabe> well, clearly he should set up an openvpn for you :)
[21:28] <cmccabe> since I can't get a route to any of that stuff from yakko any more
[21:28] <cmccabe> I'm glad metropolis is still accessible
[22:05] <sagewk> fixing flab
[22:07] <darkfader> hey, would 2.6.35-22-generic from ubuntu be somewhat current enough for basic testing?
[22:08] <darkfader> i see one osd coredump at mkcephfs. if it's the kernel i'll update it instead of searching :)
[22:09] <sagewk> should be more or less okay. what's teh backtrace in the osd log?
[22:09] <darkfader> 1 moment i had rebooted, will go and look
[22:11] <darkfader> these are the last two lines:
[22:11] <darkfader> 011-01-25 16:10:15.320177 7f469e064720 filestore(/osd0) mount WARNING: no consistent snaps found, store may be in inconsiste
[22:11] <darkfader> nt state
[22:11] <darkfader> 2011-01-25 16:10:15.320359 7f469e064720 filestore(/osd0) mount WARNING: no journal
[22:12] <darkfader> i don't have a journal directory set in the config.
[22:12] <darkfader> like at http://ceph.newdream.net/wiki/Cluster_configuration
[22:12] <darkfader> would that make it crash?
[22:13] <darkfader> btw i love how it now logs to /var/log/ceph
[22:13] <sagewk> shouldn't, although no journal isn't a well tested config.
[22:14] <sagewk> do you have a core file?
[22:14] <sagewk> the process crashed, but there's no stack dump in the log? :/
[22:15] <darkfader> no, not stack dump, yes, got a core
[22:15] <darkfader> ah! do i need the dbg packages for the stack trace
[22:15] <sagewk> can you gdb it and get a stack trace?
[22:15] <sagewk> yeah ceph-dbg
[22:15] <darkfader> then give me 1 sec to install
[22:15] <darkfader> sorry :)
[22:16] <darkfader> i was optimistic i wouldnt need them "this time i'll start with a simple setup"
[22:19] <sagewk> cephbooter is back up too. should be all good there.
[22:19] <bchrisman> IIRC, ceph is full-data journaled not just metadata-journaled… is data journaled on both copies (when crushmap mandates two copies on different nodes)?
[22:20] <sagewk> yeah
[22:22] <bchrisman> are writes acknowledged after the journal completes write to stable store, or after the data is also committed to its final resting place?
[22:23] <sagewk> whichever comes first (usually the journal)
[22:24] <bchrisman> ahh ok.. cool
[22:25] <bchrisman> one last Q on that topic… effectively there will be four writes, log on each of two nodes and data on each of two nodes… so any one of those completing will mean I/O return complete, or one on each node?
[22:26] <sagewk> one on each node/replica. a write completes when both replica acknowledge the write.
[22:26] <bchrisman> ok cool
[22:29] <darkfader> sagewk: ok, moved on a little - it dies with a segfault. the core seems not usable: gdb says
[22:29] <darkfader> "/root/core": not in executable format: File format not recognized
[22:29] <darkfader> (i installed dev and dbg stuff)
[22:30] <darkfader> and the vm has 640mb ram, need more?
[22:31] <darkfader> (but just 3 osd, 120gb total)
[22:31] <sagewk> hmm, sounds like enough. are youd oing 'gdb /usr/bin/cosd /root/core'?
[22:31] <darkfader> oh. i was being a noob
[22:32] <darkfader> thanks
[22:32] <darkfader> it's right in the journaling
[22:32] <darkfader> http://pastebin.com/unM5RdNj
[22:33] <darkfader> journal become mandatory? ?
[22:33] <sagewk> which version of the package? 0.24.2?
[22:33] <sagewk> not intentionally :)
[22:33] <darkfader> 0.24-1
[22:33] <darkfader> i can try -2
[22:33] <sagewk> keep in mind it'll be painfully slow without it.
[22:33] <sagewk> oh, yeah that should be fixed in the latest 0.24.2-1
[22:33] <darkfader> okay then i'll rebuild
[22:34] <sagewk> there are prebuilt debs on ceph.newdream.net
[22:34] <darkfader> *blink*?
[22:34] <darkfader> since last week?
[22:34] <darkfader> thankies :)
[22:34] <sagewk> http://ceph.newdream.net/wiki/Debian
[22:34] <sagewk> since forever :)
[22:34] <darkfader> yeah but i tried deb http://ceph.newdream.net/debian/ maverick ceph-testing
[22:34] <darkfader> and got something oldish
[22:35] <sagewk> ceph-testing isn't updated regularly. ceph-stable will be the last release (this morning)
[22:35] <darkfader> lol. i didnt figure i would get something current at stable and something old at testing
[22:35] <darkfader> *ducks*
[22:37] <darkfader> ok ceph-stable works fine, i'm updating
[23:07] <darkfader> update finished and i had to kick the second osd a litle for mkfs to work
[23:07] <darkfader> (rm -r /osd1 and user_xattr)
[23:07] <darkfader> now mon0 still has a hickup
[23:07] <darkfader> root@itzibitzi02:/var/log/ceph# ceph status
[23:07] <darkfader> 2011-01-25 17:07:38.648554 mon <- [status]
[23:07] <darkfader> 2011-01-25 17:07:38.649332 mon0 -> 'unrecognized subsystem' (-22)
[23:08] <darkfader> i'm such an error magnet
[23:10] <gregaf> ceph -s
[23:10] <darkfader> it looks happy now!
[23:10] <gregaf> the ceph tool interprets extra words stuck on the end as subsystems to route commands to
[23:10] <darkfader> 2011-01-25 17:10:37.410640 pg v105: 528 pgs: 528 active+clean; 0 KB data, 1268 MB used, 16266 MB / 18035 MB avail
[23:11] <darkfader> 2011-01-25 17:10:37.412726 mds e1: 0/0/1 up
[23:11] <gregaf> ie ceph mds set_max_mds
[23:11] <darkfader> 2011-01-25 17:10:37.412821 osd e6: 2 osds: 2 up, 2 in
[23:11] <darkfader> 2011-01-25 17:10:37.413066 log 2011-01-25 17:10:33.067329 osd0 108 : [INF] 1.46 scrub ok
[23:11] <darkfader> 2011-01-25 17:10:37.413185 mon e1: 1 mons at {0=}
[23:12] <darkfader> i got that at ceph status. but i dont yet understand what you mean... it gets confused by ceph status?
[23:12] <gregaf> when you run "ceph status"
[23:12] <darkfader> ah. ceph status doesnt even exist - where did i read about that
[23:12] <gregaf> "status" isn't an option
[23:13] <gregaf> ceph —status might work, not sure
[23:13] <darkfader> yeah i just figured - i had read about it on some blog
[23:13] <darkfader> or no, 1 monent
[23:13] <gregaf> but if you give it words that aren't signed as options, then it tries to interpret them as subsystems like mds or osd or pg
[23:13] <darkfader> it was ceph health i was looking for
[23:14] <darkfader> ah and that agrees with ceph -s that my mds ain't running?
[23:15] <darkfader> 0/0/1 up means one is running right?
[23:16] <gregaf> no
[23:16] <darkfader> i meant to type none
[23:16] <gregaf> the order is mdses up, mdses in, max_mds
[23:16] <darkfader> will have to go to bed soon
[23:16] <darkfader> thanks
[23:17] <darkfader> simply started now when i manually tried with cmds -i 0 -c ...conf
[23:18] <darkfader> finally i got a working ceph again!
[23:37] <darkfader> 104857600 bytes (105 MB) copied, 10.3453 s, 10.1 MB/s
[23:37] <darkfader> could be worse for a worstcase setup
[23:38] <darkfader> (vmware, shared spindle under 2 osds, and no journal :)
[23:39] <sagewk> the no journal is the worst part
[23:39] <darkfader> and i didnt turn off replication too
[23:39] <darkfader> sagewk: i know, i'm positively impressed there's still 10mb left like that :)
[23:42] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:45] <bchrisman> running into a metadata issue where sometimes subdirs aren't getting returned in directory calls. This has shown up in two cases: 1) while [ ! -d foo]; do cd ..; done infinite loop while we're in a directory like /a/b/foo/c/d, 2) rm -rf fails with 'Directory not empty'.. where the rm basically missed a subdirectory five directories deep.
[23:45] <bchrisman> I'm running a 0.24rc… if there's been metadata fixes since, I can pul 0.24.2
[23:46] <bchrisman> otherwise I can take a look and try to recreate...
[23:48] <sagewk> sounds like a client side issue.
[23:48] <sagewk> can you open an issue in the tracker with steps to reproduce it?
[23:49] <bchrisman> ok
[23:51] <gregaf> bchrisman: is this in the kernel client or cfuse?
[23:51] <bchrisman> kernel client
[23:52] <bchrisman> actually.. kernel client exported via NFS...
[23:53] <sagewk> oh... does the same problem happen without nfs in the picture?
[23:53] <bchrisman> I can check…
[23:53] <bchrisman> nfs to another clustered filesystem with same setup works.
[23:53] <bchrisman> but I'll see if I can get it mounted locally.
[23:53] <bchrisman> errr… 'ceph locally'

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.