#ceph IRC Log


IRC Log for 2011-08-18

Timestamps are in GMT/BST.

[0:01] <slang> http://fpaste.org/bcxa/
[0:04] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[0:07] <sagewk> slang; hrm. how about the last ~30 lines prior to the failed assert?
[0:09] <gregaf> lxo: have you checked the wiki? I don't know it offhand but yes, there are ways to do that???. ;)
[0:10] <lxo> I tried but it was down. so I resorted to the ultimate documentation. the way I could find (and I'm trying now) is to inject a default ruleset in the configuration before creating the pool
[0:11] <Tv> lxo: wiki is back up, it had some temporary db connectivity issue
[0:13] <lxo> Tv, great,thanks
[0:14] <Tv> i don't see much about crush maps for individual pools in there, and i just (still) don't know the crush syntax & tools very well
[0:16] <gregaf> I think maybe what you want to do is get the crushmap out of the cluster, edit it to add your new placement rules, inject it back in, and then the pool creation tools should let you specify the crushrule to use when creating it?
[0:16] <gregaf> sagewk has actually done this, unlike me
[0:17] <sagewk> right. iirc the ceph mkpool lets you specify the crush ruleset to use for the new pool
[0:18] <lxo> ceph mkpool?!?
[0:18] <Tv> yeah ok so there's exactly one crush config, and a pool can choose a ruleset within that config
[0:19] <Tv> rados mkpool, probably
[0:19] <lxo> yeah, I got the crush config in place, but I can't get ceph osd pool create to use it
[0:19] <Tv> mkpool <pool-name> [123[ 4]] create pool <pool-name>'
[0:19] <Tv> [with auid 123[and using crush rule 4]]
[0:19] <slang> http://fpaste.org/IEga/
[0:19] <slang> sagewk: that's the last 200 lines of the log
[0:20] <slang> sagewk: oh, also, those were originally all different filenames
[0:21] <sagewk> slang: hmm, are you running with commit 8c5e7dcf8cf7f3daa65eb9905a63014ed92c5505 ?
[0:22] <sagewk> i think damien hit this last week
[0:22] <johnl_> hi all. just a bit of a nudge on this ticket. Still having the same problem, even with more recent builds: http://tracker.newdream.net/issues/1376
[0:23] <sagewk> johnl_: tnx, sticking it on my list
[0:23] <slang> sagewk: is that from today?
[0:23] <slang> looks like no
[0:24] <sagewk> last week i think
[0:24] <slang> (not running with that)
[0:24] <sagewk> ok i think that fixes the same bug
[0:24] <johnl_> s: ta!
[0:24] <gregaf> it's in v0.33
[0:24] <slang> sagewk: ok cool
[0:25] <lxo> thanks, that worked. yay, one more ceph command for me to learn about
[0:32] <slang> sagewk: I wish I understood the code better so that I could fix these issues myself
[0:32] <slang> not there yet though
[0:33] <sagewk> slang: the snapshot stuff is pretty obtuse, and poorly documented
[0:33] <sagewk> slang: the rule here is never call issue_caps() on a non-head inode (we don't issue caps on anything snapped)
[0:48] * greglap (~Adium@ has joined #ceph
[0:57] <lxo> hmm, cephfs set_layout-issued ioctl doesn't seem to like me at all; tried both cfuse 0.32 and ceph.ko known problem? known solution?
[1:02] <slang> that makes sense -- it took me a while to grok that head inodes were just non-snapshot inodes
[1:02] <Tv> lxo: i will tell you this; that's not part of our regular test suite yet, it might be broken :-/
[1:03] <Tv> oh actually qa/workunits/kclient/file_layout.sh does something
[1:03] <greglap> it's not going to work in cfuse
[1:03] <greglap> should work in the kclient but not sure when it was added
[1:03] <lxo> booting server into 3.0
[1:03] <lxo> err one client
[1:10] <lxo> nope, didn't work either :-( oh, well... back to the drawing board, I guess
[1:12] <greglap> lxo: how is it breaking?
[1:13] <lxo> # cephfs /media/Shared/l/mirror set_layout => Error setting layout: Invalid argument
[1:14] <lxo> arguments tried were -p 3,with or without other values taken from show_layout from the root
[1:14] <lxo> mirror is a directory
[1:18] <sagewk> greglap: let's add a test for this to file_layout.sh while we're poking at it
[1:19] <greglap> yeah, I'm trying to work out now the last time it would have been tested
[1:19] <lxo> so I don't have to report a bug, eh? :-)
[1:22] <sagewk> yes? http://ceph.newdream.net/git/?p=ceph.git;a=commitdiff;h=11979f82937b628a2334c3359bfb24a3e3ea49ea
[1:22] <sagewk> this is re: http://tracker.newdream.net/issues/1397
[1:23] <Tv> i think our test suite doesn't include the layout tests yet
[1:24] <Tv> as in i'm staring at ceph-qa-suite.git and don't see the word "layout" anywhere
[1:24] <sagewk> it includes misc/ + kclient, which has file_layout.sh
[1:24] <sagewk> kclient_workunit_misc.yaml
[1:25] <Tv> sagewk: but file_layout.sh is in qa/workunits/kclient/ not in misc/
[1:26] <sagewk> dar
[1:26] <Tv> sagewk: perhaps there should be tasks/kclient_workunit_kclient.yaml
[1:26] <sagewk> yep
[1:27] <sagewk> oh, there is
[1:28] <Tv> oh my blindness this time
[1:29] <sagewk> in any case, that test passes.
[1:29] <sagewk> but it doesn't exercise the pool stuff
[1:29] <Tv> lxo: oh hey let's talk about the arguments you used more
[1:29] <Tv> mkpool <pool-name> [123[ 4]] create pool <pool-name>'
[1:29] <Tv> [with auid 123[and using crush rule 4]]
[1:30] <Tv> lxo: can you show the exact line, so we're sure we can reproduce the same bug?
[1:32] <lxo> Tv, rados mkpool mirror => pool #3; mkdir /cephmnt/mirror; cephfs /cephmnt/mirror set_layout -p 3 => fail
[1:33] <Tv> i wonder if that really wanted the -p
[1:33] <lxo> cephfs /cephmnt/mirror set_layout -p 3 -s 4194304 -u 4194304 -c 1 -o -1 => same failure
[1:34] <lxo> it accepted it all right. it was the ioctl syscall that returned an error
[1:34] <Tv> lxo: what's the error?
[1:34] <Tv> (sorry just trying to extract all the info so i'm sure we see the same thing)
[1:34] <lxo> => Error setting layout: Invalid argument
[1:35] <lxo> (while playing with the flags trying to set them to zero, I noticed it complains about most flags being zero, even -o, which should accept zero IMHO)
[1:36] <lxo> I'm still running 0.32, FWIW. 0.33 is already built, but not yet installed
[1:37] <lxo> (even rebooting one of the machines was scary at this point; I have one of the machines serving an old version of the filesystem, copying it all to the two other machines serving a new version of the filesystem
[1:37] <lxo> I decided to re-create the filesystem from scratch after facing a major btrfs snafu involving external and internal disks
[1:37] <greglap> hmmm, I can't get anything to work on current masters
[1:38] <greglap> did we decide that it is working on sepia right now?
[1:39] <lxo> I could verify that all of the data in the filesystem was ok, but I suspect some metadata wasn't, because I kept on getting crashes upon mds restart after modifying certain files
[1:39] <sagewk> greglap: i just ran file_layout.sh on my uml successfully
[1:40] <greglap> hmmm, I was trying to recreate by hand on mine and couldn't get stuff to go
[1:40] <greglap> might have done something wrong though
[1:40] <greglap> I'll have to look into it tomorrow
[1:41] * greglap (~Adium@ Quit (Quit: Leaving.)
[1:45] * lxo (~aoliva@659AADOWU.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[1:46] <sagewk> lxo: oh, i see the problem.
[1:46] <sagewk> the mdsmap lists data pools that are allowed, and the new pool has to be in that list
[1:49] * lxo (~aoliva@82VAAC8Q5.tor-irc.dnsbl.oftc.net) has joined #ceph
[1:50] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:51] <sagewk> lxo: oh, i see the problem.
[1:51] <sagewk> the mdsmap lists data pools that are allowed, and the new pool has to be in that list
[1:51] <sagewk> lxo: about to push a fix
[1:58] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Quit: Leaving)
[1:58] <sagewk> lxo: just pushed 92746916dc19ef155b943c326ab0f8062811edac, which should fix your problem.
[1:59] <sagewk> lxo: you need to do 'ceph mds add_data_pool <poolid>' before you'll be able to use it via cephfs
[1:59] <sagewk> lxo: ceph mds dump -o - to verify that it's listed as a valid data pool
[2:00] <lxo> fetching, thanks
[2:02] * u3q (~ben@jupiter.tspigot.net) Quit (Ping timeout: 480 seconds)
[2:41] * huangjun (~root@ has joined #ceph
[2:54] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[3:02] <lxo> sagewk, thanks, it worked
[3:02] <sagewk> lxo: yay!
[3:02] <lxo> I still find it slightly annoying that I have to override *all* settings, instead of just the pool, but I can live with that
[3:17] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[4:13] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[4:58] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[5:00] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit ()
[5:24] * _Shiva_ (shiva@whatcha.looking.at) Quit (Ping timeout: 480 seconds)
[5:27] * _Shiva_ (shiva@whatcha.looking.at) has joined #ceph
[7:32] * atg (~atg@please.dont.hacktheinter.net) Quit (Ping timeout: 480 seconds)
[8:21] * yehuda_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[11:33] * todin (tuxadero@kudu.in-berlin.de) Quit (Read error: Connection reset by peer)
[11:35] * todin (tuxadero@kudu.in-berlin.de) has joined #ceph
[12:53] * huangjun (~root@ Quit (Quit: Lost terminal)
[15:21] * verwilst (~verwilst@dD576F8E2.access.telenet.be) has joined #ceph
[16:31] * deksai (~chris@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[16:32] * deksai (~chris@96-35-100-192.dhcp.bycy.mi.charter.com) Quit ()
[16:54] <slang> pg v65565: 7960 pgs: 72 active, 6044 active+clean, 408 peering, 351 crashed+peering, 66 down+peering, 16 active+clean+degraded, 953 crashed+down+peering, 5 degraded+peering, 4 crashed+degraded+peering, 2 down+degraded+peering, 39 crashed+down+degraded+peering; 701 GB data, 428 GB used, 81792 GB / 83818 GB avail; 7749/771263 degraded (1.005%)
[16:54] <slang> pgs seem to be stuck in those states
[16:55] <slang> anything I can do to get them to stop peering and become active?
[16:55] <slang> (peering pgs are preventing the mds from being able to replay)
[17:41] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:48] <royh> is there a good way to have the same filesystem available on multiple different boxes?
[17:53] * greglap (~Adium@ has joined #ceph
[18:06] <greglap> slang: are all your OSDs up?
[18:08] <slang> greglap: not all
[18:08] <slang> slang: 30/36
[18:09] <greglap> the cluster probably thinks there's missing data on those OSDs it needs to wait for
[18:09] <slang> greglap: ok
[18:10] <slang> greglap: any way to tell it to stop waiting?
[18:10] <greglap> sjust or sagewk would be able to talk about actually checking for that better than me, though
[18:10] <greglap> and again: yes, but you'll need sjust or sagewk to tell you how ;)
[18:11] <greglap> royh: I'm not sure what your question means, since I'm presuming you want an answer other than "use NFS or Ceph or Gluster or..."
[18:13] <slang> I crashed two nodes (with multiple osds on each), I brought one back, but not the other. It looks like the pgs that are crashed+peering or crashed+down+peering are all pgs that only have a replication factor of 2
[18:15] <slang> actually, nevermind -- some of the pgs with a rep factor of 3 are also in crashed+peering
[18:15] <greglap> yeah, I wouldn't expect the replication factor to be a problem here unless you had multiple replicas on the same physical node (you do have your crushmap set up to avoid that, right?)
[18:16] <slang> greglap: yes, crushmap specifies which osds are on which hosts
[18:16] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:17] * verwilst (~verwilst@dD576F8E2.access.telenet.be) Quit (Quit: Ex-Chat)
[18:17] <greglap> although with only two nodes having gone down it is harder to construct a scenario where there's possible lost data, hmmm
[18:19] <sagewk> slang: was there any flapping going on, or did they get marked down just once?
[18:19] <slang> um
[18:20] <slang> initially just once, but I did restart the ones on the node I brought back up, because they remained marked down, even though the cosd processes were running
[18:21] <slang> and there were a few osds that crashed on various other nodes when that happened that had to be restarted
[18:21] <sagewk> slang: try this, 'ceph osd out N', then 'ceph osd lost N' where N is the osd node that is still down
[18:22] <slang> ok
[18:23] <slang> had to use --yes-i-really-mean-it
[18:24] <slang> :-)
[18:24] <sagewk> :)
[18:25] <sagewk> it's because the history is such that it is possible that node was the only surviving replica for those pgs and could conceivably and written data. (at least it can't prove that didn't happen.)
[18:29] <slang> looks like I still have a bunch of pgs in peering
[18:29] <slang> pg v67190: 7960 pgs: 72 active, 6061 active+clean, 418 peering, 366 crashed+peering, 58 down+peering, 16 active+clean+degraded, 869 crashed+down+peering, 7 degraded+peering, 6 crashed+degraded+peering, 2 down+degraded+peering, 29 crashed+down+degraded+peering, 56 active+clean+scrubbing+repair; 689 GB data, 423 GB used, 79044 GB / 81024 GB avail; 7769/921308 degraded (0.843%)
[18:29] <sagewk> oh wait, so the failed node had multiple cosds on it?
[18:29] <slang> sagewk: yes
[18:30] <sagewk> did you have a crush rule to separate replicas across nodes?
[18:30] <slang> sagewk: yes
[18:31] <sagewk> slang: ok, do 'ceph pg dump -o - | grep peering', and pick a random pg (say, the first). you'll see the osds it maps to in brackets.
[18:31] <greglap> looks like more of them have moved into peering, though
[18:31] <slang> hmm
[18:31] <slang> maybe I didn't have a crush rule setup
[18:32] <greglap> it's just going to take a while since each OSD is going to limit the number of in-progress PG peering processes
[18:32] <slang> I thought I added one when I did mkcephfs
[18:32] <sagewk> for the first osd, do ceph osd tell N injectargs '--debug-osd 20 --debug-ms 1', tail-f the log | grep pgid, and then 'ceph osd down N', and we'll see exactly why peering isn't completing
[18:32] <slang> and --crushmapsrc
[18:33] <sagewk> greglap: it's probably because i had him do ceph osd out N.. which probably wasn't strictly necessary but i wasn't certain.
[18:33] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:33] <slang> sagewk: ok
[18:33] <sagewk> there isn't actually any peering throttling, that's only on the recovery/object migration (which happens after the pgs go active)
[18:34] <greglap> sagewk: no, I mean the numbers did move away from stuck and into peering and active
[18:34] <greglap> I guess I got confused at some point about the throttling, though
[18:35] <greglap> or the reporting of peering statuses, rather
[18:35] <sagewk> slang: did you do the lost thing on all of the cosd's on that node?
[18:35] <slang> sagewk: yes
[18:35] <sagewk> k
[18:41] <slang> http://fpaste.org/JwVi/
[18:41] <slang> that's the beginning
[18:42] * greglap (~Adium@ Quit (Quit: Leaving.)
[18:43] <sagewk> which osd(s) are the ones that are lost?
[18:43] <slang> 0-5
[18:45] <sagewk> are other nodes going up and down?
[18:45] <sagewk> looks like 13 went down in epoch 1449
[18:47] <slang> yeah 13 went down 6 minutes ago
[18:47] <slang> ../../src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)', in thread '0x7f7c98914700'
[18:47] <slang> ../../src/osd/PG.cc: 3891: FAILED assert(0 == "we got a bad state machine event")
[18:47] <slang> ceph version (commit:)
[18:48] <slang> same error as #1403
[18:51] <slang> they're dropping like flies now..
[18:52] <slang> http://fpaste.org/urAQ/
[18:58] <slang> also got that same segfault on osd14
[18:58] <slang> 5 minutes later
[19:01] <sagewk> slang: hrm, seen both of those crashes a couple of times now. do you have logs on those two machines by any chance?
[19:02] <slang> the core file for osd14 is: http://dl.dropbox.com/u/18702194/osd14.core.gz
[19:03] <sagewk> (both bugs should be pretty easy fixes with a matching log)
[19:04] <sagewk> slang: the core file won't help us too much without a matched exec, and even then it can be hard if the environment/libs are too different. a gdb backtrace and line number for the actual error is usually just as good
[19:04] <slang> the log is: http://dl.dropbox.com/u/18702194/osd.14.log
[19:05] <sagewk> oh, do you by chance have commit 77c780df54a8151195c34241c7858e0063d0edf4 ?
[19:05] <slang> http://fpaste.org/WgIw/
[19:05] * cmccabe (~cmccabe@ has joined #ceph
[19:06] <sagewk> that fixed an issue with the heartbeats that may have triggerd all that flapping just now
[19:06] <sagewk> that last paste gets Error 500: Sorry, you broke our server. You might have reached the 512KiB
[19:07] <slang> oh
[19:07] <slang> sagewk: looks like I have that commit, yes
[19:08] <slang> http://pastebin.com/raw.php?i=fYFGnVPJ
[19:08] <slang> that's the stack trace from core file
[19:08] <slang> (it should have occurred to me that core files aren't useful to you without the same binary/arch)
[19:10] <slang> gotta head out for a bit
[19:10] <slang> be back in an hour or so
[19:10] <sagewk> ok, think i have a fix for that crash at least. if you can get cores/ logs for the others that'd be awesome
[19:12] <sagewk> back in ~30
[19:21] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[19:58] * aliguori (~anthony@ has joined #ceph
[19:58] * aliguori (~anthony@ Quit (Read error: Connection reset by peer)
[20:25] <sagewk> slang: commits 8ce6544764452daa6581f12c7c83790b1dd54a5f and a88c1790ffb7b708b334a6d9a1deacc300e933be hopefully address the two crashes we're seeing
[20:25] <sagewk> both are in ceph.git master
[22:36] * lxo (~aoliva@82VAAC8Q5.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[22:50] * lxo (~aoliva@9KCAAAC6H.tor-irc.dnsbl.oftc.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.