#ceph IRC Log


IRC Log for 2012-02-22

Timestamps are in GMT/BST.

[0:00] <pulsar> ok. let me script that
[0:00] <pulsar> grep peering pg_dump.txt | sed 's/\t.*//' | wc -l
[0:00] <pulsar> 1190
[0:00] <pulsar> :)
[0:00] <sjust> yup
[0:00] <sjust> for i in `grep peering pg_dump.txt | sed 's/\t.*//'`; do ceph pg force_create_pg $i; done
[0:01] <pulsar> running...
[0:01] <pulsar> this will take a while
[0:02] <pulsar> 2012-02-22 00:02:20.267113 pg v1382: 7920 pgs: 62 creating, 6734 active+clean, 1124 peering; 7837 bytes data, 39743 GB used, 63602 GB / 100 TB avail
[0:02] <pulsar> 2012-02-22 00:02:21.293248 pg v1383: 7920 pgs: 63 creating, 6734 active+clean, 1123 peering; 7837 bytes data, 39743 GB used, 63602 GB / 100 TB avail
[0:02] <pulsar> 2012-02-22 00:02:22.321967 pg v1384: 7920 pgs: 64 creating, 6734 active+clean, 1122 peering; 7837 bytes data, 39743 GB used, 63602 GB / 100 TB avail
[0:02] <pulsar> 1/sec
[0:02] <pulsar> ~
[0:03] <pulsar> so, the reason was somehow faulty hardware?
[0:03] <pulsar> node is up for 120 days and i did a lot of heavy network jobs on that machine
[0:04] * aliguori_ (~anthony@ Quit (Quit: Ex-Chat)
[0:05] <pulsar> i'll take that down and have a closer look at the stability - don't expect to find much though. running ecc ram and 0 visible I/O problems.
[0:07] <sjust> pulsar: it's hard to say, the basic problem from ceph's point of view is that the new primary for a pg is notified by old holders of that pg to prevent every osd from having to scan all pgs
[0:08] <sjust> in this case, the only holder was that dead pg
[0:08] <pulsar> i wonder why this problem rises up with higher number of initial nodes
[0:08] <pulsar> perhaps if i expand the cluster by adding new nodes in batches of 10...
[0:08] <pulsar> and start small
[0:09] <pulsar> i need to double the number of data nodes
[0:09] <pulsar> errr
[0:09] <pulsar> osds
[0:09] <sjust> *the dead osd
[0:09] <pulsar> targeting ~210tb
[0:09] <pulsar> evaluating if ceph can take a billion inodes
[0:09] <sjust> pulsar: we've had >100 node clusters
[0:10] <sjust> >1.5pb
[0:10] <pulsar> <3
[0:10] <sjust> pulsar: we should figure out where you are hitting trouble
[0:10] <pulsar> we'll see in a couple of minutes if the cluster comes back online
[0:11] <pulsar> after that i can reformat it and try again
[0:11] <sjust> pulsar: was there memory pressure on osd12?
[0:11] <pulsar> not really, let me give you the graphs...
[0:12] <sjust> pulsar: the osd log from osd12 would also help
[0:12] <pulsar> a sec...
[0:13] <pulsar> http://dl.dropbox.com/u/3343578/node-13.html
[0:14] <pulsar> ignore the peaks before.... 7pm or so
[0:14] <pulsar> i was deleting previous copy of a file system i stress-tested (filled up to the max)
[0:15] <pulsar> so, once the disk usage on /mnt/data2 got down to 35% i reinitialized the fs
[0:15] <pulsar> logs coming up
[0:16] <pulsar> http://dl.dropbox.com/u/3343578/logs/ceph.log
[0:17] <pulsar> http://dl.dropbox.com/u/3343578/logs/ceph.log.1.gz
[0:17] <pulsar> up to 4
[0:17] <pulsar> ah
[0:18] <pulsar> according to the log, i started the FS after formating at at 20:41
[0:18] <pulsar> and restarted the node manually after it crashed at 21:19
[0:19] <pulsar> the archived logs are of no interrest, they belong to a old fs
[0:20] <pulsar> 50 peers to go
[0:22] <pulsar> script is done
[0:22] * __nolife (~Lirezh@83-64-53-66.kocheck.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[0:22] <pulsar> 2012-02-22 00:22:26.020478 pg v2554: 7920 pgs: 1013 creating, 6907 active+clean; 7837 bytes data, 39743 GB used, 63602 GB / 100 TB avail
[0:22] <pulsar> 2012-02-22 00:22:26.039298 mds e3: 1/1/1 up {0=1=up:creating}
[0:23] <darkfader> 100TB... nice :)
[0:24] <pulsar> right now it is 100tb of dead space :)
[0:24] <darkfader> (ok, once no more osd's die it will be nicer)
[0:25] <sjust> pulsar: is the number of creating pgs going down?
[0:25] <pulsar> ... still creating and a couple of "scrub ok" messages
[0:25] <pulsar> nope
[0:25] <pulsar> 1013
[0:25] <sjust> hmm, ok
[0:25] <pulsar> rock solid
[0:25] <sjust> this is different bug
[0:25] <pulsar> \o/
[0:26] <pulsar> i am so used to it :)
[0:26] <sjust> post output of: 'ceph osd getmap -o /tmp/tmpmap; osdmaptool --test-map-pg 2.69 /tmp/tmpmap'
[0:26] <pulsar> after running hbase and cassandra's early releases and patching the crap out of it in the middle of the night, that does not even makes me wonder
[0:27] <pulsar> got it
[0:27] <pulsar> ceph osd getmap -o /tmp/tmpmap; osdmaptool --test-map-pg 2.69 /tmp/tmpmap
[0:27] <pulsar> 2012-02-22 00:27:11.839738 mon <- [osd,getmap]
[0:27] <pulsar> 2012-02-22 00:27:11.840252 mon.0 -> 'got osdmap epoch 134' (0) wrote 21091 byte payload to /tmp/tmpmap
[0:27] <pulsar> osdmaptool: osdmap file '/tmp/tmpmap' parsed '2.69' -> 2.69
[0:27] <pulsar> 2.69 raw [15,39] up [15,39] acting [15,39]
[0:27] <pulsar> need that payload?
[0:27] <sjust> nope
[0:27] <sjust> so, it's supposed to be mapped to osd15 and osd39
[0:28] <sjust> are they healthy?
[0:28] <pulsar> as in online?
[0:28] <sjust> did the daemons crash/hang, basically
[0:28] <pulsar> does not look like that, nope.
[0:28] <pulsar> but let me have a closer look
[0:29] <pulsar> 2012-02-22 00:22:34.595615 7f22e232f700 log [INF] : 0.48a scrub ok
[0:29] <pulsar> 2012-02-22 00:22:35.596087 7f22e232f700 log [INF] : 1.489 scrub ok
[0:29] <pulsar> 2012-02-22 00:22:40.596376 7f22e232f700 log [INF] : 2.488 scrub ok
[0:29] <pulsar> 2012-02-22 00:28:13.260802 7f22e5436700 osd.15 134 OSD::ms_handle_reset()
[0:29] <pulsar> 2012-02-22 00:28:13.260833 7f22e5436700 osd.15 134 OSD::ms_handle_reset() s=0x1b04120
[0:29] <pulsar> still alive
[0:30] <pulsar> at least the process is there
[0:31] <pulsar> yep
[0:31] <pulsar> both of them ok
[0:31] <sjust> ok, could you post the last hour of osd.15's log?
[0:31] <pulsar> osd.39 has no "ms_handle_reset()" messaes though
[0:31] <pulsar> yeah
[0:31] <sjust> or, wait
[0:31] <sjust> hang on
[0:32] <sjust> nvm, could you post the last hour of osd.15's log?
[0:32] <pulsar> yup
[0:33] <sjust> actually, you don't seem to have osd logging on
[0:33] <pulsar> http://dl.dropbox.com/u/3343578/logs/osd15.log
[0:33] <sjust> could you turn on osd logging for osd15 to 20 and restart the daemon?
[0:33] <pulsar> sure
[0:35] <pulsar> i'll wait for the ournal to finish loading?
[0:35] <pulsar> 2012-02-22 00:34:14.268772 7f8d32067780 journal _open /mnt/data2/ceph/osd.15.journal fd 16: 1048576000 bytes, block size 4096 bytes, directio = 1
[0:35] <pulsar> 2012-02-22 00:34:44.686472 7f8d257c9700 journal throttle: waited for ops
[0:35] <pulsar> 2012-02-22 00:34:44.853912 7f8d257c9700 journal throttle: waited for ops
[0:35] <sjust> yeah, give it about 5 minutes
[0:35] <sjust> # creating still holding steady?
[0:36] <pulsar> pg v2679: 7920 pgs: 906 creating,
[0:36] <pulsar> got a bit lower
[0:36] <pulsar> after restarting the osd
[0:36] <pulsar> does not move since though
[0:36] <pulsar> scrubbing now
[0:37] <pulsar> sbrubbing...
[0:37] <pulsar> what does tha mean anyway?
[0:37] <pulsar> new terminology to me
[0:38] <pulsar> a point in the right wiki direction or something would be perfectly fine :)
[0:38] <gregaf> pulsar: scrubbing is when the OSDs are checking that the contents of the PG on each OSD match the other replicas
[0:39] <gregaf> anyway, given that, the cheap way to fix your problem is to restart all the OSDs one at a time
[0:39] <gregaf> though that doesn't tell us exactly what went wrong here
[0:39] <gregaf> let me read back through the log and catch up a bit more
[0:39] <pulsar> i see
[0:44] <pulsar> 906 creating / scrubbing
[0:44] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[0:45] * __nolife (~Lirezh@83-64-53-66.kocheck.xdsl-line.inode.at) has joined #ceph
[0:50] <pulsar> a lot of these now:
[0:50] <pulsar> 2012-02-22 00:49:47.618906 7f8d1eaa9700 -- >> pipe(0x1bdc000 sd=77 pgs=114 cs=1 l=0).fault with nothing to send, going to standby
[0:50] <gregaf> pulsar: yeah, that's pretty normal
[0:50] <pulsar> so far the number of creating pgs does not move a bit
[0:50] <gregaf> it happens when the daemons don't have anything to communicate to each other, so they drop silent, and eventually the messaging system shuts it down
[0:52] <gregaf> okay, try just "ceph pg send_pg_creates"
[0:52] <gregaf> pulsar: ^
[0:52] <pulsar> sent
[0:53] <pulsar> does not change much yet
[0:53] <pulsar> 2012-02-22 00:52:47.984735 7fb1e81dc700 mon.1@0(leader) e1 handle_command mon_command(pg send_pg_creates v 0) v1
[0:54] <gregaf> does the monitor send any messages out after that?
[0:54] <gregaf> I wasn't expecting much, just hoping
[0:55] <pulsar> not according to the log
[0:55] <pulsar> just a bunch of ms_handle_reset
[0:57] <gregaf> okay, bummer
[1:01] <gregaf> actually, can you check your monitor logs for osd_pg_create to make sure they got sent out in the past?
[1:02] <pulsar> let me see...
[1:02] <pulsar> cat ceph.log | egrep osd_pg_create | wc -l
[1:02] <pulsar> 0
[1:03] <gregaf> can you run it on each of your monitors?
[1:03] <gregaf> it should only show up on one of them :)
[1:03] <pulsar> i have only one monitor
[1:03] <gregaf> oh
[1:03] <gregaf> did you pastebin your config somewhere? I didn't see it
[1:03] <gregaf> and would like to :)
[1:03] <pulsar> oh, as in "oh oh"?
[1:03] <pulsar> nope, its rolled out via puppet
[1:03] <pulsar> but... let me grab it
[1:04] <gregaf> actually I think those messages are only logged with debug_ms = 1, which you might not have
[1:04] <gregaf> argh
[1:04] <gregaf> I hate remote debugging
[1:04] <pulsar> yeah, i am trying to find anything free for online desktop sharing. seems teamviewer has removed the presentation feature from the free version
[1:05] <pulsar> and setting up vnc on a mac was a bit painful last time i tried. damn hippies.
[1:06] <pulsar> http://dl.dropbox.com/u/3343578/logs/ceph.conf
[1:07] <gregaf> okay, that's very vanilla, but indeed doesn't have any debugging
[1:07] <gregaf> no wonder it's not showing the messages, then
[1:07] <pulsar> so, restart the whole cluster with debug = 20?
[1:07] <gregaf> no
[1:07] <sjust> pulsar: just debug mon = 20 and debug filestore = 20
[1:08] <pulsar> k
[1:08] <gregaf> filestore isn't going to do much without seeing what the upper layers are doing :)
[1:08] <pulsar> 2012-02-22 01:08:22.649294 mon.0 -> 'unrecognized subsystem' (-22)
[1:08] <pulsar> node@node-1:~$ sudo ceph debug filestore = 20
[1:08] <pulsar> ah
[1:08] <pulsar> nvm
[1:08] <gregaf> sorry, it's more complicated than that
[1:09] <pulsar> so, new config and restart, right?
[1:09] <gregaf> to do the mon you can run "ceph injectargs debug_filestore = 20" (I think, let me know if it fails)
[1:09] <pulsar> oh, cool.
[1:09] <gregaf> wait, no, that's full of typos
[1:09] <gregaf> "ceph mon injectargs debug_mon = 20"
[1:10] <pulsar> node@node-1:~$ sudo ceph mon injectargs debug_mon = 20
[1:10] <pulsar> 2012-02-22 01:10:18.441217 mon <- [mon,injectargs,debug_mon,=,20]
[1:10] <pulsar> 2012-02-22 01:10:18.441755 mon.0 -> 'unknown command injectargs' (-22)
[1:10] <gregaf> dammit, let me check the syntax...
[1:11] <sjust> pulsar: you'll be happy to hear the admin tool improvements are coming down the pipe Real Soon Now
[1:11] <pulsar> hehe
[1:11] <pulsar> well... it depends if i choose to stick with ceph or not :)
[1:11] <gregaf> okay, the format is "ceph mds tell '(mds name)' injectargs --debug_mds 20"
[1:11] <gregaf> so for this case "ceph mon tell 0 injectargs --debug_mon 20" should do it
[1:11] <gregaf> I think ;)
[1:12] <pulsar> \o/
[1:12] <pulsar> works
[1:12] <gregaf> yay
[1:12] <pulsar> but.. no noise in the logs
[1:12] <gregaf> can you run "ceph pg send_pg_creates" again?
[1:13] <gregaf> wait, no, I want to see the messenger debugging too
[1:13] <pulsar> 2012-02-22 01:13:03.176504 7fb1e81dc700 mon.1@0(leader) e1 handle_command mon_command(pg send_pg_Creates v 0) v1
[1:13] <gregaf> "ceph mon tell 0 injectargs --debug_ms 20"
[1:13] <gregaf> I mean "ceph mon tell 0 injectargs --debug_ms 1"
[1:13] <gregaf> 20 would be a lot
[1:14] <gregaf> though having done that already if you grep the log for "pg_create" you should hopefully find some useful output
[1:14] <pulsar> ok. set. re-running send_pg_creates
[1:14] <pulsar> 012-02-22 01:14:28.028777 7fb1e81dc700 mon.1@0(leader) e1 handle_command mon_command(pg send_pg_creates v 0) v1
[1:14] <pulsar> and ...
[1:14] <pulsar> nada
[1:15] <gregaf> is that line followed by one saying "sent pg creates"?
[1:15] <pulsar> yep
[1:15] <gregaf> okay
[1:15] <pulsar> skype / desktop sharing?
[1:15] <gregaf> and probably one saying "send_pg_creates to 0 pgs"?
[1:16] <gregaf> which means it's definitely a bizarre problem on the OSDs
[1:16] <pulsar> just this one line
[1:16] <pulsar> 2012-02-22 01:14:28.028777 7fb1e81dc700 mon.1@0(leader) e1 handle_command mon_command(pg send_pg_creates v 0) v1
[1:16] <pulsar> the cluster works with half of the osds
[1:16] <pulsar> booting with ~40 istead of 20 makes one node previously working crash during the startup
[1:16] <pulsar> and ... here we are
[1:17] <gregaf> yeah, the problem we're hitting here is something to do with a recovery edge case
[1:17] * joao (~joao@ Quit (Quit: joao)
[1:19] <gregaf> pulsar: can you tail the last 200 or so lines of the monitor log and post them for me?
[1:19] <pulsar> mom
[1:20] <pulsar> http://dl.dropbox.com/u/3343578/logs/mon.log
[1:20] <gregaf> and then find another PG that's still creating (and the OSD it's on), then turn up the debug ("debug osd = 20" and "debug ms = 1") on that OSD config and restart the OSD
[1:20] <gregaf> might give us a hint
[1:20] <pulsar> 20:42 where it starts
[1:21] <pulsar> find antoher pg that's still creating... ok.. let me see if i can do that
[1:22] <sjust> pulsar: grep 'ceph pg dump' for 'creating'
[1:22] <pulsar> yeah, already looking for that
[1:25] <pulsar> ok. got one. i see a lot more debug messages in the monitor log now
[1:25] <pulsar> i picked that one:
[1:25] <pulsar> 2.2 0 0 0 0 0 0 0 creating 0'0 0'0 [] [] 0'0 0.000000
[1:25] <pulsar> ceph osd tell 2.2 injectargs --debug_ms 1
[1:26] <pulsar> ceph osd tell 2.2 injectargs --debug_mon 20
[1:26] <pulsar> right?
[1:26] <gregaf> actually, use quote marks around the "—debug_ms 1" sections
[1:26] <pulsar> k
[1:26] <gregaf> sorry, haven't done this in a while :)
[1:27] <pulsar> yay
[1:27] <pulsar> now we are talking
[1:27] <pulsar> jeeez
[1:27] <pulsar> wasn't working before with mon either
[1:27] <gregaf> heh
[1:27] <gregaf> yeah, my bad
[1:27] <pulsar> no prob
[1:27] <pulsar> so... pg create once agan?
[1:27] <gregaf> finally spun up a local instance to check these before I spew them back to you :)
[1:28] <gregaf> yeah, why not
[1:29] <pulsar> a lot noise now
[1:29] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[1:29] <pulsar> i'll get you a fresh mon log
[1:31] <pulsar> http://dl.dropbox.com/u/3343578/logs/mon.debug.log
[1:31] <pulsar> how do i find the osd a pg is running on?
[1:31] <pulsar> pgid 2.2, osd=2?
[1:31] <gregaf> oh, haha, so it is sending creates to them
[1:32] <gregaf> ….in tons of different epochs?
[1:32] <pulsar> perhaps the commands got accumulated?
[1:33] <gregaf> oh, it's not really sending them as from that epoch, good
[1:33] <pulsar> and epoch stands for?
[1:33] <pulsar> sorry for the noob questions
[1:33] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[1:33] <gregaf> each revision of the OSD Map is a new epoch
[1:33] <pulsar> just trying to understand the language
[1:33] <pulsar> ah, revsision==epoch. check.
[1:33] <gregaf> epoch == version
[1:34] <gregaf> you can probably thank Leslie Lamport for our using it
[1:34] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit ()
[1:34] <gregaf> you turned up debug on the OSDs too, right?
[1:34] <gregaf> or at least on one of them
[1:35] <pulsar> i have
[1:35] <pulsar> on 2.2
[1:35] <pulsar> pg == 2.2
[1:35] <pulsar> which osd would that map to?
[1:35] <gregaf> osds 10, 11, 13, 14, 16, 17, 18, 19 should have just gotten pg_create messages
[1:35] <pulsar> or how do i look that up
[1:35] <gregaf> sjust: best way to map PG to OSD?
[1:36] <sjust> osdmaptool --test-map-pg <pgid> <osdmap>
[1:36] <sjust> you can get the osdmap by: 'ceph osd getmap -o <osdmap>'
[1:36] <pulsar> thanks
[1:37] <gregaf> since you haven't already turned up the debugging you'll need to turn it up on one of those OSDs, then resend the mon command
[1:37] <gregaf> then the OSD log will include that message coming in and information about what the OSD did with it
[1:37] <pulsar> 2.2 raw [14,4] up [14,4] acting [14,4]
[1:37] <pulsar> so osd.14 then?
[1:38] <pulsar> i have set the debugging level on that one already.
[1:38] <gregaf> any of the ones I listed, really — 14 is as good as any :)
[1:38] <gregaf> cool!
[1:39] <pulsar> hmm.. not that muc in th elog
[1:39] <gregaf> so it should have information in the log about a pg create
[1:39] <gregaf> which options did you turn up?
[1:39] <pulsar> debug_ms 2
[1:39] <pulsar> debug_mon 20
[1:39] <pulsar> ceph osd tell 2.2 injectargs "--debug_mon 20"
[1:40] <pulsar> and just used the osdmap to look up 2.2
[1:40] <pulsar> said [14,4]
[1:40] <gregaf> pulsar: oh, heh, for the OSD it'll want "--debug_osd 20" :)
[1:40] <gregaf> since it's an OSD, not a monitor
[1:40] <pulsar> yes, i would want that. right :)
[1:40] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[1:40] <pulsar> re-sending creates
[1:41] <pulsar> waiting for the log to do its matrix thing...
[1:41] <pulsar> nothing yet.
[1:42] <pulsar> let me set the debug level on all nodes then
[1:47] <pulsar> for i in $(ceph pg dump | grep creating | sed 's/\t.*//'); do ceph osd tell $i injectargs "--debug_ms 2"; done
[1:47] <pulsar> perhaps an overkill.
[1:47] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:48] <pulsar> sent pg creates...
[1:48] <pulsar> ah, osd.1 is logging like crazy
[1:49] <gregaf> pulsar: you'll need to "--debug_osd 20" as well, in order to see the actual osd logic progressing
[1:49] <pulsar> did that too
[1:49] <gregaf> I've got to run off and climb stairs now, though, sjust should be back on for you :)
[1:49] <pulsar> http://dl.dropbox.com/u/3343578/logs/osd1.log
[1:50] <pulsar> no hurry
[1:50] <pulsar> i will call it a day now
[1:50] <pulsar> it is 2am over here
[1:50] <pulsar> and i do actually work normal busines hours.
[1:50] <pulsar> :)
[1:50] <gregaf> yeesh, sleep well!
[1:50] <pulsar> tn
[1:50] <pulsar> thanks for the help!
[1:55] <sjust> pulsar: aha
[1:55] <sjust> try ceph osd lost 12
[1:55] <pulsar> aha!?
[1:55] <sjust> 'ceph osd lost 12' that is
[1:55] <pulsar> k
[1:56] <sjust> and then send pg creates again
[1:56] <pulsar> yes-i-really-mean-it
[1:56] <pulsar> k
[1:56] <pulsar> cent
[1:56] <pulsar> sent even
[1:56] <sjust> anything happening to creating?
[1:56] <pulsar> nope
[1:57] <sjust> yeah, I just had a thought, one sec
[2:03] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[2:04] <sjust> pulsar: ugh, we don't handle lost correctly in the creation code paths, we'll need to get you a patch
[2:05] <sjust> pulsar: I'll get back to you tomorrow, if that's ok
[2:05] <pulsar> hmm...
[2:05] <pulsar> right now i run on stable debian packages
[2:05] <pulsar> "stable"
[2:05] <sjust> pulsar: ah
[2:05] <pulsar> rolled out by puppet.
[2:06] <sjust> bringing osd12 back up would probably work, actually
[2:06] <pulsar> i guess i can build myself deb packages and roll these out
[2:06] <sjust> basically, the remaining osds won't create a new copy of that pg because they can't prove osd12 didn't write anything before it died
[2:06] <pulsar> if i manage to meet all dependencies
[2:06] <pulsar> well.. here is the thing,
[2:07] <pulsar> righ now i am evaluating if ceph would be stable enough for my purpose
[2:07] <pulsar> 1b files, uptime for about 20 days
[2:07] <pulsar> i don't mind using workarounds to get the fs up and running
[2:08] <pulsar> so, bringing new nodes into the cluster one by one after the base fs has been initialized should be working in my case, right?
[2:08] <sjust> pulsar: yeah, but then, I haven't seen a cluster fail like this before
[2:08] <pulsar> i see. i would be glad to help you with that one then.
[2:09] <sjust> pulsar: just make sure you allocate plenty of pgs (around 1-200 per final node count)
[2:09] <pulsar> so, leave the cluster untouched as it is right now and you get me a deb package? that is, if you want me to test it.
[2:10] <pulsar> otherwise i would rebootstrap it and as you suggested preallocate 200 pgs per node
[2:10] <sjust> pulsar: that should work
[2:10] <pulsar> one last question, how do i preallocate pgs?
[2:10] <pulsar> :)
[2:10] <sjust> once the pgs are actually replicated, this should not happen
[2:10] <sjust> it's the pg_num configurable, which I now need to remind myself of the location
[2:11] <pulsar> ok. i'll find it then
[2:11] <pulsar> as long it is not "ceph pg <something> <sometihing>"
[2:11] <pulsar> :D
[2:11] <sjust> it's part of creating the initial crushmap
[2:11] <sjust> or, I think you could populate the ceph.conf with all of the nodes and just bring them up 10 or so at a time
[2:11] <sjust> that might do it
[2:12] <pulsar> bring up 10 as in start the daemons
[2:12] <pulsar> ?
[2:12] <sjust> yeah
[2:12] <sjust> sorry
[2:12] <pulsar> straight forward then
[2:13] <pulsar> http://ceph.newdream.net/wiki/Cluster_configuration
[2:13] <pulsar> just out of curiosity
[2:13] <pulsar> didn't find the pg settings here :/
[2:13] <pulsar> oh
[2:13] <pulsar> ceph.conf is another section...
[2:13] <sjust> the defaults should work as long as you have the full number of osds specified when you initialize the cluster
[2:14] <pulsar> ok
[2:14] <sjust> the default is based on the number of osds
[2:14] <pulsar> so, format all nodes and start in batches of 10
[2:14] <sjust> yeah
[2:14] <pulsar> what is the average uptime of your instances?
[2:14] <pulsar> just curious what to expect
[2:15] <sjust> weeks on our large test cluster, I think
[2:15] <pulsar> sounds good
[2:16] <pulsar> in worst case scenarious i will expect a job to run for about 8 weeks
[2:18] <pulsar> i have also noticed that plenty of physical files are created when an osd is filling up. can the blocksize thus the number of files can be tweaked or something?
[2:19] <sjust> pulsar: yes, I think so
[2:19] <pulsar> it took hours to flush/delete the data directories on xfs earlier this day for me, just wondering if there are some adjusments to that
[2:20] <sjust> pulsar: I believe the fs chunks files into objects, and the chunk size should be configurable
[2:21] <pulsar> looking through the osd options over here: http://ceph.newdream.net/wiki/Ceph.conf
[2:24] <sjust> hmm, I don't see it
[2:24] <pulsar> me neither
[2:24] <sjust> I'll need to ask greg/sage tommorrow
[2:25] <sjust> I'm heading out for the day, I'll be back tomorrow morning
[2:25] <pulsar> what would be great. i'll call it a day now. for the second time :)
[2:25] <pulsar> 2:30am
[2:26] <pulsar> thanks for all the help!
[2:27] <pulsar> s/what/that/
[2:27] <pulsar> AFK
[2:45] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[3:05] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[3:11] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[3:36] * joshd1 (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:04] * alexxy[home] (~alexxy@ Quit (Remote host closed the connection)
[4:04] * alexxy (~alexxy@ has joined #ceph
[4:30] * alexxy[home] (~alexxy@ has joined #ceph
[4:36] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[4:41] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:43] * alexxy[home] (~alexxy@ Quit (Read error: Connection reset by peer)
[4:44] * alexxy (~alexxy@ has joined #ceph
[4:48] * alexxy (~alexxy@ Quit ()
[4:49] * alexxy (~alexxy@ has joined #ceph
[4:54] * alexxy (~alexxy@ Quit ()
[4:54] * alexxy (~alexxy@ has joined #ceph
[4:58] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[5:06] * alexxy (~alexxy@ has joined #ceph
[5:09] * alexxy (~alexxy@ Quit (Quit: No Ping reply in 180 seconds.)
[5:10] * alexxy (~alexxy@ has joined #ceph
[5:17] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[5:22] * alexxy (~alexxy@ has joined #ceph
[5:57] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[6:01] * alexxy (~alexxy@ Quit (Quit: No Ping reply in 180 seconds.)
[6:07] * alexxy (~alexxy@ has joined #ceph
[6:14] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) Quit (Remote host closed the connection)
[6:15] * alexxy (~alexxy@ Quit (Quit: No Ping reply in 180 seconds.)
[6:16] * alexxy (~alexxy@ has joined #ceph
[6:23] * alexxy (~alexxy@ Quit (Quit: No Ping reply in 180 seconds.)
[6:27] * alexxy (~alexxy@ has joined #ceph
[6:30] * alexxy (~alexxy@ Quit (Quit: No Ping reply in 180 seconds.)
[6:31] * alexxy (~alexxy@ has joined #ceph
[9:09] * gohko (~gohko@natter.interq.or.jp) Quit (Quit: Leaving...)
[9:12] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[9:20] * gohko_ (~gohko@natter.interq.or.jp) has joined #ceph
[9:20] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[9:34] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:34] * lofejndif (~lsqavnbok@229.Red-81-32-60.dynamicIP.rima-tde.net) has joined #ceph
[10:07] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[10:19] * joao (~joao@ has joined #ceph
[10:30] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[11:19] * lofejndif (~lsqavnbok@229.Red-81-32-60.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[11:30] * joao (~joao@ Quit (Quit: joao)
[12:02] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Read error: Operation timed out)
[12:33] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[12:39] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[13:18] * joao (~joao@ has joined #ceph
[13:30] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) has joined #ceph
[13:30] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) Quit (Remote host closed the connection)
[13:50] <pulsar> sjust: gregaf: giving it one more try, starting with 8 osds (of 78)
[13:50] <pulsar> already lost one osd during bootup
[13:51] <pulsar> seems i am stuck at pg v439: 15642 pgs: 3 creating, 11705 active+clean, 3934 peering
[14:27] <pulsar> seems the number of pgs is a bit high, i would like to start the cluster with a lower value. right now the system assumes 200 pgs per osd. can this default setting be changed via ceph.conf? did not find anything yet
[14:31] <pulsar> also checked common/config_opts.h ... hmmm
[14:33] <pulsar> osd_pool_default_pg_num? default is 8, so not exactly 200
[15:07] <pulsar> when running / booting 2 osds on the same server i get a lot of
[15:07] <pulsar> 2012-02-22 15:07:06.898269 7f4ddfb02700 -- >> pipe(0x3f03500 sd=39 pgs=0 cs=0 l=0).connect claims to be not - wrong node!
[15:11] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:38] * BManojlovic (~steki@ has joined #ceph
[16:06] * psomas_ (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[16:06] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (Read error: Connection reset by peer)
[16:14] * tnt_ (~tnt@212-166-48-236.win.be) has joined #ceph
[16:16] <tnt_> Hi. I'm trying to get started with ceph, creating a small clusters in several VMs. Unfortunately the osd won't start. So I tried starting one by hand and get an "Abort assertion failed".
[16:16] <tnt_> http://pastebin.com/rwgewtB6
[16:20] <pulsar> does this directory exist? /srv/osd.0/
[16:20] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[16:21] <tnt_> yes, I created it before doing the mkcephfs script. There is a btrfs volume mounted there. (empty at first, then after the mkcephfs script, it contains several files and dirs : ceph_fsid current fsid magic snap_1 store_version whoami )
[16:21] <pulsar> sorry, no idea then :/
[16:26] <tnt_> For the mon, should I create a /srv/mon.a directory with a btrfs volume as well ?
[16:27] <pulsar> i did that, yes
[16:28] <pulsar> osd_pg_bits <- is this exclusively used to calculate the number of pgs in a cluster based on the number of osds?
[16:28] <tnt_> when I do it, it mkcephfs script complains "2012-02-22 16:25:46.816298 7f853fc99780 store(/srv/mon.a) MonitorStore::mkfs: failed to remove /srv/mon.a: rm returned run_cmd(rm): exited with status 1"
[16:29] <pulsar> sounds like a permission issue, yes.
[16:29] <pulsar> and does also seem like the script is actually trying to create it
[16:30] <tnt_> it's trying to remove it ... but can't because there is my btrfs volume mounted there ...
[16:31] <pulsar> maybe you could try mounting / providing the data directories by yourself then and use a subdirectory for the whatever-storage?
[16:32] <pulsar> thats the way i do it, did not try the btrfs auto stuff yet
[16:32] <tnt_> If I let /srv/mon.a just me a directory then mkcephfs works fine. I can start mds & mon but not the ods (errors above ...)
[16:33] * joao is now known as Guest3524
[16:33] * joao (~joao@89-181-147-200.net.novis.pt) has joined #ceph
[16:36] <tnt_> mmm, if I don't use btrfs and just create empty dirs on my root ext4 part, then it works.
[16:37] <pulsar> you can still use btrfs
[16:37] <pulsar> just point the data directories to the right location
[16:38] <tnt_> what's the "right location" ? I have /dev/sda2 mounted on /srv/osd.0 that seemed right ?
[16:38] <pulsar> whatever location you have the btrfs mounted on
[16:38] <pulsar> [mon] mon data = /mnt/data2/ceph/mon.$id
[16:39] <pulsar> i have /mnt/data2/ mounted and ceph is a subdirectory
[16:39] <pulsar> [osd] osd data = /mnt/data2/ceph/osd.$id
[16:39] <pulsar> same over here
[16:39] <tnt_> mmm, I might try that. So that the ceph dir is not the 'root' of the FS.
[16:39] <pulsar> just make sure your ceph directory is on a btrfs fs
[16:39] <pulsar> yep, it is a subdirectory of the fs (xfs in my case)
[16:39] * Guest3524 (~joao@ Quit (Ping timeout: 480 seconds)
[16:43] <tnt_> Nope, same error with the OSD "terminate called after throwing an instance of 'ceph::FailedAssertion'"
[16:43] <tnt_> I'll try to remake all the VMs with ubuntu 11.10, it has a newer kernel ...
[16:43] <pulsar> and mkfs.ceph did not bail out any errors?
[16:43] <tnt_> no, mkcephfs script on the admin host worked fine
[16:43] <tnt_> (I'm following http://ceph.newdream.net/docs/latest/ops/install/mkcephfs/#ssh-config )
[16:44] <pulsar> sounds like a btrfs issue then.
[16:54] <tnt_> For the clients is there any need for kernel support ? (i.e. I'd like some of my squeeze machines to be able to mount / access block devices of the cluster)
[16:54] <iggy> depends
[16:55] <iggy> for the kernel fs client, yes
[16:55] <iggy> you can use fuse (I think it's less tested than the kernel client though)
[16:55] <iggy> and if you just want rbd/radosgw/direct access to the object store, then no, that's mostly userspace
[16:56] <tnt_> Ok. Can it be built as module ? (basically I'd like to use the ceph cluster to store VM images as RBD on xen0)
[16:57] <iggy> rbd with xen?
[16:59] <tnt_> Yes. Sounded like a nice idea. I currently have all vm disks as iSCSI volumes and I wanted to replace that with RDB.
[17:00] <iggy> you'd have to use the kernel rbd driver I think
[17:00] <iggy> I don't know how far back it compiles
[17:08] <pulsar> iggy: do you have any idea how to bring up a fresh fs with a predefined pg number?
[17:08] <iggy> or if your xen is using a new enough qemu, it should have rbd support available
[17:09] <iggy> pulsar: nfc, it's been a while since I've messed with ceph fs... I've mostly been insterested in rbd stuff lately
[17:09] <pulsar> i tried settings pg bits, pool default pg num, pool data pg num via ceph.conf in [osd] without any effects
[17:09] <pulsar> ic
[17:09] <tnt_> iggy: qemu is only used for HVM hosts I think, not paravirtualized ones.
[17:09] <iggy> tnt_: yeah, you're right
[17:10] <iggy> so yeah, for pv guests, you'd need the rbd kernel driver
[17:15] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:37] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:58] * tnt_ (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[18:06] * tnt_ (~tnt@123.165-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:20] * Tv|work (~Tv__@aon.hq.newdream.net) has joined #ceph
[18:40] <gregaf> pulsar: what killed the OSD during bootup?
[18:40] <gregaf> in any case, if you bring it back up you should see the rest of your PGs get happy
[18:40] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:41] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[18:41] <^conner> I saw there is a ceph tutorial scheduled for MSST
[18:43] <gregaf> huh, so there is — you'd have to ask sagewk about that!
[18:49] <pulsar> gregaf: difficult to say, 2 osds were logging into the same file with verbosity turned up.
[18:49] * rcattelan (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[18:49] <pulsar> gregaf: i tracked the problem down to a too high number of pgs
[18:50] <pulsar> gregaf: looked serveral hours for a way to initialize the fs with a fixed number of pgs per osd but did not find any way to configure that
[18:50] <gregaf> hmm, it is possible I guess
[18:50] <pulsar> seems the number of osds is determined by shifting the number of osds by 6 (bitwise) to the left
[18:50] <gregaf> we need to work on cluster bringup, ugh
[18:50] <gregaf> yeah, the default PG calculations are a bit silly
[18:50] <pulsar> but pg_bits or something had no effect
[18:51] <pulsar> tried pretty much every config parameter found in config_?.cc in the [osd] section
[18:51] <gregaf> you need to have it set in the right place, which you probably didn't :/
[18:51] <pulsar> i guessed, everything starting with osd_some_magic_foo goes into [osd] osd some magic foo = 1337
[18:52] <gregaf> those config values need to find their way into the process that generates the initial map, which iirc is a client called by mkcephfs
[18:52] <pulsar> anyway. booting the cluster with 15 nodes, adding new ones, rebalancing
[18:52] <gregaf> yeah
[18:52] <pulsar> and swearing
[18:52] <pulsar> a lot :)
[18:52] <gregaf> :(
[18:53] <gregaf> so things are progressing now?
[18:53] <pulsar> manpage for mkcephfs did not reveal any parameters like that
[18:53] <pulsar> i already suspected that
[18:53] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[18:53] <pulsar> well.. i messed up, but it worked quite well
[18:54] <pulsar> re-adding new nodes one by one and forgetting about the 2nd dirty (previous test fs) osd on the same node and calling ceph start...
[18:54] <pulsar> bad idea
[18:54] <gregaf> heh
[18:54] <pulsar> exec_all.sh "sudo rm /mnt/data2/ceph/* -Rf" *wait*
[18:55] <pulsar> but.. if you can give me a hint how to limit the number of pgs on init - now would be a great time
[18:55] <pulsar> otherwise i'll continue with the less pretty plan
[18:56] <gregaf> Tv|work: sagewk: do you remember how to do this (limit PGs on startup) nicely?
[18:56] <pulsar> this is what i tried:
[18:56] <pulsar> # osd pg bits = 4
[18:56] <pulsar> # osd pool default pg num = 2
[18:56] <pulsar> # osd pool data pg num = 2
[18:56] <pulsar> # osd pool rbd pg num = 2
[18:56] <pulsar> # osd pool metadata pg num = 2
[18:56] <pulsar> # osd pool default pgp num = 2
[18:56] <pulsar> # osd_pg_bits = 4
[18:58] <Tv|work> gregaf: i don't think i've ever needed to touch part
[18:58] <Tv|work> *that part
[18:58] <gregaf> pulsar: you need to get the "osd pg bits = 4" config into the map generator, which I think you want to do by putting it in the "global" section on the node you run mkcephfs on…
[18:58] <gregaf> but that's ugly and I could be misremembering the flow
[18:58] <pulsar> i'll give it a try
[18:59] * bchrisman (~Adium@ has joined #ceph
[18:59] <Tv|work> then there's osd_pool_default_pg_num and osd_pool_default_pgp_num...
[19:00] <Tv|work> oh heh, new favorite part of the ceph source tree:
[19:00] <Tv|work> =item some googoo notes
[19:00] <pulsar> Tv|work: tried those in the [osd] section
[19:01] <gregaf> those are for new pools but I believe don't apply to the initial map…*sigh*
[19:01] <Tv|work> yeah
[19:01] <pulsar> i'll try the bits first
[19:01] <Tv|work> hmm why do i see accesses to g_conf->osd_pg_bits in src/mon/ ?
[19:01] <pulsar> 4 or 3 should be fine (80 << 3) * 3
[19:01] <pulsar> 80 osd nodes and 3 pools. right?
[19:02] <gregaf> and remember, if the initial map creator only sees 1 OSD it'll only create 16 or so PGs…you could just not put all the OSDs into the initial config and then add them afterwards and add the new PGs then
[19:02] <pulsar> i was doing that
[19:03] <pulsar> started with 15 osds
[19:03] <pulsar> formatted everything
[19:03] <pulsar> and started adding / formatting the other nodes one by one
[19:03] <dmick> Tv|work: see also ossh.{lib,include}{.big}
[19:04] <gregaf> I dunno what the limits are on PGs/OSD — the smallest number I've heard of is several thousand but I thought you had big nodes that could handle it
[19:05] <pulsar> i had 15k
[19:05] <pulsar> which was too much
[19:05] <pulsar> for xfs as a backend i presume
[19:06] <gregaf> oh right, you died on the FS, not on the OSD doing something
[19:06] * chutzpah (~chutz@ has joined #ceph
[19:06] <gregaf> I wouldn't expect xfs to have a problem with that workload, but anything's possible!
[19:07] <pulsar> i guess it is more like a timeout issue, too many threads doing file io ... smells a bit like that
[19:07] <pulsar> root@node-1 ~ # ceph osd pool get data pg_num
[19:07] <pulsar> 2012-02-22 19:07:29.649048 mon <- [osd,pool,get,data,pg_num]
[19:07] <pulsar> 2012-02-22 19:07:29.649752 mon.0 -> 'PG_NUM: 1264' (0)
[19:07] <pulsar> yay!
[19:07] * sagewk1 (~sage@aon.hq.newdream.net) Quit (Quit: Leaving.)
[19:08] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[19:08] <pulsar> " osd pg bits = 4" in "[global]" it is
[19:08] <pulsar> *dance*
[19:08] <NaioN> pulsar: howmany pg/xfs fs did you have?
[19:08] <pulsar> NaioN: 15k pgs
[19:08] <pulsar> 80 osds
[19:08] <NaioN> I'm also using xfs because I run into a btrfs bug all the time
[19:08] <pulsar> 40 physical nodes
[19:08] <pulsar> 80 disks
[19:09] <gregaf> it was during startup and he was trying to create all of them on a single OSD all at once, is probably the problem
[19:09] <NaioN> hmmm I have 4752 on 24 osds/disks
[19:09] <NaioN> 2 physical machines
[19:10] <NaioN> is that 15k in total or per osd?
[19:10] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[19:10] * The_Bishop (~bishop@cable-89-16-138-109.cust.telecolumbus.net) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[19:12] <gregaf> his was per OSD
[19:12] <gregaf> I've gott run though, we have a meeting and then interviews!
[19:19] <pulsar> total
[19:19] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Remote host closed the connection)
[19:21] <NaioN> pulsar: hmmm if it's total then it's not that different from mine
[19:22] <NaioN> I didn't add any options for pgs so this is the default for 0.40 i think (this is the release I created the cluster with)
[19:22] <NaioN> I now run .42 on that cluster
[19:22] <NaioN> but I haven't seen any problems with XFS
[19:23] <NaioN> only made the mistake to not resize the inode size and now it takes another block for the xattrs, but its an experimental cluster so in the next one I will you a bigger inode size
[19:28] <pulsar> i have 40 servers, 80 osds and with default settings i get 15k pgs
[19:28] <pulsar> causing my cluster to fail on the first boot. if i lower the number of pgs everything seems to be just fine
[19:28] <gregaf> pulsar: NaioN: the problem is that during startup if your OSDs come up sufficiently staggered then you can have all of those 15k PGs mapped to a single OSD, which means a lot of mkdirs in a single directory
[19:28] <gregaf> running away again!
[19:29] <pulsar> right
[19:36] <NaioN> so the question is if xfs can cope with 15k dirs in one dir?
[19:40] * lofejndif (~lsqavnbok@229.Red-81-32-60.dynamicIP.rima-tde.net) has joined #ceph
[19:41] <pulsar> it can, but it will get a bit slow doing that
[19:41] <pulsar> causing the whole bootstrapping process to time out
[19:42] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[19:42] <pulsar> and if this happens during the first boot, the peering / pg_create will deadlock
[19:43] <pulsar> i mean, the osd will time out and die. not the boot process as a whole.
[19:45] <NaioN> hmmm so the best solution would be to increase the time-out, because with enough time it would settle
[19:45] <^conner> is xfs's dir hashing that poor?
[19:45] <^conner> I thought it was good to 100K
[19:45] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[19:46] <pulsar> this is what i have figured out so far by guessing
[19:46] <pulsar> i might be wrong, and something else is causing the osd to time out
[19:46] <NaioN> well the only thing i know about it is that i switched from xfs to ext3 for my backup volumes... i used rsync with hardlinks
[19:46] <^conner> have you looked at the xfsstress tests?
[19:46] <pulsar> nope?
[19:46] <^conner> I thought one of the tests is a massive dir creation
[19:47] <NaioN> and the difference in time between xfs and ext3 are huge on such a load
[19:47] <NaioN> because of the many inodes you create every backup
[19:47] <^conner> NaioN, ya, well ext3 is preallocated inodes
[19:47] <^conner> NaioN, xfs will also break if you create a new filesystem and then put a > 16TB file on it
[19:48] <NaioN> yeah so you have to think in advance, because you need to allocate enough inodes in advance
[19:48] <^conner> it wants to allocate some inodes in 32bit addressable block space and craps itself for new file allocation
[19:48] <^conner> I guess I should have reported that one
[19:48] <NaioN> hmmm didn't create a filesystem of that size... if you have to fsck it you have a problem :)
[19:49] <pulsar> pg v706: 4266 pgs: 4266 active+clean; 8730 bytes data, 69171 GB used, 138 TB / 206 TB avail
[19:49] <pulsar> \o/
[19:49] <pulsar> active
[19:49] <pulsar> finally!
[19:49] <^conner> NaioN, my largest xfs filesystem is 84TB
[19:49] <^conner> mke2fs still can't create filesystems > 16TB
[19:49] <NaioN> i hope for you you never have to fsck it
[19:49] <^conner> I pulled the git code a year ago and tried 17TB ext4, it was corrupted
[19:50] <^conner> NaioN, xfs fsck times aren't bad
[19:50] <NaioN> the newer tools are somewhat memory efficient
[19:50] <NaioN> but the old ones where crap
[19:50] <^conner> well what's shipping with rhel6 is limited 16tb
[19:50] <^conner> xfs is your only good option from > 16TB
[19:50] <^conner> at least it doesn't leak memory anymore
[19:51] <NaioN> yeah xfs has made some real good steps in the last years on linux
[19:52] <^conner> it's a solid FS now
[19:52] <NaioN> i agree with that
[19:52] <^conner> btrfs seems to be years off
[19:52] <NaioN> we use it a lot in production
[19:53] <^conner> Chris Mason did a btrfs update talk at scale10x
[19:53] <^conner> and he still hasn't released the fsck code
[19:53] <NaioN> I can't get btrfs stable with ceph
[19:53] <^conner> seriously scarey
[19:53] <^conner> the benchmarks aren't very good idea
[19:53] <^conner> I don't get all the interest
[19:53] <NaioN> last tests i did the performance dropped after some load
[19:54] <^conner> I'd rather have a working device-map raid6 personality
[19:54] <^conner> well, it works now, the performance is just horrid
[19:54] <NaioN> I had the same problem as somebody posted on the mailing list
[19:54] <^conner> I built a 4 7200rpm disk raid6 for storing photography stuff... it does about 3MB/s
[19:55] <NaioN> you use a mdraid/raid6 setup with xfs and ceph?
[19:55] <^conner> hell no
[19:55] <^conner> that's ext4
[19:56] <NaioN> aha ok
[19:56] <NaioN> why not?
[19:56] <^conner> "3MB/s"
[19:56] <^conner> for sequental writes
[19:56] <NaioN> oh because of the raid6?
[19:56] <^conner> actually, I've never tried to setup ceph
[19:57] <NaioN> hmmm i had good performance with raid5/raid6 setups
[19:57] <^conner> I've been popping into this channel once a year for about 5 years
[19:57] <^conner> but I'm thinking about giving it a try
[19:57] <NaioN> but for some reason the IO stalled after a period of time and the osds began crashing
[19:57] <^conner> I've a 48 disk box I can use for testing right now
[19:58] <^conner> I need another 300TB by august and it's looking like it's going to be GPFS
[19:58] <NaioN> I've now two 24 disks boxes (only half filled at the moment) for testing
[19:58] <NaioN> from IBM?
[19:58] <^conner> the software
[19:58] <^conner> storage may be build my own or maybe Dell
[19:58] <NaioN> yeah i know, but that's from IBM if I'm correct
[19:59] <^conner> Dell has given me really aggressive pricing on 3TB SAS disks
[19:59] <^conner> yes, GPFS is from IBM
[19:59] <^conner> GPFS is pretty slick and it "just works"
[19:59] <NaioN> yeah ok, but it's a commercial product
[19:59] <^conner> but it won't store more than 2 copies of a file and it doesn't really checksum
[20:00] <^conner> yes, it's commercal
[20:00] <^conner> but I already have it licensed
[20:00] <NaioN> well i was looking for a cheaper solution :)
[20:02] <NaioN> besides one of the biggest advantages of ceph is the data placement algoritm, no need for a central table
[20:03] <NaioN> I don't know the workings of GPFS
[20:04] <NaioN> aha GPFS is a shared disk filesystem...
[20:05] * BManojlovic (~steki@ has joined #ceph
[20:06] <nhm> We've never been able to get good GPFS pricing. I think IBM has been angry that they didn't win any of our recent procurments.
[20:08] <NaioN> hmmm the closest thing that I have worked with that looks like GPFS is OCFS2
[20:09] <Tv|work> nhm: "If you need to ask, it's too high" pricing?
[20:09] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[20:09] <NaioN> Tv|work: hehe, enterprise...
[20:10] <nhm> Tv|work: not quite that bad. More like "we'll reply to your RFP, but we'll obviously price ourselves higher than DDN just to spite you".
[20:11] <nhm> They were threatening to revoke the University's "AAA" customer status a while back supposedly.
[20:11] <nhm> whatever that means.
[20:13] <nhm> actually I should be fair. I think on our last storage RFP they did come in lower than DDN.
[20:13] <elder> I was asked to comment on some of the XFS discussion above... I find there are a number of things to comment on...
[20:13] <elder> First: NaioN> so the question is if xfs can cope with 15k dirs in one dir?
[20:13] <elder> Yes, XFS can cope with that just fine. I don't have any numbers, but the XFS directory structure is a btree so it should scale fairly well.
[20:14] <NaioN> elder: well pulsar seems to have a problem with it
[20:14] <elder> Second: <^conner> NaioN, xfs will also break if you create a new filesystem and then put a > 16TB file on it
[20:14] <NaioN> I would also assume XFS could handle 15k dirs in a dir
[20:14] <elder> I'm not aware of that but if it's true it's a bug.
[20:15] <elder> It would be nice if someone could give a more specific test case to reproduce that problem so it can get fixed.
[20:15] <Tv|work> pulsar: are you still here?
[20:15] <NaioN> elder: the problem began with pulsar, who is having troubles with building his ceph cluster
[20:16] <elder> Next: <^conner> well what's shipping with rhel6 is limited 16tb
[20:17] <nhm> elder: He might be talking about the maximum limit at 4k page size: http://oss.sgi.com/projects/xfs/
[20:17] <nhm> with kernel 2.4 though.
[20:18] <elder> That limitation is just Red Hat stating that beyond that is not a fully supported configuration. It's a practical thing--they don't want to offer it if they have not actually tested it to their satisfaction. 16TB they are comfortable with.
[20:18] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[20:20] <^conner> elder, no, it's a limitation in e2fstools
[20:20] <elder> I think that's all. 15000 entries in a directory shouldn't be a problem.
[20:20] <elder> You mean xfsprogs?
[20:20] <elder> I'm talking about XFS
[20:21] <^conner> elder, RH supports XFS up to 100TB by default
[20:21] <^conner> you have to contact them for anything larger
[20:21] <elder> OK, that's more like it.
[20:21] <^conner> this is with the "scalable filesystem" RHEL addon
[20:21] <^conner> it's ~$200
[20:21] <elder> Red Hat is investing a lot in XFS.
[20:22] <^conner> what the do is pretty silly. The stock kernel has xfs support but the don't include the utils in the normal repos
[20:22] <^conner> you can just grab the SL rpm and install it
[20:22] <^conner> I think I purchased 6 of the XFS addons thou
[20:22] <elder> You can, but they won't support it. That's their business. You can grab all of Linux too.
[20:23] <elder> But if you want to have help and the peace of mind of their support, you have to pay for it.
[20:23] <^conner> of course, I think it's pretty silly that they don't include the utils thou
[20:23] <^conner> seeing that RHEL is useless without EPEL... and they don't support that either
[20:48] <pulsar> Tv|work: yeah
[21:05] * lofejndif (~lsqavnbok@229.Red-81-32-60.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[21:13] <Tv|work> pulsar: elder here is one of the xfs developers; if you have xfs problems, let us help you figure out why
[21:13] <Tv|work> (sorry for delay, it's lunchtime here)
[21:14] <pulsar> Tv|work: to proove that it was xfs related i would need to recreate the same situation on btrfs
[21:14] <Tv|work> pulsar: no, not really.. you can recreate the scenarios outside of ceph
[21:14] <Tv|work> on just any xfs mountpoint
[21:14] <pulsar> right now i am running a stress test
[21:14] <elder> Can you characterize the problem you're having a bit better?
[21:15] <elder> (Maybe I missed something earlier)
[21:15] <pulsar> yes, so let me explain:
[21:15] <pulsar> node@node-1:~$ uname -a
[21:15] <pulsar> Linux node-1.cluster-5.intra 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011 x86_64 GNU/Linux
[21:15] <pulsar> debian squeeze
[21:15] <pulsar> ceph 0.41
[21:16] <elder> Ooh, 2.6.32 is not going to have the benefit of a LOT of performance work on XFS.
[21:16] <pulsar> deb http://ceph.newdream.net/debian/ squeeze main
[21:16] <pulsar> deb-src http://ceph.newdream.net/debian/ squeeze main
[21:16] <pulsar> yeah, i know it is not very fresh
[21:16] <Tv|work> that's the oldest kernel running ceph i've seen in months, i think
[21:16] <pulsar> debian stable :/
[21:16] <pulsar> i like stable :
[21:16] <Tv|work> s/stab/sta/ :-p
[21:17] <pulsar> anyway. should i continue with the problem description?
[21:17] <Tv|work> elder: do you think he has any chance of decent performance with that kernel?
[21:17] <elder> Yes
[21:17] <pulsar> so, i have 40 servers, 2 disks in each, makes theoretically 80 osds
[21:18] <elder> Yes, it should be fine, but the work over the last couple of years has dramatically improve metadata-intensive loads.
[21:18] <NaioN> elder: what are the changes with trying a newer kernel?
[21:18] <pulsar> i use 76 osds, 2 servers are reserved
[21:18] <pulsar> one master/mds
[21:18] <pulsar> so simplest setup possible.
[21:18] <NaioN> elder: yeah that's something we have noticed
[21:18] <pulsar> formatted all nodes
[21:19] <pulsar> and tried to start all ceph daemons (-a start)
[21:19] <pulsar> the system was creating pgs, peering
[21:19] <pulsar> and at least one node died during this process
[21:19] <pulsar> got left behind with non peered osds
[21:19] <pulsar> and uncreated pgs
[21:20] <pulsar> we tried a lot of pg_create commands yesterday
[21:20] <pulsar> sjust was helping me out to figure out whats wrong
[21:20] <pulsar> let me scroll up, he had an explaination to the problem on ceph's side
[21:21] <pulsar> 02:07 < sjust> basically, the remaining osds won't create a new copy of that pg because they can't prove osd12 didn't write anything before it died
[21:22] <elder> Hmm. So far I don't see anything that points to XFS... But to be honest my ceph knowledge so far is not very deep so I don't know what this big all-at-once start means in terms of load on the underlying FS.
[21:22] <pulsar> 02:05 < sjust> pulsar: ugh, we don't handle lost correctly in the creation code paths, we'll need to get you a patch
[21:22] <elder> When the node died, is there any evidence on it that it was the fault of XFS?
[21:22] <pulsar> and why i think it is i/o related:
[21:22] <gregaf> elder: the OSD that died was one of the first to boot up, so it was the primary for 1900 PGs
[21:22] <Tv|work> elder: "died" might mean "became very very slow", in this context
[21:23] <pulsar> the node died because of timeout issues
[21:23] <pulsar> so, the file system could not keep up creating all the pg directories on one node
[21:23] <gregaf> and it suicided on a thread that was waiting for the filesystem to get back to it
[21:23] <pulsar> yep
[21:23] <elder> Is that a verb?
[21:23] <Tv|work> he just verbed it
[21:24] <elder> OK, again, I'd say that's a problem with Ceph, not XFS. Ultimately you're saturating the hardware.
[21:24] <elder> That being said, newer XFS would likely give you a better chance of survival.
[21:24] <pulsar> so, i dont say it is a xfs bug, but rather it might be related to ceph vs xfs, xfs being slow, hd being slow.
[21:24] <elder> But I won't assume that's an option for you.
[21:24] <pulsar> therefore i have to verify that using btrfs
[21:24] <Tv|work> pulsar: with that old kernel? probably not worth the effort...
[21:24] <pulsar> or any other fs.
[21:25] <gregaf> yeah, the assumption I had was that it was trying to create some 6k directories in a single dir all at once and hit some bottleneck
[21:25] <gregaf> we need to do something to improve our cluster startup ops :/
[21:25] <Tv|work> btrfs in the 2.6.32 era was most likely very bad, too
[21:25] <pulsar> so, what i have acutally found might be a bug in ceph, as sjust stated, not handing failures during initialb pg creage phase?
[21:25] <elder> More modern XFS will give you performance on metadata work (like mkdirs) that is comparable to or better than what btrfs offers.
[21:26] <Tv|work> gregaf: throttling so you don't start every operation at once
[21:26] <elder> I would expect your one node was busy writing out journal entries for all the mkdirs, and pretty much saturating your disk bandwidth.
[21:26] <pulsar> very likely
[21:26] <Tv|work> pulsar: ceph definitely could use improvement, but the stress you put it through should not have been enough to break it
[21:26] <pulsar> once i lowered the number of pgs (pg_bits) the system came up just fine
[21:26] <pulsar> i guess i could bump it up a little bit now
[21:27] <gregaf> Sam and I know what happened; it's just annoying and fixing it will be non-trivial
[21:27] <gregaf> since most other large-cluster users aren't reporting issues yet, it's probably not a priority…. :/
[21:27] <NaioN> going for a newe kernel isn't an option for you pulsar?
[21:27] <pulsar> running at 5k pgs / 200tb now
[21:27] <nhm> gregaf: what's "large-cluster" in this context?
[21:28] <pulsar> well... i might consider that
[21:28] <gregaf> nhm: well, nobody else has reported issues on startup of this nature :)
[21:28] <pulsar> that cluster is puppet managed, so if i crash it, i can still bring it back
[21:28] <Tv|work> nhm: something like >>5, <<1000
[21:28] <gregaf> and I know there are a few people running similarly-sized deployments
[21:28] <pulsar> i need to backup some tb though first
[21:28] <elder> I believe that massive mkdirs in the same directory is a particular case that the "delaylog" work on XFS a year or two ago will improve really dramatically.
[21:28] <nhm> gregaf: ok. Just curious what kind of testing users are doing.
[21:28] <pulsar> i could pull a new kernel from sid
[21:29] <NaioN> pulsar: i think that will most likely will help you most
[21:29] <elder> I'd have to check but I think 2.6.38 offered delaylog, or maybe that's the release at which it became the default way XFS operated.
[21:29] <pulsar> i think, i need that new kernel really bad.
[21:29] <pulsar> using fuse right now
[21:30] <pulsar> aiming at 1b directories in a ceph fs
[21:30] <pulsar> i guess fuse won't be my best option for that
[21:30] <elder> Sounds like a really great test case for ceph.
[21:30] <pulsar> i'll let you know how it goes
[21:30] <pulsar> just started the test
[21:31] <elder> Please do. I think we really would like to see you be successful with this.
[21:31] <pulsar> just passed 80k
[21:31] * greglap (~Adium@aon.hq.newdream.net) has joined #ceph
[21:31] <NaioN> pulsar: I tried to use rsync backup (hardlinks) on cephfs with the kernel client
[21:32] <NaioN> but I doidn't get it really stable
[21:32] <pulsar> :/
[21:32] <NaioN> And that was nowhere near 8b dirs...
[21:33] <greglap> NaioN: iirc your problem was with hard links and the AnchorTable, right?
[21:33] <NaioN> yes
[21:33] * filoo (~jens@ip-88-153-224-220.unitymediagroup.de) has joined #ceph
[21:34] <filoo> hi
[21:34] <NaioN> I did about 20 rsyncs at the same time
[21:34] <filoo> is there any know problem of setting up a fresh ceph at the moment ?
[21:35] <NaioN> data was a linux distro with user data
[21:35] <NaioN> rsyncs to seperate dirs
[21:36] <Tv|work> NaioN: lots of hardlinks might be a non-optimal use for the ceph dfs
[21:36] <greglap> filoo: depends which components you're planning to use
[21:36] <Tv|work> NaioN: essentially, ceph embeds inodes to directories, so n-1 of the hardlinks need to go through an indirection layer
[21:36] <NaioN> Tv|work: yeah i noticed, now I use an intermediate server with rbd's formatted with ext3
[21:36] <Tv|work> NaioN: ahhh that changes the whole picture
[21:37] <NaioN> yes that setup is stable
[21:37] <filoo> greglap: 1 mds and 1 mon on one machine, 2 ods ans seperate machines
[21:37] <filoo> i've installed the 0.42
[21:37] <filoo> -> system runs fine before on 0.40
[21:37] <NaioN> Tv|work: I used to test the cephfs, but as I said, I didn't get it stable and I was aware that I was stretching cephfs a bit with the many inodes
[21:37] <filoo> now the osds didn't start
[21:38] <greglap> filoo: we don't consider the filesystem stable, but with 1MDS it handles boring stuff pretty well :)
[21:38] <NaioN> so I switched to RBDs and that has been stable for some time now
[21:38] <greglap> ah, I don't think there are any new issues you're going to run into, no
[21:39] <NaioN> But I use XFS on the disks, because I got troubles with performance witt BTRFS (in the latest kernels)
[21:39] <NaioN> with older kernels I ran into btrfs bugs
[21:40] <filoo> NaioN -> same here .... problems with BTRFS .... using XFS now
[21:40] <NaioN> filoo: which kernel?
[21:40] <filoo> 3.2.2
[21:40] <NaioN> with the latest kernels (3.2.x) I don't have any crashes but I get real poor performance after a while
[21:41] <filoo> yep .... agree
[21:41] <nhm> NaioN: on btrfs?
[21:41] <NaioN> nhm: yeps
[21:41] <nhm> NaioN: there was some discussion about that a while back.
[21:41] <greglap> we think we might have tracked down what's happening there, although we haven't verified it
[21:42] <NaioN> nhm: yeah if I'm correct it was Christian ...
[21:42] <NaioN> He had made a graph
[21:42] <NaioN> and how to reproduce
[21:43] <greglap> oh, there were a lot of things that made it a little better, but we're still seeing issues
[21:43] <NaioN> I had the same issue
[21:43] <greglap> we think it has to do with the PG logs getting highly fragmented due to the numerous small allocations, but we haven't had time to verify it or try out different allocation patterns yet
[21:43] <filoo> greglap: i've just reinitialized our test cluster ...... with a very simple config. The OSDs didnt start
[21:44] <filoo> i get the following on "ceph osd tree"
[21:44] <greglap> filoo: heh — did the logs have any useful output?
[21:44] <filoo> root@cephnode1:~# ceph osd tree
[21:44] <filoo> 2012-02-22 21:43:58.718548 mon <- [osd,tree]
[21:44] <filoo> 2012-02-22 21:43:58.718993 mon.0 -> 'dumped osdmap tree epoch 1' (0)
[21:44] <filoo> # id weight type name up/down reweight
[21:44] <filoo> -1 2 domain root
[21:44] <filoo> 0 1 osd.0 DNE
[21:44] <filoo> 1 1 osd.1 DNE
[21:45] <filoo> created with the normal "mkcephfs -a -c /etc/ceph/ceph.conf" command
[21:45] <filoo> osd is marked "running"
[21:46] <filoo> root@cephnode2:~# /etc/init.d/ceph status
[21:46] <filoo> === osd.0 ===
[21:46] <filoo> osd.0: running...
[21:46] <nhm> greglap: how were you thinking about verifying the fragmentation?
[21:47] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) has joined #ceph
[21:48] <greglap> nhm: well, we can see it's fragmented using btrfs tools; what we'll have to do is try preallocating larger chunks or something
[21:48] <greglap> filoo: hmm, I'm not familiar with the tree stuff
[21:48] <greglap> what does ceph -s report?
[21:49] <filoo> root@cephnode2:/var/log# ceph -s
[21:49] <filoo> 2012-02-22 21:49:40.212537 pg v3: 396 pgs: 396 creating; 0 bytes data, 0 KB used, 0 KB / 0 KB avail
[21:49] <filoo> 2012-02-22 21:49:40.213736 mds e3: 1/1/1 up {0=alpha=up:creating}
[21:49] <filoo> 2012-02-22 21:49:40.213763 osd e1: 0 osds: 0 up, 0 in
[21:49] <filoo> 2012-02-22 21:49:40.213815 log 2012-02-22 21:42:13.343131 mon.0 2 : [INF] mds.? up:boot
[21:50] <filoo> 2012-02-22 21:49:40.213852 mon e1: 1 mons at {alpha=}
[21:50] <nhm> greglap: Makes sense. what size chunks are you seeing now?
[21:50] <filoo> osds are up and running .... network is ok
[21:50] <greglap> nhm: I really don't remember; I'm hearing all this second-hand from various people
[21:50] <filoo> no auth
[21:50] <nhm> ah, ok
[21:50] <greglap> filoo: okay, is there any logging enabled on the OSDs?
[21:52] <filoo> no .... standard logging.... just on other comment: if i use "ceph osd unpause" -> all osds will die with "file not found errors" in the logs
[21:52] <filoo> they will show up very short in "ceph -w"
[21:53] <filoo> root@cephnode2:/var/log# ceph osd unpause
[21:53] <filoo> 2012-02-22 21:52:41.270494 mon <- [osd,unpause]
[21:53] <filoo> 2012-02-22 21:52:41.573966 mon.0 -> 'unpause rd+wr' (0)
[21:53] <filoo> root@cephnode2:/var/log# ceph -w
[21:53] <filoo> 2012-02-22 21:52:44.564770 pg v6: 396 pgs: 396 creating; 0 bytes data, 0 KB used, 0 KB / 0 KB avail
[21:53] <filoo> 2012-02-22 21:52:44.565971 mds e3: 1/1/1 up {0=alpha=up:creating}
[21:53] <filoo> 2012-02-22 21:52:44.565999 osd e4: 2 osds: 2 up, 2 in
[21:53] <filoo> 2012-02-22 21:52:44.566050 log 2012-02-22 21:52:40.322765 mon.0 4 : [INF] osd.0 boot
[21:53] <filoo> 2012-02-22 21:52:44.566088 mon e1: 1 mons at {alpha=}
[21:53] <filoo> 2012-02-22 21:42:16.037283 7f4ca09b5780 journal _open /dev/sdb1 fd 24: 5379297280 bytes, block size 4096 bytes, directio = 1, aio = 0
[21:53] <filoo> 2012-02-22 21:52:43.591005 7f4c93716700 filestore(/data/osd0) error opening file /data/osd0/current/meta/osdmap.1__0_FD6E49B1 with flags=0 and mode=0: (2) No such file or directory
[21:53] <filoo> *** Caught signal (Aborted) **
[21:53] <filoo> in thread 7f4c93716700
[21:53] <NaioN> filoo: how did you prepare the osds?
[21:53] <greglap> okay, so did you pause them?
[21:53] <filoo> no
[21:53] <filoo> i didnt
[21:54] <greglap> so why did you try the unpause command?
[21:54] <greglap> I'm confused
[21:54] <NaioN> filoo: how did you prepare the osds?
[21:55] <filoo> i've mounted the xfs filesystem to /data/osd0
[21:55] <NaioN> I think you did something wrong there
[21:55] <NaioN> and then?
[21:55] <filoo> the issued "mkcephfs -a -c /etc/ceph/ceph.conf"
[21:55] <filoo> -> runs without errors
[21:55] * lofejndif (~lsqavnbok@78.Red-88-19-214.staticIP.rima-tde.net) has joined #ceph
[21:55] <NaioN> that's not correct
[21:56] <filoo> whats wrong with that ?
[21:56] <NaioN> thats for the complete cluster
[21:56] <NaioN> for a single osd you need another "format" command
[21:57] <NaioN> filoo: http://ceph.newdream.net/wiki/Replacing_a_failed_disk/OSD
[21:57] <filoo> yes -> as i mentioned before .... i've created a new cluster a few minutes ago
[21:57] <filoo> very simple config
[21:57] <filoo> 1 mds/mon
[21:57] <NaioN> yeah but you forgot something
[21:57] <NaioN> the osds have to get initialized
[21:58] <NaioN> you could use mkcepfs -a -c ... --mkbtrfs
[21:58] <NaioN> then mkcephfs will format and initialize the osds with btrfs (if you have the correct options in the conf)
[21:59] <NaioN> but because you use xfs you have to initialize them manually
[21:59] <filoo> -> i use xfs ..... but the same happend with btrfs
[22:00] <filoo> 2012-02-22 21:42:05.637108 7fd41eaae780 created object store /data/osd0 journal /dev/sdb1 for osd.0 fsid ac911edc-d686-c354-2aae-447388c79250
[22:00] <NaioN> see the page I posted, mind that with the newer cephs cosd = ceph-osd
[22:00] <NaioN> did you do anything with the osds after formatting the filesystem?
[22:00] <filoo> no
[22:00] <NaioN> because you have to
[22:01] <NaioN> so there a no files and dirs under /data/osd0?
[22:01] <filoo> sure ... a lot
[22:01] <NaioN> oh ok
[22:01] <filoo> root@cephnode2:/data/osd0# ls
[22:01] <filoo> ceph_fsid current fsid magic store_version whoami
[22:01] <filoo> root@cephnode2:/data/osd0#
[22:02] <filoo> i've initialized a lot of versions on this cluster before
[22:02] <NaioN> ok sorry i misunderstood... what you did is correct
[22:02] <filoo> ;) *puhhh*
[22:02] <NaioN> :)
[22:02] <filoo> sorry for my english
[22:03] <NaioN> mine isn't better :)
[22:04] <filoo> i've used the cluster from 0.25 on for testing
[22:04] <filoo> the initalisation was the same ever
[22:04] <filoo> but now the osds are not starting
[22:04] <NaioN> on the osd what do you see in the log?
[22:05] <filoo> before issuing the "ceph osd unpause" command ?
[22:05] <NaioN> what you pasted are the last two lines?
[22:05] <filoo> just one moment .... i do it again ....
[22:05] <NaioN> I mean in the log on the osd
[22:05] <NaioN> not with the ceph command
[22:07] <filoo> ok .... i've reinitalized it now (freshly with mkcephfs)
[22:07] <filoo> what infos do you need ?
[22:07] <greglap> filoo: you should add debugging to the OSD configs before starting them (or do that and restart them)
[22:07] <NaioN> on the osd you get a log in /var/log/ceph for that osd
[22:08] <filoo> jep
[22:08] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[22:09] <filoo> may i post the last 20 lines ? or is it to much ?
[22:09] <Tv|work> filoo: pastebin
[22:09] <NaioN> or use something like pastebin
[22:09] <filoo> ok
[22:09] <Tv|work> 20 lines is enough for this irc server to throttle or even kick you
[22:10] <Tv|work> plus, irc truncates long lines
[22:10] <filoo> autsch .... ok
[22:22] <filoo> http://pastebin.com/xYeWGrk7
[22:22] <filoo> from mkfs on
[22:26] * greglap (~Adium@aon.hq.newdream.net) Quit (Quit: Leaving.)
[22:27] <filoo> Tv|work: thanks for the pastbin tip .... never heard about that
[22:27] <gregaf> okay, so the OSD is sending a message to the monitor and not getting a reply
[22:28] <gregaf> filoo: the monitor is at, correct?
[22:28] <filoo> right
[22:28] <gregaf> okay, can you turn on debugging on the monitor so we can see why it's not replying to the OSD?
[22:28] <filoo> sure
[22:28] <filoo> just a moment
[22:29] <gregaf> if you don't want to restart you can use injectargs
[22:29] <gregaf> ceph mon tell 0 injectargs "--debug_mon 20"
[22:33] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[23:04] <filoo> gregaf: http://pastebin.de/23530
[23:06] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) Quit (Quit: Ex-Chat)
[23:07] <gregaf> filoo: huh, it doesn't seem to be getting the message from the OSD
[23:07] <gregaf> can you restart the osd again, this time adding "debug ms = 20" as well?
[23:08] <gregaf> errr, that's assuming the OSD is still running and trying to connect
[23:08] <gregaf> check the osd log and see if there's an auth message going out that should be in that monitor log
[23:09] <filoo> hmmm ..... meanwhile the osds has crashed
[23:09] <filoo> 012-02-22 22:45:31.262523 7fc56aa48700 filestore(/data/osd0) error opening file /data/osd0/current/meta/osdmap.1__0_FD6E49B1 with flags=0 and mode=0: (2) No such file or directory
[23:09] <filoo> 2012-02-22 22:45:31.262535 7fc56aa48700 filestore(/data/osd0) FileStore::read(meta/fd6e49b1/osdmap.1/0) open error: (2) No such file or directory
[23:09] <filoo> *** Caught signal (Aborted) **
[23:10] <gregaf> okay, that looks like the same one you were triggering with the unpause
[23:10] <gregaf> actually that's probably a bug, they seem to think they have an OSDMap before they really do :/
[23:11] <gregaf> can you zip up one of those logs and create a bug on the tracker for it?
[23:11] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[23:11] <filoo> yes .... i will do that tomorrow :)
[23:11] <filoo> its late here ;)
[23:11] <filoo> thanks for the support
[23:31] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[23:35] * lofejndif (~lsqavnbok@78.Red-88-19-214.staticIP.rima-tde.net) Quit (Ping timeout: 480 seconds)
[23:39] * filoo (~jens@ip-88-153-224-220.unitymediagroup.de) Quit (Quit: Verlassend)
[23:41] * yehudasa__ (~yehudasa@aon.hq.newdream.net) has joined #ceph
[23:42] * sjust1 (~sam@aon.hq.newdream.net) has joined #ceph
[23:42] * gregaf1 (~Adium@aon.hq.newdream.net) has joined #ceph
[23:42] * sjust (~sam@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[23:43] * joshd1 (~joshd@aon.hq.newdream.net) has joined #ceph
[23:43] * sagewk1 (~sage@aon.hq.newdream.net) has joined #ceph
[23:44] * dmick1 (~dmick@aon.hq.newdream.net) has joined #ceph
[23:44] * lofejndif (~lsqavnbok@207.Red-88-19-214.staticIP.rima-tde.net) has joined #ceph
[23:44] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[23:44] * gregaf2 (~Adium@aon.hq.newdream.net) has joined #ceph
[23:45] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[23:48] * Tv|work (~Tv__@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:48] * yehudasa_ (~yehudasa@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:48] * joshd (~joshd@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:48] * dmick (~dmick@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:49] * gregaf (~Adium@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:49] * sagewk (~sage@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:49] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[23:50] * gregaf1 (~Adium@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:50] * sjust1 (~sam@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:50] * yehudasa__ (~yehudasa@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:51] * joshd1 (~joshd@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.