#ceph IRC Log

Index

IRC Log for 2011-08-26

Timestamps are in GMT/BST.

[0:10] * cp (~cp@74.85.19.35) Quit (Ping timeout: 480 seconds)
[0:18] * cp (~cp@74.85.19.35) has joined #ceph
[0:20] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[0:24] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[0:58] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[0:59] * aliguori (~anthony@32.97.110.59) Quit (Quit: Ex-Chat)
[0:59] <adjohn> Are there any chef recipes floating around for ceph?
[1:01] <sjust> not so far
[1:05] <lxo> mmm, ceph yummy ;-)
[1:06] <gregaf1> lxo: have you had any trouble with 1318 lately?
[1:07] <gregaf1> http://tracker.newdream.net/issues/1318
[1:35] <cp> So, one of my ceph osd nodes died and the btrfs file system I had on it (mounted from a local file) is corrupted. How do I go about adding this node back with, say, a blank osd to replace the now missing one?
[1:41] <sjust> cp: starting the osd with a blank filestore should do it
[1:44] <cp> sjust: ** ERROR: unable to open OSD superblock on /mnt/foo/osd1: No such file or directory
[1:44] <cp> failed: 'ssh node1 /usr/bin/cosd -i 1 -c /tmp/ceph.conf.4278 '
[1:44] <cp> :/mnt/foo# ls
[1:44] <cp> log osd1 out
[1:45] <cp> This is running "ceph -a start" from a monitor node.
[1:45] <sjust> ah
[1:45] <sjust> I am looking up the command
[1:45] <cp> (I mkdir'd osd1 as well to see if that would help)
[1:46] <sjust> http://ceph.newdream.net/wiki/Replacing_a_failed_disk/OSD
[1:46] <sjust> start at the cosd --mkfs part
[1:47] <sjust> you want to ssh into the node, run cosd --mkfs -i <num> --monmap <path to monmap>
[1:47] <sjust> then copy over the keyring and run ceph start
[1:48] <sjust> ah, nvm, need to do the top part to extract the monmap and keyring
[1:50] <cp> sjust: thanks
[1:50] <sjust> sure
[1:55] <cp> "service ceph start osd1" didn't seem to work though (used "service ceph -a start" which seems like overkill). Also, "service ceph stop" on node1 doesn't stop osd1 (on that node). Am I missing something?
[1:56] <sjust> hmm, service ceph -a start did work?
[1:56] <cp> yup
[1:56] <sjust> ok
[1:56] <sjust> looking up the script
[1:59] <sjust> odd, I think that service ceph start from node1 should start osd1
[1:59] <sjust> same with stop
[2:00] <cp> yeah, that's what I was expecting too. Seems like it's totally crashed now too. Ah well...
[2:00] <sjust> how did it crash?
[2:00] <Tv> adjohn: we're setting up cookbooks, but they're not quite ready enough for consumption
[2:01] <Tv> adjohn: i understand our current QA cluster is set up purely by chef, but we still need to add a few features to make osd deployment work right
[2:02] <Tv> we really should make those repos public already..
[2:02] <adjohn> Tv: are those available somewhere? We are working on a crowbar deployment involving Chef, and it would be great to collaborate
[2:02] <Tv> adjohn: sadly i don't think they are; i'll nudge people in that direction
[2:03] <adjohn> cool thanks
[2:05] <cp> sjust: I'm ssh'd in. My terminal just froze. Let's see if it happens again.
[2:09] <sjust> cp: ok
[2:09] <sjust> cp: I'm heading out for the day, I'll be back on in about 15 hours
[2:09] <cp> sjust: thanks. :)
[2:10] <sjust> cp: sure
[2:11] * cmccabe (~cmccabe@69.170.166.146) has left #ceph
[2:39] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[2:57] * bchrisman (~Adium@64.164.138.146) Quit (Quit: Leaving.)
[3:26] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:26] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[3:32] * cp (~cp@74.85.19.35) Quit (Quit: cp)
[3:45] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[3:56] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[4:00] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[5:05] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[5:05] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit ()
[7:37] * nido_c (~nidodo@201.200.154.78) has joined #ceph
[7:37] * nido_c (~nidodo@201.200.154.78) has left #ceph
[8:26] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Server closed connection)
[8:26] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[9:18] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[9:35] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[10:36] * monrad (~mmk@domitian.tdx.dk) has joined #ceph
[10:38] * _Shiva__ (shiva@whatcha.looking.at) has joined #ceph
[10:38] * wonko_be (bernard@november.openminds.be) has joined #ceph
[10:38] * f4m8_ (~f4m8@lug-owl.de) has joined #ceph
[10:38] * Anticime1 (anticimex@netforce.csbnet.se) has joined #ceph
[10:38] * _Shiva_ (shiva@whatcha.looking.at) Quit (charon.oftc.net magnet.oftc.net)
[10:38] * monrad-51468 (~mmk@domitian.tdx.dk) Quit (charon.oftc.net magnet.oftc.net)
[10:38] * wonko_be_ (bernard@november.openminds.be) Quit (charon.oftc.net magnet.oftc.net)
[10:38] * stingray (~stingray@stingr.net) Quit (charon.oftc.net magnet.oftc.net)
[10:38] * Anticimex (anticimex@netforce.csbnet.se) Quit (charon.oftc.net magnet.oftc.net)
[10:38] * f4m8 (~f4m8@lug-owl.de) Quit (charon.oftc.net magnet.oftc.net)
[10:44] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Remote host closed the connection)
[10:49] * stingray (~stingray@stingr.net) has joined #ceph
[10:52] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[12:18] <johnl_> hey cephers. redmine is giving connection refused.
[14:01] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[14:41] * Juul (~Juul@gw1.imm.dtu.dk) has joined #ceph
[14:49] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:27] * Juul (~Juul@gw1.imm.dtu.dk) Quit (Quit: Leaving)
[17:32] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[17:32] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[17:37] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[18:16] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[18:29] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[18:35] * Meths_ (rift@2.27.72.229) has joined #ceph
[18:38] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:38] * Meths (rift@2.25.193.139) Quit (Ping timeout: 480 seconds)
[18:40] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:47] * Meths (rift@2.25.193.199) has joined #ceph
[18:50] * Meths_ (rift@2.27.72.229) Quit (Ping timeout: 480 seconds)
[19:16] * bchrisman (~Adium@64.164.138.146) has joined #ceph
[19:36] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (Quit: Ex-Chat)
[19:42] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:53] <cp> Question: So, when I run "ceph -s" I'm getting messages like this: "ceph -s
[19:53] <cp> 2011-08-26 10:52:16.919956 7f94f12eb700 -- 172.16.16.136:0/14671 >> 172.16.16.138:6789/0 pipe(0x135cce0 sd=3 pgs=0 cs=0 l=0).fault first fault"
[19:53] <sjust> cp: that probably means that the Monitor isn't running?
[19:54] <cp> Hmm.. It claimed to start a moment ago:
[19:54] <cp> starting mon.1 rank 1 at 172.16.16.137:6789/0 mon_data /mnt/foo/mon1 fsid 723a2589-5e63-7fea-ad5f-acedbc70d24d
[19:54] <sjust> cp: is the process running?
[19:54] <cp> How do I check that?
[19:55] <sjust> pgrep cmon on the appropriate machine, I suppose
[19:56] <cp> Right, cmon isn't running on that node. How do I force it to start? "service ceph start" on the node does basically nothing
[19:56] <cp> This is my conf file:
[19:56] <sjust> actually, it may have crashed, do you see anything in the logs?
[19:56] <cp> [global]
[19:56] <cp> log dir = /mnt/foo/out
[19:56] <cp> logger dir = /mnt/foo/log
[19:56] <cp> chdir = ""
[19:56] <cp> pid file = /mnt/foo/out/$type$id.pid
[19:56] <cp> [mds]
[19:56] <cp> pid file = /mnt/foo/out/$name.pid
[19:56] <cp> lockdep = 1
[19:56] <cp> debug ms = 1
[19:56] <cp> debug mds = 20
[19:56] <cp> mds log max segments = 2
[19:56] <cp> [osd]
[19:56] <cp> lockdep = 1
[19:56] <cp> debug ms = 1
[19:57] <cp> debug osd = 25
[19:57] <cp> debug journal = 20
[19:57] <cp> debug filestore = 10
[19:57] <cp> [mon]
[19:57] <cp> lockdep = 1
[19:57] <cp> debug mon = 20
[19:57] <cp> debug paxos = 20
[19:57] <cp> debug ms = 1
[19:57] <cp> [mon0]
[19:57] <cp> host = node10
[19:57] <cp> mon addr = 172.16.16.136:6789
[19:57] <cp> mon data = /mnt/foo/mon0
[19:57] <cp> [osd0]
[19:57] <cp> host = node10
[19:57] <cp> mon addr = 172.16.16.136:6790
[19:57] <cp> osd data = /mnt/foo/osd0
[19:57] <cp> osd journal = /mnt/foo/osd0/journal
[19:57] <cp> osd journal size = 100
[19:57] <cp> [mon1]
[19:57] <cp> host = node11
[19:57] <cp> mon addr = 172.16.16.137:6789
[19:57] <cp> mon data = /mnt/foo/mon1
[19:57] <cp> [osd1]
[19:57] <cp> host = node11
[19:57] <cp> mon addr = 172.16.16.137:6790
[19:57] <cp> osd data = /mnt/foo/osd1
[19:57] <cp> osd journal = /mnt/foo/osd1/journal
[19:57] <cp> osd journal size = 100
[19:57] <cp> [mon2]
[19:57] <cp> host = node12
[19:57] <cp> mon addr = 172.16.16.138:6789
[19:57] <cp> mon data = /mnt/foo/mon2
[19:57] <cp> [osd2]
[19:57] <cp> host = node12
[19:57] <cp> mon addr = 172.16.16.138:6790
[19:57] <cp> osd data = /mnt/foo/osd2
[19:57] <cp> osd journal = /mnt/foo/osd2/journal
[19:57] <cp> osd journal size = 100
[19:57] <cp> [mds.a]
[19:57] <cp> sjust: only osd logs
[19:57] <cp> sjust: you mean the ceph logs, right?
[19:57] <sjust> yeah, there should be monitor logs as well
[20:01] <cp> Can't see anything in particular. In the mon directory there are some log files, but:
[20:02] <cp> -rw-r--r-- 1 root root 83449 2011-08-26 10:25 log
[20:02] <cp> -rw-r--r-- 1 root root 83449 2011-08-26 10:25 log.debug
[20:02] <cp> -rw-r--r-- 1 root root 55 2011-08-26 09:56 log.err
[20:02] <cp> -rw-r--r-- 1 root root 83449 2011-08-26 10:25 log.info
[20:02] <cp> drwxr-xr-x 1 root root 4972 2011-08-26 10:59 logm
[20:02] <cp> they haven't been updated in a while
[20:02] <cp> Still can't get the monitors to start
[20:04] <cp> OK, so I monitors up on two of the three nodes (other one refuses to actually start), but I'm still getting the message " ceph -s
[20:04] <cp> 2011-08-26 11:03:21.914247 7f3334f39700 -- :/16158 >> 172.16.16.138:6789/0 pipe(0xdf2ce0 sd=3 pgs=0 cs=0 l=0).fault first fault"
[20:04] <cp> (.138 has a cmon process up and running)
[20:05] <sjust> you can point the ceph tool to the working one with ceph -m <ip>
[20:05] <sjust> cp: could you post the command and output of attemptint to start mon0?
[20:06] <cp> root@ubuntu:/mnt/foo/log# service ceph -a start
[20:06] <cp> === mon.0 ===
[20:06] <cp> Starting Ceph mon0 on node10...already running
[20:06] <cp> === mon.1 ===
[20:06] <cp> Starting Ceph mon1 on node11...
[20:06] <cp> ** WARNING: Ceph is still under heavy development, and is only suitable for **
[20:06] <cp> ** testing and review. Do not trust it with important data. **
[20:06] <cp> starting mon.1 rank 1 at 172.16.16.137:6789/0 mon_data /mnt/foo/mon1 fsid 723a2589-5e63-7fea-ad5f-acedbc70d24d
[20:06] <cp> === mon.2 ===
[20:06] <cp> Starting Ceph mon2 on node12...
[20:06] <cp> ** WARNING: Ceph is still under heavy development, and is only suitable for **
[20:06] <cp> ** testing and review. Do not trust it with important data. **
[20:06] <cp> starting mon.2 rank 2 at 172.16.16.138:6789/0 mon_data /mnt/foo/mon2 fsid 723a2589-5e63-7fea-ad5f-acedbc70d24d
[20:06] <cp> === mds.a ===
[20:06] <cp> Starting Ceph mds.a on ubuntu...already running
[20:06] <cp> === osd.0 ===
[20:06] <cp> Starting Ceph osd0 on node10...already running
[20:06] <cp> === osd.1 ===
[20:06] <cp> Starting Ceph osd1 on node11...
[20:06] <cp> ** WARNING: Ceph is still under heavy development, and is only suitable for **
[20:06] <cp> ** testing and review. Do not trust it with important data. **
[20:06] <cp> starting osd1 at 0.0.0.0:6804/9652 osd_data /mnt/foo/osd1 /mnt/foo/osd1/journal
[20:06] <cp> object store /mnt/foo/osd1 is currently in use (cosd already running?)
[20:06] <cp> ** ERROR: initializing osd failed: Device or resource busy
[20:06] <cp> failed: 'ssh node11 /usr/bin/cosd -i 1 -c /tmp/ceph.conf.16592
[20:07] <sjust> there isn't any monitor log file at /mnt/foo/out/ or /mnt/foo/log on node 11?
[20:07] <sjust> */mnt/foo/log/
[20:07] <cp> root@ubuntu:/mnt/foo# cd log
[20:07] <cp> root@ubuntu:/mnt/foo/log# ls
[20:07] <cp> osd.1.log
[20:08] <cp> root@ubuntu:/mnt/foo# ls
[20:08] <cp> log mon1 osd1 out
[20:08] <sjust> nothing in /mnt/foo/out/?
[20:09] <cp> ah, yes
[20:09] <cp> root@ubuntu:/mnt/foo/out# tail mon.1.log
[20:09] <cp> 2011-08-26 10:51:00.400295 7f0367fa7740 -- 172.16.16.137:6789/0 messenger.start
[20:09] <cp> 2011-08-26 10:51:00.400325 7f0367fa7740 -- 172.16.16.137:6789/0 messenger.start daemonizing
[20:09] <cp> 2011-08-26 10:51:00.405328 7f0367fa7740 -- 172.16.16.137:6789/0 accepter.start
[20:09] <cp> 2011-08-26 10:51:00.649373 7f0367fa7740 mon.1@1(starting) e1 init fsid 723a2589-5e63-7fea-ad5f-acedbc70d24d
[20:09] <cp> 2011-08-26 10:51:00.658270 7f0367fa7740 store(/mnt/foo/mon1) get_int pgmap/last_pn = 201
[20:09] <cp> 2011-08-26 10:51:00.665686 7f0367fa7740 store(/mnt/foo/mon1) get_int pgmap/accepted_pn = 400
[20:10] <cp> 2011-08-26 10:51:00.665826 7f0367fa7740 store(/mnt/foo/mon1) get_int pgmap/last_committed = 505
[20:10] <cp> 2011-08-26 10:51:00.665906 7f0367fa7740 store(/mnt/foo/mon1) get_int pgmap/first_committed = 5
[20:10] <cp> 2011-08-26 10:51:00.665931 7f0367fa7740 mon.1@1(starting).paxos(pgmap recovering lc 505) init
[20:10] <cp> 2011-08-26 10:51:00.666817 7f0367fa7740 store(/mnt/foo/mon1) get_int mdsmap/last_pn = 201
[20:10] <Tv> cp: i just realized he had "log dir" in the config -- you really want to use "log file" instead
[20:11] <Tv> s/he/you/
[20:12] <cp> So what should those lines look like?
[20:13] <Tv> something like log file = /var/log/ceph/$name.$id.log
[20:13] <Tv> err $name.log
[20:14] <Tv> because $name already contains $id
[20:14] <gregaf1> cp: can you run ceph -s -m [working_monitor_address] and paste it?
[20:16] <cp> gregag1: hmmm...
[20:16] <cp> root@ubuntu:/mnt/foo/log# pgrep cmon
[20:16] <cp> 14126
[20:16] <cp> root@ubuntu:/mnt/foo/log# ceph -s -m 172.16.16.136
[20:16] <cp> ^C
[20:16] <cp> It just freezes
[20:16] <sjust> port?
[20:16] <gregaf1> yeah
[20:16] <gregaf1> ceph -s -m 172.16.16.136:6789
[20:17] <cp> root@ubuntu:/mnt/foo/log# ceph -s -m 172.16.16.136:6789
[20:17] <cp> freezes too
[20:17] <gregaf1> although it shouldn't freeze
[20:17] <gregaf1> oh, maybe it guesses if you leave port off
[20:17] <gregaf1> okay, so the whole thing's just borked
[20:17] <gregaf1> do you have any data in here or has it never started up properly?
[20:18] <cp> How I got here (I think) was by suspending two of the three nodes and then resuming them.
[20:18] <cp> (I'm running with VMs right now to make testing easier)
[20:19] <gregaf1> oh, so it was working, and then you suspended 2/3, and when you resumed everything broke?
[20:19] <cp> basically
[20:20] <gregaf1> okay
[20:20] <cp> I then restarted .138 to see if that changed anything. stopped all the services and started them again. No luck
[20:21] <gregaf1> so no new maps could have gotten committed while they were suspended, but maybe when they came back up....
[20:21] <gregaf1> all right, we're going to have to do this piecemeal I suspect
[20:21] <gregaf1> I don't think any of us have dealt with this before
[20:21] <gregaf1> so we're going to shut all the nodes down, then try and build the system back up to working from a single monitor
[20:22] <cp> ok
[20:22] <gregaf1> first step: stop all the daemons
[20:22] <cp> done
[20:23] <gregaf1> second step: start up mon0 manually:
[20:23] <cp> what command should do that?
[20:23] <cp> ok
[20:23] <gregaf1> we're going to use some extra arguments to turn up the debugging:
[20:23] <cp> cmon -i 0
[20:23] <gregaf1> cmon -i 0 ???debug_ms 1 ???debug_mon 10
[20:23] * TuckerB (~Tucker@64.124.146.138) has joined #ceph
[20:23] <gregaf1> sjust: think we want any other logging?
[20:24] <gregaf1> oh, sorry, forgot config file
[20:24] <sjust> probably not
[20:24] <gregaf1> cmon -i 0 -c /path/to/conf ???debug_ms 1 ???debug_mon 10
[20:25] <cp> root@ubuntu:/mnt/foo/mon0# cmon -i 0 -c /etc/ceph/ceph.conf ???debug_ms 1 ???debug_mon 10
[20:25] <cp> ** WARNING: Ceph is still under heavy development, and is only suitable for **
[20:25] <cp> ** testing and review. Do not trust it with important data. **
[20:25] <cp> usage: cmon -i monid [--mon-data=pathtodata] [flags]
[20:25] <cp> --debug_mon n
[20:25] <cp> debug monitor level (e.g. 10)
[20:25] <cp> --mkfs
[20:25] <cp> build fresh monitor fs
[20:25] <cp> --debug_ms N
[20:25] <cp> set message debug level (e.g. 1)
[20:25] <cp> -D debug (no fork, log to stdout)
[20:25] <cp> -f foreground (no fork, log to file)
[20:25] <cp> -c ceph.conf or --conf=ceph.conf
[20:25] <cp> get options from given conf file
[20:25] <gregaf1> oh, looks like you copied my auto-corrected text
[20:25] <gregaf1> you want a double-dash, not an em-dash on those word arguments :)
[20:26] <cp> still no luck with "cmon -i 0 -c /etc/ceph/ceph.conf ???-debug_ms 1 ???-debug_mon 10"
[20:26] <cp> ah
[20:26] <cp> I understand
[20:26] <cp> done'
[20:27] <gregaf1> okay, can you pastebin the mon log?
[20:27] <gregaf1> and see if you can get a ceph -s off that monitor
[20:28] <cp> how much
[20:29] <gregaf1> there shouldn't be too much since you started it up? all of it if it'll fit
[20:29] <cp> here's the tail:
[20:29] <cp> lease_expire=0.000000 has v0 lc 4
[20:29] <cp> 2011-08-26 11:28:22.411441 7f204ef72700 mon.0@0(starting).paxosservice(auth) waiting for paxos -> readable (v0)
[20:29] <cp> 2011-08-26 11:28:22.411481 7f204ef72700 -- 172.16.16.136:6789/0 <== osd1 172.16.16.137:6800/8212 1 ==== auth(proto 0 26 bytes) v1 ==== 52+0+0 (2050109519 0 0) 0x12d1490 con 0x12cf8f0
[20:29] <cp> 2011-08-26 11:28:22.411502 7f204ef72700 mon.0@0(starting) e1 do not have session, making new one
[20:29] <cp> 2011-08-26 11:28:22.411516 7f204ef72700 mon.0@0(starting) e1 ms_dispatch new session MonSession: osd1 172.16.16.137:6800/8212 is open for osd1 172.16.16.137:6800/8212
[20:29] <cp> 2011-08-26 11:28:22.411531 7f204ef72700 mon.0@0(starting) e1 setting timeout on session
[20:29] <cp> 2011-08-26 11:28:22.411541 7f204ef72700 mon.0@0(starting).paxosservice(auth) dispatch auth(proto 0 26 bytes) v1 from osd1 172.16.16.137:6800/8212
[20:29] <cp> 2011-08-26 11:28:22.411555 7f204ef72700 mon.0@0(starting).paxos(auth recovering lc 4) is_readable now=2011-08-26 11:28:22.411555 lease_expire=0.000000 has v0 lc 4
[20:29] <cp> 2011-08-26 11:28:22.411568 7f204ef72700 mon.0@0(starting).paxosservice(auth) waiting for paxos -> readable (v0)
[20:29] <cp> 2011-08-26 11:28:23.986165 7f204e771700 mon.0@0(starting).elector(1) election timer expired
[20:29] <cp> 2011-08-26 11:28:23.986266 7f204e771700 mon.0@0(starting).elector(1) start -- can i be leader?
[20:29] <cp> 2011-08-26 11:28:23.986280 7f204e771700 mon.0@0(starting) e1 starting_election 1
[20:29] <cp> 2011-08-26 11:28:23.986317 7f204e771700 mon.0@0(starting).paxos(pgmap recovering lc 506) election_starting -- canceling timeouts
[20:29] <cp> 2011-08-26 11:28:23.986332 7f204e771700 mon.0@0(starting).paxos(mdsmap recovering lc 5) election_starting -- canceling timeouts
[20:29] <cp> 2011-08-26 11:28:23.986357 7f204e771700 mon.0@0(starting).paxos(osdmap recovering lc 34) election_starting -- canceling timeouts
[20:29] <cp> 2011-08-26 11:28:23.986370 7f204e771700 mon.0@0(starting).paxos(logm recovering lc 847) election_starting -- canceling timeouts
[20:29] <cp> 2011-08-26 11:28:23.986405 7f204e771700 mon.0@0(starting).paxos(class recovering lc 3) election_starting -- canceling timeouts
[20:29] <cp> 2011-08-26 11:28:23.986445 7f204e771700 mon.0@0(starting).paxos(monmap recovering lc 1) election_starting -- canceling timeouts
[20:29] <cp> 2011-08-26 11:28:23.986458 7f204e771700 mon.0@0(starting).paxos(auth recovering lc 4) election_starting -- canceling timeouts
[20:29] <cp> 2011-08-26 11:28:23.986470 7f204e771700 mon.0@0(starting).paxosservice(pgmap) election_starting
[20:29] <cp> 2011-08-26 11:28:23.986481 7f204e771700 mon.0@0(starting).paxosservice(mdsmap) election_starting
[20:29] <cp> 2011-08-26 11:28:23.986491 7f204e771700 mon.0@0(starting).paxosservice(osdmap) election_starting
[20:29] <cp> 2011-08-26 11:28:23.986501 7f204e771700 mon.0@0(starting).paxosservice(logm) election_starting
[20:29] <cp> 2011-08-26 11:28:23.986533 7f204e771700 mon.0@0(starting).paxosservice(class) election_starting
[20:29] <cp> 2011-08-26 11:28:23.986545 7f204e771700 mon.0@0(starting).paxosservice(monmap) election_starting
[20:29] <cp> 2011-08-26 11:28:23.986555 7f204e771700 mon.0@0(starting).paxosservice(auth) election_starting
[20:29] <cp> 2011-08-26 11:28:23.986576 7f204e771700 -- 172.16.16.136:6789/0 --> mon1 172.16.16.137:6789/0 -- election(propose 1) v1 -- ?+0 0x12d19c0
[20:29] <cp> 2011-08-26 11:28:23.986717 7f204e771700 -- 172.16.16.136:6789/0 --> mon2 172.16.16.138:6789/0 -- election(propose 1) v1 -- ?+0 0x12d1bf0
[20:29] <gregaf1> eek
[20:30] <gregaf1> I meant pastebin.org :)
[20:30] <gregaf1> okay, so that actually looks okay
[20:30] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[20:30] <gregaf1> cp: start up mon.1 with the same command set
[20:32] <cp> ok
[20:32] <cp> root@ubuntu:/mnt/foo/mon1/monmap# cmon -i 1 -c /etc/ceph/ceph.conf --debug_ms 1 --debug_mon 10
[20:32] <cp> ** WARNING: Ceph is still under heavy development, and is only suitable for **
[20:32] <cp> ** testing and review. Do not trust it with important data. **
[20:32] <cp> starting mon.1 rank 1 at 172.16.16.137:6789/0 mon_data /mnt/foo/mon1 fsid 723a2589-5e63-7fea-ad5f-acedbc70d24d
[20:32] <gregaf1> okay
[20:33] <gregaf1> so give it a few seconds and then try ceph -s -m live_monitor_ip
[20:33] <gregaf1> see if it still hangs
[20:34] <cp> root@ubuntu:/mnt/foo/mon1/monmap# ceph -s -m 172.16.16.137
[20:34] <cp> 2011-08-26 11:34:17.848043 7f4650030700 -- :/12119 >> 172.16.16.137:6789/0 pipe(0x2450b20 sd=3 pgs=0 cs=0 l=0).fault first fault
[20:34] <cp> ^C
[20:34] <cp> root@ubuntu:/mnt/foo/mon1/monmap# ceph -s -m 172.16.16.136
[20:34] <cp> second one hangs; so behavior is just as before.
[20:34] <gregaf1> 172.16.16.136 is one of the ones we just started up?
[20:35] <cp> mon0 yes
[20:35] <cp> 137 is mon1
[20:35] <gregaf1> okay, can you paste the tail of the log into pastebin.org and give me the link?
[20:36] <cp> http://pastebin.com/faU7XBMC
[20:38] <gregaf1> okay, can you run "ceph -s -m 172.16.16.137:6789"?
[20:38] <gregaf1> I just want to make sure the behavior is the same, I don't remember if it guesses ports
[20:38] <cp> Here are the mon1 logs, which are a little more interesting
[20:38] <cp> http://pastebin.com/FW2UuV3s
[20:39] <gregaf1> umm, hrm
[20:39] <gregaf1> OH
[20:40] <gregaf1> your clocks are off by over an hour
[20:40] <gregaf1> I just assumed they would resync properly on resume; all the VM stuff I've used does that
[20:40] <gregaf1> that is almost certainly your culprit
[20:41] <cp> yes - I just noticed that too
[20:41] <gregaf1> you'll need to sync up the time on all the machines running monitors
[20:41] <gregaf1> or they won't play well together
[20:42] <cp> that makes sense. Is there a standard way to do that kind of thing?
[20:42] <gregaf1> ummmm.....yes
[20:42] <gregaf1> *looks around for linux admin*
[20:42] <gregaf1> I think it's ntp?
[20:42] <gregaf1> sjust: ntp, right?
[20:43] <sjust> uh, probably
[20:43] <gregaf1> Tv says ntp or ntpdate
[20:44] <gregaf1> in this case probably ntpdate, and then ntp in the future
[20:56] <gregaf1> okay, lunchtime here; we'll be back in 30 or so
[20:56] <cp> have a good lunch
[20:59] <Tv> suspending a vm is no reason for clocks to go wrong though
[20:59] <Tv> ntp won't save you if you're that far off; ntpdate is customarily done at boot, so you'd need to do that manually
[21:00] <Tv> hmm perhaps the vm doesn't notice that it was suspended, and thus doesn't get to fix it's clock
[21:00] <Tv> i was thinking it'd work more like how with laptop suspend you get do actions before/after
[21:00] <Tv> i do wonder why the kernel wouldn't detect huge jiffie jump and recalibrate clock
[21:00] <Tv> err well not jiffie jump, something like tsc jump
[21:01] <Tv> hmm
[21:02] <Tv> google gives me relevant suse docs: http://doc.opensuse.org/products/opensuse/openSUSE/opensuse-kvm/cha.libvirt.config.html#sec.kvm.managing.clock
[21:03] <Tv> i can confirm my ubuntu 10.10 vm under kvm uses kvm_clock automagically
[21:03] <Tv> and then you should *not* use ntp inside the vm ;)
[21:05] * bchrisman (~Adium@64.164.138.146) Quit (Remote host closed the connection)
[21:06] * bchrisman (~Adium@64.164.138.146) has joined #ceph
[21:08] <cp> I'm using VMware fusion, which I'm surprised would be off.
[21:08] <cp> The timestamps on the files actually looked fine too
[21:08] <cp> forgot to check "date" before doing ntp
[21:09] <cp> anyway, at least there's no freezing when I do "ceph -s" now
[21:09] <cp> just regular errors: ceph -s
[21:09] <cp> 2011-08-26 11:56:36.005907 7f443549b700 -- 172.16.16.137:0/14846 >> 172.16.16.137:6789/0 pipe(0x11e5b00 sd=3 pgs=0 cs=0 l=0).fault first fault
[21:10] <cp> This one is interesting because it's from the machine to itself (137 to 137). Any ideas?
[21:10] <Tv> oh linux vm clocks under vmware were totally broken the last time i had to deal with that
[21:10] <Tv> installing vmware-tools helped somewhat, not always
[21:11] <cp> OK. I have them installed at least
[21:11] <Tv> it's been a while though
[21:12] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[21:17] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[21:24] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[21:30] * bchrisman (~Adium@64.164.138.146) Quit (Quit: Leaving.)
[21:32] <gregaf1> cp: once you've got all your clocks synced you probably want to start up each of your monitors and see if they live
[21:33] <gregaf1> if they still hit that assert we'll need to work out what's going on there, otherwise you probably just need to re-add any OSDs that have been marked out (just a command) and start up the MDSes
[21:41] <cp> gregag1: I'm not sure if it was a time problem. mon1 is starting up (though it dies again), but the log file isn't getting any longer. The last entries are still 10:51, even though the system clock is certainly at the right time
[21:42] <cp> The time on the file mon.1.log is up to date too, so something is touching it, but nothing is being written to it
[21:51] * greglap (~Adium@aon.hq.newdream.net) has joined #ceph
[21:52] <greglap> cp: the logs won't get longer without debugging enabled, and without turning on the debugging in the conf or when you start it up it won't log most stuff :)
[21:53] <greglap> so try with those commands and we'll see if it's the same crash, which was something to do with storage going badly
[21:55] <cp> still no logs from mon1
[21:55] <cp> root@ubuntu:/mnt/foo/out# ls -la
[21:55] <cp> total 61256
[21:55] <cp> drwxr-xr-x 1 root root 122 2011-08-26 11:31 .
[21:55] <cp> dr-xr-xr-x 1 root root 28 2011-08-26 09:55 ..
[21:55] <cp> -rw-r--r-- 1 root root 0 2011-08-26 12:54 mds.a.log
[21:55] <cp> -rw-r--r-- 1 root root 0 2011-08-26 10:59 mds.a.pid
[21:55] <cp> -rw-r--r-- 1 root root 0 2011-08-26 11:56 mon.0.log
[21:55] <cp> -rw-r--r-- 1 root root 10817618 2011-08-26 12:40 mon.1.log
[21:55] <cp> -rw-r--r-- 1 root root 0 2011-08-26 12:49 mon1.pid
[21:55] <cp> -rw-r--r-- 1 root root 51897390 2011-08-26 12:54 osd.1.log
[21:55] <cp> -rw-r--r-- 1 root root 0 2011-08-26 10:52 osd1.pid
[21:55] <cp> root@ubuntu:/mnt/foo/out# cmon -i 1 -c /etc/ceph/ceph.conf --debug_ms 1 --debug_mon 10
[21:55] <cp> ** WARNING: Ceph is still under heavy development, and is only suitable for **
[21:55] <cp> ** testing and review. Do not trust it with important data. **
[21:55] <cp> starting mon.1 rank 1 at 172.16.16.137:6789/0 mon_data /mnt/foo/mon1 fsid 723a2589-5e63-7fea-ad5f-acedbc70d24d
[21:55] <cp> root@ubuntu:/mnt/foo/out# ls -la
[21:55] <cp> total 61256
[21:55] <cp> drwxr-xr-x 1 root root 122 2011-08-26 11:31 .
[21:55] <cp> dr-xr-xr-x 1 root root 28 2011-08-26 09:55 ..
[21:56] <cp> -rw-r--r-- 1 root root 0 2011-08-26 12:55 mds.a.log
[21:56] <cp> -rw-r--r-- 1 root root 0 2011-08-26 10:59 mds.a.pid
[21:56] <cp> -rw-r--r-- 1 root root 0 2011-08-26 11:56 mon.0.log
[21:56] <cp> -rw-r--r-- 1 root root 10817618 2011-08-26 12:55 mon.1.log
[21:56] <cp> -rw-r--r-- 1 root root 0 2011-08-26 12:55 mon1.pid
[21:56] <cp> -rw-r--r-- 1 root root 51897390 2011-08-26 12:55 osd.1.log
[21:56] <cp> -rw-r--r-- 1 root root 0 2011-08-26 10:52 osd1.pid
[21:56] <cp> root@ubuntu:/mnt/foo/out# tail mon.1.log
[21:56] <cp> 2011-08-26 10:51:00.400295 7f0367fa7740 -- 172.16.16.137:6789/0 messenger.start
[21:56] <cp> 2011-08-26 10:51:00.400325 7f0367fa7740 -- 172.16.16.137:6789/0 messenger.start daemonizing
[21:56] <cp> 2011-08-26 10:51:00.405328 7f0367fa7740 -- 172.16.16.137:6789/0 accepter.start
[21:56] <cp> 2011-08-26 10:51:00.649373 7f0367fa7740 mon.1@1(starting) e1 init fsid 723a2589-5e63-7fea-ad5f-acedbc70d24d
[21:56] <cp> 2011-08-26 10:51:00.658270 7f0367fa7740 store(/mnt/foo/mon1) get_int pgmap/last_pn = 201
[21:56] <cp> 2011-08-26 10:51:00.665686 7f0367fa7740 store(/mnt/foo/mon1) get_int pgmap/accepted_pn = 400
[21:56] <cp> 2011-08-26 10:51:00.665826 7f0367fa7740 store(/mnt/foo/mon1) get_int pgmap/last_committed = 505
[21:56] <cp> 2011-08-26 10:51:00.665906 7f0367fa7740 store(/mnt/foo/mon1) get_int pgmap/first_committed = 5
[21:56] <cp> 2011-08-26 10:51:00.665931 7f0367fa7740 mon.1@1(starting).paxos(pgmap recovering lc 505) init
[21:56] <cp> 2011-08-26 10:51:00.666817 7f0367fa7740 store(/mnt/foo/mon1) get_int mdsmap/last_pn = 201
[21:56] <cp> root@ubuntu:/mnt/foo/out#
[21:58] <cp> gregaf1: sorry for the long dump there
[21:59] <greglap> cp: did you start it up again with the debugging options?
[22:00] <cp> yup, I killed the process and used "cmon -i 1 -c /etc/ceph/ceph.conf --debug_ms 1 --debug_mon 10"
[22:00] <greglap> sjust: I guess I'm going to be sitting in on a meeting; can you look at this more?
[22:00] <greglap> cp: odd
[22:00] <greglap> I really have no idea then why it wouldn't update...
[22:02] <cp> oh - no space left on device. I wonder how I managed to chew up that much space already
[22:04] <cp> Well, mon0 and mon2 are up at least
[22:05] <greglap> cp: well, that would also cause things to break
[22:05] <cp> :)
[22:05] <greglap> actually, it probably was the cause of that crash in the log
[22:06] <cp> Ah. So it would be great to get the cluster up with mon0 and mon2 and then rebuild mon1 from a fresh file system
[22:06] <greglap> well if you can clean up some space you can probably get them all up
[22:07] <cp> that would be great too
[22:08] <jojy> is there an inject arg cmd for changing mds log level
[22:08] <greglap> jojy: yeah, use the inject-args framework with ???debug_mds x
[22:09] <greglap> you can update all the config options with inject_args, and some of them that get transferred to other variables won't take but all the logging stuff will
[22:10] <jojy> greglap: thanks. I am sorry i am trying to find the inject-arg wiki and am having a hard time.
[22:10] <greglap> iirc it's just a tell
[22:11] <greglap> ceph mds tell * inject_args ???debug_mds 10
[22:11] <greglap> somebody else might correct me
[22:15] <Tv> quote the *
[22:16] <Tv> wow greg gave you a fancy em-dash.. he meant "--" ;)
[22:16] <Tv> greglap: your client is too fancy for a geek channel
[22:17] <greglap> yeah, I could probably disable that for this
[22:26] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[22:27] <jojy> greglap: not sure that worked. changing log level to 0 still logs stuff
[22:29] <greglap> what's it logging?
[22:32] <jojy> greglap: i tried "ceph mds tell * inject_args --debug_mds 0" (also tried replacing * with IP address)
[22:33] <jojy> i still see "2011-08-26 20:31:59.300582 7ff06fffe700 mds0.4 send_message_client_counted client4114 seq 16972 client_caps(grant ino 1000000001a "
[22:33] <jojy> on the active mds log
[22:33] <greglap> jojy: is it actually sending the message?
[22:33] <greglap> sjust will need to take over now, meeting starting
[22:33] * greglap (~Adium@aon.hq.newdream.net) Quit (Quit: Leaving.)
[22:34] <jojy> 2011-08-26 20:31:59.300582 7ff06fffe700 mds0.4 send_message_client_counted client4114 seq 16972 client_caps(grant ino 1000000001a
[22:34] <jojy> sorry
[22:34] <jojy> 192.168.98.115:0/32046 --> 192.168.98.115:6789/0 -- mon_command(mds tell anaconda-ks.cfg install.log install.log.syslog last_nuke_file post-ks.log inject_args v 0) v1 -- ?+0 0xecc270 con 0xecf780
[22:34] <jojy> 192.168.98.115:0/32046 <== mon0 192.168.98.115:6789/0 4 ==== mon_command_ack([install.log,install.log.syslog,last_nuke_file,post-ks.log,inject_args]=0 ok v28) v1 ==== 117+0+0 (3643399399 0 0) 0x7f3a4c000bd0 con 0xecf780
[22:34] <jojy> 2011-08-26 20:31:32.748490 mon0 -> 'ok' (0)
[22:39] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[22:43] <sjust> jojy: it looks like bash expanded the *
[22:43] * greglap (~Adium@aon.hq.newdream.net) has joined #ceph
[22:44] * greglap (~Adium@aon.hq.newdream.net) Quit ()
[22:46] <jojy> u are right. i wud have gussed that giving IP address shudnt have that issue
[22:47] <sjust> oh, you would have to specify the mds name, not the ip
[22:48] <jojy> ahh
[22:56] * greglap (~Adium@aon.hq.newdream.net) has joined #ceph
[23:04] * greglap (~Adium@aon.hq.newdream.net) Quit (Quit: Leaving.)
[23:21] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[23:34] * TuckerB (~Tucker@64.124.146.138) Quit (Remote host closed the connection)
[23:43] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[23:45] * adjohn (~adjohn@50.0.103.34) Quit ()
[23:51] * adjohn (~adjohn@50.0.103.34) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.