#ceph IRC Log

Index

IRC Log for 2011-06-14

Timestamps are in GMT/BST.

[13:50] -coulomb.oftc.net- *** Looking up your hostname...
[13:50] -coulomb.oftc.net- *** Checking Ident
[13:50] -coulomb.oftc.net- *** No Ident response
[13:50] -coulomb.oftc.net- *** Found your hostname
[13:50] * CephLogBot (~PircBot@rockbox.widodh.nl) has joined #ceph
[13:54] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[14:18] * bhem (~bhem@9KCAAA409.tor-irc.dnsbl.oftc.net) has joined #ceph
[14:22] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[14:54] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:01] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[15:01] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[15:02] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[15:03] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[15:07] * mtk (TXfTZmmdxw@panix2.panix.com) has joined #ceph
[15:08] * mtk (TXfTZmmdxw@panix2.panix.com) Quit ()
[15:08] * mtk (pvWV0uQLvX@panix2.panix.com) has joined #ceph
[16:06] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[16:12] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[16:30] * mtk (pvWV0uQLvX@panix2.panix.com) Quit (Remote host closed the connection)
[16:30] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[16:39] * jmlowe (~Adium@mobile-166-137-143-169.mycingular.net) Quit (Read error: Connection reset by peer)
[16:54] * Yulya_ (~Yu1ya_@ip-95-220-189-27.bb.netbynet.ru) has joined #ceph
[16:57] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[17:08] * bhem (~bhem@9KCAAA409.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[17:14] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:15] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[17:45] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:45] * ghaskins_mobile (~ghaskins_@130.57.22.201) has joined #ceph
[17:57] * ghaskins_mobile (~ghaskins_@130.57.22.201) Quit (Quit: This computer has gone to sleep)
[17:59] * slang (~slang@chml01.drwholdings.com) has joined #ceph
[17:59] <slang> hi all
[18:00] <slang> I'm trying to setup a ceph deployment with my own crushmap, on a 6 node rig, with 5 devices (osds) running on each node
[18:02] <slang> if I specify the crushmap at mkcephfs time, everything starts fine but it looks like the osds just remain in the peering state forever (days), and I'm never able to mount
[18:04] <slang> but if I use the default crushmap when I make the fs, the peering process completes much more quickly, and I'm able to mount, and then go back and change the crushmap with ceph osd getcrushmap/setcrushmap, and still able to access the fs after a minor delay
[18:05] <slang> is this expected behavior, and if so, can someone explain what's happening when the crushmap is specified at mkcephfs time that seems to get it stuck forever?
[18:05] <slang> I should probably note that I'm using a really wimpy setup, the nodes are just laptops with a gige network
[18:06] <slang> so all the peering could just be flooding the network and causing timeouts
[18:11] * sjust (~sam@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[18:12] * ghaskins_mobile (~ghaskins_@130.57.22.201) has joined #ceph
[18:13] * sagewk1 (~sage@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[18:16] * ghaskins_mobile (~ghaskins_@130.57.22.201) Quit ()
[18:20] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[18:24] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:27] * sjust (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:27] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) Quit (Remote host closed the connection)
[18:33] * NashTrash (~Adium@129.59.105.122) has left #ceph
[18:40] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:51] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:59] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:00] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:01] * cmccabe (~cmccabe@208.80.64.174) has joined #ceph
[19:06] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[19:08] * alexxy[home] (~alexxy@79.173.81.171) has joined #ceph
[19:09] * alexxy (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[19:17] * alexxy[home] (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[19:18] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[19:40] <cmccabe> meeting?
[19:40] <Tv> cmccabe: sage and yehuda are still interviewing a candidate i think
[19:40] <sjust> sage and yehuda are talking to someone in their office
[19:41] <cmccabe> k
[19:43] <bchrisman> got another mtg...
[19:51] * fred_ (~fred@200-228.77-83.cust.bluewin.ch) has joined #ceph
[19:51] <fred_> hello
[19:52] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[19:53] <joshd> hi
[19:54] <joshd> are you 'ar Fred' in our bug tracker?
[19:54] <fred_> I managed to reproduce #998, I posted stacktraces
[19:54] <fred_> yes
[19:54] <fred_> that's why I'm here, in case you need more infos about #998
[19:54] <joshd> cool, I'm looking at it now
[19:55] <fred_> good, it would be very nice to have a fix for it as it is quite a pain for me... I'm available almost full time for debugging, ...
[19:58] <joshd> awesome, you're the first one to reproduce with the full trace. What ceph version is this? a few line numbers have changed in master
[19:59] <fred_> indicated in the bug. this is the stable branch from 2-4 days ago
[19:59] <fred_> yeah, I battled the whole week-end to be able to generate these coredumps !
[19:59] <joshd> ah, right
[20:03] * ghaskins_mobile (~ghaskins_@130.57.22.201) has joined #ceph
[20:11] <sagewk> standup!
[20:16] * alexxy[home] (~alexxy@79.173.81.171) has joined #ceph
[20:19] * alexxy (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[20:39] <sjust> wido: are you there?
[20:43] <joshd> fred_: sorry for the wait, could you open one of the cores in gdb, 'frame 8', and 'p *this'?
[21:09] * alexxy[home] (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[21:10] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[21:23] * alexxy[home] (~alexxy@79.173.81.171) has joined #ceph
[21:26] * alexxy (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[21:57] * alexxy[home] (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[21:59] <wido> sjust: here now
[21:59] <sjust> wido: it seemed that the some of the atom machines did have rather high load'
[21:59] <sjust> wido: and so osds were being marked down incorrectly
[22:00] <wido> sjust: I just saw your update
[22:00] <sjust> wido: I turned off logging (probably contributing to the slow) and turned the heartbeat grace interval way up and restarted the osds
[22:01] <wido> ok, cool
[22:01] <wido> I did notice that osd.17 crashed, but I'll report an issue for that seperate
[22:01] <sjust> ok
[22:01] <wido> have full logs + core dump
[22:02] <wido> sjust: Your guess is lack of CPU power?
[22:03] <sjust> well, more a cascading failure
[22:05] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[22:06] <wido> Ok, I'm pretty happy that it isn't raining asserts today
[22:07] <cmccabe> :)
[22:07] <cmccabe> are you using multi-mds?
[22:08] <sjust> wido: doing that seems to be causing it to make some progress in peering, but atom0 has 3 osds I can't restart
[22:09] <wido> cmccabe: me? No, single MDS. First things first, getting the OSD's up and running
[22:09] <cmccabe> wido: k
[22:09] <wido> sjust: Only one zombie left now on atom0
[22:10] <wido> That's some behaviour I'm seeing lately, OSD's going Zombie after killing them. Sometimes it takes up to a few hours before they die
[22:12] <sjust> wido: easiest thing to do might be to just restart the machines with stuck processes, most of the ones that didn't restart seem to be because of stuck cosd processes
[22:17] <wido> sjust: I gave all the machines a reboot this afternoon for their upgrade to .39.1. You'd recommend rebooting atom0 for now?
[22:17] <wido> My guess, I'll run into the same issues within a few hours?
[22:17] <sjust> yeah, probably
[22:19] * alexxy[home] (~alexxy@79.173.81.171) has joined #ceph
[22:20] <wido> sjust: Anything I can do to help you tracking this down? Or any suggestions?
[22:21] <sjust> wido: I'm thinking about it now
[22:21] <wido> sjust: K, just update the issue if you have anything. I'm going afk in a bit
[22:21] <wido> Feel free to reboot the machines if you want to
[22:21] <sjust> ok, thanks!
[22:22] * alexxy (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[22:22] <sjust> now that they aren't getting marked out quickly, the loads seem to be pretty bad
[22:22] <sjust> 22:22:03 up 11:03, 1 user, load average: 229.00, 229.04, 229.44
[22:22] <sjust> 22:22:04 up 11:03, 0 users, load average: 131.52, 131.85, 132.37
[22:22] <sjust> 22:22:19 up 11:00, 0 users, load average: 6.26, 5.35, 4.95
[22:22] <sjust> 22:22:20 up 11:03, 0 users, load average: 145.14, 144.43, 144.27
[22:22] <sjust> 22:22:20 up 11:03, 1 user, load average: 125.05, 128.81, 127.91
[22:22] <sjust> 22:22:21 up 11:03, 0 users, load average: 141.04, 140.88, 140.61
[22:22] <sjust> 22:22:23 up 11:03, 0 users, load average: 3.16, 6.46, 7.32
[22:22] <sjust> 22:22:35 up 11:03, 0 users, load average: 7.10, 7.27, 5.98
[22:22] <sjust> 22:22:35 up 11:00, 0 users, load average: 96.21, 96.46, 94.60
[22:22] <sjust> 22:22:37 up 10:59, 0 users, load average: 2.77, 2.09, 1.97
[22:24] <darkfaded> 229 aint bad :)
[22:25] * verwilst (~verwilst@dD5769271.access.telenet.be) has joined #ceph
[22:29] <sjust> wido: TV and I are looking at vmstat and we are starting to think we are seeing a btrfs bug
[22:38] * alexxy[home] (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[22:39] <wido> sjust: Ok, cool. I upgraded to .39.1 to rule out any already fixed bugs in btrfs
[22:39] <wido> I'm going afk now, tnx guys!
[22:39] <sjust> ok!
[22:45] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[23:01] * alexxy[home] (~alexxy@79.173.81.171) has joined #ceph
[23:02] * alexxy (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[23:10] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[23:12] * alexxy[home] (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[23:15] * alexxy[home] (~alexxy@79.173.81.171) has joined #ceph
[23:18] * alexxy (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[23:19] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has joined #ceph
[23:20] * lx0 (~aoliva@186.214.49.80) Quit (Read error: Connection reset by peer)
[23:21] * lx0 (~aoliva@83TAABT1T.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:35] * ghaskins_mobile (~ghaskins_@130.57.22.201) Quit (Quit: This computer has gone to sleep)
[23:41] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[23:42] * wilfrid (5138106c@ircip2.mibbit.com) has joined #ceph
[23:43] * alexxy[home] (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[23:43] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[23:47] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has joined #ceph
[23:56] <sagewk> tv: can i tell orchestra to ignore missing host keys?
[23:56] <sagewk> -o StrictHostKeyChecking=no
[23:58] * fred_ (~fred@200-228.77-83.cust.bluewin.ch) Quit (Quit: Leaving)
[23:58] <Tv> sagewk: there's options in paramiko for that but.. hold on
[23:58] <Tv> sagewk: *missing* host keys, not wrong ones?
[23:58] <sagewk> missing yeah
[23:58] <Tv> i think that was supposed to work
[23:58] <Tv> let me check
[23:58] <sagewk> didn't for me
[23:58] <sagewk> once they were added all was well tho
[23:59] <Tv> yeah i see it, should be a short fix, i remember looking at this

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.