#ceph IRC Log

Index

IRC Log for 2012-02-15

Timestamps are in GMT/BST.

[0:34] <darkfader> re kdump over network: it's been fucking slow for me at $oldjob
[0:35] <darkfader> even with all possible page exclusion smartness and so on
[0:35] <darkfader> i think (cannot prove) that it was bandwidth spikes on their network and latency
[0:35] <darkfader> but it's also not multithreaded and so on
[0:36] <darkfader> dump time even to disk was 3-4 fold over hp-ux
[0:36] <darkfader> besides it usally doesn't work if the box didn't crash due to a very easy issue
[0:38] <darkfader> you'd manually test sysrq-c(works), test nmi (works) and then it looks up and could trigger an nmi and nothing happens :)\
[0:48] <Tv|work> darkfader: the amount of data to shuffle is the same, it should be a race of hdd vs 10gige-to-hdd-with-buffering.. with both driven without interrupts etc annoyances, but still.
[0:49] <Tv|work> i'd just like to have it be a feature we can safely tell customers to turn on
[0:49] <Tv|work> at $very_old_job, we used lkcd for that, but those boxes never had more than transient data; this time, the data is way more important
[0:50] * joao is now known as Guest2610
[0:50] * joao_ (~joao@89-181-154-123.net.novis.pt) has joined #ceph
[0:50] * joao_ is now known as joao
[0:56] * Guest2610 (~joao@89.181.154.123) Quit (Ping timeout: 480 seconds)
[0:59] * joao (~joao@89-181-154-123.net.novis.pt) Quit (Quit: joao)
[1:03] * BManojlovic (~steki@212.200.241.85) Quit (Remote host closed the connection)
[1:30] * Tv|work (~Tv__@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:07] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:10] * bchrisman (~Adium@108.60.121.114) Quit (Quit: Leaving.)
[2:19] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: Connection reset by peer)
[2:23] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[2:32] * lofejndif (~lsqavnbok@57.Red-88-19-214.staticIP.rima-tde.net) Quit (Quit: Leaving)
[2:36] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[2:45] * jantje (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[2:53] * jantje (~jan@paranoid.nl) has joined #ceph
[3:54] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:08] * chutzpah (~chutz@216.174.109.254) Quit (Quit: Leaving)
[4:12] * bergwolf (~983e2c39@webuser.thegrebs.com) has joined #ceph
[4:20] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Operation timed out)
[4:22] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:24] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[4:26] * bergwolf (~983e2c39@webuser.thegrebs.com) Quit (Quit: TheGrebs.com CGI:IRC)
[4:29] * joshd1 (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:30] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[4:30] * joshd (~joshd@aon.hq.newdream.net) Quit ()
[4:32] * yoshi (~yoshi@u662141.xgsfmg3.imtp.tachikawa.mopera.net) has joined #ceph
[4:33] * yoshi_ (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[4:40] * yoshi (~yoshi@u662141.xgsfmg3.imtp.tachikawa.mopera.net) Quit (Ping timeout: 480 seconds)
[5:34] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Read error: Operation timed out)
[5:35] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[5:44] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[5:56] * jantje (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[6:17] * alexxy (~alexxy@79.173.81.171) Quit (Quit: No Ping reply in 180 seconds.)
[6:17] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[6:26] * alexxy (~alexxy@79.173.81.171) Quit (Quit: No Ping reply in 180 seconds.)
[6:27] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[6:50] * jantje (~jan@paranoid.nl) has joined #ceph
[7:58] * MT`AwAy (~meuh@110.50.70.24) has joined #ceph
[8:02] <MT`AwAy> hm
[8:20] * MT`AwAy (~meuh@110.50.70.24) has left #ceph
[8:22] * fronlius (~fronlius@f054184098.adsl.alicedsl.de) has joined #ceph
[8:30] * f4m8_ is now known as f4m8
[8:31] <f4m8> with a debian wheezy version 0.40-1 i try to setup a rbd
[8:32] <f4m8> i run mkcephfs which populate my four machines osd00 osd01 mon and mds
[8:33] <f4m8> the fifth machine, the client, where e run "mkcephfs -a -c rbd-cluster.conf -k rbd-cluster.keyring" are not able to auth against the mon
[8:34] <f4m8> my conf ist here http://paste.debian.net/156288
[8:34] <f4m8> Where does ceph_mon read his auth key?
[8:35] <f4m8> a "ceph-conf -s mon --lookup keyring" gives me a filename which exist and has the same section and key = value as my client keyring
[8:35] * yoshi_ (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[8:36] <f4m8> I#m pretty sure to miss a point, but i dont't see it..
[8:47] <f4m8> an strace of the start of ceph_mon doesnt open any file for the key, so i wonder where the ceph_mon read his auth key
[8:47] * joao (~joao@89.181.154.123) has joined #ceph
[8:59] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[9:00] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit ()
[9:17] * joao (~joao@89.181.154.123) Quit (Quit: joao)
[9:52] * joao (~joao@193.136.122.17) has joined #ceph
[11:05] * jantje (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[12:25] * jantje (~jan@paranoid.nl) has joined #ceph
[12:33] * jantje (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[12:34] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[12:53] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) has left #ceph
[13:06] * jantje (~jan@paranoid.nl) has joined #ceph
[14:54] * ninkotech_lite (~dp@ip-89-103-90-23.net.upcbroadband.cz) has joined #ceph
[14:54] * ninkotech_lite (~dp@ip-89-103-90-23.net.upcbroadband.cz) Quit ()
[14:56] * ninkotech_lite (~dp@ip-89-103-90-23.net.upcbroadband.cz) has joined #ceph
[15:53] * ninkotech_lite (~dp@ip-89-103-90-23.net.upcbroadband.cz) Quit (Quit: Konversation terminated!)
[16:00] * ninkotech_lite (~dp@ip-89-103-90-23.net.upcbroadband.cz) has joined #ceph
[16:33] * ninkotech_lite (~dp@ip-89-103-90-23.net.upcbroadband.cz) Quit (Ping timeout: 480 seconds)
[16:39] * andreask (~andreas@62.178.13.131) has joined #ceph
[17:32] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[17:35] * andresambrois (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[17:35] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Read error: Connection reset by peer)
[17:42] * aa_ (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[17:43] * andresambrois (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Read error: Operation timed out)
[17:52] * andreask (~andreas@62.178.13.131) Quit (Quit: Leaving.)
[17:55] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[17:56] <jmlowe> joshd around today?
[18:08] * joao (~joao@193.136.122.17) Quit (Ping timeout: 480 seconds)
[18:12] * gregorg_taf (~Greg@78.155.152.6) Quit (Quit: Quitte)
[18:13] * joao (~joao@193.136.122.17) has joined #ceph
[18:15] <SpamapS> will 'make check-local' run a standalone test suite that doesn't need any network access?
[18:15] * SpamapS is doing the build now.. but it takes.. so..long..
[18:27] * joao (~joao@193.136.122.17) Quit (Ping timeout: 480 seconds)
[18:56] * joao (~joao@89.181.154.123) has joined #ceph
[18:57] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[18:58] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[19:02] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: Connection reset by peer)
[19:03] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[19:12] * fronlius_ (~fronlius@e182095021.adsl.alicedsl.de) has joined #ceph
[19:13] * chutzpah (~chutz@216.174.109.254) has joined #ceph
[19:14] * lofejndif (~lsqavnbok@57.Red-88-19-214.staticIP.rima-tde.net) has joined #ceph
[19:18] * fronlius (~fronlius@f054184098.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[19:18] * fronlius_ is now known as fronlius
[19:29] <Ludo> /server +chat.ofoto.com
[19:29] <Ludo> oups
[19:29] <Ludo> sorry
[19:34] * aa_ (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Read error: Connection reset by peer)
[19:35] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[19:35] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[19:35] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[19:38] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[19:39] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[19:46] <SpamapS> Hm.. can the python tests be run w/o doing any pip installs?
[19:51] <SpamapS> seems like not.. :-/
[20:06] <jmlowe> nhm: you were interested in the arrays we are using, I've got some news there
[20:08] <nhm> jmlowe: cool, what's new?
[20:08] <SpamapS> hm, and 'cram' is not packaged, so at least for now, I think the test suite will have to remain "off"
[20:09] <jmlowe> I have 5 of 24 drives in predictive failure and when these go into predictive failure they apparently don't go offline
[20:10] <jmlowe> also I think they are cooking themselves off
[20:10] <jmlowe> Current Temperature (C): 36
[20:10] <jmlowe> Maximum Temperature (C): 45
[20:10] <jmlowe> specs say max operating temp is 35
[20:11] <nhm> yikes, that's in the HP units?
[20:11] <jmlowe> yep
[20:12] <jmlowe> I start getting read errors and ceph, how should I put this, loses control of its bowels in it's own bed
[20:13] <nhm> jmlowe: yeah, lustre does the same when hardware starts misbehaving.
[20:13] <nhm> We'll need to make ceph smarter about it.
[20:13] <jmlowe> I started 12 vm's running iozone -a on while loops monday and today I had 6/10 osd's up
[20:14] <jmlowe> everything was running just fine yesterday
[20:14] <nhm> hardware problems on all of them?
[20:14] <jmlowe> 2 machines 4 raid 0's and 6 raid 0's respectively
[20:15] <jmlowe> 10 osd's in total
[20:15] <nhm> which 4 osds went down?
[20:15] <jmlowe> all the problems seem to be confined to the oss with 6 od's
[20:16] <nhm> ok. all 5 drives were on the second machine then?
[20:16] <nhm> (the ones that were in in predictive failure)
[20:16] * sagelap (~sage@205.158.58.247.ptr.us.xo.net) has joined #ceph
[20:18] <jmlowe> I should back up, 4 predictive failure drives were on machine A and were taken out of service but left in the chassis, this morning I have a new predictive failure drive in machine B which was running 6 osd's
[20:19] <nhm> ok, so the first set of predictive failures didn't take any OSDs out on machine A, but the one failure on machine B took out 4 of the 6 OSDs there?
[20:19] <jmlowe> bone chilling things like this in the logs: kernel: [604698.792979] end_request: I/O error, dev cciss/c0d0, sector 4205008
[20:19] <jmlowe> no idea of the timeline for machine A, took them out of service to do load testing monday
[20:19] <nhm> that's the kind of stuff that would have me up at 1am trying to get our scratch file systems back online after raid controller failures. ;(
[20:20] <jmlowe> but yes, seemingly one failed drive took down 4 osd's
[20:21] <nhm> ok, good to know. So when that one drive failed did the raid0 it was participating in fail?
[20:21] <jmlowe> all attached through the same sas/sata controller, so I'm thinking they are spewing garbage out on the sas backplane
[20:21] <nhm> yeah, that could be.
[20:21] <jmlowe> here is the kicker, the controller doesn't seem to think there is anything wrong with the array
[20:22] <jmlowe> it would be one thing if it would fail the drive and knock the array out so it wouldn't cause any more damage and ceph could go about it's way bringing the replication back up until I rebuild the raid0/osd
[20:23] <nhm> jmlowe: so what about the two OSDs that didn't fail? Anything special/different about them?
[20:25] <jmlowe> not as far as I can tell
[20:26] <jmlowe> could just be luck of the draw with the pg's
[20:26] * sagelap (~sage@205.158.58.247.ptr.us.xo.net) Quit (Ping timeout: 480 seconds)
[20:26] <nhm> I suppose they probably were just lucky enough that they didn't try to do something while the funkyness was going on maybe.
[20:27] <nhm> so where there I/O errors on all of the devices behind the OSDs that failed?
[20:27] <nhm> s/where/were
[20:28] <jmlowe> not as far as I can tell, but I do have the osd associated with the predictive failure drive in an uninterruptible wait state
[20:29] <nhm> If you want to pastie any logs or anything I'd be happy to take a look.
[20:29] <nhm> not that I'm probably going to be able to offer you much more than you already know. ;)
[20:34] <jmlowe> so I'm thinking about scrapping this external array and just populating the machine with 2.5in sata disks, it has bays I believe
[20:34] <jmlowe> make that 12 bays
[20:34] <nhm> jmlowe: It's always a crapshoot with old hardware.
[20:35] <jmlowe> it's not that old, I've got 4x HP ProLiant DL180 G6
[20:35] <jmlowe> staked out for oss's
[20:36] <nhm> oh, sorry. I think I had you confused with someone else I was talking to.
[20:38] * sagelap (~sage@205.158.58.247.ptr.us.xo.net) has joined #ceph
[20:46] * sagelap (~sage@205.158.58.247.ptr.us.xo.net) Quit (Ping timeout: 480 seconds)
[20:51] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[21:24] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[21:42] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) has joined #ceph
[22:09] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:11] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[22:56] <SpamapS> hm, in 0.41, does mkcephfs still actually work?
[22:57] <dmick> most of the people that usually hang here from Ceph are at FAST today
[22:57] <SpamapS> whats FAST?
[22:57] <dmick> conference
[22:58] <dmick> I don't see any open bugs on it
[22:58] <SpamapS> http://paste.ubuntu.com/843634/
[22:59] <jmlowe> last I tried it worked
[23:00] <SpamapS> ceph-authtool doesn't seem to be producing any .key files
[23:02] <SpamapS> seems like the keys are already in the keyring..
[23:03] <jmlowe> wasn't using auth
[23:06] <SpamapS> Yeah I'm guessing this bit is just broken now
[23:07] <dmick> sorry I can't help SpamapS; joshd is probably your man
[23:10] <SpamapS> Hm, is there a bug tracker?
[23:11] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[23:11] * jmlowe (~Adium@140-182-217-226.dhcp-bl.indiana.edu) has joined #ceph
[23:12] <dmick> http://ceph.newdream.net/, Bug Tracker on the right hand side
[23:13] <SpamapS> dmick: ty
[23:13] <dmick> yw
[23:14] * jmlowe (~Adium@140-182-217-226.dhcp-bl.indiana.edu) Quit (Read error: Operation timed out)
[23:21] * fronlius (~fronlius@e182095021.adsl.alicedsl.de) Quit (Quit: fronlius)
[23:29] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) Quit (Quit: Ex-Chat)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.