#ceph IRC Log

Index

IRC Log for 2013-10-17

Timestamps are in GMT/BST.

[0:01] <albionandrew> The other node won't start - http://pastebin.com/TAVEiygf
[0:02] * rendar (~s@host249-181-dynamic.20-87-r.retail.telecomitalia.it) Quit ()
[0:03] <dmick> yes, you'll need to resolve that. looks like it's missing keys/keyrings
[0:03] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) Quit (Ping timeout: 480 seconds)
[0:04] <albionandrew> OK Thanks, any pointers where I might start?
[0:05] * onizo (~onizo@wsip-184-182-190-131.sd.sd.cox.net) Quit (Ping timeout: 480 seconds)
[0:07] * lanfear2667 (~oftc-webi@rtp-isp-nat1.cisco.com) Quit (Quit: Page closed)
[0:12] * hughsaunders (~hughsaund@2001:4800:780e:510:fdaa:9d7a:ff04:4622) Quit (Ping timeout: 480 seconds)
[0:12] * hughsaunders (~hughsaund@2001:4800:780e:510:fdaa:9d7a:ff04:4622) has joined #ceph
[0:13] * Tamil1 (~Adium@cpe-142-136-96-212.socal.res.rr.com) Quit (Quit: Leaving.)
[0:13] * Tamil1 (~Adium@cpe-142-136-96-212.socal.res.rr.com) has joined #ceph
[0:13] * Cube (~Cube@12.248.40.138) Quit (Read error: Operation timed out)
[0:16] * sarob (~sarob@nat-dip4.cfw-a-gci.corp.yahoo.com) has joined #ceph
[0:18] * JoeGruher (~JoeGruher@jfdmzpr02-ext.jf.intel.com) Quit (Remote host closed the connection)
[0:23] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[0:24] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[0:28] * xarses (~andreww@64-79-127-122.static.wiline.com) has joined #ceph
[0:28] * dmsimard (~Adium@2607:f748:9:1666:312a:4191:cfe2:1de3) Quit (Ping timeout: 480 seconds)
[0:28] * sarob (~sarob@nat-dip4.cfw-a-gci.corp.yahoo.com) Quit (Ping timeout: 480 seconds)
[0:32] * albionandrew (~albionand@64.25.15.100) Quit (Quit: albionandrew)
[0:36] * ScOut3R (~scout3r@dsl51B61603.pool.t-online.hu) Quit (Remote host closed the connection)
[0:37] * mschiff (~mschiff@85.182.236.82) Quit (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
[0:38] * yanzheng (~zhyan@101.82.164.110) Quit (Ping timeout: 480 seconds)
[0:39] * Tamil1 (~Adium@cpe-142-136-96-212.socal.res.rr.com) Quit (Quit: Leaving.)
[0:40] * nwat (~nwat@eduroam-232-148.ucsc.edu) Quit (Ping timeout: 480 seconds)
[0:51] * sarob (~sarob@nat-dip4.cfw-a-gci.corp.yahoo.com) has joined #ceph
[0:53] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) Quit (Quit: ...)
[0:53] * Tamil1 (~Adium@cpe-142-136-96-212.socal.res.rr.com) has joined #ceph
[0:55] * xarses (~andreww@64-79-127-122.static.wiline.com) Quit (Quit: Leaving.)
[0:55] * xarses (~andreww@64-79-127-122.static.wiline.com) has joined #ceph
[0:56] * sarob (~sarob@nat-dip4.cfw-a-gci.corp.yahoo.com) Quit (Read error: Operation timed out)
[1:05] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) Quit (Quit: Leaving.)
[1:07] * nwat (~nwat@c-24-5-146-110.hsd1.ca.comcast.net) has joined #ceph
[1:12] * wrencsok (~wrencsok@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[1:15] * Cube (~Cube@66-87-67-16.pools.spcsdns.net) has joined #ceph
[1:19] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[1:19] * ChanServ sets mode +o scuttlemonkey
[1:20] * mtanski (~mtanski@69.193.178.202) Quit (Ping timeout: 480 seconds)
[1:21] * sarob (~sarob@nat-dip28-wl-b.cfw-a-gci.corp.yahoo.com) has joined #ceph
[1:24] * sarob_ (~sarob@nat-dip28-wl-b.cfw-a-gci.corp.yahoo.com) has joined #ceph
[1:24] * sarob (~sarob@nat-dip28-wl-b.cfw-a-gci.corp.yahoo.com) Quit (Read error: Connection reset by peer)
[1:24] * diegows (~diegows@190.190.11.42) Quit (Ping timeout: 480 seconds)
[1:30] * sarob_ (~sarob@nat-dip28-wl-b.cfw-a-gci.corp.yahoo.com) Quit (Read error: Operation timed out)
[1:40] * vata (~vata@2607:fad8:4:6:402f:56f7:bb1a:8352) Quit (Quit: Leaving.)
[1:41] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[1:48] * diegows (~diegows@190.190.11.42) has joined #ceph
[1:50] * maurone (~root@95-210-233-66.ip.skylogicnet.com) has joined #ceph
[1:51] <maurone> ciao
[2:01] * LPG_ (~LPG@c-76-104-197-224.hsd1.wa.comcast.net) Quit (Remote host closed the connection)
[2:02] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Read error: Connection reset by peer)
[2:05] * ircolle (~Adium@2601:1:8380:2d9:a042:caca:d54f:47ec) Quit (Quit: Leaving.)
[2:05] * dmsimard (~Adium@69-165-206-93.cable.teksavvy.com) has joined #ceph
[2:05] * dmsimard (~Adium@69-165-206-93.cable.teksavvy.com) Quit ()
[2:07] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Quit: Leaving.)
[2:11] * danieagle (~Daniel@179.176.61.26.dynamic.adsl.gvt.net.br) Quit (Quit: inte+ e Obrigado Por tudo mesmo! :-D)
[2:11] * sputnik13 (~sputnik13@64-73-250-90.static-ip.telepacific.net) Quit (Quit: My MacBook has gone to sleep. ZZZzzz…)
[2:18] * LeaChim (~LeaChim@host86-174-76-26.range86-174.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[2:19] * mattbenjamin (~matt@aa2.linuxbox.com) Quit (Quit: Leaving.)
[2:19] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[2:19] * diegows (~diegows@190.190.11.42) Quit (Remote host closed the connection)
[2:26] * yehudasa (~yehudasa@2607:f298:a:607:ea03:9aff:fe98:e8ff) Quit (Ping timeout: 480 seconds)
[2:28] * sagelap1 (~sage@2600:1012:b019:fe41:6997:5864:5572:2f95) has joined #ceph
[2:28] * Guest2566 (~a@209.12.169.218) Quit (Quit: This computer has gone to sleep)
[2:30] * bandrus (~Adium@c-98-238-148-252.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[2:30] * sagelap (~sage@2607:f298:a:607:ea03:9aff:febc:4c23) Quit (Ping timeout: 480 seconds)
[2:30] * bandrus (~Adium@c-98-238-148-252.hsd1.ca.comcast.net) has joined #ceph
[2:35] * AfC (~andrew@2407:7800:200:1011:2ad2:44ff:fe08:a4c) Quit (Ping timeout: 480 seconds)
[2:36] * xarses1 (~andreww@64-79-127-122.static.wiline.com) has joined #ceph
[2:37] * xarses (~andreww@64-79-127-122.static.wiline.com) Quit (Ping timeout: 480 seconds)
[2:43] * maurone (~root@95-210-233-66.ip.skylogicnet.com) Quit (Quit: WeeChat 0.4.0)
[2:45] * nwat (~nwat@c-24-5-146-110.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[2:46] * xarses1 (~andreww@64-79-127-122.static.wiline.com) Quit (Read error: No route to host)
[2:46] * xarses (~andreww@64-79-127-122.static.wiline.com) has joined #ceph
[2:46] * xarses (~andreww@64-79-127-122.static.wiline.com) Quit ()
[2:47] * xarses (~andreww@64-79-127-122.static.wiline.com) has joined #ceph
[2:51] * gregaf (~Adium@2607:f298:a:607:7886:7b21:6806:aaa5) Quit (Quit: Leaving.)
[2:54] * gregaf (~Adium@2607:f298:a:607:dcd:f9b1:80f4:84cc) has joined #ceph
[2:58] * xarses (~andreww@64-79-127-122.static.wiline.com) Quit (Ping timeout: 480 seconds)
[2:59] * Tamil1 (~Adium@cpe-142-136-96-212.socal.res.rr.com) has left #ceph
[3:03] * yanzheng (~zhyan@101.83.146.215) has joined #ceph
[3:07] * mattbenjamin (~matt@76-206-42-105.lightspeed.livnmi.sbcglobal.net) has joined #ceph
[3:11] * yy-nm (~Thunderbi@122.224.154.38) has joined #ceph
[3:13] * nerdtron (~Administr@202.60.8.250) has joined #ceph
[3:14] <nerdtron> hi all,, I use a separate OS drive for my ceph cluster, what happens when the OS drive fails? how do i recover?
[3:15] <sagelap1> nerdtron: swap the drives into another machine, or swap in a (working) os drive
[3:15] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) has joined #ceph
[3:16] <nerdtron> sagelap1, really that would work?
[3:16] <sagelap1> yeah
[3:16] <sagelap1> if it's a fresh os drive, you'll need to do a few things with ceph-deploy to get ceph installed and get a valid ceph.conf and keys in place
[3:16] <nerdtron> all three nodes were using a flash drive as the OS drives, how ever they all failed with only 1 week a part.. now all my data on the ceph osd are lost
[3:17] <sagelap1> but then udev will see the osd data disks and start up teh daemons
[3:17] <sagelap1> as long as you didn't lose the osd journals or osd drives the data is intact
[3:18] <nerdtron> yup, that is why i'm still having hopes..but i don't know where to start
[3:18] * mozg (~andrei@host86-184-120-113.range86-184.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[3:18] <nerdtron> can i just get a fresh drive, install the os and ceph and then just ceph will do its thing?
[3:19] <sagelap1> first just get your machine to boot :)
[3:19] <nerdtron> sagelap1, then?
[3:20] <sagelap1> then ceph-deploy install HOST, ceph-deploy config push HOST, and then give it a kick to start up the osds (ceph-disk activate-all or just reboot the node)
[3:20] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Quit: Leaving.)
[3:20] * sagelap1 (~sage@2600:1012:b019:fe41:6997:5864:5572:2f95) Quit (Quit: Leaving.)
[3:21] <nerdtron> hmmm..i hope this will work, but will ceph really know which osd is which? i mean all three nodes are dead, i can stil find three hard drives for os though so i can try it
[3:24] * yy-nm (~Thunderbi@122.224.154.38) Quit (Quit: yy-nm)
[3:33] * wenjianhn (~wenjianhn@114.245.46.123) has joined #ceph
[3:38] * yy-nm (~Thunderbi@122.224.154.38) has joined #ceph
[3:49] * yanzheng (~zhyan@101.83.146.215) Quit (Ping timeout: 480 seconds)
[3:51] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[3:51] <aarontc> hey guys - if I add 'rbd cache = true' to the [client] section of /etc/ceph/ceph.conf on a host running qemu VMs through libvirtd, will that enable caching for any VMs using rbd that don't specify caching explicityl in the libvirt domain XML?
[4:01] * yanzheng (~zhyan@101.82.7.117) has joined #ceph
[4:04] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[4:08] * angdraug (~angdraug@64-79-127-122.static.wiline.com) Quit (Quit: Leaving)
[4:10] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[4:14] * JoeGruher (~JoeGruher@134.134.139.72) has joined #ceph
[4:22] * JoeGruher (~JoeGruher@134.134.139.72) Quit ()
[4:23] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[4:23] * yanzheng (~zhyan@101.82.7.117) Quit (Ping timeout: 480 seconds)
[4:24] * themgt (~themgt@201-223-204-131.baf.movistar.cl) Quit (Ping timeout: 480 seconds)
[4:26] * themgt (~themgt@201-223-204-131.baf.movistar.cl) has joined #ceph
[4:32] * john_barbee_ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) has joined #ceph
[4:39] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) Quit (Quit: mtanski)
[4:40] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Read error: Operation timed out)
[4:41] <terje-> hi all
[4:41] * Administrator_ (~Administr@202.60.8.250) has joined #ceph
[4:41] <terje-> I have a cluster where one of my osd's has been marked DNE
[4:43] <terje-> it's osd.0, and here's the log entries from that osd: http://pastie.org/8408097
[4:43] <terje-> not sure how to troubleshoot this
[4:45] * john_barbee_ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.90.1 [Firefox 24.0/20130910160258])
[4:45] <Administrator_> terje-, there is no quorum?? how many monitors?
[4:46] * nerdtron (~Administr@202.60.8.250) Quit (Ping timeout: 480 seconds)
[4:55] * xarses (~andreww@c-71-202-167-197.hsd1.ca.comcast.net) has joined #ceph
[4:55] <dmick> terje-: I assume restarting the OSD doesn't help?
[4:58] * aliguori (~anthony@74.202.210.82) Quit (Remote host closed the connection)
[4:59] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) has joined #ceph
[5:01] <terje-> no, it didn't help
[5:01] * fireD (~fireD@93-139-179-118.adsl.net.t-com.hr) Quit (Remote host closed the connection)
[5:01] <terje-> I'm actually just going to rebuild this cluster. It was hanging on initial peering.
[5:01] <terje-> 3 mon's but I had quorum.
[5:05] * fireD (~fireD@93-139-172-198.adsl.net.t-com.hr) has joined #ceph
[5:06] * adam4 (~adam@46-65-111-12.zone16.bethere.co.uk) has joined #ceph
[5:07] * Administrator_ is now known as nerdtron
[5:07] * Cube (~Cube@66-87-67-16.pools.spcsdns.net) Quit (Quit: Leaving.)
[5:08] * adam3 (~adam@46-65-111-12.zone16.bethere.co.uk) Quit (Read error: Operation timed out)
[5:14] * themgt (~themgt@201-223-204-131.baf.movistar.cl) Quit (Ping timeout: 480 seconds)
[5:17] * jantje (~jan@paranoid.nl) Quit (Read error: Connection reset by peer)
[5:17] * jantje (~jan@paranoid.nl) has joined #ceph
[5:17] * Cube (~Cube@66-87-67-16.pools.spcsdns.net) has joined #ceph
[5:18] * themgt (~themgt@201-223-204-131.baf.movistar.cl) has joined #ceph
[5:20] * AfC (~andrew@2407:7800:200:1011:2ad2:44ff:fe08:a4c) has joined #ceph
[5:21] * themgt_ (~themgt@201-223-249-218.baf.movistar.cl) has joined #ceph
[5:26] * themgt (~themgt@201-223-204-131.baf.movistar.cl) Quit (Ping timeout: 480 seconds)
[5:26] * themgt_ is now known as themgt
[5:34] * yanzheng (~zhyan@101.84.204.29) has joined #ceph
[5:37] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit (Quit: Leaving.)
[5:39] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[5:42] * yanzheng (~zhyan@101.84.204.29) Quit (Ping timeout: 480 seconds)
[5:48] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) Quit (Quit: mtanski)
[5:51] * The_Bishop (~bishop@2001:470:50b6:0:6915:da3e:ce9d:d5df) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[5:53] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[5:57] * yanzheng (~zhyan@134.134.137.75) has joined #ceph
[5:57] * Cube (~Cube@66-87-67-16.pools.spcsdns.net) Quit (Read error: Connection reset by peer)
[6:00] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) has joined #ceph
[6:01] * themgt_ (~themgt@201-223-255-38.baf.movistar.cl) has joined #ceph
[6:06] * themgt (~themgt@201-223-249-218.baf.movistar.cl) Quit (Ping timeout: 480 seconds)
[6:06] * themgt_ is now known as themgt
[6:10] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Quit: Leaving.)
[6:11] * sputnik13 (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) has joined #ceph
[6:11] * yanzheng (~zhyan@134.134.137.75) Quit (Remote host closed the connection)
[6:16] * yy-nm (~Thunderbi@122.224.154.38) Quit (Quit: yy-nm)
[6:27] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[6:42] * a (~a@pool-173-55-143-200.lsanca.fios.verizon.net) has joined #ceph
[6:42] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[6:42] * a is now known as Guest2619
[6:44] * The_Bishop (~bishop@f055147133.adsl.alicedsl.de) has joined #ceph
[6:50] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) has joined #ceph
[6:51] * bandrus (~Adium@c-98-238-148-252.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[6:51] * sputnik13 (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) Quit (Quit: My MacBook has gone to sleep. ZZZzzz…)
[6:51] * bandrus (~Adium@c-98-238-148-252.hsd1.ca.comcast.net) has joined #ceph
[6:51] * sputnik13 (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) has joined #ceph
[6:56] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[7:05] * Guest2619 (~a@pool-173-55-143-200.lsanca.fios.verizon.net) Quit (Quit: This computer has gone to sleep)
[7:23] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) Quit (Remote host closed the connection)
[7:35] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[7:37] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Read error: Operation timed out)
[7:42] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[7:42] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) has joined #ceph
[7:53] * themgt (~themgt@201-223-255-38.baf.movistar.cl) Quit (Quit: Pogoapp - http://www.pogoapp.com)
[7:54] * BillK (~BillK-OFT@58-7-67-236.dyn.iinet.net.au) Quit (Read error: Connection reset by peer)
[7:55] * glzhao (~glzhao@118.195.65.67) has joined #ceph
[7:56] * BillK (~BillK-OFT@58-7-67-236.dyn.iinet.net.au) has joined #ceph
[7:59] * papamoose1 (~kauffman@hester.cs.uchicago.edu) Quit (Quit: Leaving.)
[8:07] * blindcat (~blindcat@c-24-6-67-74.hsd1.ca.comcast.net) has joined #ceph
[8:08] <blindcat> Hi, guys, I have a quick questions, I understand that each OSD needs a osd-daemon, if I setup 3 monitors for my clusters, how many monitor daemons I am looking at? The official document says I need 1G for one monitor daemon, I can't find the answer on google, thanks!
[8:11] <blindcat> The original paragraph is "RAM: Metadata servers and monitors must be capable of serving their data quickly, so they should have plenty of RAM (e.g., 1GB of RAM per daemon instance)."
[8:12] * MACscr1 (~Adium@c-98-214-103-147.hsd1.il.comcast.net) Quit (Read error: Connection reset by peer)
[8:13] <blindcat> so if I have 3 monitors, and 4 nodes with 24 OSDs, how many monitor daemons would it be? 3 or 24?
[8:20] <yanzheng> blindcat, 3
[8:21] <blindcat> Thank you for your kindly help!
[8:22] * erice (~erice@c-98-245-48-79.hsd1.co.comcast.net) Quit (Read error: Operation timed out)
[8:27] * tziOm (~bjornar@194.19.106.242) Quit (Remote host closed the connection)
[8:29] * glzhao_ (~glzhao@118.195.65.67) has joined #ceph
[8:32] * gucki (~smuxi@p549F96D5.dip0.t-ipconnect.de) has joined #ceph
[8:32] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) Quit (Remote host closed the connection)
[8:33] * rendar (~s@host200-180-dynamic.1-87-r.retail.telecomitalia.it) has joined #ceph
[8:36] * glzhao (~glzhao@118.195.65.67) Quit (Ping timeout: 480 seconds)
[8:37] * Vjarjadian (~IceChat77@94.1.37.151) Quit (Quit: Don't push the red button!)
[8:44] * AfC (~andrew@2407:7800:200:1011:2ad2:44ff:fe08:a4c) Quit (Quit: Leaving.)
[8:45] * glzhao_ (~glzhao@118.195.65.67) Quit (Ping timeout: 480 seconds)
[8:51] * glzhao (~glzhao@118.195.65.67) has joined #ceph
[8:58] * mattt_ (~textual@94.236.7.190) has joined #ceph
[9:03] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) has joined #ceph
[9:03] * mattt__ (~textual@92.52.76.140) has joined #ceph
[9:06] * mattt_ (~textual@94.236.7.190) Quit (Ping timeout: 480 seconds)
[9:06] * sputnik13 (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) Quit (Quit: My MacBook has gone to sleep. ZZZzzz…)
[9:11] * blindcat (~blindcat@c-24-6-67-74.hsd1.ca.comcast.net) Quit (Quit: Nettalk6 - www.ntalk.de)
[9:14] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[9:17] * MACscr (~Adium@c-98-214-103-147.hsd1.il.comcast.net) has joined #ceph
[9:21] * Cube (~Cube@66-87-67-213.pools.spcsdns.net) has joined #ceph
[9:23] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[9:25] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:26] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[9:31] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) has joined #ceph
[9:31] * ChanServ sets mode +v andreask
[9:37] * andreask1 (~andreask@h081217067008.dyn.cm.kabsi.at) has joined #ceph
[9:37] * ChanServ sets mode +v andreask1
[9:37] * andreask is now known as Guest2634
[9:37] * andreask1 is now known as andreask
[9:37] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) has left #ceph
[9:38] * Shmouel (~Sam@fny94-12-83-157-27-95.fbx.proxad.net) has joined #ceph
[9:38] * Shmouel (~Sam@fny94-12-83-157-27-95.fbx.proxad.net) Quit ()
[9:41] * Guest2634 (~andreask@h081217067008.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[9:57] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) has joined #ceph
[10:10] * capri (~capri@212.218.127.222) has joined #ceph
[10:21] * LeaChim (~LeaChim@host86-174-76-26.range86-174.btcentralplus.com) has joined #ceph
[10:23] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[10:26] * Cube (~Cube@66-87-67-213.pools.spcsdns.net) Quit (Quit: Leaving.)
[10:29] * glzhao (~glzhao@118.195.65.67) Quit (Quit: leaving)
[10:30] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) has joined #ceph
[10:30] * ChanServ sets mode +v andreask
[10:43] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[10:43] * ninkotech (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[10:50] * odyssey4me (~odyssey4m@165.233.71.2) has joined #ceph
[10:57] * capri (~capri@212.218.127.222) Quit (Ping timeout: 480 seconds)
[10:58] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) has joined #ceph
[11:01] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) Quit (Remote host closed the connection)
[11:02] * cofol1986 (~xwrj@120.35.11.138) has joined #ceph
[11:14] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) has joined #ceph
[11:16] <AndreyGrebennikov> hi there all. Something happened with my cluster, now when I try to execute "rados lspools" all works fine, but when I do "rados ls -p <any pool>" it just hangs
[11:16] <AndreyGrebennikov> what it could be?
[11:17] <AndreyGrebennikov> ps: I found there is single pool which I could list - volumes pool
[11:17] <AndreyGrebennikov> yesterday everything was fine
[11:21] <AndreyGrebennikov> moreover, I'm able to "mkpool", it is shown in the list, but when I try to list it, rados hangs too
[11:33] <mattt> what does 'ceph health detail' say ?
[11:34] <AndreyGrebennikov> mattt, something very strange happening by the way
[11:34] <AndreyGrebennikov> # ceph -s
[11:34] <AndreyGrebennikov> cluster aae22644-178d-4d38-bb04-32b28c4b1709
[11:34] <AndreyGrebennikov> health HEALTH_WARN 25 pgs peering; 1 pgs stuck inactive; 25 pgs stuck unclean; 2 mons down, quorum 0,1,2,3 ceph-01,ceph-02,ceph-03,ceph-07
[11:34] <AndreyGrebennikov> monmap e4: 6 mons at {ceph-01=10.20.0.132:6789/0,ceph-02=10.20.0.133:6789/0,ceph-03=10.20.0.134:6789/0,ceph-04=10.20.0.135:6789/0,ceph-05=10.20.0.136:6789/0,ceph-07=10.20.0.131:6789/0}, election epoch 1078, quorum 0,1,2,3 ceph-01,ceph-02,ceph-03,ceph-07
[11:34] <AndreyGrebennikov> osdmap e8969: 6 osds: 6 up, 6 in
[11:34] <AndreyGrebennikov> pgmap v124765: 328 pgs: 303 active+clean, 2 peering, 23 remapped+peering; 34423 MB data, 108 GB used, 179 TB / 180 TB avail
[11:34] <AndreyGrebennikov> mdsmap e1: 0/0/1 up
[11:35] <AndreyGrebennikov> mattt, but
[11:35] <AndreyGrebennikov> 2013-10-17 09:34:52.648723 mon.0 10.20.0.131:6789/0 1082 : [INF] osd.6 10.20.0.136:6800/19041 failed (30 reports from 5 peers after 25.001271 >= grace 23.783753)
[11:35] <AndreyGrebennikov> 2013-10-17 09:34:52.649013 mon.0 10.20.0.131:6789/0 1083 : [DBG] osd.6 10.20.0.136:6800/19041 reported failed by osd.4 10.20.0.131:6804/19116
[11:35] <AndreyGrebennikov> 2013-10-17 09:34:52.649228 mon.0 10.20.0.131:6789/0 1084 : [DBG] osd.6 10.20.0.136:6800/19041 reported failed by osd.1 10.20.0.133:6800/15117
[11:35] <AndreyGrebennikov> 2013-10-17 09:34:52.670402 mon.0 10.20.0.131:6789/0 1085 : [DBG] osd.6 10.20.0.136:6800/19041 reported failed by osd.3 10.20.0.134:6800/26959
[11:35] <AndreyGrebennikov> mattt, they all claim for one another
[11:39] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) Quit (Remote host closed the connection)
[11:40] <AndreyGrebennikov> ok, it was the iptables issue
[11:47] <AndreyGrebennikov> well, dear members, could anyone tell me which ports should be opened so that osd nodes work fine?
[11:49] <mattt> AndreyGrebennikov: good question, an OSD on one of my hosts binds to: 6800, 6801, 6802, 6803
[11:49] <mattt> not sure what each port does
[11:50] <andreask> http://ceph.com/docs/master/rados/configuration/network-config-ref/#ceph-daemons
[11:52] <AndreyGrebennikov> mattt, great. Now I switched iptabless off on all nodes. Netstat shows me a lot of opened connections from 6800,6801,6802 to different dynamic ports of all others
[12:02] * allsystemsarego (~allsystem@5-12-37-46.residential.rdsnet.ro) has joined #ceph
[12:03] * mzohrehie (~mzohrehie@217.25.5.178) has joined #ceph
[12:04] <mattt> andreask: neat, thanks!
[12:04] <mattt> AndreyGrebennikov: good! my first thought was a potential network issue, glad you have it sorted
[12:06] * mzohrehie (~mzohrehie@217.25.5.178) Quit ()
[12:07] * max-100 (~max-100@217.25.5.178) has joined #ceph
[12:11] <max-100> Hi, I'm having some trouble bringing my OSDs up after creating them with ceph-deploy. When i use the command ceph-deploy osd create machine:sdb:/dev/sdc1 there are no errors logged to the screen but when i check ceph osd tree the osd is listed as down. When i check the logs on the OSD server it shows mkjournal error creating journal on /var/lib/ceph/tmp/mnt.MKI6hL/journal and ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.MKI6hL. I unders
[12:11] <max-100> tand this is common when not properly zapping the partitions on the journal, but this was a fresh install so there were no prior ceph installations. Any ideas on how to troubleshoot this? Thanks in advance.
[12:11] * jbd_ (~jbd_@2001:41d0:52:a00::77) has joined #ceph
[12:20] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) Quit (Quit: Leaving.)
[12:25] * JustEra (~JustEra@89.234.148.11) has joined #ceph
[12:26] <JustEra> Hello, I'm getting some request blocked, how can I debug and know why ?
[12:26] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[12:31] * ScOut3R (~ScOut3R@catv-89-133-21-203.catv.broadband.hu) has joined #ceph
[12:35] * AfC (~andrew@2001:44b8:31cb:d400:6e88:14ff:fe33:2a9c) has joined #ceph
[12:48] * i_m (~ivan.miro@deibp9eh1--blueice1n1.emea.ibm.com) has joined #ceph
[12:50] * fedepalla (~fedepalla@201-213-4-67.net.prima.net.ar) Quit (Read error: Connection reset by peer)
[12:58] * rudolfsteiner (~federicon@181.167.96.123) has joined #ceph
[13:00] * rudolfsteiner (~federicon@181.167.96.123) Quit ()
[13:05] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[13:05] * rudolfsteiner (~federicon@181.167.96.123) has joined #ceph
[13:05] * rudolfsteiner (~federicon@181.167.96.123) Quit ()
[13:14] * carlos (~carlos@lxbifi81.bifi.unizar.es) has joined #ceph
[13:14] * carlos is now known as cgimeno
[13:26] * BillK (~BillK-OFT@58-7-67-236.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[13:28] * glzhao (~glzhao@118.195.65.67) has joined #ceph
[13:33] <niklas> Hi. How come that after I replaced 1 out of 60 OSDs, my cluster is 5000 out of 12000 PGs on recovery_wait ?
[13:33] <niklas> also why does the new osd (same size as the others) has a weight of -3.052e-05
[13:38] * glzhao (~glzhao@118.195.65.67) Quit (Ping timeout: 480 seconds)
[13:44] <cgimeno> hi, anyone know how can force ceph to remount an osd? i want to change mount options
[13:47] <brother> What does 'rados df' report for pools with snapshots? The current usage stats or the usage including snapshots?
[13:48] * glzhao (~glzhao@118.195.65.67) has joined #ceph
[13:50] * tziOm (~bjornar@194.19.106.242) has joined #ceph
[13:51] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[13:52] <max-100> when spinning up a new cluster i was able to get the OSDs up by not zapping them first. When i try and delete OSDs then readd them (zapping them first) i get OSD down
[13:54] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) has joined #ceph
[13:54] * ChanServ sets mode +v andreask
[13:56] * AfC (~andrew@2001:44b8:31cb:d400:6e88:14ff:fe33:2a9c) Quit (Quit: Leaving.)
[13:56] * cgimeno (~carlos@lxbifi81.bifi.unizar.es) Quit (Quit: Konversation terminated!)
[13:59] * phoenix (~phoenix@vpn1.safedata.ru) has joined #ceph
[13:59] <phoenix> hello all
[13:59] <phoenix> i have little problems
[13:59] <phoenix> any boddy can help me?
[14:00] * BillK (~BillK-OFT@58-7-67-236.dyn.iinet.net.au) has joined #ceph
[14:01] <phoenix> after entering the new data nodes and after there was a cluster Rebuild ceph not recovered
[14:01] <phoenix> health HEALTH_WARN 314 pgs down; 314 pgs peering; 314 pgs stuck inactive; 314 pgs stuck unclean; 5 requests are blocked > 32 sec; mds cluster is degraded
[14:06] <paravoid> loicd: fwiw, I reimplemented ceph/puppet, as the enovance one wasn't good enough for my uses
[14:06] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:07] <paravoid> loicd: mine's public too but a) it doesn't do mds, b) it's not documented well c) it assumes all nodes get the admin key, which can be unacceptable from a security PoV (but the other module has the admin key in the puppetdb, so it's even worse in this aspect)
[14:07] <kraken> http://i.imgur.com/BwdP2xl.gif
[14:08] <loicd> paravoid: cool, could you add a reference to your work in https://wiki.openstack.org/wiki/Puppet-openstack/ceph-blueprint#Related_tools_and_implementations ? It will be helpfull as there will be a lot of cherry picking in existing implementations to bootstrap the work.
[14:13] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:16] * glzhao_ (~glzhao@118.195.65.67) has joined #ceph
[14:17] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[14:18] * glzhao (~glzhao@118.195.65.67) Quit (Ping timeout: 480 seconds)
[14:19] * yanzheng (~zhyan@jfdmzpr04-ext.jf.intel.com) has joined #ceph
[14:21] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:24] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) has joined #ceph
[14:30] * rudolfsteiner (~federicon@181.29.144.4) has joined #ceph
[14:34] * sputnik13 (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) has joined #ceph
[14:37] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[14:38] * sputnik13 (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) Quit ()
[14:42] * t0rn (~ssullivan@2607:fad0:32:a02:d227:88ff:fe02:9896) has joined #ceph
[14:42] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) has joined #ceph
[14:42] * nerdtron (~Administr@202.60.8.250) Quit (Quit: Leaving)
[14:46] * andreask1 (~andreask@h081217067008.dyn.cm.kabsi.at) has joined #ceph
[14:46] * ChanServ sets mode +v andreask1
[14:46] * andreask is now known as Guest2656
[14:46] * andreask1 is now known as andreask
[14:49] * sleinen (~Adium@2001:620:0:46:d44a:2517:df4b:ebc9) has joined #ceph
[14:50] <JustEra> I've got 1 active+clean+scrubbing+deep but the data isn't available how can I repair that ?
[14:52] * Guest2656 (~andreask@h081217067008.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[15:00] <Gugge-47527> JustEra: active+clean suggest that it is available
[15:00] <JustEra> Gugge-47527, some part yes some part no just a hang on the filesystem :/
[15:01] <Gugge-47527> stop the bad osd then
[15:01] <Gugge-47527> and let the system heal
[15:03] * odyssey4me2 (~odyssey4m@41-132-104-169.dsl.mweb.co.za) has joined #ceph
[15:07] * gsaxena (~gsaxena@pool-71-178-225-182.washdc.fios.verizon.net) has joined #ceph
[15:08] * odyssey4me (~odyssey4m@165.233.71.2) Quit (Ping timeout: 480 seconds)
[15:20] <JustEra> Gugge-47527, dosn't work it's still in active+clean+scrubbing+deep :(
[15:27] * mikedawson (~chatzilla@23-25-46-107-static.hfc.comcastbusiness.net) has joined #ceph
[15:32] <Gugge-47527> JustEra: that is not an error, you know that right?
[15:33] <Gugge-47527> its just telling you that a deep scrub is running
[15:34] * toabctl (~toabctl@toabctl.de) has joined #ceph
[15:35] * papamoose1 (~kauffman@hester.cs.uchicago.edu) has joined #ceph
[15:35] * rudolfsteiner (~federicon@181.29.144.4) Quit (Quit: rudolfsteiner)
[15:36] * erice (~erice@c-98-245-48-79.hsd1.co.comcast.net) has joined #ceph
[15:39] <JustEra> Gugge-47527, yep but it's seems to be that pg is infact "recovering" because some data is not available ..
[15:43] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) Quit (Quit: mtanski)
[15:44] * dmsimard (~Adium@2607:f748:9:1666:890d:6d98:9fc5:a429) has joined #ceph
[15:46] <gsaxena> I have a bunch of Dell servers with about 6 to 10 3TB disks each. To avoid wasting these disks, I was thinking of putting the doing a software raid 10 on the disks for say 100 GB and putting the OS on it, and leaving the rest of each disk for Ceph. I haven't seen anyone do it this way (based on my googling), but I don't see any harm in it? (My servers don't have dedicated small disks for just the OS, and my thoughts are that we s
[15:46] * erice (~erice@c-98-245-48-79.hsd1.co.comcast.net) Quit (Read error: Operation timed out)
[15:46] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[15:49] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[15:50] * mattbenjamin (~matt@76-206-42-105.lightspeed.livnmi.sbcglobal.net) Quit (Quit: Leaving.)
[15:51] * RuediR (~Adium@2001:620:0:25:8414:19ff:fe71:adf8) has joined #ceph
[15:52] * sjm (~sjm@38.98.115.249) has joined #ceph
[15:56] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) has joined #ceph
[16:01] <peetaur> gsaxena: raid would add overhead; you can simply make one OSD per disk and get better performance (with separate journals on some disks too)
[16:01] <peetaur> but if you're short on memory, then I think people would do some RAID since there's some requirement per OSD
[16:04] <gsaxena> but shouldn't the OS (not OSD, I mean the centos linux operating system) be raided? Otherwise, if I put the OS (not OSD) on just one disk and that disk fails, the whole machine goes down.
[16:05] <absynth> yup, although i'm not sure if software raid is still the way to go these days
[16:05] <absynth> 2 bay raid controllers are about a dime a dozen
[16:05] <gsaxena> But if I spread the operating system accross all hard disks (using software based RAID 1), I can at least protect teh OS. And I'm assuming the OS doesn't get written much, so hopefully there's not much CPU time being wasted on handling the RAID
[16:06] <gsaxena> can I just add a 2 bay RAID controller to a typical DELL Machine and have the OS installed on it? I didn't knww I could do that...Would that perform well and be stable? Is that what people do?
[16:08] * rudolfsteiner (~federicon@200.68.116.185) Quit (Quit: rudolfsteiner)
[16:09] * marrusl (~mark@209-150-43-182.c3-0.wsd-ubr2.qens-wsd.ny.cable.rcn.com) Quit (Quit: sync && halt)
[16:09] * yanzheng (~zhyan@jfdmzpr04-ext.jf.intel.com) Quit (Remote host closed the connection)
[16:10] * marrusl (~mark@209-150-43-182.c3-0.wsd-ubr2.qens-wsd.ny.cable.rcn.com) has joined #ceph
[16:10] <dmsimard> leseb: ping
[16:11] <peetaur> gsaxena: and yeah I'd raid the OS
[16:12] <peetaur> swap too if any
[16:12] * mattbenjamin (~matt@aa2.linuxbox.com) has joined #ceph
[16:14] <peetaur> gsaxena: and as for the os disk being shared by ceph, this affects performance, but who can say if it's relevant... is this the bottleneck, or is the journal, or network? And you can horizontally scale it out anyway
[16:14] <peetaur> I was thinking of doing this on 4 disk machines ... and not sure if it's the best on a 6-10 disk machine
[16:15] <peetaur> an SSD journal shared by the OS root woudl seem smart, but don't know
[16:19] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[16:21] <leseb> dmsimard: pong
[16:26] * Qu310 (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (Read error: Connection reset by peer)
[16:26] * Qu310 (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[16:32] <dmsimard> leseb: have you come across https://wiki.openstack.org/wiki/Puppet-openstack/ceph-blueprint and https://groups.google.com/a/puppetlabs.com/forum/?fromgroups=#!topic/puppet-openstack/tPknixlc5ds
[16:32] <dmsimard> ?
[16:33] * tziOm (~bjornar@194.19.106.242) Quit (Remote host closed the connection)
[16:33] <pmatulis_> is anyone running native ceph rbd images as install disks for kvm guests? i'm seeing a lot of instability in terms of the guest not booting properly 30% of the time. it gets stuck at different places too
[16:34] * onizo (~onizo@wsip-184-182-190-131.sd.sd.cox.net) has joined #ceph
[16:34] <saaby> hi
[16:34] <saaby> any of you tried 0.61.9 yet?
[16:34] <mattch> pmatulis_: I'm using rbd disks with kvm guests (with libvirt & opennebula) without any issues that I can see...
[16:35] <leseb> dmsimard: yes, why? I still haven't read the last messages though
[16:35] <pmatulis_> mattch: but are they native? or did you go through the kernel (rbd map...)
[16:35] <mattch> pmatulis_: Using the librbd support in qemu, not rbd map (no kernel support for that on these vm hosts)
[16:36] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit (Quit: sprachgenerator)
[16:36] <saaby> we just tried running 0.61.9 on one of our data nodes, and that aparently staleld most I/O to the entire cluster..
[16:36] <saaby> anyone else seen that?
[16:36] <dmsimard> leseb: Didn't see you weigh in on the topic is why :) EmilienM did though
[16:36] <dmsimard> leseb: They're going to be looking for core reviewers for gerrit, thought about you right away
[16:36] <mattch> saaby: I think 0.56 -> 0.61 has some schema/protocol changes, and you have to upgrade everything for it to work, maybe?
[16:36] <pmatulis_> mattch: so you have pool/image stuff in your xml files? as well as specifying the monitors and so on?
[16:36] <leseb> dmsimard: yes but I'll just in soon :)
[16:37] <saaby> mattch: nah, everyone else runs 0.61.8 in the cluster
[16:37] <dmsimard> leseb: quoi? :D
[16:37] <leseb> dmsimard: s/just/jump/
[16:37] <mattch> pmatulis_: Yep - xml had rbd disk 'type', monitors specified either in xml, or in /etc/ceph/ceph.conf
[16:38] <mattch> s/had/has
[16:38] <dmsimard> leseb: Ah, great. Meet us in #puppet-openstack if anything :)
[16:38] <mattch> saaby: OK - just a thought :)
[16:38] <leseb> dmsimard: I'll join :)
[16:38] <saaby> mattch: sure :)
[16:39] <pmatulis_> mattch: using cephx for access to those pools/images too?
[16:39] <mattch> pmatulis_: Not in production, but it's been tried successfully in testing (opennebula doesn't have cephx support yet)
[16:42] <pmatulis_> mattch: my disk 'type' is 'network', not 'rbd'. did you mean protocol='rbd'?
[16:43] <mattch> pmatulis_: Sorry, yes - not got xml in front of me...
[16:43] <pmatulis_> mattch: ok. this is what i have if you ever have time to compare
[16:43] <pmatulis_> http://paste.ubuntu.com/6251348/
[16:44] * rudolfsteiner (~federicon@200.68.116.185) Quit (Quit: rudolfsteiner)
[16:44] <mattch> pmatulis_: http://pastebin.com/Sv5TdEdR is my config
[16:45] <pmatulis_> mattch: thx, i wonder if i should be using cache='writeback'
[16:46] <mattch> pmatulis_: Effectively the same configs it seems
[16:46] <mattch> pmatulis_: Depends on what your ceph setup looks like... iirc qemu pre 1.1 or 1.2 ? doesn't support cache options properly anyway
[16:46] * Cube (~Cube@66-87-67-213.pools.spcsdns.net) has joined #ceph
[16:48] <mattch> pmatulis_: it's 1.2: http://tracker.ceph.com/issues/2295
[16:50] <pmatulis_> mattch: yeah. i have 1.4
[16:52] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[16:52] * rudolfsteiner (~federicon@200.68.116.185) Quit ()
[16:53] <pmatulis_> i wonder how i can query librbd to see if it's really turned on
[16:54] * Cube (~Cube@66-87-67-213.pools.spcsdns.net) Quit (Ping timeout: 480 seconds)
[16:55] <mattch> pmatulis_: Should be enabled then... though I'd recommend on reading up on writeback v writethrough etc to see which option to choose... you can always take it out for a bit too and see if that's affecting your guests?
[16:56] <pmatulis_> mattch: yup, looking at just that
[16:57] <mattch> pmatulis_: writethrough only caches read requests, writeback caches writes too, so relies on you having something like battery-backup etc to avoid data loss
[16:58] <peetaur> pmatulis_: which kernel version?
[16:58] * gregaf1 (~Adium@cpe-172-250-69-138.socal.res.rr.com) has joined #ceph
[16:59] <pmatulis_> peetaur: 3.8.0 on my kvm host
[16:59] <pmatulis_> Ubuntu 13.04
[16:59] <peetaur> k, 3.8 is likely a very suitable version
[17:00] <pmatulis_> well, something seems wrong, going to change to writethrough
[17:02] <pmatulis_> damn, broke on the second boot
[17:02] <peetaur> do you see any stack traces in syslog or anything? It's hard to troubleshoot a thing like that without logs
[17:03] <peetaur> or a screenshot
[17:06] <saaby> hah, we just found out that the slowdown/stall was'nt because of 0.61.9 but because 10 OSD's came back to the cluster after having been out for ~24 hours.
[17:06] * mtanski (~mtanski@69.193.178.202) has joined #ceph
[17:07] <saaby> so we just ran a few tests, turns out, that putting back in two newly (re)created OSD's is fine, no I/O impact
[17:07] <saaby> but putting back two OSD's which have been "out" for ~24 hours almost stalls all I/O to that cluster
[17:07] <saaby> disturbing...
[17:07] <saaby> anyone seen something like that before?
[17:08] <saaby> this is all on 0.61.8
[17:09] <saaby> ...and I/O busy rate on those OSD's drives are never more than ~60% during either test btw.
[17:09] <saaby> network usage even less.
[17:11] <peetaur> out and up? or down?
[17:12] <saaby> asking me?
[17:12] <peetaur> I'd think there would be no effect at all if it was out and up ... all data would be already in the in+up cluster
[17:12] <peetaur> yes
[17:12] <saaby> those osd's where down,out for 24h
[17:12] * RuediR (~Adium@2001:620:0:25:8414:19ff:fe71:adf8) Quit (Quit: Leaving.)
[17:12] * RuediR (~Adium@130.59.94.143) has joined #ceph
[17:13] <peetaur> and was it all active+clean at the point they rejoined?
[17:13] * JustEra (~JustEra@89.234.148.11) Quit (Ping timeout: 480 seconds)
[17:13] <peetaur> if so, sounds like a painful bug
[17:13] <saaby> so, I am pretty sure we could get these 10 osd's back into the cluster by deleting all data on the drives, and recreating them (just as we do when a drive failes).
[17:13] <saaby> no, the slowdown happened when they began recovering/backfilling
[17:13] <peetaur> maybe try with fresh disks just in case .. so you still have a copy of the old one
[17:14] <saaby> but no slowdown when the re-created osd's backfilled
[17:14] <saaby> we did
[17:14] <peetaur> but I don't know much about that. I would never have expected that problem.
[17:14] <saaby> same server - one test with fresh disks - one test with osd's being down,out for 24h
[17:14] <saaby> two very different results
[17:14] <peetaur> but I'll try it out ... I'll shut down one of my test nodes and test it tomorrow or monday
[17:14] <saaby> no... this is quite surprising
[17:15] <peetaur> any suggestion on how to reproduce... just shutdown -h now on a node?
[17:15] <saaby> ok, you should probably make sure there is lots of I/O going on while they are down,out
[17:15] <saaby> thats what we did.
[17:15] <peetaur> okay ... so need to script some writes, which I don't have yet
[17:15] <saaby> but this is a 720 OSD cluster
[17:15] <saaby> size probably matteres here..
[17:16] <saaby> 3 PB cluster...
[17:16] <peetaur> when I test a thing, I really abuse it ;) and so I've done lots of shutdown -r, got some kernel panics with btrfs, etc. and only time it died was a messed up monitor and then I injected a different mon map all over and broke it ;)
[17:16] <peetaur> so it really surprises me... it's pretty resilient
[17:16] <saaby> right
[17:16] <peetaur> ah well, ... my test cluster is not quite that large ;)
[17:16] * sputnik13 (~sputnik13@64-73-250-90.static-ip.telepacific.net) has joined #ceph
[17:16] <saaby> we have been doing much of the same - and haven't seen anything like this before
[17:17] <saaby> all right... I just wanted to know if this was somehow common knowledge on cuttlefish or something..
[17:18] <saaby> we have support, so it's probably better we proceed on that route from here..
[17:19] <saaby> ticket created.
[17:19] <saaby> thanks for your feedback
[17:19] * i_m (~ivan.miro@deibp9eh1--blueice1n1.emea.ibm.com) Quit (Ping timeout: 480 seconds)
[17:20] <peetaur> and thanks for your experience
[17:21] * RuediR (~Adium@130.59.94.143) Quit (Ping timeout: 480 seconds)
[17:21] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[17:25] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[17:26] * claenjoy (~leggenda@37.157.33.36) has joined #ceph
[17:27] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Read error: Operation timed out)
[17:28] * rudolfsteiner (~federicon@200.68.116.185) Quit ()
[17:30] * wenjianhn (~wenjianhn@114.245.46.123) Quit (Ping timeout: 480 seconds)
[17:31] * JustEra (~JustEra@ALille-555-1-102-208.w90-34.abo.wanadoo.fr) has joined #ceph
[17:32] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[17:36] * rudolfsteiner (~federicon@200.68.116.185) Quit ()
[17:36] <pmatulis_> peetaur: there were some apparmor problems i identified in the beginning of this but i disabled all my profiles
[17:37] <peetaur> pmatulis_: disable, or complain?
[17:37] <peetaur> I learned the hard way this week that complain is not disabled ... a "deny" line still denies.
[17:38] <Azrael> sjm: ping
[17:38] <sjm> hey
[17:38] <pmatulis_> peetaur: srsly?
[17:38] <pmatulis_> http://paste.ubuntu.com/6251605/
[17:38] <pmatulis_> peetaur: i did aa-complain
[17:39] <peetaur> so then it's complain mode rather than disabled (unconfined)
[17:39] <Azrael> sjm: josh here
[17:39] <peetaur> so if you have no deny lines, then in theory with my knowledge, it's the same as disabled + logging anyway
[17:39] <sjm> Azrael: whats up?
[17:39] <Azrael> sjm: you are sheldon, yes? the google hangout.
[17:39] <peetaur> but if you have deny lines, it is still denying, for the purpose of learning... what will a program do and need to learn to allow next if that is denied?
[17:40] <sjm> Azrael: correct are you able to connect?
[17:40] <Azrael> sjm: i'm there but nobody else is there
[17:40] <Azrael> sjm: i can try to rejoin
[17:40] <sjm> Azrael: sorry let me join now
[17:40] * ScOut3R (~ScOut3R@catv-89-133-21-203.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[17:42] <mikedawson> saaby: what is the ticket number you created? Does it sound similar to my ticket? http://tracker.ceph.com/issues/6333#change-27664
[17:43] <pmatulis_> peetaur: disabled everything, broke at 3rd boot
[17:43] * gregaf1 (~Adium@cpe-172-250-69-138.socal.res.rr.com) Quit (Quit: Leaving.)
[17:43] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) has joined #ceph
[17:44] * pmatulis_ wonders why he still gets DENIED after disabling all apparmor profiles...
[17:45] <peetaur> pmatulis_: check for deny lines, and comment them out before aa-complain
[17:46] * gregsfortytwo (~Adium@cpe-172-250-69-138.socal.res.rr.com) has joined #ceph
[17:46] <peetaur> and what is the name of the profile you are using aa-complain with
[17:46] * aliguori (~anthony@74.202.210.82) has joined #ceph
[17:47] <pmatulis_> peetaur: i did 'sudo aa-disable /etc/apparmor.d/*' but 'apparmor_status' shows the weird kvm guest ones are still enforced
[17:48] <peetaur> disable might only do it for next startup
[17:48] <peetaur> did you restart the process?
[17:48] <peetaur> check with "ps -Z" also
[17:48] <pmatulis_> peetaur: no
[17:49] <pmatulis_> peetaur: so far, i've found aa-* commands are runtime
[17:50] <pmatulis_> peetaur: status show only the weird kvm guest profiles enforced, but that's what the logs are pointing at
[17:50] * gregsfortytwo (~Adium@cpe-172-250-69-138.socal.res.rr.com) Quit ()
[17:50] <peetaur> you can go frm enforce to complain and back at runtime
[17:50] <peetaur> but you can't go from unconfined to complain/enforce without restart
[17:50] <peetaur> and I don't know if you can go to unconfined without restart
[17:52] <pmatulis_> this is what i have: http://paste.ubuntu.com/6251675/
[17:52] <pmatulis_> and 'libvirt-d4c8b8fe-3554-11e3-bf96-005b78bdeff2' is in the logs (DENIED)
[17:52] * sprachgenerator (~sprachgen@130.202.135.218) has joined #ceph
[17:53] <pmatulis_> hm, i managed to disable that individual one
[17:55] <peetaur> heh that shows them in enforce even ... not complain
[17:56] * sagelap (~sage@2600:1012:b007:5c23:ecc6:8f80:a454:66ed) has joined #ceph
[17:58] <pmatulis_> peetaur: i got 7 consecutive good boots, i never got more than 3 before, after trying about 10 reboot sessions. but this 8th broke, might be something else though
[17:58] <peetaur> did you get denys logged this time?
[17:59] <pmatulis_> peetaur: no
[17:59] * mattt__ (~textual@92.52.76.140) Quit (Read error: Connection reset by peer)
[17:59] <pmatulis_> but now this: apparmor="STATUS" operation="profile_remove" info="profile does not exist" error=-2 name="libvirt-d4c8b8fe-3554-11e3-bf96-005b78bdeff2"
[18:00] <peetaur> that's likely from an aa-... command (possibly run by libvirt)
[18:00] <peetaur> I'm not sure how libvirt does it.
[18:00] <peetaur> I write my own qemu scripts, and they append to /etc/apparmor.d/local/... files and then run aa-enforce to replace
[18:01] <mikedawson> saaby: Just added a link to your trouble on my ticket (http://tracker.ceph.com/issues/6333#note-6). Could you let me know if you get anywhere on your ticket? I can confirm this affects Dumpling as well.
[18:04] <pmatulis_> peetaur: well, i can't even start my domain/guest now. i can enforce or disable that profile but i can't start this guest at all now, let alone get it to boot all the way
[18:04] * glzhao_ (~glzhao@118.195.65.67) Quit (Quit: Lost terminal)
[18:05] <pmatulis_> i don't think one is supposed to mess with those profiles
[18:05] * angdraug (~angdraug@64-79-127-122.static.wiline.com) has joined #ceph
[18:06] * themgt (~themgt@201-223-255-38.baf.movistar.cl) has joined #ceph
[18:12] <peetaur> which profiles?
[18:13] <peetaur> the ones with the hex number things after them?
[18:13] <peetaur> and BTW there's an #apparmor channel on this irc network
[18:14] <peetaur> I don't know what those hex number ones are
[18:15] * xarses (~andreww@c-71-202-167-197.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[18:17] * erice (~erice@71-208-244-175.hlrn.qwest.net) has joined #ceph
[18:17] * a_ (~a@209.12.169.218) has joined #ceph
[18:19] <pmatulis_> peetaur: they must be for each libvirt domain
[18:21] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[18:31] * yehudasa (~yehudasa@2607:f298:a:607:ea03:9aff:fe98:e8ff) has joined #ceph
[18:32] * sagelap (~sage@2600:1012:b007:5c23:ecc6:8f80:a454:66ed) Quit (Read error: Connection reset by peer)
[18:34] * ircolle (~Adium@c-67-172-132-222.hsd1.co.comcast.net) has joined #ceph
[18:35] * albionandrew (~albionand@64.25.15.100) has joined #ceph
[18:37] * Vjarjadian (~IceChat77@94.1.37.151) has joined #ceph
[18:39] <albionandrew> If I run "ceph osd repair osd.1" I get "unknown command repair" - http://ceph.com/docs/master/rados/operations/control/ says ceph osd repair N I've tried "ceph osd repair 1" TOO
[18:40] <albionandrew> version - ceph version 0.61.8 (a6fdcca3bddbc9f177e4e2bf0d9cdd85006b028b)
[18:41] * xarses (~andreww@64-79-127-122.static.wiline.com) has joined #ceph
[18:43] * JustEra (~JustEra@ALille-555-1-102-208.w90-34.abo.wanadoo.fr) Quit (Quit: This computer has gone to sleep)
[18:48] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:48] * mtanski (~mtanski@69.193.178.202) Quit (Quit: mtanski)
[18:50] * xarses (~andreww@64-79-127-122.static.wiline.com) Quit (Quit: Leaving.)
[18:50] * xarses (~andreww@64-79-127-122.static.wiline.com) has joined #ceph
[18:56] * jbd_ (~jbd_@2001:41d0:52:a00::77) has left #ceph
[18:59] * albionandrew (~albionand@64.25.15.100) Quit (Quit: albionandrew)
[19:01] * mtanski (~mtanski@69.193.178.202) has joined #ceph
[19:01] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[19:03] <pmatulis_> peetaur: fyi, disabling a/a globablly (apparmor=0 kernel option) did not help
[19:14] * Vjarjadian (~IceChat77@94.1.37.151) Quit (Quit: Always try to be modest, and be proud about it!)
[19:15] * joshd1 (~jdurgin@2602:306:c5db:310:5840:6456:9bc4:d59) has joined #ceph
[19:15] <n1md4> hi. still trying to get xenserver techreview to talk with ceph. I've disabled auth on the cluster, but still getting this error:
[19:15] <n1md4> virStorageBackendRBDOpenRADOSConn:169 : internal error failed to connect to the RADOS monitor on: 10.11.4.52:6789
[19:16] <n1md4> I can telnet to 10.11.4.52:6789
[19:16] <n1md4> so his must be some internal mechanism of libvirts that's getting in the way
[19:22] * mtanski (~mtanski@69.193.178.202) Quit (Quit: mtanski)
[19:25] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Read error: Operation timed out)
[19:26] * mattt_ (~textual@cpc25-rdng20-2-0-cust162.15-3.cable.virginmedia.com) has joined #ceph
[19:28] * gregaf (~Adium@2607:f298:a:607:dcd:f9b1:80f4:84cc) Quit (Quit: Leaving.)
[19:29] * gregsfortytwo (~Adium@2607:f298:a:607:dcd:f9b1:80f4:84cc) has joined #ceph
[19:30] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) has joined #ceph
[19:30] * ChanServ sets mode +v andreask
[19:39] * sputnik13 (~sputnik13@64-73-250-90.static-ip.telepacific.net) Quit (Ping timeout: 480 seconds)
[19:39] * Grasshopper (~quassel@rrcs-74-218-204-10.central.biz.rr.com) Quit (Read error: Connection reset by peer)
[19:39] * mattt_ (~textual@cpc25-rdng20-2-0-cust162.15-3.cable.virginmedia.com) Quit (Quit: Computer has gone to sleep.)
[19:40] * Grasshopper (~quassel@rrcs-74-218-204-10.central.biz.rr.com) has joined #ceph
[19:42] * Cube (~Cube@vfw00.dincloud.com) has joined #ceph
[19:47] * sleinen (~Adium@2001:620:0:46:d44a:2517:df4b:ebc9) Quit (Ping timeout: 480 seconds)
[19:54] * ScOut3R (~scout3r@dsl51B61603.pool.t-online.hu) has joined #ceph
[20:00] * mtanski (~mtanski@69.193.178.202) has joined #ceph
[20:03] * gsaxena (~gsaxena@pool-71-178-225-182.washdc.fios.verizon.net) Quit (Remote host closed the connection)
[20:05] * jmlowe (~Adium@2601:d:a800:511:345c:56cb:4960:1280) has joined #ceph
[20:05] <jmlowe> So, how about those saucy builds?
[20:14] <mikedawson> jmlowe: you can likely install the raring debs on saucy (at least you could install quantal debs on raring)
[20:15] <jmlowe> mikedawson: the raring debs are working fine, I'd just like to have them for correctness
[20:16] * Pedras (~Adium@c-24-130-196-123.hsd1.ca.comcast.net) has joined #ceph
[20:16] <mikedawson> jmlowe: yep, +1 cc: glowell
[20:17] <jmlowe> mikedawson: saucy upgrade going as smoothly for you as it is for me?
[20:17] <mikedawson> jmlowe: waiting on some dev hardware to arrive on Monday
[20:18] <jmlowe> mikedawson: my qemu 1.4.2 vm's are live migrating to the distro 1.5.0, no problems yet
[20:19] <mikedawson> jmlowe: nice. We've been running 1.5.0 on raring without issue for a few months
[20:19] * The_Bishop (~bishop@f055147133.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[20:20] <tsnider> DQOTD: does the osd pool default size of 2 include the primary object or just the number of object replicas to create?
[20:20] * mtanski (~mtanski@69.193.178.202) Quit (Quit: mtanski)
[20:21] <jmlowe> mikedawson: biggest part of my upgrade process is doing the firmware on my hypervisor disks, apparently a firmware bug can cause the heads to scrape the rust off the platters
[20:21] <mikedawson> tsnider: default size of 2 gets a primary and a single replica
[20:22] <tsnider> mikedawson: thx. I've been arguing with myself and losing for awhile. :)
[20:22] <jmlowe> mikedawson: yet another in a long line of "oops, our oem's firmware can trash the disk" from hp
[20:22] <mikedawson> jmlowe: yeah, that's not cool
[20:22] * alfredodeza (~alfredode@c-24-99-84-83.hsd1.ga.comcast.net) has joined #ceph
[20:23] <peetaur> tsnider: I did the same when I started reading about ceph :D
[20:23] * bandrus1 (~Adium@c-98-238-148-252.hsd1.ca.comcast.net) has joined #ceph
[20:28] * bandrus (~Adium@c-98-238-148-252.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[20:30] * The_Bishop (~bishop@i59F6C731.versanet.de) has joined #ceph
[20:33] * mtanski (~mtanski@69.193.178.202) has joined #ceph
[20:37] * Cube (~Cube@vfw00.dincloud.com) Quit (Quit: Leaving.)
[20:42] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[20:43] * mtanski_ (~mtanski@69.193.178.202) has joined #ceph
[20:43] * mtanski_ (~mtanski@69.193.178.202) Quit ()
[20:44] * erice (~erice@71-208-244-175.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[20:47] * mtanski (~mtanski@69.193.178.202) Quit (Ping timeout: 480 seconds)
[20:55] * sagelap (~sage@38.122.20.226) has joined #ceph
[20:58] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[20:59] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) has joined #ceph
[20:59] * ChanServ sets mode +v andreask
[21:00] * andreask (~andreask@h081217067008.dyn.cm.kabsi.at) Quit ()
[21:00] * aurora (~a@78-61-117-10.static.zebra.lt) has joined #ceph
[21:02] <aurora> http://addland.blogspot.com/2013/10/amazing-dog.html
[21:04] * aurora (~a@78-61-117-10.static.zebra.lt) Quit ()
[21:05] * ScOut3R (~scout3r@dsl51B61603.pool.t-online.hu) Quit (Remote host closed the connection)
[21:10] * odyssey4me2 (~odyssey4m@41-132-104-169.dsl.mweb.co.za) Quit (Ping timeout: 480 seconds)
[21:17] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[21:18] * rudolfsteiner (~federicon@200.68.116.185) Quit (Quit: rudolfsteiner)
[21:20] * bandrus1 is now known as bandrus
[21:21] * Cube (~Cube@vfw00.dincloud.com) has joined #ceph
[21:22] * newb (~oftc-webi@egress.nitrosecurity.com) has joined #ceph
[21:23] <newb> I'd like to experiment with the ceph class methods technology. Are there any step-by-step documents on how to compile, link and deploy the method .so files?
[21:26] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Read error: Operation timed out)
[21:29] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Quit: Leaving.)
[21:31] * onizo (~onizo@wsip-184-182-190-131.sd.sd.cox.net) Quit (Remote host closed the connection)
[21:31] * onizo (~onizo@wsip-184-182-190-131.sd.sd.cox.net) has joined #ceph
[21:38] * ShaunR (~ShaunR@staff.ndchost.com) has joined #ceph
[21:38] * mattt_ (~textual@cpc25-rdng20-2-0-cust162.15-3.cable.virginmedia.com) has joined #ceph
[21:39] * onizo (~onizo@wsip-184-182-190-131.sd.sd.cox.net) Quit (Ping timeout: 480 seconds)
[21:40] * newb (~oftc-webi@egress.nitrosecurity.com) Quit (Quit: Page closed)
[21:46] * JustEra (~JustEra@ALille-555-1-102-208.w90-34.abo.wanadoo.fr) has joined #ceph
[21:52] * nwat (~nwat@c-24-5-146-110.hsd1.ca.comcast.net) has joined #ceph
[21:55] * davidz (~Adium@ip68-5-239-214.oc.oc.cox.net) Quit (Quit: Leaving.)
[21:56] * Shmouel (~Sam@fny94-12-83-157-27-95.fbx.proxad.net) has joined #ceph
[21:59] * gregmark (~Adium@68.87.42.115) Quit (Quit: Leaving.)
[22:01] * Cube (~Cube@vfw00.dincloud.com) Quit (Quit: Leaving.)
[22:02] * onizo (~onizo@wsip-184-182-190-131.sd.sd.cox.net) has joined #ceph
[22:02] * sagelap (~sage@38.122.20.226) Quit (Ping timeout: 480 seconds)
[22:06] * sagelap (~sage@38.122.20.226) has joined #ceph
[22:13] * amospalla (~amospalla@0001a39c.user.oftc.net) Quit (Remote host closed the connection)
[22:13] * amospalla (~amospalla@0001a39c.user.oftc.net) has joined #ceph
[22:14] * onizo (~onizo@wsip-184-182-190-131.sd.sd.cox.net) Quit (Ping timeout: 480 seconds)
[22:17] * onizo (~onizo@wsip-184-182-190-131.sd.sd.cox.net) has joined #ceph
[22:21] * rudolfsteiner (~federicon@mail.bittanimation.com) has joined #ceph
[22:30] * Vjarjadian (~IceChat77@94.1.37.151) has joined #ceph
[22:34] * alram (~alram@38.98.115.249) has joined #ceph
[22:40] * gucki (~smuxi@p549F96D5.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[22:50] * allsystemsarego (~allsystem@5-12-37-46.residential.rdsnet.ro) Quit (Quit: Leaving)
[22:51] * sarob (~sarob@nat-dip4.cfw-a-gci.corp.yahoo.com) has joined #ceph
[22:57] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[23:00] * sleinen1 (~Adium@2001:620:0:26:94fc:24a6:8fd3:36ce) has joined #ceph
[23:03] <JustEra> can we disable placement group and just set to replicate on all osd ?
[23:04] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Read error: Operation timed out)
[23:09] * claenjoy (~leggenda@37.157.33.36) Quit (Quit: Leaving.)
[23:10] * alfredodeza (~alfredode@c-24-99-84-83.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[23:11] * sagelap (~sage@38.122.20.226) Quit (Ping timeout: 480 seconds)
[23:15] * yehudasa (~yehudasa@2607:f298:a:607:ea03:9aff:fe98:e8ff) Quit (Ping timeout: 480 seconds)
[23:23] * sagelap (~sage@38.122.20.226) has joined #ceph
[23:24] * sprachgenerator (~sprachgen@130.202.135.218) Quit (Quit: sprachgenerator)
[23:25] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) Quit (Ping timeout: 480 seconds)
[23:25] <Azrael> hi sagelap
[23:26] <Azrael> have you ever seen a group of osd's come back into a cluster, who have been offline for ~24 hours, and end up causing a drastic reduction in op/sec that the cluster can handle?
[23:29] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) Quit (Ping timeout: 480 seconds)
[23:30] <mikedawson> Azrael: are you with saaby?
[23:31] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit (Quit: Leaving.)
[23:31] <mikedawson> Azrael: Does it sound at all like this? http://tracker.ceph.com/issues/6333#note-6
[23:33] <infernix> what was the benchmark tool again to test raw rados performance on a box with OSDs?
[23:34] <Azrael> mikedawson: yep
[23:34] <Azrael> mikedawson: and let me take a look
[23:34] <mikedawson> infernix: rados bench (tests aggregate cluster throughput) or ceph tell osd.* bench (tests individual osds)
[23:35] * infernix builds an 48 SSD osd box
[23:36] <infernix> this thing does 23GB random large block reads in raid 10, 10GB random large block writes
[23:36] <infernix> GByte/s
[23:36] <infernix> let's see what rados can do
[23:36] <Azrael> mikedawson: sounds similar
[23:36] <mikedawson> Azrael: we see that type of issue with Cuttlefish or Dumpling. Watch for saturated spindles (ioutil - 100%). We see some of our rbd-backed VMs hang (on reads, we believe) and other run without much disruption
[23:37] <Azrael> that's just it
[23:37] <Pedras> infernix: what kind/how many network interfaces on that box?
[23:37] <Azrael> we don't get to past 60% for each disk
[23:37] <Azrael> util
[23:37] <Azrael> but the entire cluster is affected
[23:37] <Azrael> one server out of 30
[23:38] <mikedawson> Azrael: if you watch ceph -w or the monitor logs, do you get slow request warnings during the time of degraded performance?
[23:38] <Azrael> hmm i did not use ceph -w at that time
[23:38] <Azrael> i will check the logs now
[23:38] * houkouonchi-work (~linux@12.248.40.138) Quit (Read error: Connection reset by peer)
[23:38] <Azrael> yes
[23:39] <Azrael> lots of slow requests
[23:39] <Azrael> tons
[23:39] <mikedawson> Azrael: it only takes a single OSD process on my cluster to affect everything
[23:39] <Azrael> 1.3 million to be exact
[23:39] <Azrael> interesting
[23:39] <Azrael> so what about that osd causes the slowdown
[23:39] <Azrael> is that osd brand new? fully zero'd/wiped/clean?
[23:39] <Azrael> or is it one that was just out for a while?
[23:40] * Cube (~Cube@vfw00.dincloud.com) has joined #ceph
[23:40] <mikedawson> Azrael: I have occasionally had luck restarting all osd processeses in parallel. They seem to clear all the slow requests and start over. More often than not, they will get bogged down again, though
[23:42] * iii8 (~Miranda@91.207.132.71) Quit (Read error: Connection reset by peer)
[23:42] <Azrael> its very strange. we have been running cuttlefish in a stable and performing fashion for a long while now.
[23:42] * onizo (~onizo@wsip-184-182-190-131.sd.sd.cox.net) Quit (Remote host closed the connection)
[23:43] <mikedawson> Azrael: I think it can happen with new OSDs or OSDs that have been out for a while
[23:43] <mikedawson> Azrael: do you run XFS under your OSDs?
[23:44] <mikedawson> Azrael: If so, will you check for xfs extent fragmentation for me? xfs_db -c frag -r /dev/sdb1
[23:45] <Azrael> yes, xfs
[23:45] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) Quit (Quit: ...)
[23:46] <Azrael> i'm guesing the xfs_db will take a few mins to return
[23:46] * mattt_ (~textual@cpc25-rdng20-2-0-cust162.15-3.cable.virginmedia.com) Quit (Quit: Computer has gone to sleep.)
[23:47] * ScOut3R (~scout3r@dsl51B61603.pool.t-online.hu) has joined #ceph
[23:47] <mikedawson> Azrael: several hours, possibly (if it is a large disk with heavy fragmentation). Our 3TB'ers were over 80%
[23:48] <Azrael> ahh
[23:48] <Azrael> i'll run it later then likely in screen
[23:48] <loicd> is it possible for a pre-formated disk to join a cluster and get an osd id automagically ?
[23:49] <loicd> I mean that if a disk has a partition with a magic matching udev magic ... I don't know how that works :-(
[23:49] <dmick> I just filed a bug about that
[23:49] * loicd investigating to be able to formulate the question properly
[23:49] <loicd> dmick: about me asking non sensible questions ? :-)
[23:50] * alram (~alram@38.98.115.249) Quit (Quit: leaving)
[23:51] <dmick> sigh. or maybe I didn't actually hit "submit", I can't find it. grr.
[23:51] <dmick> anyway, see the files in udev in the src tree; magic GPT Partition IDs is the key
[23:51] <dmick> then udev kicks off the process
[23:51] <dmick> "new partition" events happen at boot or at rescan time
[23:53] * Cube (~Cube@vfw00.dincloud.com) Quit (Quit: Leaving.)
[23:56] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) has joined #ceph
[23:57] * sjm (~sjm@38.98.115.249) has left #ceph
[23:58] * mozg (~andrei@host86-184-120-113.range86-184.btcentralplus.com) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.