#ceph IRC Log


IRC Log for 2016-10-13

Timestamps are in GMT/BST.

[0:02] <T1> doppelgraus suggestion of only mirroring the OS partitions and then having a 1:4 ratio would probably be the better solution
[0:03] <T1> I just don't like the thought of a single SSD failure taking 4 OSDs down with it
[0:03] <doppelgrau> T1: a failing mainboard or bad powersupply could take the whole server down
[0:04] <T1> yes, but we're going with redundant PSUs just for that (one of the only places we have redundancy in those nodes)
[0:05] * doppelgrau had simply accepted a whole server as a failure domain and with distributing the data to nodes on different switches some network errors are tolerated by ceph too
[0:06] <T1> alas I cannot say I've never had a motherboard or CPU fail on me, but its only happened once the past 11 years
[0:06] <doppelgrau> T1: assuming ???fail silent???, but not failing with to high voltage ???
[0:07] <T1> the "best part" was when we saw the replacement motherboard had a faulty dimm slot, so we had to schedule yet another replacement the next day
[0:07] <doppelgrau> but ???murphy??? allways catches you with some errors you did not think before - once I got many many slow requests with no reason at a first look
[0:07] <T1> at least it only cost me a bit of missed sleep - yeay for missions critical 4H support
[0:08] <doppelgrau> everything was healthy, load even lower than normal ???
[0:08] <bstillwell> Slow requests can be painful to track down the first time.
[0:09] <doppelgrau> in the end, it was a ???zombie??? HDD, nearly dead (only very few IO/s), but alive enough, that the OSD did not kill itself
[0:09] <bstillwell> doppelgrau: Sick hardware is the worst
[0:09] <T1> hopefully you could see what PGs were slow and trace it to the same OSD
[0:09] <bstillwell> Wait until you get sick network links between your spine and leaf switches...
[0:10] <T1> hah! been there
[0:10] <T1> iperf in one direction - all good and performed as expected
[0:10] <T1> iperf in reverse.. <10% of expected throughput
[0:11] <T1> "but nothing is changed"
[0:11] <T1> made someone was a cable anyway
[0:11] <T1> .. and back to normal and expected throughput in both directions
[0:12] * ira (~ira@c-24-34-255-34.hsd1.ma.comcast.net) Quit (Quit: Leaving)
[0:14] * Racpatel (~Racpatel@2601:87:3:31e3::34db) Quit (Ping timeout: 480 seconds)
[0:14] <kjetijor> my least favorite instance of sick hardware so far have been "sick" memory - 1 out of 1300 compute nodes, orders of magnitude slower than the others for memory intensive things. Running memtest, the only difference observed were a factor 3600 difference in total execution time.
[0:14] <kjetijor> swap dimms between identical compute nodes - problem gone.
[0:14] <T1> ewww
[0:15] <doppelgrau> ???nice???
[0:15] <T1> I've had a server with excessive ECC correction ratio on one dimm slot coming and going
[0:16] <T1> it was not related to the dimms used
[0:16] <T1> and it was not the same slot from time to time
[0:16] <T1> it was just freaky
[0:18] <kjetijor> Didn't have any complaints from the EDAC (or whatever it may have been called at the time) about ECC making correction(s), although this could also be due to missing/buggy implementations interacting with the EDAC subsystem.
[0:18] <T1> evil
[0:20] <kjetijor> for the throughput drop - I do find the tcp retransmit counters useful for proving it's existence, although if it's switc-switch (i.e. between leaf and spine) you're sort-of relliant on *some* counter from the switch about dropping packet(s).
[0:20] <kjetijor> .. for locating where it is.
[0:21] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[0:21] <T1> mmm - never underestimate various counters
[0:22] * BrianA1 (~BrianA@fw-rw.shutterfly.com) has joined #ceph
[0:25] * BrianA (~BrianA@fw-rw.shutterfly.com) Quit (Ping timeout: 480 seconds)
[0:28] * w0lfeh (~ZombieL@tsn109-201-152-26.dyn.nltelcom.net) Quit ()
[0:29] * dneary (~dneary@nat-pool-bos-u.redhat.com) Quit (Remote host closed the connection)
[0:29] * BrianA1 (~BrianA@fw-rw.shutterfly.com) Quit (Read error: Connection reset by peer)
[0:29] * xinli (~charleyst@ Quit (Ping timeout: 480 seconds)
[0:29] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[0:30] * mattbenjamin (~mbenjamin@97-117-49-192.slkc.qwest.net) Quit (Ping timeout: 480 seconds)
[0:30] * dneary (~dneary@nat-pool-bos-u.redhat.com) has joined #ceph
[0:33] * haplo37 (~haplo37@ Quit (Ping timeout: 480 seconds)
[0:38] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[0:38] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) Quit (Quit: doppelgrau)
[0:39] * bene3 (~bene@nat-pool-bos-t.redhat.com) Quit (Quit: Konversation terminated!)
[0:45] * fdmanana (~fdmanana@2001:8a0:6e0c:6601:a095:7e95:6611:bc7) Quit (Ping timeout: 480 seconds)
[0:45] * mattbenjamin (~mbenjamin@97-117-49-192.slkc.qwest.net) has joined #ceph
[0:50] * davidzlap (~Adium@2605:e000:1313:8003:6156:a079:6823:cfb6) Quit (Quit: Leaving.)
[0:52] * Unai (~Adium@50-115-70-150.static-ip.telepacific.net) Quit (Ping timeout: 480 seconds)
[0:52] * lcurtis (~lcurtis@ Quit (Read error: Connection reset by peer)
[0:52] * joao sets mode -o joao
[0:52] * davidzlap (~Adium@2605:e000:1313:8003:6156:a079:6823:cfb6) has joined #ceph
[0:53] * ChanServ sets mode -v joao
[0:53] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) has joined #ceph
[0:56] * Pulp (~Pulp@151-168-35-213.dyn.estpak.ee) has joined #ceph
[0:58] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[1:03] * dneary (~dneary@nat-pool-bos-u.redhat.com) Quit (Ping timeout: 480 seconds)
[1:07] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[1:13] * stiopa (~stiopa@cpc73832-dals21-2-0-cust453.20-2.cable.virginm.net) Quit (Ping timeout: 480 seconds)
[1:14] * rakeshgm (~rakesh@ has joined #ceph
[1:16] * oms101 (~oms101@p20030057EA491300C6D987FFFE4339A1.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[1:23] * bniver (~bniver@pool-71-174-250-171.bstnma.fios.verizon.net) has joined #ceph
[1:25] * diver (~diver@cpe-2606-A000-111B-C12B-8017-6EC9-7523-3BEA.dyn6.twc.com) has joined #ceph
[1:25] * rwheeler (~rwheeler@174-23-105-60.slkc.qwest.net) Quit (Remote host closed the connection)
[1:25] * oms101 (~oms101@p20030057EA44F600C6D987FFFE4339A1.dip0.t-ipconnect.de) has joined #ceph
[1:27] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[1:33] * diver (~diver@cpe-2606-A000-111B-C12B-8017-6EC9-7523-3BEA.dyn6.twc.com) Quit (Ping timeout: 480 seconds)
[1:35] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[1:35] * sudocat (~dibarra@ Quit (Ping timeout: 480 seconds)
[1:42] * wushudoin (~wushudoin@ Quit (Ping timeout: 480 seconds)
[1:44] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[1:47] * xarses_ (~xarses@ Quit (Ping timeout: 480 seconds)
[1:48] * mattbenjamin (~mbenjamin@97-117-49-192.slkc.qwest.net) Quit (Ping timeout: 480 seconds)
[1:53] * davidzlap (~Adium@2605:e000:1313:8003:6156:a079:6823:cfb6) Quit (Quit: Leaving.)
[1:56] * davidzlap (~Adium@2605:e000:1313:8003:6156:a079:6823:cfb6) has joined #ceph
[1:59] * xarses_ (~xarses@c-73-202-191-48.hsd1.ca.comcast.net) has joined #ceph
[2:04] * debian112 (~bcolbert@c-73-184-103-26.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[2:12] * towo (~towo@towo.netrep.oftc.net) Quit (Ping timeout: 480 seconds)
[2:13] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[2:14] * Skaag (~lunix@ Quit (Quit: Leaving.)
[2:14] * debian112 (~bcolbert@c-73-184-103-26.hsd1.ga.comcast.net) has joined #ceph
[2:21] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[2:23] * salwasser (~Adium@c-73-219-86-22.hsd1.ma.comcast.net) has joined #ceph
[2:25] * Bobby (~CobraKhan@ has joined #ceph
[2:31] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) Quit (Quit: wes_dillingham)
[2:36] * vbellur (~vijay@ Quit (Remote host closed the connection)
[2:40] * mattt (~mattt@lnx1.defunct.ca) Quit (Quit: leaving)
[2:41] * kristen (~kristen@jfdmzpr03-ext.jf.intel.com) Quit (Quit: Leaving)
[2:41] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) has joined #ceph
[2:46] * davidzlap (~Adium@2605:e000:1313:8003:6156:a079:6823:cfb6) Quit (Quit: Leaving.)
[2:47] * dneary (~dneary@pool-96-233-46-27.bstnma.fios.verizon.net) has joined #ceph
[2:47] * vata (~vata@ Quit (Remote host closed the connection)
[2:49] * Racpatel (~Racpatel@2601:87:3:31e3::34db) has joined #ceph
[2:49] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[2:50] * Pulp (~Pulp@151-168-35-213.dyn.estpak.ee) Quit (Read error: Connection reset by peer)
[2:53] * cmart (~cmart@ Quit (Ping timeout: 480 seconds)
[2:55] * Bobby (~CobraKhan@ Quit ()
[2:57] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[2:58] * etamponi (~etamponi@net-93-71-251-206.cust.vodafonedsl.it) Quit ()
[3:02] * davidzlap (~Adium@2605:e000:1313:8003:6156:a079:6823:cfb6) has joined #ceph
[3:03] * davidzlap (~Adium@2605:e000:1313:8003:6156:a079:6823:cfb6) Quit ()
[3:05] * towo (~towo@metadatenhafen.de) has joined #ceph
[3:09] * topro (~prousa@p578af414.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[3:11] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) Quit (Quit: wes_dillingham)
[3:12] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) has joined #ceph
[3:14] * evelu (~erwan@aut78-1-78-236-183-64.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[3:14] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[3:15] * salwasser (~Adium@c-73-219-86-22.hsd1.ma.comcast.net) Quit (Quit: Leaving.)
[3:19] * diver (~diver@cpe-98-26-71-226.nc.res.rr.com) has joined #ceph
[3:19] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) has joined #ceph
[3:22] * vata (~vata@ has joined #ceph
[3:23] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) Quit ()
[3:24] * diver (~diver@cpe-98-26-71-226.nc.res.rr.com) Quit (Remote host closed the connection)
[3:24] * diver (~diver@cpe-2606-A000-111B-C12B-9033-CF90-B387-1CDD.dyn6.twc.com) has joined #ceph
[3:26] * diver (~diver@cpe-2606-A000-111B-C12B-9033-CF90-B387-1CDD.dyn6.twc.com) Quit ()
[3:26] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[3:27] * Racpatel (~Racpatel@2601:87:3:31e3::34db) Quit (Ping timeout: 480 seconds)
[3:34] * Craig (~quassel@ has joined #ceph
[3:35] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[3:36] * Racpatel (~Racpatel@2601:87:3:31e3:4e34:88ff:fe87:9abf) has joined #ceph
[3:52] * derjohn_mobi (~aj@x4db0f8e6.dyn.telefonica.de) has joined #ceph
[3:53] * yanzheng (~zhyan@ has joined #ceph
[3:53] * steve_ (~steve@ip68-98-63-137.ph.ph.cox.net) Quit (Ping timeout: 480 seconds)
[3:58] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[3:58] * jfaj (~jan@p20030084AD2EBE006AF728FFFE6777FF.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[3:59] * derjohn_mob (~aj@x590c583f.dyn.telefonica.de) Quit (Ping timeout: 480 seconds)
[4:02] * haplo37 (~haplo37@107-190-42-94.cpe.teksavvy.com) has joined #ceph
[4:03] * steve_ (~steve@ip68-98-63-137.ph.ph.cox.net) has joined #ceph
[4:03] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[4:08] * jfaj (~jan@p20030084AD3FB7016AF728FFFE6777FF.dip0.t-ipconnect.de) has joined #ceph
[4:10] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) has joined #ceph
[4:11] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[4:15] * vasu (~vasu@c-73-231-60-138.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[4:19] * georgem (~Adium@ has joined #ceph
[4:21] * toastydeath (~toast@pool-71-255-253-39.washdc.fios.verizon.net) has joined #ceph
[4:30] * vicente (~~vicente@125-227-238-55.HINET-IP.hinet.net) has joined #ceph
[4:32] * Racpatel (~Racpatel@2601:87:3:31e3:4e34:88ff:fe87:9abf) Quit (Quit: Leaving)
[4:39] * georgem (~Adium@ Quit (Quit: Leaving.)
[4:39] * georgem (~Adium@ has joined #ceph
[4:40] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[4:43] * kefu (~kefu@ has joined #ceph
[4:45] * newdave (~newdave@36-209-181-180.cpe.skymesh.net.au) Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[4:48] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[5:01] * haplo37 (~haplo37@107-190-42-94.cpe.teksavvy.com) Quit (Read error: Connection reset by peer)
[5:08] * efirs (~firs@ has joined #ceph
[5:16] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[5:16] * Skaag (~lunix@cpe-172-91-77-84.socal.res.rr.com) has joined #ceph
[5:21] * georgem (~Adium@ Quit (Quit: Leaving.)
[5:24] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[5:30] * rakeshgm (~rakesh@ Quit (Quit: Peace :))
[5:33] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[5:33] * topro (~prousa@ has joined #ceph
[5:35] * kefu (~kefu@ Quit (Read error: Connection reset by peer)
[5:35] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:36] * kefu (~kefu@ has joined #ceph
[5:39] * jidar (~jidar@ Quit (Quit: stuff.)
[5:39] * Vacuum_ (~Vacuum@ has joined #ceph
[5:41] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Ping timeout: 480 seconds)
[5:46] * Vacuum__ (~Vacuum@ Quit (Ping timeout: 480 seconds)
[5:56] * Skaag (~lunix@cpe-172-91-77-84.socal.res.rr.com) Quit (Quit: Leaving.)
[6:02] * janos_ (~messy@static-71-176-211-4.rcmdva.fios.verizon.net) has joined #ceph
[6:05] * Skaag (~lunix@cpe-172-91-77-84.socal.res.rr.com) has joined #ceph
[6:05] * ivve (~zed@cust-gw-11.se.zetup.net) has joined #ceph
[6:09] * janos (~messy@static-71-176-211-4.rcmdva.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[6:10] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) has joined #ceph
[6:12] * walcubi (~walcubi@p5795AB17.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[6:12] * walcubi (~walcubi@p5797AA84.dip0.t-ipconnect.de) has joined #ceph
[6:12] * cmart (~cmart@ip68-226-23-16.tc.ph.cox.net) has joined #ceph
[6:19] * jidar (~jidar@ has joined #ceph
[6:23] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[6:24] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[6:29] * wkennington (~wak@0001bde8.user.oftc.net) has joined #ceph
[6:31] * TomasCZ (~TomasCZ@yes.tenlab.net) Quit (Quit: Leaving)
[6:34] * kefu_ (~kefu@ has joined #ceph
[6:34] * kefu (~kefu@ Quit (Read error: Connection reset by peer)
[6:36] * kefu_ is now known as kefu
[6:38] * mnc (~mnc@c-50-137-214-131.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[6:38] * diq (~diq@2620:11c:f:2:c23f:d5ff:fe62:112c) Quit (Ping timeout: 480 seconds)
[6:40] * diq (~diq@2620:11c:f:2:c23f:d5ff:fe62:112c) has joined #ceph
[6:44] * Hemanth (~hkumar_@ has joined #ceph
[6:48] * dneary (~dneary@pool-96-233-46-27.bstnma.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[6:53] * newdave (~newdave@36-209-181-180.cpe.skymesh.net.au) has joined #ceph
[6:54] * mivaho_ (~quassel@2001:983:eeb4:1:c0de:69ff:fe2f:5599) has joined #ceph
[6:54] * Craig (~quassel@ Quit (Quit: No Ping reply in 180 seconds.)
[6:56] * Craig (~quassel@125-227-147-112.HINET-IP.hinet.net) has joined #ceph
[6:56] * mivaho (~quassel@2001:983:eeb4:1:c0de:69ff:fe2f:5599) Quit (Ping timeout: 480 seconds)
[6:59] * cmart (~cmart@ip68-226-23-16.tc.ph.cox.net) Quit (Ping timeout: 480 seconds)
[7:03] * karnan (~karnan@ has joined #ceph
[7:06] * Hemanth (~hkumar_@ Quit (Ping timeout: 480 seconds)
[7:13] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[7:17] * rdas (~rdas@ has joined #ceph
[7:22] * font (~ifont@ Quit (Ping timeout: 480 seconds)
[7:28] * Hemanth (~hkumar_@ has joined #ceph
[7:34] * font (~ifont@ has joined #ceph
[7:36] * font (~ifont@ Quit ()
[7:36] * font (~ifont@ has joined #ceph
[7:38] * stiopa (~stiopa@cpc73832-dals21-2-0-cust453.20-2.cable.virginm.net) has joined #ceph
[7:39] * font (~ifont@ has left #ceph
[7:46] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) has joined #ceph
[7:46] * stiopa (~stiopa@cpc73832-dals21-2-0-cust453.20-2.cable.virginm.net) Quit (Ping timeout: 480 seconds)
[7:54] * toastyde1th (~toast@pool-71-255-253-39.washdc.fios.verizon.net) has joined #ceph
[7:58] * vimal (~vikumar@ has joined #ceph
[7:59] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[8:01] * toastydeath (~toast@pool-71-255-253-39.washdc.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[8:13] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) has joined #ceph
[8:23] * Popz (~BillyBobJ@ has joined #ceph
[8:26] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[8:26] * kefu (~kefu@ Quit (Max SendQ exceeded)
[8:27] * Be-El (~blinke@nat-router.computational.bio.uni-giessen.de) has joined #ceph
[8:27] * kefu (~kefu@ has joined #ceph
[8:29] * rraja (~rraja@ has joined #ceph
[8:30] * Skaag (~lunix@cpe-172-91-77-84.socal.res.rr.com) Quit (Quit: Leaving.)
[8:37] * Hemanth (~hkumar_@ Quit (Ping timeout: 480 seconds)
[8:38] * blizzow (~jburns@2601:280:4a00:4c00:fae4:f4b0:66e2:e966) has joined #ceph
[8:45] * T1w (~jens@node3.survey-it.dk) has joined #ceph
[8:53] * Popz (~BillyBobJ@ Quit ()
[8:54] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[8:55] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[9:02] * ffilzwin2 (~ffilz@c-67-170-185-135.hsd1.or.comcast.net) has joined #ceph
[9:03] * jowilkin (~jowilkin@184-23-213-254.fiber.dynamic.sonic.net) Quit (Quit: Leaving)
[9:03] * ade (~abradshaw@pool- has joined #ceph
[9:03] * ade (~abradshaw@pool- Quit (Remote host closed the connection)
[9:04] * ade (~abradshaw@pool- has joined #ceph
[9:07] * branto (~branto@transit-86-181-132-209.redhat.com) has joined #ceph
[9:09] * ffilzwin (~ffilz@c-67-170-185-135.hsd1.or.comcast.net) Quit (Ping timeout: 480 seconds)
[9:11] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) has joined #ceph
[9:11] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[9:12] * Jeeves_ (~mark@2a03:7900:1:1:4cac:cad7:939b:67f4) Quit (Remote host closed the connection)
[9:14] * Jeeves_ (~mark@2a03:7900:1:1:4cac:cad7:939b:67f4) has joined #ceph
[9:15] * blahdodo (~blahdodo@ Quit (Ping timeout: 480 seconds)
[9:17] * trololo (~cypher@fra06-3-78-243-232-119.fbx.proxad.net) has joined #ceph
[9:18] * eth00 (~eth00@ has joined #ceph
[9:18] * spgriffinjr (~spgriffin@ Quit (Read error: Connection reset by peer)
[9:18] * eth00- (~eth00@ Quit (Read error: Connection reset by peer)
[9:18] * spgriffinjr (~spgriffin@66-46-246-206.dedicated.allstream.net) has joined #ceph
[9:19] * rakeshgm (~rakesh@ has joined #ceph
[9:20] * benhuang (~ben@ has joined #ceph
[9:20] * benhuang (~ben@ has left #ceph
[9:21] * analbeard (~shw@ has joined #ceph
[9:22] * benhuang (~ben@ has joined #ceph
[9:22] <benhuang> hi
[9:30] * benhuang (~ben@ Quit ()
[9:40] * fsimonce (~simon@ has joined #ceph
[9:40] * Hemanth (~hkumar_@ has joined #ceph
[9:43] * maybebuggy (~maybebugg@2a01:4f8:191:2350::2) has joined #ceph
[9:45] <thoht> hi; yesterday i remove OSD from a host
[9:45] <thoht> i removed the osd of crush map
[9:46] <thoht> but i still see the hostname of this node in "ceoh osd tree" : https://gist.github.com/nlienard/ef6ba37b87be7eb48b328275ea2b5932
[9:46] <thoht> any way to remove the host completely ?.
[9:46] <Be-El> you can remove the crush bucket representing the host
[9:47] <thoht> Be-El: how ?
[9:48] <Be-El> afaik 'ceph osd crush remove <bucket name>'
[9:48] <thoht> oh like for the osd
[9:48] * rmart04 (~rmart04@support.memset.com) has joined #ceph
[9:48] <thoht> Be-El: all good it worked
[9:48] <thoht> thanks !
[9:49] * Mibka (~andy@ has joined #ceph
[9:49] <Be-El> you're welcome
[9:49] <Mibka> I have a couple Dell R710 servers with integrated H700 controller. I wanted to use Ceph on them but now I realize these H700 controllers do not suppport jbod mode.. So, I need a different sas card in all of them, right?
[9:50] <IcePic> see if you can make a bunch of one-disk raid0 setups, one per disk
[9:50] <IcePic> many LSI controllers I tried allowed for that, as a workaround when you want to expose the separate disks
[9:51] <Be-El> there's also usually a single command to convert all unassociated disks into single-disk-raid0
[9:51] * wjw-freebsd2 (~wjw@smtp.digiware.nl) Quit (Ping timeout: 480 seconds)
[9:51] <Be-El> as an added benefit the raid0 setup is also able to use the controller's cache
[9:51] <Mibka> IcePic: I think it can do this but what if I want to add more disks later? I will have to reboot every server and create the single-disk-raid setup in the h700 bios?
[9:52] <IcePic> depends on if the OS you run has raid tools or not
[9:52] <Mibka> IcePic: it's Linux.. Not sure if there are tools for the H700 but probably there are ...
[9:53] * TMM (~hp@dhcp-077-248-009-229.chello.nl) Quit (Quit: Ex-Chat)
[9:53] <Mibka> I think there's MageCLI for this
[9:53] <IcePic> yes, or Storcli and whatever their names are
[9:54] <IcePic> might awsell look into that immediately, so you can use it later to read info about the disks and so on for monitoring later on
[9:54] <thoht> Be-El: i need also to remove the MON of this host. what is the best way ?
[9:54] <Mibka> True ;)
[9:55] <Be-El> thoht: use either ceph-deploy (if you setup the cluster with ceph-deploy), or use the manual way described in the documentation
[9:55] <Mibka> IcePic: And it won't be a performance problem, right? I added 3x 960Gb SDD's in every R710. Maybe IOPS wise this single disk as raid0 is a bad idea?
[9:55] <Be-El> thoht: how many mons do you currently use?
[9:55] <IcePic> Mibka: no, ceph (and other filesystems like zfs and so on) actually want to talk to every single disk
[9:55] <thoht> Be-El : doc is unreachable. i 've 4 MON at this very moment, and i ll go to 3 after removing this one
[9:56] <IcePic> as far as iops go, the raid card does the work, so the "extra" of being raid0 on a single disk is unnoticeable
[9:57] <Mibka> IcePic: okay ;) So there is no point in replacing the H700 with something else. And Be-El just even said there's the advantage of the disk being able to use the controller's cache too ;)
[9:57] <Mibka> I'll look into MegaCLI so I'm sure I can manage and monitor the disks correctly before I put it in production
[9:57] <IcePic> Mibka: yes, and if this is the standard setup at your place, sticking to the defaults will also help
[9:57] <Be-El> thoht: the manual command is 'ceph mon rm <monitor>'. not sure whether you need to stop the monitor process before or after running that command
[9:58] <thoht> Be-El: it returns : Error EINVAL: removing mon.dnxovh-hy002 at, there will be 3 monitors
[9:59] <thoht> Be-El: but it worked
[9:59] <thoht> i got 3 MON now :) thanks :
[9:59] <thoht> the output is a bit tricky
[9:59] <Be-El> don't forget to stop the process itself
[9:59] <IcePic> and fix ceph.conf files if they still refer to the removed one.
[9:59] <Mibka> I do have another Ceph related question too: I will have 3x R710 servers, all with 3x 960Gb SSD's. I was thinking to configure Ceph so it only creates one single copy. That's already better than raid10, right? since the copy will be on a different server. So, my available space would be 9*960/2 = 4320Gb?
[10:01] <IcePic> Mibka: its basically true, yes.
[10:01] <IcePic> You will handle a different kind of fault modes than R10 on single boxes would.
[10:03] <Mibka> IcePic: Okay. Then what worries me is the following: 4320Gb of space. Let's assume I write 3000Gb on it and then suddenly 1 server dies completely. Ceph will try to rebalance if I understand correctly. So it'll try to rebalance 3000Gb of date on 6x960Gb ssd's. But 6*960/2=2880Gb .. So in face, there's not enough room to rebalance?
[10:03] <thoht> is it possible that MON has a Memory LEAK ? here a memory graph of my MON : http://imgur.com/a/RzdsH
[10:03] <Mibka> s/date/data/
[10:04] <thoht> you can see that the memory increased then the device swapped until mon was restarted this morning
[10:04] <Mibka> s/face/fact .... sorry, had not enough sleep tonight ;)
[10:04] <IcePic> Mibka: as long as you have "one server extra" in terms of space, you should be able to handle one server lost.
[10:05] <Mibka> IcePic: okay. But how would Ceph react in the situation I just said? Will it completel break or will it just pause/stop the rebalancing process?
[10:05] * Kurt (~Adium@2001:628:1:5:5450:738e:2f55:70f5) Quit (Quit: Leaving.)
[10:06] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[10:06] <IcePic> Mibka: pause/stop I guess, yes. the monitoring will whine about not being able to place stuff. But "write 3000G" is more or less 1500G of real data and one copy for each Gb, so it wont try to rebalance all 3000GB
[10:08] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) Quit (Quit: doppelgrau)
[10:08] <IcePic> Mibka: in short, ceph works better and better the more nodes you have, since a lost node will be less and less noticeable as the cluster grows.
[10:08] <Be-El> Mibka: what kind of network connection do you plan for these hosts?
[10:08] <Mibka> I understand but right now I can only afford 3 or 4 nodes ;)
[10:09] <Mibka> I have X520-DA2 cards in them and I'm still looking for an FSP+ 10G switch
[10:09] <Mibka> that's what I'll be using.
[10:09] <Be-El> and more important, what kind of ssd?
[10:09] <Mibka> Samsung SM863
[10:09] * rraja (~rraja@ Quit (Remote host closed the connection)
[10:09] <thoht> ****** http://docs.ceph.com is back online *****
[10:10] <thoht> \o/
[10:10] <Be-El> ok, they should work for ceph. some ssd have problems with the synchronous writes used for the ceph full data journal
[10:10] <Mibka> Be-El: okay, good to hear they are fine. I bought them after advice from someone else here a while ago :-)
[10:11] <IcePic> thoht: yay
[10:11] <Tetard_> Mibka: for a 10G switch, I'd grab 3064PQs off of ebay - or Dell 8024s
[10:12] <Mibka> Tetard_: thanks for the tip. I'll look into them.
[10:13] * DanFoster (~Daniel@office.34sp.com) has joined #ceph
[10:13] <Mibka> Be-El: I was also thinking about creating a second Ceph pool for VM's which aren't disk intensive at all. I do have 4x Samsung 850 PRO 512Gb, 2x Intel 730 480Gb and 4x Intel S3510 DC 480Gb SSD's that I have no use for right now. Would they be any good for a second storage for those VM's? .. If not I'll better leave them out :)
[10:14] <Be-El> Mibka: check their write endurance capabilities.
[10:14] <Be-El> Mibka: as pure data osds they might be fine, but we had bad experience with the samsung pro drives as journals
[10:15] <Be-El> Mibka: you might want to use the samsung sm drives as journals for the other drives
[10:15] <Be-El> (which introduces another failure mode..... )
[10:16] <Mibka> Be-El: okay. I would like to keep both pools separated so maybe I better get rid of those samsung/intel ssd's and get some 480Gb SM863 instead
[10:17] <Mibka> Actually.. Why am I considering SSD's for VM's that are not i/o intensive. If I'm going to buy something new I probably better get soms 10K or 15K sas drives instead?
[10:20] <peetaur2> can I take an exported rbd image nad merge a diff without starting any ceph daemons?
[10:21] * Kioob1 (~Kioob@LMontsouris-656-1-1-206.w80-12.abo.wanadoo.fr) has joined #ceph
[10:22] <Be-El> Mibka: in case of HDD drives you definitely want to put the journal on a suitable SSD
[10:23] * bitserker (~toni@ has joined #ceph
[10:27] * hbogert (~Adium@ has joined #ceph
[10:29] <Mibka> Be-El: Okay. So in fact I need at least one ssd+hdd in each server then?
[10:29] <Be-El> Mibka: that's the recommended setup
[10:31] * hbogert1 (~Adium@ has joined #ceph
[10:31] * hbogert (~Adium@ Quit (Read error: Connection reset by peer)
[10:37] <IcePic> Mibka: or, have the journal on parts of an ssd in front of each HDD
[10:38] <IcePic> as in, one decent ssd partitioned in X parts, each being the journal for one out of X HDDs
[10:39] * DV_ (~veillard@2001:41d0:a:f29f::1) has joined #ceph
[10:39] * fdmanana (~fdmanana@2001:8a0:6e0c:6601:e87c:2761:738e:5817) has joined #ceph
[10:40] <Mibka> IcePic: Okay, that's probably a cheaper solution that I can live with :) .. I haven't read into the journal to be honest but how exactly should I partition the setup with only 3 SSD's in each server. Should I create two partiotions on each SSD and use one for journal and the other for data? .. S
[10:40] * Kurt (~Adium@2001:628:1:5:25a0:84ac:7bc0:80c4) has joined #ceph
[10:42] * dgurtner (~dgurtner@ has joined #ceph
[10:45] <peetaur2> IcePic: you can keep adding more osd journals to the same SSD until the load is too high
[10:47] * DV (~veillard@2001:41d0:a:f29f::1) Quit (Ping timeout: 480 seconds)
[10:52] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) has joined #ceph
[10:52] * doppelgrau (~doppelgra@ has joined #ceph
[10:57] * hbogert1 (~Adium@ Quit (Quit: Leaving.)
[10:59] <IcePic> Mibka: in a ssd environment, you wont gain from the journals I think. Its mostly for HDD setups where the OSD can ack the writes as soon as its on the journal, then worry about writing to X HDDs later (depending on your amount of copies)
[10:59] * TMM (~hp@ has joined #ceph
[11:00] * wkennington (~wak@0001bde8.user.oftc.net) Quit (Quit: Leaving)
[11:00] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[11:05] * bitserker (~toni@ Quit (Quit: Leaving.)
[11:06] <Mibka> IcePic: I definately need to start reading the Ceph docs more ;) So you're saying I should use 1 partition on the SSD's as an OSD and no journal?
[11:08] * rotbeard (~redbeard@ has joined #ceph
[11:09] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[11:09] * b0e (~aledermue@ has joined #ceph
[11:11] <IcePic> if you have hdds and want to use ssds in front of them, then you may partition the ssd to smaller pieces and have a smaller journal on the ssd and still gain performance on the hdd pools
[11:11] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[11:11] <IcePic> if (as you started out earlier) you are to use ssds only, then skip the layered approach and just do one osd per ssd.
[11:11] <IcePic> and no journal
[11:13] <Mibka> Okay. For the first pool I'll be using these 9 SSD only and for the second pool I'm considering buying SM863 120 or 240Gb SSD's for the journal and 1x 1.8Tb 2.5" HDD, for each server. So I get about 2.7Tb of ceph storage for vm's that don't do much i/o
[11:14] <Mibka> It's probably not a good idea to have 2 HDD's and their journal on a single SSD?
[11:19] <IcePic> depends on what you expect will happen and the budget one has
[11:19] <IcePic> not everyone can stick it all on ssd, and the write buffer will make the hdds more tolerable.
[11:22] * hbogert (~Adium@ has joined #ceph
[11:24] <peetaur2> oh I finally found you can merge rbd images and diffs without the daemons running... just need to export-diff the first one instead of export, and then can use merge-diff
[11:25] * foxxx0 (~fox@valhalla.nano-srv.net) Quit (Quit: WeeChat 1.6)
[11:25] <Mibka> IcePic: if the journal ssd dies then both osd's are gone, isn't it? That's why I thought a single ssd operating as a journal for 2 hdd's is maybe a bad choice
[11:25] * Kioob1 (~Kioob@LMontsouris-656-1-1-206.w80-12.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[11:25] * foxxx0 (~fox@2a01:4f8:171:1fd3::2) has joined #ceph
[11:26] * ivve (~zed@cust-gw-11.se.zetup.net) Quit (Ping timeout: 480 seconds)
[11:32] * derjohn_mobi (~aj@x4db0f8e6.dyn.telefonica.de) Quit (Ping timeout: 480 seconds)
[11:32] * derjohn_mobi (~aj@x4db0f8e6.dyn.telefonica.de) has joined #ceph
[11:34] * masteroman (~ivan@93-142-249-147.adsl.net.t-com.hr) has joined #ceph
[11:35] * masteroman (~ivan@93-142-249-147.adsl.net.t-com.hr) Quit ()
[11:36] * masteroman (~ivan@ has joined #ceph
[11:37] * masteroman (~ivan@ Quit ()
[11:39] * Kioob (~Kioob@LMontsouris-656-1-1-206.w80-12.abo.wanadoo.fr) has joined #ceph
[11:42] * derjohn_mobi (~aj@x4db0f8e6.dyn.telefonica.de) Quit (Ping timeout: 480 seconds)
[11:42] * yanzheng (~zhyan@ Quit (Quit: This computer has gone to sleep)
[11:44] * ledgr (~ledgr@88-119-196-104.static.zebra.lt) has joined #ceph
[11:49] * topro (~prousa@ Quit (Quit: Konversation terminated!)
[11:51] * nardial (~ls@p5DC06480.dip0.t-ipconnect.de) has joined #ceph
[11:53] * amarao (~oftc-webi@static-nbl2-118.cytanet.com.cy) has joined #ceph
[11:53] <amarao> Hello. Anyone had noticed than docs.ceph.org is down for second day?
[11:54] * blahdodo (~blahdodo@ has joined #ceph
[11:55] * minnesotags (~herbgarci@c-50-137-242-97.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[12:01] * Dominik_H (~Dominik_H@ has joined #ceph
[12:01] <thoht> oh i removed a ceph node but now i ve to modify all my libvirt XML def
[12:01] <thoht> <host name='' port='6789'/> and son on :)
[12:01] <thoht> oOps
[12:08] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Quit: Leaving)
[12:08] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[12:14] <ahadi> Hey guys, I was asking about Hardware RAID vs. Software some days before. T1w was helping me very well and all other participants did tell me to avoid HW raid where possible. I have a machine of 8 SSDs and did setup a HW raid 10 vs software raid 10 and I see that the HW raid performs 100% (!) better. There must be an error somewhere I guess?
[12:18] <peetaur2> ahadi: is there a BBU?
[12:20] <IcePic> Mibka: yes, that is true. But in a larger setup, even a few lost osds isnt a huge deal.
[12:20] <IcePic> Mibka: as in, if you build a cluster to survive one or two complete boxes falling over, a lost osd or two wont be a huge disaster
[12:23] <ahadi> You mean on the controller? Yes
[12:23] <ahadi> peetaur2
[12:24] <Mibka> IcePic: Okay. I understand. But if I configure Ceph to store a copy on a different server, even with a 3 node setup it should survive a complete node failure. So, I think I'm going to go with 2x HDD and 1x SSD and use the SSD for the journal of both HDD's... If the SSD fails I can just replace it, and let Ceph rebalance
[12:24] <Mibka> IcePic: that's for the second ceph pool. The first, main, pool will be on SSD's only.
[12:25] * Keiya1 (~jakekosbe@tor-exit.squirrel.theremailer.net) has joined #ceph
[12:26] * kefu_ (~kefu@ has joined #ceph
[12:26] <Mibka> I also have a HP-6600ML-24G-4XG switch here with 4x SFP+ ports. I'm thinking I can use this one for now and replace the 10G switch with one having more ports when I need it. Right now I will be connecting 4 servers only. 3x for the ceph cluster and a 4th as a backup server.
[12:27] * Ramakrishnan (~ramakrish@ has joined #ceph
[12:27] * ivve (~zed@cust-gw-11.se.zetup.net) has joined #ceph
[12:29] * kefu (~kefu@ Quit (Read error: Connection reset by peer)
[12:29] * derjohn_mob (~aj@ip-178-203-145-53.hsi10.unitymediagroup.de) has joined #ceph
[12:33] * rmart04 (~rmart04@support.memset.com) has left #ceph
[12:33] * rmart04 (~rmart04@support.memset.com) has joined #ceph
[12:39] * kefu_ (~kefu@ Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[12:41] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) has joined #ceph
[12:41] * [0x4A6F]_ (~ident@p4FC271BD.dip0.t-ipconnect.de) has joined #ceph
[12:42] * bniver (~bniver@pool-71-174-250-171.bstnma.fios.verizon.net) Quit (Remote host closed the connection)
[12:44] * salwasser (~Adium@2601:197:800:6ea1:650f:efc2:de5c:f218) has joined #ceph
[12:45] * [0x4A6F] (~ident@0x4a6f.user.oftc.net) Quit (Ping timeout: 480 seconds)
[12:45] * [0x4A6F]_ is now known as [0x4A6F]
[12:49] * gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[12:49] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[12:52] * karnan (~karnan@ Quit (Ping timeout: 480 seconds)
[12:53] <T1w> ahadi: I never suggested using software raid10..
[12:54] <ahadi> No you didn't, but clearly wrote that avoid HW Raid where possible. Is Software RAID 10 so problematic?
[12:54] <ahadi> T1w:
[12:54] * gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) has joined #ceph
[12:54] <T1w> I said that you should avoid raid at all when it comes to OSD data disks
[12:54] * jcsp (~jspray@82-71-16-249.dsl.in-addr.zen.co.uk) Quit (Quit: Ex-Chat)
[12:55] * Keiya1 (~jakekosbe@tor-exit.squirrel.theremailer.net) Quit ()
[12:55] * derjohn_mob (~aj@ip-178-203-145-53.hsi10.unitymediagroup.de) Quit (Ping timeout: 480 seconds)
[12:55] <T1w> for OS raid would provide some backup for a single disk failure and at certain times you can also benefit from some sort of raid setup for journals
[12:56] <T1w> but it depends on the actual setup
[12:57] <ahadi> ah okay thank you for the help once again. The question about the Hardware RAID has this time nothing to do with ceph, its for a Postgres cluster we also need to build
[13:01] * topro (~prousa@p578af414.dip0.t-ipconnect.de) has joined #ceph
[13:01] * valeech (~valeech@pool-96-247-203-33.clppva.fios.verizon.net) Quit (Quit: valeech)
[13:03] * webertrlz (~oftc-webi@ has joined #ceph
[13:04] * karnan (~karnan@ has joined #ceph
[13:05] * dgurtner (~dgurtner@ Quit (Read error: Connection reset by peer)
[13:07] * T1w (~jens@node3.survey-it.dk) Quit (Ping timeout: 480 seconds)
[13:08] * rraja (~rraja@ has joined #ceph
[13:09] * AG_Scott (~DougalJac@tor2r.ins.tor.net.eu.org) has joined #ceph
[13:09] * vicente (~~vicente@125-227-238-55.HINET-IP.hinet.net) Quit (Quit: Leaving)
[13:10] <darkfaded> my 2 cents would be: I ran sw raid10 really long, till i hit some io size issue with io from xen where md couldn't process what xen said
[13:10] <darkfaded> i went and bought a highend lsi and, uh, like, iops changed from ~200 to 1500-2000, no issues ever, until it had some ecc error
[13:11] <darkfaded> so i like, swapped it, and done
[13:11] <darkfaded> hw raid is a big issue if one thinks release notes are for other people or that spending $200 instead of $500 is a smart thing
[13:12] <darkfaded> i also have had some areca controller for example
[13:12] <darkfaded> that "thing" is well used where it is - a box i only use to do filesystem forensics and such
[13:13] <darkfaded> like, it defaults to enabling writeback even if it has no BBU
[13:13] <darkfaded> there's crap and there's server hardware
[13:13] <darkfaded> and yeah, hw can break, but sw also does :)
[13:13] <darkfaded> i didn't even mention when md nuked out the lvm headers
[13:14] <peetaur2> darkfaded: what's your opinion on MegaRAID?
[13:15] <darkfaded> peetaur2: it's the standard, they're ok
[13:15] <darkfaded> but btw for all-ssd raid10 i think sw raid also isn't a disaster
[13:16] <darkfaded> you just need better practice for hot-swapping etc
[13:16] <peetaur2> ok well I found them to be terrible :)
[13:17] * bniver (~bniver@71-9-144-29.static.oxfr.ma.charter.com) has joined #ceph
[13:17] <darkfaded> peetaur2: the cli sucks
[13:17] <darkfaded> but idk, if you get paid you just learn it anyway
[13:17] <peetaur2> like they delay telling you disks are bad really long, probably to save DELL money on returns... and they kick 2 disks out at a time frequently... the patrol read has never once said anything was wrong... if you create raid and tell it do not initiailze, sometimes it does anyway...
[13:18] <peetaur2> also sometimes they randomly without explanation call their own disks foreign
[13:18] * gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[13:18] * gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) has joined #ceph
[13:18] <darkfaded> neither of that either in my totally messed up playgorund nor at customers
[13:19] <darkfaded> i can easily believe the dell thing
[13:19] <darkfaded> but that'd be a thing dell does in their firmware
[13:20] <peetaur2> k, well 100% of my megaraid experience is with dell machines
[13:21] <darkfaded> hp also had something like that - you'd get MCE errors from the cpu because shitty ram, open a case and they'd not do anythig until the thresholds for iLO's own monitoring are reached
[13:21] <darkfaded> i quite hate that
[13:21] <darkfaded> $customer has lotsa of them and it's really not a thing - but they also really force same tested fw version everywhere
[13:22] <darkfaded> what i do myself is to always backup the nvram to some other box or usb
[13:22] <darkfaded> well
[13:22] <darkfaded> 'always'
[13:23] * georgem (~Adium@2605:8d80:683:8b2f:e55d:f807:3ba2:102c) has joined #ceph
[13:23] * nardial (~ls@p5DC06480.dip0.t-ipconnect.de) Quit (Quit: Leaving)
[13:24] <darkfaded> peetaur2: i've read about the foreign disk thing a few times
[13:24] <darkfaded> it might not be dell
[13:24] <darkfaded> but definitely picking fw well helps
[13:24] <darkfaded> i also know a cloud isp who did their infiniband mirrors with IB and a patched md and corrupted shitloads of data
[13:24] <darkfaded> so, really, i think we always lose
[13:25] <peetaur2> what kind of patch?
[13:25] <darkfaded> i honestly don't remember what the patch was for
[13:26] <darkfaded> the storage/linux guy once told me, but it was 3-4y ago
[13:26] <darkfaded> they used SRP
[13:26] * KindOne (kindone@0001a7db.user.oftc.net) Quit (Quit: ...)
[13:26] <darkfaded> so native ib blockdevs, and i think it was for that
[13:27] * salwasser (~Adium@2601:197:800:6ea1:650f:efc2:de5c:f218) Quit (Quit: Leaving.)
[13:34] * derjohn_mob (~aj@x4db0f8e6.dyn.telefonica.de) has joined #ceph
[13:35] * KindOne (kindone@0001a7db.user.oftc.net) has joined #ceph
[13:38] * AG_Scott (~DougalJac@tor2r.ins.tor.net.eu.org) Quit ()
[13:40] * Ramakrishnan (~ramakrish@ Quit (Ping timeout: 480 seconds)
[13:42] * kutija (~kutija@ has joined #ceph
[13:43] * bene2 (~bene@nat-pool-bos-t.redhat.com) has joined #ceph
[13:46] * Ramakrishnan (~ramakrish@ has joined #ceph
[13:46] * dneary (~dneary@pool-96-233-46-27.bstnma.fios.verizon.net) has joined #ceph
[13:47] * Joppe4899 (~dicko@ has joined #ceph
[13:48] * ccourtaut (~ccourtaut@ has joined #ceph
[13:50] * bene3 (~bene@nat-pool-bos-t.redhat.com) has joined #ceph
[13:51] * georgem (~Adium@2605:8d80:683:8b2f:e55d:f807:3ba2:102c) Quit (Quit: Leaving.)
[13:51] * georgem (~Adium@ has joined #ceph
[13:52] * wes_dillingham (~wes_dilli@ has joined #ceph
[13:53] * bene2 (~bene@nat-pool-bos-t.redhat.com) Quit (Read error: Connection reset by peer)
[13:59] * dgurtner (~dgurtner@ has joined #ceph
[14:00] * georgem (~Adium@ Quit (Quit: Leaving.)
[14:01] * rakeshgm (~rakesh@ Quit (Quit: Peace :))
[14:02] * derjohn_mob (~aj@x4db0f8e6.dyn.telefonica.de) Quit (Ping timeout: 480 seconds)
[14:03] * trociny (~mgolub@ Quit (Quit: ??????????????)
[14:10] * DanFoster (~Daniel@office.34sp.com) Quit (Quit: Leaving)
[14:12] * DanFoster (~Daniel@2a00:1ee0:3:1337:f59a:fc4b:b6a6:43e6) has joined #ceph
[14:17] * Joppe4899 (~dicko@ Quit ()
[14:17] <mistur> Hello
[14:18] <mistur> I have a radosgw
[14:18] <mistur> I delete a bucket
[14:18] <mistur> then I force gc
[14:19] * dneary (~dneary@pool-96-233-46-27.bstnma.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[14:19] <mistur> now I still have plenty of shadow object still remain
[14:19] <mistur> any one know how I can force the purge of this object ?
[14:19] * lmb (~Lars@ip5b404bab.dynamic.kabel-deutschland.de) Quit (Ping timeout: 480 seconds)
[14:20] * georgem (~Adium@ has joined #ceph
[14:20] * owasserm (~owasserm@2001:984:d3f7:1:5ec5:d4ff:fee0:f6dc) has joined #ceph
[14:24] <mistur> those objects*
[14:27] * Kurt (~Adium@2001:628:1:5:25a0:84ac:7bc0:80c4) Quit (Quit: Leaving.)
[14:28] * mattbenjamin (~mbenjamin@174-23-175-206.slkc.qwest.net) has joined #ceph
[14:30] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) has joined #ceph
[14:32] * rdas (~rdas@ Quit (Quit: Leaving)
[14:34] * AlexB_ (~AlexB@ has joined #ceph
[14:34] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:38] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[14:40] <wes_dillingham> I have an mds setup such that one is active and the other standby. Currently the one that is active is not the one I want to be active. I would like to switch which one is active and which is standby and then set the new standby one to be mds_standby_replay true, is there a better way to swap which one is active than to ???fail??? the current active mds?
[14:42] <AlexB_> Hello, all. I'm currently taking a look at doing an orphans find/orphans finish on the latest Hammer release. I was hoping to clarify, is the actual cleanup done by orphans finish async? Is any familiar with how that cleanup functions?
[14:44] <peetaur2> wes_dillingham: why do you need to switch it?
[14:44] <peetaur2> ideally you should not care which one is active, and let it pick
[14:44] <wes_dillingham> peetaur2: I was doubling up a mon to run the mds, i now have dedicated hardware for the mds
[14:45] <wes_dillingham> i want one of the mons (current active mds) to be the failover mds
[14:46] <peetaur2> ok well I don't really know what's best, but I trust that service stop ... would be a graceful failover
[14:46] <peetaur2> but I have no idea if it'll switch over again later for some reason
[14:47] <wes_dillingham> thanks peetaur2
[14:48] <wes_dillingham> I am testbedding outside of production anyways, so I can just test but thought there might somehow be a way to force a standby mds to become active
[14:48] <wes_dillingham> it seems there may be a 15 second window for it to be marked as laggy by the mon
[14:49] <wes_dillingham> which might mean a 15 second hiccup in the FS
[14:50] * vbellur (~vijay@2601:18f:700:55b0:5e51:4fff:fee8:6a5c) has joined #ceph
[14:51] <mistur> AlexB_: do you talk about shadow object not purge on radosgw ?
[14:51] <AlexB_> mistur: correct
[14:52] <mistur> AlexB_: I just have an issu with this
[14:52] <mistur> I have deleted the latest bucket on a pool
[14:52] <mistur> start the gc process
[14:52] <mistur> most of objects has been deleted
[14:52] <peetaur2> wes_dillingham: yes maybe if you kill -9 or pull network, but I think a service stop should be very fast
[14:53] <mistur> but I still have shadow object
[14:53] <peetaur2> log will say things like "mds marked itself down" and then the other takes over
[14:53] <mistur> this pool suposse to be empty
[14:53] <wes_dillingham> service stop would likely inform the monitor its going down
[14:53] <mistur> I need to delete it and recreate it
[14:53] <wes_dillingham> yep.
[14:53] <mistur> I don't know if I can do that if I still have obejct on
[14:54] <AlexB_> Yep. That's similar to what I'm unclear on. We're losing a lot of storage to what appear to be shadow files, and we're planning to try cleaning them up. However, I'm not finding much documentation on how the commands work.
[14:55] <mistur> I run Jewel right now
[14:55] <AlexB_> I've run them in a test environment, and I'm not seeing them cleaned up when I run the orphans finish, so I'm wondering if it's async.
[14:56] * dneary (~dneary@nat-pool-bos-u.redhat.com) has joined #ceph
[14:56] <mistur> I don't know
[14:56] * karnan (~karnan@ Quit (Remote host closed the connection)
[14:56] <mistur> AlexB_: https://www.spinics.net/lists/ceph-users/msg31396.html
[14:56] <mistur> we are not alone !
[14:57] * pfactum is now known as post-factum
[14:58] <AlexB_> Here is Yehuda's emails about it: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001172.html it just doesn't cover much.
[14:59] * derjohn_mob (~aj@b2b-94-79-172-98.unitymedia.biz) has joined #ceph
[15:02] <mistur> AlexB_: I give a look to orphan find/finish
[15:02] <mistur> I didn't try this before
[15:02] <mistur> I gonna test and see if it works
[15:04] <webertrlz> is there any known problem with hardlinks in cephfs?
[15:07] * mhack (~mhack@nat-pool-bos-t.redhat.com) has joined #ceph
[15:16] * Bromine (~dusti@178-175-128-50.static.host) has joined #ceph
[15:17] * AlexB_ (~AlexB@ Quit (Remote host closed the connection)
[15:19] * mhack (~mhack@nat-pool-bos-t.redhat.com) Quit (Ping timeout: 480 seconds)
[15:19] * mhack (~mhack@nat-pool-bos-t.redhat.com) has joined #ceph
[15:20] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[15:21] * Racpatel (~Racpatel@2601:87:3:31e3::3840) has joined #ceph
[15:21] * AlexB_ (~AlexB@ has joined #ceph
[15:21] <AlexB_> mistur: yes, I believe that's the suggested way to clean up shadow files.
[15:21] <mistur> AlexB_: how long oes it take for a "orphan find"
[15:21] <mistur> ?
[15:21] * derjohn_mob (~aj@b2b-94-79-172-98.unitymedia.biz) Quit (Ping timeout: 480 seconds)
[15:22] <AlexB_> I don't know the time complexity, but it's a function of the size of the pool.
[15:22] <mistur> right now, it's in a loop on a couple of object
[15:22] <mistur> storing 1 entries at orphan.scan.erasure.linked.5
[15:22] <AlexB_> Make sure you create a .log pool first.
[15:22] <mistur> erasure.rgw.buckets.data 36 11838M 0 75362G 4735
[15:23] <mistur> AlexB_: "log_pool": "default.rgw.log",
[15:23] <mistur> this one ?
[15:24] <AlexB_> I believe it specifically needs to be called .log
[15:24] <mistur> ok
[15:25] <AlexB_> Again, I'm just researching this myself.
[15:25] <mistur> AlexB_: ok
[15:25] <mistur> me too :)
[15:25] <AlexB_> No problem, just don't want to give the impression I'm 100% familiar with this!
[15:25] <mistur> sure don't worry
[15:26] <mistur> and for information the pool .log existe :)
[15:27] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[15:28] * salwasser (~Adium@ has joined #ceph
[15:28] * derjohn_mob (~aj@b2b-94-79-172-98.unitymedia.biz) has joined #ceph
[15:30] * mattbenjamin (~mbenjamin@174-23-175-206.slkc.qwest.net) Quit (Ping timeout: 480 seconds)
[15:32] <AlexB_> As a more general question, can anyone clarify when it is normal to see shadow files in a pool?
[15:32] <AlexB_> I assume that doesn't always indicate leaked files.
[15:33] <mistur> AlexB_: I have no idea
[15:35] * yanzheng (~zhyan@ has joined #ceph
[15:35] * Ramakrishnan (~ramakrish@ Quit (Ping timeout: 480 seconds)
[15:41] * kefu (~kefu@ has joined #ceph
[15:42] * ivve (~zed@cust-gw-11.se.zetup.net) Quit (Ping timeout: 480 seconds)
[15:43] * DeMiNe0 (~DeMiNe0@ Quit (Ping timeout: 480 seconds)
[15:43] * DeMiNe0 (~DeMiNe0@ has joined #ceph
[15:46] * Bromine (~dusti@178-175-128-50.static.host) Quit ()
[15:50] * derjohn_mob (~aj@b2b-94-79-172-98.unitymedia.biz) Quit (Ping timeout: 480 seconds)
[15:51] * xarses_ (~xarses@c-73-202-191-48.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[15:55] * drupal (~mLegion@tor-exit.squirrel.theremailer.net) has joined #ceph
[16:03] * mattbenjamin (~mbenjamin@97-117-49-192.slkc.qwest.net) has joined #ceph
[16:06] * derjohn_mob (~aj@b2b-94-79-172-98.unitymedia.biz) has joined #ceph
[16:06] * Mibka (~andy@ Quit (Quit: leaving)
[16:07] * kefu (~kefu@ Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[16:09] * vata1 (~vata@ has joined #ceph
[16:11] * Nicho1as (~nicho1as@00022427.user.oftc.net) has joined #ceph
[16:16] * xarses (~xarses@ has joined #ceph
[16:16] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[16:17] * xarses (~xarses@ Quit (Remote host closed the connection)
[16:17] * xarses (~xarses@ has joined #ceph
[16:18] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) has joined #ceph
[16:25] * drupal (~mLegion@tor-exit.squirrel.theremailer.net) Quit ()
[16:25] * vimal (~vikumar@ Quit (Quit: Leaving)
[16:25] * haplo37 (~haplo37@ has joined #ceph
[16:27] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:28] * evelu (~erwan@ has joined #ceph
[16:31] * wushudoin (~wushudoin@2601:646:8200:c9f0:2ab2:bdff:fe0b:a6ee) has joined #ceph
[16:32] * newdave (~newdave@36-209-181-180.cpe.skymesh.net.au) Quit (Ping timeout: 480 seconds)
[16:34] * kristen (~kristen@ has joined #ceph
[16:39] * yanzheng (~zhyan@ Quit (Quit: This computer has gone to sleep)
[16:40] * magicrobot (~oftc-webi@d-65-175-145-252.cpe.metrocast.net) has joined #ceph
[16:40] <magicrobot> does anyone run radosgw exposed to the internet?
[16:41] * cmart (~cmart@ has joined #ceph
[16:42] * kefu (~kefu@ has joined #ceph
[16:47] <IcePic> magicrobot: yes, via laodbalancer.
[16:48] * atheism (~atheism@ Quit (Remote host closed the connection)
[16:49] * atheism (~atheism@ has joined #ceph
[16:50] <Dominik_H> Is a erasure coded pool, useable as a ordinary pool? (adding and moving osd??s, pool growing live etc. ) ?
[16:50] * wgao (~wgao@ Quit (Ping timeout: 480 seconds)
[16:53] * FidoNet (~FidoNet@seattle.morby.org) Quit (Read error: Connection reset by peer)
[16:53] * FidoNet (~FidoNet@seattle.morby.org) has joined #ceph
[16:54] * Rickus (~Rickus@office.protected.ca) has joined #ceph
[16:57] * analbeard (~shw@ Quit (Quit: Leaving.)
[17:00] * kefu (~kefu@ Quit (Ping timeout: 480 seconds)
[17:02] * wgao (~wgao@ has joined #ceph
[17:04] * kefu (~kefu@ has joined #ceph
[17:04] * mattbenjamin (~mbenjamin@97-117-49-192.slkc.qwest.net) Quit (Ping timeout: 480 seconds)
[17:09] * amarao (~oftc-webi@static-nbl2-118.cytanet.com.cy) Quit (Remote host closed the connection)
[17:09] <magicrobot> IcePic: do you have security concerns with it?
[17:10] * Uniju1 (~allenmelo@ has joined #ceph
[17:12] * hbogert (~Adium@ Quit (Quit: Leaving.)
[17:12] <wes_dillingham> Dominik_H: erasure coded pools cant be used for RBD directly and you cant modify the erasure-code-profile after initial pool creation http://docs.ceph.com/docs/jewel/rados/operations/erasure-code/
[17:13] <SamYaple> wes_dillingham: is that still acturate with bluestore?
[17:13] * dgurtner (~dgurtner@ Quit (Read error: Connection reset by peer)
[17:13] <wes_dillingham> SamYaple: unsure..
[17:13] <SamYaple> i thought one of the big things with bluestore was partial writes without the cache tier
[17:13] <SamYaple> allowing for erasure-rbd-on-bluestore
[17:14] * kutija (~kutija@ Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[17:15] <wes_dillingham> probably the case, but from what I understand bluestore is still experimental feature and not sure if Dominik_H is inclined to use experimental feature
[17:15] * winmutt (~rolfmarti@ has joined #ceph
[17:15] <winmutt> anyone having memory leaks with radosgw/civetweb?
[17:16] <SamYaple> wes_dillingham: no youre correct there. this was a side point
[17:17] * mattbenjamin (~mbenjamin@75-162-245-215.slkc.qwest.net) has joined #ceph
[17:18] <dillaman> SamYaple: that is still accurate for now -- the goal will be to get direct RBD on EC pool* support in the kraken release
[17:18] * newdave (~newdave@36-209-181-180.cpe.skymesh.net.au) has joined #ceph
[17:18] <SamYaple> dillaman: cool. thats what i thought
[17:19] <dillaman> * note: the RBD image data can be stored on the EC pool, but associate image metadata must remain in a replicated pool. Kraken will introduce a new "--data-pool" option when creating RBD images to place the data objects on a different pool from the metadata
[17:19] * aiicore_ (~aiicore@s30.linuxpl.com) Quit (Quit: leaving)
[17:19] * aiicore (~aiicore@s30.linuxpl.com) has joined #ceph
[17:20] * sudocat (~dibarra@ has joined #ceph
[17:22] <SamYaple> dillaman: good info. that makes sense. I was wondering how some of those issues were going to be solved
[17:23] <wes_dillingham> dillaman: so when doing an rbd image-meta operation which pool do you specify when referring to the rbd device?
[17:23] * ade (~abradshaw@pool- Quit (Quit: Too sexy for his shirt)
[17:23] <dillaman> wes_dillingham: you would always refer to the pool where the image metadata lives -- the fact that the data is stored in a different pool is hidden by rbd and handled under the hood
[17:24] * AlexB_ (~AlexB@ Quit (Quit: Leaving...)
[17:24] <wes_dillingham> glorious, exciting, looking forward to trying it
[17:25] * doppelgrau (~doppelgra@ Quit (Ping timeout: 480 seconds)
[17:25] * krypto (~krypto@G68-90-102-226.sbcis.sbc.com) has joined #ceph
[17:26] <dillaman> e.g.: you could have a glance image pool named "glance" and create a new "glance_data" EC pool. you would update your glance backend to point to a new "ceph.conf" file which overrides "rbd_default_data_pool = glance_data" to ensure all new images get their data objects stored in the EC pool
[17:27] * pdrakewe_ (~pdrakeweb@cpe-71-74-153-111.neo.res.rr.com) has joined #ceph
[17:28] * branto (~branto@transit-86-181-132-209.redhat.com) Quit (Quit: ZNC 1.6.3 - http://znc.in)
[17:28] * b0e (~aledermue@ Quit (Quit: Leaving.)
[17:30] * pdrakeweb (~pdrakeweb@oh-76-5-101-140.dhcp.embarqhsd.net) Quit (Ping timeout: 480 seconds)
[17:35] * DV (~veillard@2001:41d0:a:f29f::1) has joined #ceph
[17:39] * raphaelsc (~raphaelsc@ has joined #ceph
[17:40] * rmart04 (~rmart04@support.memset.com) Quit (Ping timeout: 480 seconds)
[17:40] * Uniju1 (~allenmelo@ Quit ()
[17:41] * xinli (~charleyst@ has joined #ceph
[17:42] * DV_ (~veillard@2001:41d0:a:f29f::1) Quit (Ping timeout: 480 seconds)
[17:42] * nilez (~nilez@ec2-52-37-170-77.us-west-2.compute.amazonaws.com) Quit (Ping timeout: 480 seconds)
[17:44] * Nicho1as (~nicho1as@00022427.user.oftc.net) Quit (Quit: A man from the Far East; using WeeChat 1.5)
[17:48] * TMM (~hp@ Quit (Quit: Ex-Chat)
[17:51] * karnan (~karnan@ has joined #ceph
[17:51] * newbie (~kvirc@host217-114-156-249.pppoe.mark-itt.net) has joined #ceph
[17:51] * karnan (~karnan@ Quit ()
[17:54] * Hemanth (~hkumar_@ Quit (Ping timeout: 480 seconds)
[17:58] * nilez (~nilez@ec2-52-37-170-77.us-west-2.compute.amazonaws.com) has joined #ceph
[17:59] * derjohn_mob (~aj@b2b-94-79-172-98.unitymedia.biz) Quit (Ping timeout: 480 seconds)
[18:03] * hbogert (~Adium@ has joined #ceph
[18:03] * Skaag (~lunix@ has joined #ceph
[18:05] * treypalmer (~treypalme@ has joined #ceph
[18:09] * TomasCZ (~TomasCZ@yes.tenlab.net) has joined #ceph
[18:10] * davidzlap (~Adium@2605:e000:1313:8003:6066:d3ef:ff1c:3b58) has joined #ceph
[18:11] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) has joined #ceph
[18:11] * xinli (~charleyst@ Quit (Remote host closed the connection)
[18:12] <btaylor> when i???m attaching my rbd slice to a qemu process, i???m only seeing ~100MBps in transfer inside of the VM. is that normal? i???d think i should see more
[18:12] <btaylor> am i missing something?
[18:12] * xinli (~charleyst@ has joined #ceph
[18:12] <herrsergio> 3~
[18:13] <blizzow> So if I have a couple drives fail, and wait for my cluster to be rebalanced, Is it better to add a replacement OSD and then remove the failed one, or is it better to remove the old OSD from the cluster and then add the replacement?
[18:15] <doppelgrau> btaylor: wtach the cpu-usage of qemu, in some setups I found qemu eating lots of CPU for emulation the HW => that was the bottleneck
[18:16] <doppelgrau> btaylor: assuming you have fast enough network and the cluster is fast with small IO (usually 4-40kb)
[18:16] <bstillwell> blizzow: I usually wait for the recovery to finish, remove the failed osd from the cluster (ceph osd crush remove osd.NN; ceph osd rm NN; ceph auth del osd.NN), then add the replacement drive back in.
[18:16] <bstillwell> Although for me the recovery for a single drive failure is something like 20 minutes or less.
[18:17] <bstillwell> Benefits of large clusters...
[18:18] <bstillwell> BTW, doing those changes with nobackfill set helps quite a bit.
[18:18] * Kioob (~Kioob@LMontsouris-656-1-1-206.w80-12.abo.wanadoo.fr) Quit (Quit: Leaving.)
[18:19] <bstillwell> That way it doesn't do a re-balance after the OSD is removed from your crush map, and the new OSD takes the same PGs as the old OSD
[18:19] <diq> there's no reason to remove the OSD from the crush map
[18:19] <diq> just use the old OSD's UUID when you add it
[18:20] <diq> I have no idea why the official docs say to do that; it's quite wasteful and unnecessary
[18:20] <bstillwell> diq: I've never tried that. Does it work with ceph-disk?
[18:20] <diq> https://www.couyon.net/blog/swapping-a-failed-ceph-osd-drive
[18:20] <diq> I wrote about it, so you can read it ;)
[18:20] <bstillwell> Will do
[18:21] <blizzow> diq: thanks. Will look at it.
[18:21] <diq> the redhat bug for "our drive swapping documentation is terrible" is quite telling
[18:22] <bstillwell> They do accept documentation pull requests...
[18:22] * rmart04 (~rmart04@host109-157-171-8.range109-157.btcentralplus.com) has joined #ceph
[18:22] <diq> reported in 2015 and still open
[18:22] <blizzow> diq: what do you do about the journal?
[18:22] <diq> bstillwell, this is for RH enterprise storage which is $200k per PB
[18:23] <diq> bstillwell, ceph-disk prepare takes care of that, no?
[18:23] <diq> we put journals on the same drives as the data for easy disk swapping
[18:24] <diq> if you're putting the journal elsewhere, well, then you already know that you have a much more complicated use case
[18:24] * mattbenjamin1 (~mbenjamin@97-117-49-192.slkc.qwest.net) has joined #ceph
[18:24] * walcubi (~walcubi@p5797AA84.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[18:25] <bstillwell> diq: I have to write documentation on swapping a drive on our clusters soon, so I might as well try and update the docs.ceph.com entry while I'm at it.
[18:25] <blizzow> diq: I've been using ceph-deploy to add OSDs and do prep. I didn't realize the ceph-disk prepare does the journal too.
[18:25] <bstillwell> blizzow: Yeah, you can use something like: ceph-disk prepare /dev/sdy /dev/sdg3
[18:26] * rmart04_ (~rmart04@support.memset.com) has joined #ceph
[18:27] * hbogert (~Adium@ Quit (Ping timeout: 480 seconds)
[18:27] * mattch (~mattch@w5430.see.ed.ac.uk) Quit (Ping timeout: 480 seconds)
[18:27] * mattbenjamin (~mbenjamin@75-162-245-215.slkc.qwest.net) Quit (Ping timeout: 480 seconds)
[18:28] * squizzi (~squizzi@ Quit (Quit: bye)
[18:30] * rmart04 (~rmart04@host109-157-171-8.range109-157.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[18:30] * rmart04_ is now known as rmart04
[18:33] * vasu (~vasu@c-73-231-60-138.hsd1.ca.comcast.net) has joined #ceph
[18:39] * Ramakrishnan (~ramakrish@ has joined #ceph
[18:41] * dgurtner (~dgurtner@ has joined #ceph
[18:43] * rmart04_ (~rmart04@support.memset.com) has joined #ceph
[18:45] * jowilkin (~jowilkin@184-23-213-254.fiber.dynamic.sonic.net) has joined #ceph
[18:46] * walcubi (~walcubi@p5099a7c3.dip0.t-ipconnect.de) has joined #ceph
[18:49] * Unai (~Adium@50-115-70-150.static-ip.telepacific.net) has joined #ceph
[18:49] * ben1 (ben@pearl.meh.net.nz) Quit (Read error: Connection reset by peer)
[18:49] * rmart04 (~rmart04@support.memset.com) Quit (Ping timeout: 480 seconds)
[18:53] * rmart04_ (~rmart04@support.memset.com) Quit (Ping timeout: 480 seconds)
[18:55] * dgurtner (~dgurtner@ Quit (Ping timeout: 480 seconds)
[18:55] * davidzlap (~Adium@2605:e000:1313:8003:6066:d3ef:ff1c:3b58) Quit (Quit: Leaving.)
[18:57] * rotbeard (~redbeard@ Quit (Quit: Leaving)
[18:57] <winmutt> hello
[18:59] * davidzlap (~Adium@2605:e000:1313:8003:6066:d3ef:ff1c:3b58) has joined #ceph
[19:01] * Discovery (~Discovery@ has joined #ceph
[19:06] * Be-El (~blinke@nat-router.computational.bio.uni-giessen.de) Quit (Quit: Leaving.)
[19:06] * vata1 (~vata@ Quit (Quit: Leaving.)
[19:08] <btaylor> doppelgrau: the qemu process is basically doing nothing. the host is seeing maybe 1% cpu usage and like 25-50MBps net usage
[19:10] * trololo (~cypher@fra06-3-78-243-232-119.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[19:11] <doppelgrau> btaylor: how fast is a rados bench with small objects and only one thread?
[19:11] <btaylor> i???ll rerun, 1 min.
[19:13] <btaylor> rados bench -p qemu 10 write --no-cleanup
[19:13] <btaylor> Max bandwidth (MB/sec): 232
[19:13] <btaylor> Min bandwidth (MB/sec): 160
[19:13] <btaylor> all the ceph nodes had 4x 10gbps bonded.
[19:13] <doppelgrau> 4MB objects, 16 threads
[19:14] <doppelgrau> rados bench -p qemu 10 write -o 32k -t 1
[19:14] <diq> that's a lotta network for an OSD
[19:16] <btaylor> osd???s and jorunals are ssd too
[19:16] <btaylor> that bench command isn???t working
[19:16] <btaylor> rados bench -p qemu 10 write -b 32K -t 1
[19:16] <btaylor> Max bandwidth (MB/sec): 8.59375
[19:16] <btaylor> Min bandwidth (MB/sec): 6.03125
[19:17] * ircolle (~ircolle@ has joined #ceph
[19:17] * ircolle (~ircolle@ has left #ceph
[19:19] <doppelgrau> 250 IO/s => 4ms
[19:19] <doppelgrau> sounds reasonable
[19:19] * Rickus (~Rickus@office.protected.ca) Quit (Read error: Connection reset by peer)
[19:21] <doppelgrau> btaylor: but don???t know how deep the command queue of the qemu device is, but that limits the number of parallel io, so that might be the limit
[19:22] <btaylor> Average IOPS: 250
[19:22] <btaylor> Stddev IOPS: 23
[19:22] <btaylor> thats from that test too
[19:22] <doppelgrau> maybee rbd cache can help if the IO is not sync, coalescing several small writes into one larger
[19:22] <btaylor> that???s right on an OSD host. not in a qemu process, btw
[19:23] <btaylor> inside the vm i was running sysbench and fio to test
[19:23] * Rickus (~Rickus@office.protected.ca) has joined #ceph
[19:23] <doppelgrau> btaylor: try enabling rbd-cache for qemu
[19:24] <btaylor> for that, i just do ???rbd_cache = true??? in ceph.conf right?
[19:24] <doppelgrau> might help, I gues you hit the number of IO/s possible, since the latencys limit more
[19:24] <btaylor> http://docs.ceph.com/docs/master/rbd/rbd-config-ref/ says rbd cache is true by default
[19:25] * Rickus (~Rickus@office.protected.ca) Quit (Read error: Connection reset by peer)
[19:25] * Rickus (~Rickus@office.protected.ca) has joined #ceph
[19:26] <btaylor> oh i see, one sec.
[19:27] * kefu (~kefu@ Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[19:28] * kefu (~kefu@li1445-134.members.linode.com) has joined #ceph
[19:28] * kefu (~kefu@li1445-134.members.linode.com) Quit ()
[19:30] * diver (~diver@ has joined #ceph
[19:31] <diver> did anyone tried kraken already with blue(new?)store?
[19:31] * bauruine (~bauruine@2a01:4f8:130:8285:fefe::36) Quit (Ping timeout: 480 seconds)
[19:32] <iggy> I tried it on jewel
[19:32] <iggy> it ate my data
[19:32] <iggy> hopefully it's improved in kraken
[19:33] * bauruine (~bauruine@2a01:4f8:130:8285:fefe::36) has joined #ceph
[19:33] <diver> in my case with jewel osd's were just crashing when I put 20M objects
[19:34] <diver> saw on the slides yesterday that
[19:34] <diver> in kraken bluestore becomes 'stable'
[19:34] <diver> "fall(?) 2016", end of quote
[19:36] <bstillwell> I'm surprised we haven't seen any development releases of Kraken yet... Maybe they're going to start with RCs?
[19:36] <blizzow> I have 2048PGs with 50 OSDs (7200 RPM spinners) spread across 11 servers, connected with 10GB ethernet. Anyone here have an idea how long should I expect a rebalance/recovery to take If I lose a 1TB OSD containing ~120GB of data?
[19:39] <SamYaple> blizzow: certainly that is going to depend on how active your cluster is and what options you have. what crush tuneables you have. recovery can be limited and throttled. there is io priority as well
[19:39] <SamYaple> a whole lotta factors
[19:41] * Ramakrishnan (~ramakrish@ Quit (Quit: Leaving)
[19:42] <blizzow> SamYaple: just asking for a ballpark.
[19:42] * oem (~oftc-webi@c-68-83-233-239.hsd1.pa.comcast.net) has joined #ceph
[19:46] <blizzow> default crush tunables.
[19:46] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) Quit (Quit: valeech)
[19:48] <Dominik_H> so i cannot grow a ec pool after initial creation ? can i use it with cephfs in combination, so without rbd ?
[19:48] <blizzow> My feels tell me more than an hour to rebalance 120GB of data is a little excessive.
[19:48] <Dominik_H> i??m looking for a of course slow but secure solution of a growable cold storage
[19:49] <SamYaple> blizzow: lose an osd as in remove osd from crush map, or lose osd like recover (duplicate objects) until the osd comes back?
[19:50] <SamYaple> blizzow: if you remove an osd entirely it has to remap all of the pgs on that osd to other osds and that changes other osds and will result in far more than 120GB of data moving
[19:50] <blizzow> Let's just say the drive gets yanked.
[19:50] <blizzow> Not removed from the map.
[19:50] <SamYaple> its got to duplicate all of the objects to the recovery pgs
[19:51] <SamYaple> what makes you say that 120GB taking more than an hour is unreasonable?
[19:52] * rraja (~rraja@ Quit (Ping timeout: 480 seconds)
[19:54] * atheism (~atheism@ Quit (Remote host closed the connection)
[19:55] <diq> It takes days to backfill a drive on my EMC Isilon cluster
[19:55] * Behedwin (~Eman@ has joined #ceph
[19:55] <blizzow> because reading from two copies to duplicate the data to a different drive should be enough to saturate a 7200 RPM disk write, and even assuming 100MB/sec read, and taking 20% off the top for overhead leaves me 80MB/sec of write stream.
[19:56] * ledgr (~ledgr@88-119-196-104.static.zebra.lt) Quit (Remote host closed the connection)
[19:57] * ledgr (~ledgr@88-119-196-104.static.zebra.lt) has joined #ceph
[19:57] <diq> it depends on the overall cluster load
[19:57] * ledgr (~ledgr@88-119-196-104.static.zebra.lt) Quit (Remote host closed the connection)
[19:58] * ledgr (~ledgr@88-119-196-104.static.zebra.lt) has joined #ceph
[19:59] <diq> on an unloaded cluster, yeah it's reasonable to expect 120GB to be finished in an hour
[19:59] * Skaag (~lunix@ Quit (Quit: Leaving.)
[19:59] <diq> the speed is directly related to the # of recovery threads
[19:59] <diq> the defaults are low (for good reason)
[20:00] <SamYaple> blizzow: it doesnt read full out like that. though you can configure recovery to go damn near that fast if you don't care about the cluster being usable in the mean time
[20:01] <blizzow> SamYaple: I don't want to find myself in the DreamHost/Ceph bucket ;)
[20:01] <diq> For most people, keeping their cluster performant and available is more important than recovering a failed drive ;)
[20:01] <SamYaple> diq: agreed
[20:01] <btaylor> doppelgrau: adding ,cache=writeback, to my -drive line hasn???t seemed to offer any better speed. actually seems worse.
[20:01] <diq> which is why I said my Isilon cluster takes days to recover from a failed drive
[20:02] <btaylor> sysbench mysql (OLtP) test, read/write requests: 1791415 (2985.61 per sec.)
[20:02] <btaylor> i was getting 500-1000 more before. and i get 13000/s with a native lvm slice
[20:05] * Skaag (~lunix@ has joined #ceph
[20:06] * ledgr (~ledgr@88-119-196-104.static.zebra.lt) Quit (Ping timeout: 480 seconds)
[20:07] * squizzi (~squizzi@2001:420:2240:1268:3944:b0b0:1c70:4014) has joined #ceph
[20:08] * atheism (~atheism@ has joined #ceph
[20:10] * vbellur (~vijay@2601:18f:700:55b0:5e51:4fff:fee8:6a5c) Quit (Ping timeout: 480 seconds)
[20:13] * xinli (~charleyst@ Quit (Ping timeout: 480 seconds)
[20:13] * xinli (~charleyst@ has joined #ceph
[20:14] * ivve (~zed@c83-254-15-40.bredband.comhem.se) has joined #ceph
[20:17] * Hemanth (~hkumar_@ has joined #ceph
[20:22] * Hemanth (~hkumar_@ Quit ()
[20:22] * Hemanth (~hkumar_@ has joined #ceph
[20:25] * Behedwin (~Eman@ Quit ()
[20:26] * Skaag (~lunix@ Quit (Quit: Leaving.)
[20:30] * vbellur (~vijay@nat-pool-bos-t.redhat.com) has joined #ceph
[20:32] * ledgr (~ledgr@88-222-11-185.meganet.lt) has joined #ceph
[20:33] * winmutt (~rolfmarti@ Quit (Ping timeout: 480 seconds)
[20:34] * Skaag (~lunix@ has joined #ceph
[20:34] * krypto (~krypto@G68-90-102-226.sbcis.sbc.com) Quit (Ping timeout: 480 seconds)
[20:36] * ivve (~zed@c83-254-15-40.bredband.comhem.se) Quit (Ping timeout: 480 seconds)
[20:50] <blizzow> For comparison, I removed a drive from my elasticsearch cluster containing 150GB. The cluster took less than 20 minutes to rebalance :/
[20:50] <diq> again, it depends on cluster activity. That's a tiny ES cluster. What's your ingest rate? How much are you reading from it? etc etc
[20:51] * winmutt (~rolfmarti@ has joined #ceph
[20:52] * salwasser (~Adium@ Quit (Read error: Connection reset by peer)
[20:52] * salwasser (~Adium@ has joined #ceph
[20:53] <blizzow> diq, 300million docs, with 2TB of data. in it. Taking in a couple hundred docs per second a fair bit of reading on it. I'd say it's busier than my ceph cluster.
[20:54] <SamYaple> blizzow: as i said, you can remove all the caps and safeties and get taht speed with ceph
[20:54] * maybebuggy (~maybebugg@2a01:4f8:191:2350::2) Quit (Ping timeout: 480 seconds)
[20:54] <diq> yep
[20:54] <diq> good ol injectargs
[20:54] <SamYaple> but the cluster likely wont be very responsive while doing it
[20:55] <blizzow> Guess I'm having difficulty knowing what performance to expect out of ceph or how much more I can get out of tuning it.
[20:55] <SamYaple> blizzow: i think by default it is a single recovery thread per osd
[20:55] <SamYaple> and recovery is lower priority than everything else
[20:55] <diq> we get several GB per second out of ours. Only 6 nodes.
[20:55] <T1> that is - unfortunately - a common concern
[20:55] <diq> not gb, but GB
[20:55] <diq> and our hardware is nothing wild
[20:56] <T1> pure ssd?
[20:56] <diq> hah. zero ssd.
[20:56] <T1> large read/writes?
[20:56] <diq> yeah
[20:56] <SamYaple> i do bcache osds and can saturate 10Gbit easy
[20:56] <diq> http://www.qct.io/Product/Storage/Storage-Server/4U/QuantaPlex-T21P-4U-p291c77c71c150c222
[20:56] <T1> that helps.. :)
[20:56] <SamYaple> for recovery
[20:56] <SamYaple> its all rbds though, 8-32M objects
[20:56] <diq> I wouldn't choose that SKU again, but it works well for DB backups
[20:57] <T1> my smallish cluster gives me ~250MB/s (yes MB) with mediocre hardware
[20:57] <T1> for 4k writes
[20:57] <T1> I'm getting around 400MB/s for 4k reads
[20:57] <diq> 3 of those chassis, 6 nodes total. We backup MySQL directly to it, then we read from it to send tertiary backups to Googles
[20:58] <T1> I could probably get higher if all clients were 10gbit
[21:00] <diq> Heh. I need a PNI with Google Cloud to really let our cluster loose. That colo is reaching them over transit, and I don't want to break the bank for backups (yet I still need them to complete in a day)
[21:01] <diq> we throttle our ceph traffic with a tc scheduler
[21:01] <cetex> So, yeah..
[21:02] * davidzlap (~Adium@2605:e000:1313:8003:6066:d3ef:ff1c:3b58) Quit (Quit: Leaving.)
[21:02] <cetex> We had an issue earlier with slowness (was since we migrated from gluster -> ceph and had raid-6 over 16hdd's per node)
[21:02] <cetex> fixed now
[21:02] <cetex> took a month to migrate all data back and forth
[21:02] <cetex> 96 osd's, controllers set to jbod
[21:03] <cetex> recovery speed increased to pretty nice levels (3-5GB/s recovery when we upped the max backfills a bit)
[21:03] <cetex> but now i see other issues..
[21:03] <cetex> :>
[21:04] <cetex> latency is insane, ceph status takes from 500ms to multiple (4-5) seconds to give output
[21:04] <cetex> I'm guessing that depends on monitors only?
[21:04] <T1> most likely yes
[21:05] * fandi (~fandi@ has joined #ceph
[21:05] <cetex> cool. when running rados bench, does that communicate with monitors continuously?, or are the osd's sending events to monitors continuously? or is that communication only happening once in a while?
[21:05] <T1> if your MONs are running on the same hardware as some of the OSDs that is just the way it is..
[21:05] <cetex> because rados bench over these 96 drives gives 240MB/s for first second, then drops off to 0
[21:05] <cetex> right..
[21:05] <cetex> now, why is that? :)
[21:06] <T1> because the MONs are busy monitoring the cluster
[21:06] <T1> and the OSDs are busy handling I/O
[21:06] <T1> and your status request is having to wait for other more important requests or map updates
[21:07] * davidzlap (~Adium@2605:e000:1313:8003:6066:d3ef:ff1c:3b58) has joined #ceph
[21:07] <T1> and the MONs and OSDs communicate on a continues base
[21:07] <cetex> right. but nothing there is relevant to monitors on same node as osd's
[21:07] <cetex> if i'm not missing something
[21:07] <cetex> we have separate harddrives for the mons, and they only use about 8% cpu at most
[21:07] <T1> there is documentation available on how it works, so I suggest a rtfm'ing a bit.. :)
[21:08] <diq> mon is all about CPU and RAM
[21:08] <diq> at least from what I could tell
[21:08] <cetex> yeah. so colocating monitors and osd's isn't an issue unless you saturate cpu or memory
[21:08] <T1> MONs do require a small amount of disk I/O for writing the map
[21:08] <diq> I would agree with that cetex
[21:08] <T1> but its no where near what an OSD requires
[21:09] * hbogert (~Adium@ip54541f88.adsl-surfen.hetnet.nl) has joined #ceph
[21:09] <cetex> yeah. separate disk there, iops not an issue
[21:09] <T1> but if your nodes with MONs are IO saturated (both network and CPU) you can see issues
[21:09] <cetex> 10Gbit networking, never gets past 2.5Gbit in writes (for 1 second) then performance dies.
[21:09] <T1> sounds like network issues
[21:10] <cetex> could've been, but since we've been pushing 3-5GB/s until 12h ago in recovery it seems weird.
[21:10] <diq> make sure your host and switch aren't dropping frames
[21:11] <T1> firewall issues
[21:11] <diq> your host can drop frames at 2gbits easily
[21:11] <T1> anything is possible
[21:11] <T1> I've heard of these kinds of problems before from others
[21:12] <diq> getting good network performance out of a linux server is much more difficult than other OS's
[21:12] <cetex> yeah. i'm pretty good at networking actually. :>
[21:12] <T1> recheck, double check and tripple check everything related to network
[21:12] <diq> so many knobs to turn and settings to fiddle with
[21:12] <diq> not just the OS
[21:12] <diq> but the drivers
[21:12] <T1> switches, cables etc etc etc..
[21:12] <diq> and how many receive queues to run
[21:12] <cetex> working on CDN as a side project, got 60Gbit/s of https from a single node :D
[21:12] <diq> and pinning them to the right numa core etc etc
[21:12] <cetex> it's pretty straight forward. this is almost the same stuff
[21:12] <diq> CDN? Run BSD. Save yourself the headache
[21:13] <cetex> well, what headache? :o
[21:13] * wjw-freebsd (~wjw@smtp.digiware.nl) has joined #ceph
[21:13] <cetex> ran out of network interfaces at 60Gbit/s, still had performance left in the host. :>
[21:13] <diq> really? On Linux?
[21:14] <diq> You can't really tell without pushing it more
[21:14] <T1> buuut if your cluster slows down to 0 then something is wrong
[21:14] <cetex> yes.
[21:14] <cetex> pretty sure it's not networking here actually
[21:14] <cetex> since it's worked quite well in recovery
[21:14] <cetex> but client requests are slow..
[21:14] <diq> You can have hardware resources left over, but the Linux IP stack can block on serialization early
[21:14] <T1> half-established TCP connections
[21:15] <T1> missed heartbeats causing OSDs to see other OSDs as down..
[21:15] <cetex> yeah, that would show in ceph -w
[21:15] <cetex> and we're not seeing any heartbeat issues or anything like that..
[21:15] <cetex> it's just really slow. :<
[21:15] <T1> OSD suicide if it cannot reach other OSDs etc etc etc..
[21:15] <thoht> Important You must deploy at least one metadata server to use CephFS. There is experimental support for running multiple metadata servers. Do not run multiple metadata servers in production.
[21:16] <thoht> isn't it piossiuble to have cephFS with HA MDS ?
[21:16] <T1> that - and "its just really slow" always comes back down to network-related issues
[21:16] <T1> trust us on that
[21:16] <diq> thoht only 1 will be primary, but you can run multiple
[21:16] <T1> even though you think of yourself as capable..
[21:16] <cetex> T1: yeah. i'm rechecking as we speak
[21:16] <thoht> diq: if premary goes down; what happens ?
[21:16] <diq> they store data inside ceph, so there's nothing local to the host. it handles failover decently well
[21:17] <diq> thoht, anything inflight is hosed, but new requests will work if the client gets mapped to the new MDS
[21:17] <T1> thoht: while multiple active MDSs are not supported, you can have 1 active and several in stand by
[21:17] <T1> if/when the active goes away a standby takes over almost at once
[21:17] <thoht> T1: does the standby MDS start automatically when primary dies ?
[21:17] <diq> yeah there's no operator intervention
[21:17] <thoht> or does it require a human operation
[21:17] <diq> nope
[21:17] <diq> it just goes
[21:17] <thoht> or pacemaker
[21:17] <diq> it's always running, it's just waiting
[21:18] <T1> IO to existing open files will be unaffected, but requests to open or create new files will see a small wait
[21:18] <thoht> so it is possible to have cephFS as an HA service !!!!
[21:18] <diq> sorta
[21:18] <T1> (cephfs IO runs directly between clients and OSDs - only metadata goes over the MDS)
[21:19] <T1> be warned.. you can still see poor performance
[21:19] <T1> hot zones etc
[21:19] <thoht> so cephFS won't have good perf like 120MB/s ?
[21:19] <T1> no.. poor performance such as slow dislistings if a directory contains many files
[21:19] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) has joined #ceph
[21:20] <diq> or big files
[21:20] <thoht> so if i want to build 2 VM as MX it is a bad idea to sore the emails into cephFS
[21:20] <diq> file size matters for some reason
[21:20] <thoht> oh mails are small it should fit
[21:21] <T1> I have not seen release notes that said "good things" about lots of files in the same directory..
[21:21] <diq> we have a few multi-TB files in cephFS and it does OK.
[21:21] <diq> have to bump up the default limit
[21:21] <T1> old (around Hammer) rules of thumb said that you should probably not have more than 1000 files in a single directory at any time unless you wanted poor performance
[21:22] <T1> it could have changed with Jewel, but..
[21:22] <thoht> huh ? my mail server has 10 000 files in some folders
[21:23] <thoht> so it is a bad idea ...
[21:24] <T1> Maildir and cephfs with loads of files would probably be problematic yes..
[21:24] * salwasser (~Adium@ Quit (Quit: Leaving.)
[21:32] * Skaag (~lunix@ Quit (Quit: Leaving.)
[21:32] <btaylor> i guess if i???m seeing 67.707Mb/sec in a fio seq wr test when directly mapping a rbd on bare metal, then that???s pretty much the top of the line. anyone getting better write stats?
[21:32] <btaylor> i got 5gbps in sew rd
[21:32] <btaylor> seq rd.
[21:32] <btaylor> seems odd.
[21:35] * Skaag (~lunix@ has joined #ceph
[21:35] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) Quit (Quit: valeech)
[21:38] * wes_dillingham (~wes_dilli@ Quit (Quit: wes_dillingham)
[21:41] <mistur> Hello
[21:42] <mistur> I have sent a mail on the ML a couple of hour ago about "Loop in radosgw-admin orphan find"
[21:42] <mistur> does anyone had the say issue ?
[21:42] <mistur> same*
[21:44] <mistur> did anyone have the same issue ?** :)
[21:45] <mistur> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013656.html
[21:47] * bniver (~bniver@71-9-144-29.static.oxfr.ma.charter.com) Quit (Remote host closed the connection)
[21:53] * Hemanth (~hkumar_@ Quit (Quit: Leaving)
[21:57] * Discovery (~Discovery@ Quit (Ping timeout: 480 seconds)
[22:04] * dis (~dis@00018d20.user.oftc.net) Quit (Ping timeout: 480 seconds)
[22:05] * Kingrat (~shiny@cpe-76-187-192-172.tx.res.rr.com) Quit (Remote host closed the connection)
[22:06] * Kingrat (~shiny@2605:6000:1526:4063:190a:a0ba:e4ee:c595) has joined #ceph
[22:10] <doppelgrau> btaylor: just to be sure, are your journals fast with sync-direct io?
[22:10] <cetex> ceph.com down again?
[22:10] <cetex> pretty unstable? :>
[22:10] <btaylor> micron m500???s. not sure if they have that...
[22:10] * bene2 (~bene@nat-pool-bos-t.redhat.com) has joined #ceph
[22:11] * oem (~oftc-webi@c-68-83-233-239.hsd1.pa.comcast.net) Quit (Quit: Page closed)
[22:13] <doppelgrau> btaylor: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
[22:13] <btaylor> ok thanks, taking a look
[22:14] * bvi (~Bastiaan@102-117-145-85.ftth.glasoperator.nl) has joined #ceph
[22:14] * winmutt (~rolfmarti@ Quit (Ping timeout: 480 seconds)
[22:14] * bene3 (~bene@nat-pool-bos-t.redhat.com) Quit (Ping timeout: 480 seconds)
[22:14] * diver_ (~diver@ has joined #ceph
[22:17] * davidzlap (~Adium@2605:e000:1313:8003:6066:d3ef:ff1c:3b58) Quit (Quit: Leaving.)
[22:19] <btaylor> i see they have teh micron m500dc on there, thats what i???ve got
[22:19] <btaylor> so probably the SSD is just not good for this
[22:20] <doppelgrau> how is your setup (how many nodes, ssds, how many Journals on each m500)
[22:20] <diq> that post is over 2 years old
[22:21] <diq> and I would take an Intel 750 NVME drive over almost all of the "enterprise" drives listed
[22:21] <T1> that post is the defacto reference and is updated to the latest models someone was kind to write about
[22:21] * diver (~diver@ Quit (Ping timeout: 480 seconds)
[22:22] <diq> that's some terrible numbers for the P3700
[22:22] <diq> also, remember not to try and use hdparm for SAS SSD's ;)
[22:22] <diq> hdparm is only for SATA
[22:23] <doppelgrau> diq: the most intresting thing is the benachmark => can test own devices
[22:23] * davidzlap (~Adium@2605:e000:1313:8003:6066:d3ef:ff1c:3b58) has joined #ceph
[22:25] <btaylor> doppelgrau: 170 OSDs, 5 journals on most of the journal drives. some with 4. 19 OSDs per box. 9 hosts
[22:26] * georgem (~Adium@ Quit (Ping timeout: 480 seconds)
[22:27] <SamYaple> can image flattening from parent to child be done while the child is in use?
[22:35] <dillaman> SamYaple: yes, it's safe
[22:35] <doppelgrau> btaylor: I???d run a benchmark myself in your place to be sure, that they haven???t ???improved??? sometimes
[22:35] <btaylor> yeah i am
[22:35] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) has joined #ceph
[22:36] * wwalker (~wwalker@ has left #ceph
[22:37] * bene2 (~bene@nat-pool-bos-t.redhat.com) Quit (Quit: Konversation terminated!)
[22:38] <btaylor> WRITE: io=3763.2MB, aggrb=64224KB/s, minb=64224KB/s, maxb=64224KB/s, mint=60000msec, maxt=60000msec i just got for 1 job.
[22:38] <btaylor> doing 5 now
[22:38] <btaylor> doppelgrau: i guess i could have done 2
[22:41] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[22:41] * raphaelsc (~raphaelsc@ Quit (Remote host closed the connection)
[22:42] * TMM (~hp@dhcp-077-248-009-229.chello.nl) has joined #ceph
[22:42] <cetex> so, trying to figure out why the last recovery basically stalled
[22:42] <cetex> debug osd 20: promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; target 5242880 obj/sec or 25 bytes/sec
[22:43] <cetex> 25bytes / sec? :>
[22:43] * magicrobot (~oftc-webi@d-65-175-145-252.cpe.metrocast.net) Quit (Quit: Page closed)
[22:43] <Unai> that's pretty awful??? have you tried rebooting the OSD process?
[22:43] <cetex> yes. multiple times
[22:43] <cetex> but i think it's a config issue :>
[22:44] <cetex> i mean, it seems like the OSD is throttling the recovery (which makes sense) but it's doing almost nothing at all, 1MB/s every 5 seconds or so
[22:45] <treypalmer> hi -- i have a couple of questions about rgw multisite config.
[22:46] * dis (~dis@00018d20.user.oftc.net) has joined #ceph
[22:47] * vbellur (~vijay@nat-pool-bos-t.redhat.com) Quit (Quit: Leaving.)
[22:47] <SamYaple> dillaman: danke
[22:48] <treypalmer> first of all -- what happens if i have multiple identically configure RGW instances fronting one ceph cluster in one DC, mirroring to the same type of setup in another DC? Is RGW designed to handle that or will the multiple instances step on each other?
[22:48] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[22:49] <treypalmer> (at present in my test cluster I have two RGW's atatched to a test cluster in one DC, mirrored to a single-node test cluster with a single RGW in the other DC)
[22:49] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) has joined #ceph
[22:50] * rwheeler (~rwheeler@174-23-105-60.slkc.qwest.net) has joined #ceph
[22:52] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) has joined #ceph
[22:54] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:55] * bvi (~Bastiaan@102-117-145-85.ftth.glasoperator.nl) Quit (Quit: Leaving)
[22:55] <cetex> soo
[22:55] <cetex> suddenly it just went "boom", last few percent done :o
[22:55] <cetex> no idea
[22:56] <doppelgrau> btaylor: better that the results on the blogpost, but if the blocksize of the kernel rbd-client is sufficient small, that would explain the 64MB sequential IO
[22:56] <treypalmer> what do you call the opposite of tail-latency? tail-snap?
[22:57] <cetex> and boom, 10Gbit maxed out.
[22:57] <cetex> very interesting. we have a random issue occuring somewhere >_<
[22:57] * kristen (~kristen@ Quit (Remote host closed the connection)
[22:57] <btaylor> i topped out at ~180MB/s at around 4 jobs
[22:57] <btaylor> 3 was 177
[22:58] <btaylor> WRITE: io=10403MB, aggrb=177537KB/s
[22:58] <diq> treypalmer, no idea about RGW. I really wish there was a good way of doing multi-site without RGW.
[22:59] <btaylor> 4:: WRITE: io=10520MB, aggrb=179532KB/s
[22:59] <diq> for those of us that don't use S3 API
[22:59] * derjohn_mob (~aj@x4db0f8e6.dyn.telefonica.de) has joined #ceph
[22:59] <treypalmer> @diq thanks.
[23:00] <treypalmer> are you using librados directly or something?
[23:00] <diq> plan to
[23:00] <treypalmer> very intersting.
[23:00] * kristen (~kristen@ has joined #ceph
[23:00] <diq> we don't want to go through a central dispatcher/gateway
[23:00] * ledgr (~ledgr@88-222-11-185.meganet.lt) Quit (Remote host closed the connection)
[23:01] * kristen (~kristen@ Quit ()
[23:01] * fandi (~fandi@ Quit (Quit: Leaving)
[23:01] * kristen (~kristen@ has joined #ceph
[23:01] <treypalmer> makes sense.
[23:02] <treypalmer> if you're using librados directly, it seems like you could code the replication piece too... eventually....
[23:02] <treypalmer> multi-site, I should say.
[23:03] <treypalmer> anyway, what I'm chasing is a really nasty memory leak in radosgw.
[23:03] <diq> if every client had access to all clusters, yes
[23:04] <cetex> We've investigated librados directly as well
[23:04] <cetex> just http api -> store objects in pool
[23:04] <cetex> no abstraction layers or anything
[23:04] <diq> that's what we've got in the works
[23:04] <cetex> since rgw seems to have performance issues when you write more than a couple million objects
[23:05] <treypalmer> I loaded in 8M 64K objects on a single-node cluster to test perfomrance and replication. box has 15 x 10T disks, 2 Intel P3700's as journals. about 2.8M objects replicated, then it all died. then I noticed memory was filling up really fast and the RGW was being killed and restarted again. this was on 10.2.3, I rolled back to 10.2.2 and same problem.
[23:05] <diq> oof
[23:05] <treypalmer> radosgw was using 70-80GB memory before OOM'd.
[23:05] <treypalmer> the box has 128GB
[23:06] <T1> sounds yucky
[23:07] <treypalmer> the other side of the multisite is a more normal cluster, with 5 OSD nodes and 3 mon's, 34 OSD's all with journals on NVMe's.
[23:08] <treypalmer> I was a little surprised noone else has run into this on the lists that I've found. But this is fairly new code and maybe noone is really doing RGW multisite at scale on Jewel yet?
[23:09] * bauruine (~bauruine@2a01:4f8:130:8285:fefe::36) Quit (Quit: ZNC - http://znc.in)
[23:09] * georgem (~Adium@2605:8d80:683:461:f113:4d7c:9668:134e) has joined #ceph
[23:09] * bauruine (~bauruine@mail.tuxli.ch) has joined #ceph
[23:10] <treypalmer> cetex -- what's this about radosgw having performance issues when you write more than a couple million objects?
[23:10] * newbie (~kvirc@host217-114-156-249.pppoe.mark-itt.net) Quit (Ping timeout: 480 seconds)
[23:10] <treypalmer> like, if you have more than that it just can't deal?
[23:10] * bauruine (~bauruine@mail.tuxli.ch) Quit ()
[23:12] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) Quit (Quit: valeech)
[23:12] * vata1 (~vata@ has joined #ceph
[23:13] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) has joined #ceph
[23:13] <cetex> since radosgw supports authentication, multiple buckets and stuff while still only using a couple of pools it has an abstraction layer
[23:13] <cetex> or something. :D
[23:13] <treypalmer> yes. true.
[23:13] <cetex> it basically has a database mapping bucket/file <-> object in other pool
[23:14] <treypalmer> though you can configure it to use alternate pools ofr different keys/users/buckets whatever. awkward though.
[23:14] <cetex> yeah.
[23:14] <cetex> but if you dump 300M objects into that it's gonna have issues it seems
[23:14] * kristen (~kristen@ Quit (Quit: Leaving)
[23:14] <cetex> or even 30M objects
[23:14] <cetex> maybe even less, it depends.
[23:15] <treypalmer> i'm wondering if it would help to put the non-data pools on SSD's. they take next to zero space.
[23:15] <cetex> besides that, it's a bit overcomplicated for what we actually need so that's why we've decided that when that day comes we'll write a simple http api <-> librados that just dumps stuff into a pool directly.
[23:15] <treypalmer> and in my testing they get a fair amount of I/O.
[23:16] <treypalmer> in particular I'm hoping it could help interactive type uses, like getting a bucket listing from the API
[23:16] * kristen (~kristen@ has joined #ceph
[23:16] <treypalmer> a lot of other things we use support S3 conveniently. ES snapshots, artifactory, etc.
[23:17] <treypalmer> from a pure self-written application point of veiw, I could see just using librados directly as you say. :-)
[23:17] * davidzlap (~Adium@2605:e000:1313:8003:6066:d3ef:ff1c:3b58) Quit (Quit: Leaving.)
[23:17] <cetex> yeah. it's nice to be able to just to rados pool ls and see the filenames as well :)
[23:18] <treypalmer> that's crazy talk! :-)
[23:19] * kristen (~kristen@ Quit ()
[23:19] <cetex> :P
[23:19] <cetex> soo
[23:20] <cetex> any ideas why reads from cephfs seems capped at 128MB/s while writes caps at 1.2GB/s?
[23:20] * kristen (~kristen@ has joined #ceph
[23:20] <treypalmer> haven't set up mds and done any cephfs testing yet.
[23:20] <treypalmer> but that seems backwards of what you'd need/want in most real world use cases.
[23:21] <cetex> yeah. it's just a huge library of rarely used files
[23:22] <cetex> but once you want to read you want to read it fast since the files are roughly 5GB on average
[23:22] * bauruine (~bauruine@2a01:4f8:130:8285:fefe::36) has joined #ceph
[23:22] <treypalmer> it can happen for small I/O's when your writes are buffered and/or COW but your reads have to spin to wherever teh data is.
[23:22] * diver (~diver@ has joined #ceph
[23:22] <treypalmer> but i don't get it for large sequential.
[23:22] * davidzlap (~Adium@2605:e000:1313:8003:6066:d3ef:ff1c:3b58) has joined #ceph
[23:23] <cetex> yeah.
[23:23] * haplo37 (~haplo37@ Quit (Remote host closed the connection)
[23:23] * bniver (~bniver@pool-71-174-250-171.bstnma.fios.verizon.net) has joined #ceph
[23:24] * squizzi (~squizzi@2001:420:2240:1268:3944:b0b0:1c70:4014) Quit (Quit: bye)
[23:27] <kjetijor> cetex: guessing - if cephfs stripes or chunks (large) files into multiple objects, you're probably getting better io-concurrency for the writes (if you're not doing anything like fsync), whereas for the read(s) it's more-or-less serial.
[23:28] <kjetijor> see - http://tracker.ceph.com/projects/ceph/wiki/Kernel_client_read_ahead_optimization
[23:29] * diver_ (~diver@ Quit (Ping timeout: 480 seconds)
[23:30] <cetex> aah, right
[23:30] <cetex> tried setting read ahead to 128MB
[23:30] <cetex> but didn't change much
[23:30] * valeech (~valeech@50-242-253-97-static.hfc.comcastbusiness.net) Quit (Quit: valeech)
[23:30] <cetex> maybe increase it to 1GB instead :>
[23:30] * diver (~diver@ Quit (Ping timeout: 480 seconds)
[23:31] * georgem (~Adium@2605:8d80:683:461:f113:4d7c:9668:134e) Quit (Read error: Connection reset by peer)
[23:31] * mtanski (~mtanski@ has joined #ceph
[23:33] * newdave (~newdave@36-209-181-180.cpe.skymesh.net.au) Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[23:35] * fsimonce (~simon@ Quit (Quit: Coyote finally caught me)
[23:36] * Jeffrey4l_ (~Jeffrey@ has joined #ceph
[23:36] * newdave (~newdave@36-209-181-180.cpe.skymesh.net.au) has joined #ceph
[23:38] * Jeffrey4l__ (~Jeffrey@ Quit (Ping timeout: 480 seconds)
[23:39] <cetex> didn't really help much
[23:39] <cetex> or, well, somewhat
[23:39] <cetex> got 230MB/s sometimes
[23:39] <cetex> and 90MB/s sometimes
[23:40] <cetex> so varies a lot
[23:41] <cetex> writes are insane though. i like it.
[23:42] <doppelgrau> cetex: OSDs = Platter with SSD journal?
[23:43] * atod (~atod@cpe-74-73-129-35.nyc.res.rr.com) has joined #ceph
[23:43] <cetex> platter only
[23:43] * xinli (~charleyst@ Quit (Ping timeout: 480 seconds)
[23:43] <doppelgrau> strange, are you sure, you do not test the speed of your memory?
[23:44] <cetex> what do you mean? :)
[23:44] <cetex> dd if=/ceph/<file> of=/dev/null = ~90-230MB/s
[23:44] <cetex> dd if=/dev/zero of=/ceph/<file> = 1.1GB/s
[23:45] <cetex> or, right..
[23:45] <cetex> need bs=...
[23:45] <cetex> one sec.
[23:46] <cetex> # dd if=/dev/zero of=/ceph/tst bs=102400k count=100
[23:46] <T1> and..
[23:46] <T1> oflag=direct,dsync
[23:46] <cetex> 10485760000 bytes (10 GB) copied, 8.6356 s, 1.2 GB/s
[23:46] <doppelgrau> cetex: do a ???sync; date; dd if=/dev/zero of=/ceph/somewhere bs=1M count=5000;sync; date
[23:46] <cetex> right right
[23:46] <doppelgrau> forcing the cache to be flusched
[23:46] <cetex> well
[23:47] <cetex> i see netwok out being maxed out
[23:47] <cetex> 10Gbit
[23:47] <doppelgrau> ok, that sounds good
[23:47] <cetex> sync; date; dd if=/dev/zero of=/ceph/somewhere bs=100M count=100;sync; date
[23:47] <cetex> Thu Oct 13 23:46:47 CEST 2016
[23:47] <cetex> 10485760000 bytes (10 GB) copied, 8.55445 s, 1.2 GB/s
[23:47] <cetex> Thu Oct 13 23:47:00 CEST 2016
[23:48] <doppelgrau> ???only??? 760 MB/s, but still very good
[23:48] <cetex> yeah. pretty happy with that
[23:48] <cetex> 96 hgst 8TB osd's
[23:49] <cetex> reads with same parameters:
[23:49] * kristen (~kristen@ Quit (Remote host closed the connection)
[23:49] <cetex> sync; date; dd of=/dev/null if=/ceph/<another huge file not read for quite some time> bs=100M count=100;sync; date
[23:50] <cetex> Thu Oct 13 23:48:05 CEST 2016
[23:50] <cetex> 10485760000 bytes (10 GB) copied, 60.3514 s, 174 MB/s
[23:50] <cetex> Thu Oct 13 23:49:05 CEST 2016
[23:50] * kristen (~kristen@ has joined #ceph
[23:50] <cetex> sync there doesn't do much but just to show the difference..
[23:51] <doppelgrau> strange, seems that read only happening at one, maxtwo drives at the sme time
[23:51] <cetex> yeah. it's a bit weird
[23:51] * treypalmer (~treypalme@ Quit (Ping timeout: 480 seconds)
[23:52] <cetex> iostat shows multiple drives being hit at least the same second with 4MB reads, but that's kinda expected. no way to tell from that if they're being hit simultaneously or not
[23:52] * mhack (~mhack@nat-pool-bos-t.redhat.com) Quit (Remote host closed the connection)
[23:52] <cetex> have rsize=1073741824 on the ceph mount now
[23:52] <cetex> didn't do much difference compared to rsize=128M or unsetting rsize
[23:54] <cetex> doing a rados bench write now
[23:54] <cetex> 300seconds
[23:55] <doppelgrau> cetex: kernel or fuse-cephfs?
[23:56] <cetex> kernel
[23:56] <cetex> 4.4.0-36-generic #55~14.04.1-Ubuntu
[23:58] <doppelgrau> cetex: might try fuse or different kernel. IIRC there was one or two kernel versions where the read ahead was more or less ignored after switching to ???blk-elevator??? in the kernel (but not 100% sure, if that affected kernel-cephfs or kernel-rbd or both)
[23:58] <cetex> aah, right
[23:58] <cetex> will see about that.
[23:59] <cetex> Bandwidth (MB/sec): 1046.19
[23:59] <cetex> Stddev Bandwidth: 52.7252
[23:59] <cetex> Max bandwidth (MB/sec): 1128
[23:59] <cetex> Min bandwidth (MB/sec): 740
[23:59] <cetex> Average Latency(s): 0.366655
[23:59] <cetex> Stddev Latency(s): 0.0610722
[23:59] <cetex> Max latency(s): 1.53167
[23:59] <cetex> Min latency(s): 0.175884
[23:59] <cetex> pretty good there.
[23:59] <cetex> sorry for spamming, i'm just happy about the writes finally being awesome

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.