#ceph IRC Log

Index

IRC Log for 2016-10-07

Timestamps are in GMT/BST.

[0:04] * j3roen (~j3roen@93.188.248.149) Quit (Ping timeout: 480 seconds)
[0:06] * bene2 (~bene@nat-pool-bos-t.redhat.com) Quit (Quit: Konversation terminated!)
[0:07] * j3roen (~j3roen@93.188.248.149) has joined #ceph
[0:09] * Raipin (~Raipin@206-15-88-26.static.twtelecom.net) has joined #ceph
[0:10] * jermudgeon (~jermudgeo@southend.mdu.whitestone.link) Quit (Quit: jermudgeon)
[0:10] * minnesotags (~herbgarci@c-50-137-242-97.hsd1.mn.comcast.net) has joined #ceph
[0:10] <Raipin> Hey everyone. I'm a fairly new sysadmin and I was hoping someone could spare a couple mintues to help me clarify some fairly basic questions.
[0:11] <minnesotags> Sure
[0:11] <minnesotags> Ask
[0:11] * dneary (~dneary@rrcs-24-103-206-82.nys.biz.rr.com) Quit (Ping timeout: 480 seconds)
[0:12] <Raipin> Alright, so I manage a company of about 100 employees. I'm looking into High Availability setups for their data. So first question would be is my usecase appropriate for Ceph?
[0:12] <Raipin> I have never heard of Ceph until a week ago and it seems interesting.
[0:12] <minnesotags> Yes.
[0:13] <diq> I think it depends upon your use case.
[0:13] <minnesotags> ;-)
[0:13] <Raipin> 20 or so Mac clients, the rest are Windows based PCs.
[0:13] <minnesotags> Virtual machines?
[0:13] <diq> are you looking to use it as a drop-in replacement for something like NFS? Or are you looking to deploy it purely as a key/value object store?
[0:13] <diq> or RBD and VM's?
[0:13] <Raipin> Drop in replacement for AFP/SMB ideally
[0:14] <Raipin> We currently have a NetApp for our VMs
[0:15] <diq> the POSIX support isn't 100% solid yet. It's improving with every release, but it's not up to the NetApp level yet.
[0:15] <Raipin> I don't believe we're planning on changing that anytime soon, definitely not for this project. We currently have 10TB FreeNAS that we're replacing. We're just looking to include High Availability as well which FreeNAS doesn't look like it really supports.
[0:15] <minnesotags> A common use case is datastorage for virtual machines and servers.
[0:15] <Raipin> Something similiar to DFS
[0:16] <diq> ceph is a good use case for your VM's, but not general purpose fileserver (at this time)
[0:16] <diq> IMHO
[0:16] <minnesotags> Unless you use a virtual machine as general purpose fileserver.
[0:16] <Raipin> Alright then, I think you just saved me hours and hours of research + testing, lol
[0:16] <diq> heh true
[0:17] <diq> you could use a VM as your fileserver, and put your VM's storage on ceph
[0:17] <minnesotags> Yes.
[0:17] <Raipin> I guess that's an idea. Our big problem with FreeNAS (other than the HA) is our Mac clients seem to constantly have file permissions/locking issues.
[0:17] <minnesotags> It scales really, really well.
[0:18] <Raipin> I wonder if it's possible to virtual OSX Server via VBox because I think that might resolve a lot of the problems
[0:18] <fusl> i think i f***** up my ceph cluster by running `ceph osd crush tunables optimal`, all cephfs clients now complain about "feature set mismatch, my 2b84a042aca < server's 40102b84a042aca, missing 401000000000000" - anyone knows how to revert back (jewel)?
[0:18] * johnavp1989 (~jpetrini@pool-100-14-10-2.phlapa.fios.verizon.net) has joined #ceph
[0:18] <- *johnavp1989* To prove that you are human, please enter the result of 8+3
[0:21] <minnesotags> Hmm. I've never needed an OSX server virtual machine, but that would indeed solve a lot of cross-platform problems.
[0:21] <minnesotags> In your setup.
[0:25] <minnesotags> The reason why it scales really well is that, unlike a raid array, you can replace individual OSDs with larger OSDs in a piecemeal fashion at the same time as maintaining uptime.
[0:26] * xinli (~charleyst@32.97.110.51) Quit (Ping timeout: 480 seconds)
[0:28] * jeh (~jeh@76.16.206.198) Quit ()
[0:33] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[0:45] * evelu (~erwan@37.163.186.61) has joined #ceph
[0:45] <SamYaple> fusl: you can set `ceph osd crush tunables legacy`, but htat will likely take yo uback farther than you were
[0:45] <SamYaple> you need to set your previous tunables to the best of your knowlege, probably `ceph osd crush tunables hammer`, but thats just a guess
[0:46] <SamYaple> the real key here is to just turn off the tunables your clients don't support, but that requires crush map editing
[0:47] * johnavp1989 (~jpetrini@pool-100-14-10-2.phlapa.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[0:50] * stiopa (~stiopa@cpc73832-dals21-2-0-cust453.20-2.cable.virginm.net) Quit (Ping timeout: 480 seconds)
[0:52] * Kingrat (~shiny@2605:6000:1526:4063:79cd:be53:c453:1c6) has joined #ceph
[0:52] * andreww (~xarses@64.124.158.3) Quit (Ping timeout: 480 seconds)
[0:55] <fusl> SamYaple: the tunables are stored in the crush map?
[0:57] <blizzow> Will I get speed gains if I set "rbd cache writethrough until flush = false" on my clients?
[0:59] <SamYaple> fusl: yes
[0:59] <SamYaple> fusl: you should read http://docs.ceph.com/docs/master/rados/operations/crush-map/#tunables
[1:01] * sudocat (~dibarra@192.185.1.20) Quit (Ping timeout: 480 seconds)
[1:01] <fusl> SamYaple: i have checked a backup of my crush map and the current crush map and the only difference is "tunable chooseleaf_stable 1" and "tunable allowed_bucket_algs 54" added
[1:04] * Raipin (~Raipin@206-15-88-26.static.twtelecom.net) Quit ()
[1:06] * davidzlap (~Adium@2605:e000:1313:8003:9c2b:4dee:7d49:9248) Quit (Quit: Leaving.)
[1:08] * andreww (~xarses@c-73-202-191-48.hsd1.ca.comcast.net) has joined #ceph
[1:13] * davidzlap (~Adium@2605:e000:1313:8003:9c2b:4dee:7d49:9248) has joined #ceph
[1:23] * oms101 (~oms101@p20030057EA3E1F00C6D987FFFE4339A1.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[1:24] * ira (~ira@12.118.3.106) Quit (Quit: Leaving)
[1:25] * vata (~vata@207.96.182.162) Quit (Quit: Leaving.)
[1:31] * oms101 (~oms101@2003:57:ea42:5a00:c6d9:87ff:fe43:39a1) has joined #ceph
[1:35] * richardus1 (~Scaevolus@anonymous.sec.nl) has joined #ceph
[1:43] * andrei__1 (~andrei@host81-151-140-236.range81-151.btcentralplus.com) has joined #ceph
[1:47] * andrei__1 (~andrei@host81-151-140-236.range81-151.btcentralplus.com) Quit ()
[1:49] * georgem (~Adium@69-165-135-139.dsl.teksavvy.com) has joined #ceph
[1:51] * georgem (~Adium@69-165-135-139.dsl.teksavvy.com) Quit ()
[1:52] * georgem (~Adium@206.108.127.16) has joined #ceph
[1:52] * scuttlemonkey is now known as scuttle|afk
[1:52] * scuttle|afk is now known as scuttlemonkey
[1:53] * mattbenjamin (~mbenjamin@12.118.3.106) Quit (Quit: Leaving.)
[1:53] <fusl> SamYaple: got it fixed. somehow the newest osds i added have straw2 instead of straw as hash
[1:53] <fusl> i changed that, set the crush map and was able to mount
[1:54] * mhack (~mhack@24-151-36-149.dhcp.nwtn.ct.charter.com) Quit (Remote host closed the connection)
[1:57] * kristen (~kristen@134.134.139.78) Quit (Remote host closed the connection)
[1:58] * dis (~dis@00018d20.user.oftc.net) Quit (Ping timeout: 480 seconds)
[1:59] * lkoranda (~lkoranda@nat-pool-brq-t.redhat.com) Quit (Ping timeout: 480 seconds)
[1:59] * kuku (~kuku@119.93.91.136) has joined #ceph
[2:05] * richardus1 (~Scaevolus@anonymous.sec.nl) Quit ()
[2:06] * lkoranda (~lkoranda@nat-pool-brq-t.redhat.com) has joined #ceph
[2:08] * Concubidated (~cube@68.140.239.164) Quit (Quit: Leaving.)
[2:09] * dis (~dis@00018d20.user.oftc.net) has joined #ceph
[2:11] * scuttlemonkey is now known as scuttle|afk
[2:18] * axion_joey (~oftc-webi@108.47.170.18) Quit (Remote host closed the connection)
[2:19] * sudocat (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) has joined #ceph
[2:27] * sudocat (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[2:28] * jermudgeon (~jermudgeo@tab.biz.whitestone.link) has joined #ceph
[2:30] * salwasser (~Adium@2601:197:101:5cc1:cae0:ebff:fe18:8237) has joined #ceph
[2:31] * cyphase (~cyphase@000134f2.user.oftc.net) Quit (Ping timeout: 480 seconds)
[2:33] * salwasser1 (~Adium@2601:197:101:5cc1:19ce:64ad:a8e0:2b42) has joined #ceph
[2:33] * salwasser (~Adium@2601:197:101:5cc1:cae0:ebff:fe18:8237) Quit (Read error: Connection reset by peer)
[2:33] * vasu (~vasu@c-73-231-60-138.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[2:35] * rdas (~rdas@121.244.87.113) Quit (Ping timeout: 480 seconds)
[2:39] * sudocat (~dibarra@2602:306:8bc7:4c50:f913:a406:3cba:59e1) has joined #ceph
[2:40] * Concubidated (~cube@h4.246.129.40.static.ip.windstream.net) has joined #ceph
[2:57] * cyphase (~cyphase@000134f2.user.oftc.net) has joined #ceph
[2:57] * Zeis (~Guest1390@185.65.134.75) has joined #ceph
[3:03] * EinstCrazy (~EinstCraz@58.246.118.135) has joined #ceph
[3:09] * johnavp1989 (~jpetrini@pool-100-34-191-134.phlapa.fios.verizon.net) has joined #ceph
[3:09] <- *johnavp1989* To prove that you are human, please enter the result of 8+3
[3:23] * salwasser1 (~Adium@2601:197:101:5cc1:19ce:64ad:a8e0:2b42) Quit (Quit: Leaving.)
[3:27] * Zeis (~Guest1390@185.65.134.75) Quit ()
[3:32] * Racpatel (~Racpatel@2601:87:3:31e3::34db) has joined #ceph
[3:43] * Racpatel (~Racpatel@2601:87:3:31e3::34db) Quit (Quit: Leaving)
[3:44] * yanzheng1 (~zhyan@125.70.23.12) has joined #ceph
[3:49] * kuku (~kuku@119.93.91.136) Quit (Remote host closed the connection)
[4:00] * jermudgeon (~jermudgeo@tab.biz.whitestone.link) Quit (Quit: jermudgeon)
[4:01] * holocron (~oftc-webi@47.185.49.208) has joined #ceph
[4:01] * wkennington (~wak@0001bde8.user.oftc.net) has joined #ceph
[4:02] <holocron> greets, i've no ceph experience and just hit on this error (new installation): *** Caught signal (Bus error) ** in thread 3ff83dff910 thread_name:ceph-mon
[4:02] <holocron> this is in ceph-mon log, it's followed by a stack trace
[4:02] <holocron> bug worthy?
[4:05] * jfaj (~jan@p20030084AD3146006AF728FFFE6777FF.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[4:15] * jfaj (~jan@2003:84:ad33:2900:6af7:28ff:fe67:77ff) has joined #ceph
[4:17] * shaunm (~shaunm@ms-208-102-105-216.gsm.cbwireless.com) Quit (Ping timeout: 480 seconds)
[4:18] * x041 (175c8396@78.129.202.38) has joined #ceph
[4:20] * holocron (~oftc-webi@47.185.49.208) Quit (Quit: Page closed)
[4:24] <x041> hello
[4:24] <x041> I have an intresting situation I came upon the otherday
[4:24] * blizzow (~jburns@c-50-152-51-96.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[4:24] <x041> I run a small cluster with 3 hosts and 18 osds
[4:25] <x041> 1536 pg/pgps
[4:25] <x041> 3 pools
[4:25] <x041> 2 pools with replication 3, min 2
[4:25] <x041> and 1 pool with rep 2, min 2
[4:26] <x041> 3 mons spread over the same 3 hosts
[4:26] <x041> one host died
[4:27] <x041> the pool, lets call it pool c, with min 2 and replica 2, the pgs associted with that pool obiously quickly went into stuck inactive
[4:28] <x041> as some of its replicas were on the dead host
[4:28] <x041> and with min2, the minimum was violated
[4:28] * davidzlap (~Adium@2605:e000:1313:8003:9c2b:4dee:7d49:9248) Quit (Quit: Leaving.)
[4:28] <x041> now, the cluster was in noout state, as the host was going to be repeaired shortly
[4:29] <x041> but heres the problem
[4:29] * kuku (~kuku@203.177.235.23) has joined #ceph
[4:29] <x041> ios to that pool, pool c got blocked, okay majkes sense
[4:29] <x041> but ios to the other two pools
[4:29] <x041> started timing out
[4:30] <x041> I thought that since the minimum wasn't violated on them
[4:30] <x041> that poolc shouldn't prevent the other two pools from working properly
[4:30] <x041> but yet, it did!
[4:31] <x041> reducing the min on that pool to 1, allowed aios to all 3 pools to return to normal perfoamnce
[4:31] <x041> any thoughts?
[4:35] <SamYaple> x041: you set noout so there was no recovery
[4:36] <SamYaple> x041: you could have recovered all of the 2 replica pools onto the two active nodes and been able to use them
[4:36] * kuku_ (~kuku@119.93.91.136) has joined #ceph
[4:36] <SamYaple> you seem to think the node that went down had none of the data for the 2 replica pools
[4:37] * rdave (~rdave@203.109.64.203) has joined #ceph
[4:38] <SamYaple> let me rephrase, min2 means it _must_ write to 2, but with noout set it thinks there are 3 available
[4:38] <SamYaple> so until all the out osds are out, it will block while trying to write to all 3 since ceph is strongly consistent
[4:39] <SamYaple> i suppose you could just set norecovery
[4:40] <SamYaple> that might be ok. but ill advised. it just depends on how quick the other node is back up
[4:40] * rdave (~rdave@203.109.64.203) Quit ()
[4:43] * kuku (~kuku@203.177.235.23) Quit (Ping timeout: 480 seconds)
[4:45] * vicente (~~vicente@125-227-238-55.HINET-IP.hinet.net) has joined #ceph
[4:51] * georgem (~Adium@206.108.127.16) Quit (Quit: Leaving.)
[4:52] * Nicho1as (~nicho1as@00022427.user.oftc.net) has joined #ceph
[5:05] * jermudgeon (~jermudgeo@tab.biz.whitestone.link) has joined #ceph
[5:30] * EinstCrazy (~EinstCraz@58.246.118.135) Quit (Remote host closed the connection)
[5:30] * rotbeard (~redbeard@aftr-109-90-233-215.unity-media.net) has joined #ceph
[5:34] * vimal (~vikumar@114.143.165.30) has joined #ceph
[5:46] <x041> No, I uynderstand why the pool c did not work
[5:46] <x041> that makes full sense
[5:46] <x041> also
[5:46] <x041> like I said
[5:46] <x041> when I set pool c to min 1
[5:46] <x041> pools a and b started working, exactly as expected
[5:47] <x041> the 6 osds were down
[5:47] <x041> just no out
[5:47] <x041> so no recovery
[5:47] <x041> what I don't understand
[5:47] <x041> is why the failure of meeting replica policy on pool c]
[5:47] <x041> broke the other pools
[5:48] <x041> which were still within their replica policy
[5:50] <x041> so I take it noone here online had any ideas then
[5:50] <m0zes> you set noout. ceph wouldn't let the osds get marked down. since they were "up" ceph required the write to hit them.
[5:50] <x041> no
[5:50] <x041> the osds were down
[5:50] <x041> but in
[5:50] <x041> it disallows recovery
[5:50] <x041> which means
[5:50] <x041> the pg in pool c
[5:50] <x041> could never recovery
[5:50] <x041> to meet
[5:51] <x041> the min of 2
[5:51] <x041> since they were down, but in, they can still meet the min 2 of pools a and pool b
[5:51] <x041> and like I said, it worked correctl
[5:51] <x041> as soon as I set pool c to min 1
[5:51] <x041> pool a and b were not touched
[5:53] <m0zes> which version of ceph?
[5:53] <x041> I honesty wonder if its just an edge case that noone bothered to think of testing for, just surprised to run into it at this maturity
[5:53] <x041> Infernius
[5:55] <x041> the writes don't "technically" block at ceph level to pool a and b, no debug info shows that pgs assosates with pool a and b are blocking
[5:55] <x041> only pg blocks showing in health detail are for pool c
[5:55] <x041> which makes perfect sense
[5:56] <x041> but yet, ios to pool a and b did not start returning until I "fixed" pool c
[5:56] * Vacuum_ (~Vacuum@i59F79C7C.versanet.de) has joined #ceph
[5:58] * johnavp1989 (~jpetrini@pool-100-34-191-134.phlapa.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[5:58] <x041> seems like a single broken pool is enough to lock up an entire cluster
[5:58] <x041> rather not broken
[5:58] <x041> in this case
[5:58] <x041> but
[6:00] <x041> whole point was it was a maintance that went longer than expected, and since pool c has data that didn't need to be avalable, we were okay with it blocking
[6:00] <x041> what we were suprised about
[6:00] <x041> was that the other two pools stoped returning data, despite not blocking
[6:01] <x041> I think I'll hit up the mailing list - at this point I thinking more and more its an edge case bug I'm running into
[6:03] * Vacuum__ (~Vacuum@88.130.193.8) Quit (Ping timeout: 480 seconds)
[6:08] * johnavp1989 (~jpetrini@8.39.115.8) has joined #ceph
[6:08] <- *johnavp1989* To prove that you are human, please enter the result of 8+3
[6:11] * walcubi (~walcubi@p5797AEFD.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[6:11] * walcubi (~walcubi@p5795BD4C.dip0.t-ipconnect.de) has joined #ceph
[6:17] * EinstCrazy (~EinstCraz@58.246.118.135) has joined #ceph
[6:19] * wiebalck_ (~wiebalck@AAnnecy-653-1-50-224.w90-41.abo.wanadoo.fr) has joined #ceph
[6:26] * ivve (~zed@cust-gw-11.se.zetup.net) has joined #ceph
[6:26] * sudocat (~dibarra@2602:306:8bc7:4c50:f913:a406:3cba:59e1) Quit (Ping timeout: 480 seconds)
[6:35] * vimal (~vikumar@114.143.165.30) Quit (Quit: Leaving)
[6:39] * wjw-freebsd (~wjw@smtp.digiware.nl) Quit (Ping timeout: 480 seconds)
[6:40] * jermudgeon (~jermudgeo@tab.biz.whitestone.link) Quit (Quit: jermudgeon)
[6:51] * TomasCZ (~TomasCZ@yes.tenlab.net) Quit (Quit: Leaving)
[6:54] * vimal (~vikumar@121.244.87.116) has joined #ceph
[6:55] * Nicho1as (~nicho1as@00022427.user.oftc.net) Quit (Ping timeout: 480 seconds)
[7:03] * dneary (~dneary@rrcs-24-103-206-82.nys.biz.rr.com) has joined #ceph
[7:05] * rdas (~rdas@121.244.87.116) has joined #ceph
[7:06] * wiebalck_ (~wiebalck@AAnnecy-653-1-50-224.w90-41.abo.wanadoo.fr) Quit (Quit: wiebalck_)
[7:10] * morse_ (~morse@supercomputing.univpm.it) Quit (Ping timeout: 480 seconds)
[7:21] * bvi (~Bastiaan@185.56.32.1) has joined #ceph
[7:34] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[7:53] * x041 (175c8396@78.129.202.38) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[7:53] * krypto (~krypto@59.89.227.7) has joined #ceph
[8:00] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) has joined #ceph
[8:04] * krypto (~krypto@59.89.227.7) Quit (Ping timeout: 480 seconds)
[8:08] * Ivan1 (~ipencak@213.151.95.130) has joined #ceph
[8:16] * Be-El (~blinke@nat-router.computational.bio.uni-giessen.de) has joined #ceph
[8:17] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:24] * shubjero (~shubjero@107.155.107.246) Quit (Ping timeout: 480 seconds)
[8:25] * shubjero (~shubjero@107.155.107.246) has joined #ceph
[8:31] * lmb (~Lars@ip5b404bab.dynamic.kabel-deutschland.de) Quit (Ping timeout: 480 seconds)
[8:33] * EinstCrazy (~EinstCraz@58.246.118.135) Quit (Remote host closed the connection)
[8:36] * sardonyx (~pico@108.61.123.83) has joined #ceph
[8:42] * branto (~branto@178.253.171.244) has joined #ceph
[8:42] * EinstCrazy (~EinstCraz@2001:da8:8001:822:9d3:f7ad:52b5:a9) has joined #ceph
[8:43] * dgurtner (~dgurtner@178.197.232.110) has joined #ceph
[8:43] * EinstCra_ (~EinstCraz@2001:da8:8001:822:d854:3708:f784:7455) has joined #ceph
[8:50] * EinstCrazy (~EinstCraz@2001:da8:8001:822:9d3:f7ad:52b5:a9) Quit (Ping timeout: 480 seconds)
[8:54] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) Quit (Quit: doppelgrau)
[8:56] * dgurtner (~dgurtner@178.197.232.110) Quit (Remote host closed the connection)
[8:56] * dgurtner (~dgurtner@178.197.232.110) has joined #ceph
[8:57] * branto (~branto@178.253.171.244) Quit (Quit: Leaving.)
[8:59] * ade_b (~abradshaw@p4FF78AE7.dip0.t-ipconnect.de) has joined #ceph
[9:02] * Kurt (~Adium@193-83-29-65.adsl.highway.telekom.at) has joined #ceph
[9:06] * sardonyx (~pico@108.61.123.83) Quit ()
[9:10] * karnan (~karnan@125.16.34.66) has joined #ceph
[9:11] * analbeard (~shw@support.memset.com) has joined #ceph
[9:13] * Alexey_Abashkin (~AlexeyAba@91.207.132.76) has joined #ceph
[9:17] * derjohn_mob (~aj@169.red-176-83-69.dynamicip.rima-tde.net) has joined #ceph
[9:19] * AlexeyAbashkin (~AlexeyAba@91.207.132.76) Quit (Ping timeout: 480 seconds)
[9:22] * wjw-freebsd (~wjw@smtp.digiware.nl) has joined #ceph
[9:23] * peetaur2 (~peter@i4DF67CD2.pool.tripleplugandplay.com) has joined #ceph
[9:24] * lmb (~Lars@62.214.2.210) has joined #ceph
[9:24] * dgurtner (~dgurtner@178.197.232.110) Quit (Read error: No route to host)
[9:24] * fsimonce (~simon@95.239.69.67) has joined #ceph
[9:27] * dgurtner (~dgurtner@195.238.25.37) has joined #ceph
[9:31] * JANorman (~JANorman@81.137.246.31) has joined #ceph
[9:32] * JANorman (~JANorman@81.137.246.31) Quit (Remote host closed the connection)
[9:33] * JANorman (~JANorman@81.137.246.31) has joined #ceph
[9:33] * doppelgrau (~doppelgra@132.252.235.172) has joined #ceph
[9:40] * gucore (~fridim@56-198-190-109.dsl.ovh.fr) has joined #ceph
[9:41] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[9:44] * doppelgrau (~doppelgra@132.252.235.172) Quit (Quit: Leaving.)
[9:45] * Hemanth (~hkumar_@125.16.34.66) has joined #ceph
[9:48] * efirs (~firs@98.207.153.155) Quit (Quit: Leaving.)
[9:48] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[9:49] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit ()
[9:50] * rotbeard (~redbeard@aftr-109-90-233-215.unity-media.net) Quit (Quit: Leaving)
[9:56] * evelu (~erwan@37.163.186.61) Quit (Ping timeout: 480 seconds)
[9:58] * Kurt1 (~Adium@193-83-29-65.adsl.highway.telekom.at) has joined #ceph
[10:00] * TMM (~hp@dhcp-077-248-009-229.chello.nl) Quit (Ping timeout: 480 seconds)
[10:01] * JANorman_ (~JANorman@81.137.246.31) has joined #ceph
[10:03] * T1w (~jens@node3.survey-it.dk) has joined #ceph
[10:03] * Kurt (~Adium@193-83-29-65.adsl.highway.telekom.at) Quit (Ping timeout: 480 seconds)
[10:04] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[10:06] * vicente (~~vicente@125-227-238-55.HINET-IP.hinet.net) Quit (Quit: Leaving)
[10:09] * JANorman (~JANorman@81.137.246.31) Quit (Ping timeout: 480 seconds)
[10:11] * kuku_ (~kuku@119.93.91.136) Quit (Remote host closed the connection)
[10:12] * sleinen (~Adium@2001:620:0:2d:a65e:60ff:fedb:f305) has joined #ceph
[10:17] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[10:18] * lmb (~Lars@62.214.2.210) Quit (Ping timeout: 480 seconds)
[10:19] * lmb (~Lars@62.214.2.210) has joined #ceph
[10:21] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[10:21] * rraja (~rraja@125.16.34.66) has joined #ceph
[10:24] * fdmanana (~fdmanana@2001:8a0:6e0c:6601:2038:c77a:4100:5349) has joined #ceph
[10:25] * Skaag (~lunix@cpe-172-91-77-84.socal.res.rr.com) has joined #ceph
[10:31] * lmb (~Lars@62.214.2.210) Quit (Ping timeout: 480 seconds)
[10:34] * JANorman_ (~JANorman@81.137.246.31) Quit (Remote host closed the connection)
[10:35] * JANorman (~JANorman@81.137.246.31) has joined #ceph
[10:47] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[10:50] * TMM (~hp@185.5.121.201) has joined #ceph
[10:50] * Anticimex (anticimex@netforce.sth.millnert.se) Quit (Ping timeout: 480 seconds)
[10:57] * vimal (~vikumar@121.244.87.116) Quit (Ping timeout: 480 seconds)
[11:09] * vimal (~vikumar@121.244.87.124) has joined #ceph
[11:12] * rotbeard (~redbeard@185.32.80.238) has joined #ceph
[11:27] * karnan (~karnan@125.16.34.66) Quit (Remote host closed the connection)
[11:32] * dgurtner_ (~dgurtner@178.197.232.110) has joined #ceph
[11:33] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[11:33] * dgurtner (~dgurtner@195.238.25.37) Quit (Ping timeout: 480 seconds)
[11:34] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[11:43] * ivve (~zed@cust-gw-11.se.zetup.net) Quit (Ping timeout: 480 seconds)
[11:45] * cyphase (~cyphase@000134f2.user.oftc.net) Quit (Ping timeout: 480 seconds)
[11:48] * Skaag (~lunix@cpe-172-91-77-84.socal.res.rr.com) Quit (Quit: Leaving.)
[11:50] * lmb (~Lars@2a02:8109:8100:1d2c:2ad2:44ff:fedf:3318) has joined #ceph
[11:55] * vikumar (~vikumar@121.244.87.116) has joined #ceph
[11:55] * ivve (~zed@cust-gw-11.se.zetup.net) has joined #ceph
[11:56] * dgurtner_ (~dgurtner@178.197.232.110) Quit (Ping timeout: 480 seconds)
[11:59] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[12:00] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[12:02] * vimal (~vikumar@121.244.87.124) Quit (Ping timeout: 480 seconds)
[12:06] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[12:14] * wjw-freebsd (~wjw@smtp.digiware.nl) Quit (Ping timeout: 480 seconds)
[12:15] * BrianA (~BrianA@c-73-189-153-151.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[12:20] * slowriot (~FierceFor@tsn109-201-154-145.dyn.nltelcom.net) has joined #ceph
[12:22] <Jeeves_> RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable".
[12:23] <Jeeves_> How can I see which features I do support?
[12:23] <peetaur2> Jeeves_: rbd info pool/image
[12:23] <peetaur2> is which are enabled on that image
[12:23] <peetaur2> and a man page lists them all
[12:23] <Jeeves_> Yes, but which ones do I support?
[12:24] <peetaur2> you mean the kernel?
[12:25] <Jeeves_> Yes
[12:25] <peetaur2> maybe it's in here http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client
[12:25] <peetaur2> this page is about tunables... not sure about rbd disable/enable
[12:26] <peetaur2> I guess it's the wrong page
[12:34] * dgurtner (~dgurtner@178.197.234.97) has joined #ceph
[12:37] * lmb (~Lars@2a02:8109:8100:1d2c:2ad2:44ff:fedf:3318) Quit (Ping timeout: 480 seconds)
[12:38] * rdas (~rdas@121.244.87.116) Quit (Quit: Leaving)
[12:40] * rdas (~rdas@121.244.87.116) has joined #ceph
[12:40] * [0x4A6F] (~ident@0x4a6f.user.oftc.net) Quit (Ping timeout: 480 seconds)
[12:41] * [0x4A6F] (~ident@p508CD43B.dip0.t-ipconnect.de) has joined #ceph
[12:48] * lmb (~Lars@ip5b404bab.dynamic.kabel-deutschland.de) has joined #ceph
[12:50] * EinstCra_ (~EinstCraz@2001:da8:8001:822:d854:3708:f784:7455) Quit (Remote host closed the connection)
[12:50] * slowriot (~FierceFor@tsn109-201-154-145.dyn.nltelcom.net) Quit ()
[13:02] <JANorman> I have a number of pgs that are peering and in an inactive state. I followed the instructions to query the specific placement group, and looked under "recovery_state", but it didn't seem to have any detail as to why they're inactive. Any suggestions?
[13:04] * salwasser (~Adium@2601:197:101:5cc1:29da:73fe:9157:5555) has joined #ceph
[13:06] <JANorman> This is what I'm seeing through the query https://gist.github.com/JANorman/fb00c2f185bede9db4d5ab846d3c496e
[13:07] * kuku (~kuku@112.203.59.175) has joined #ceph
[13:08] * kuku (~kuku@112.203.59.175) Quit (Remote host closed the connection)
[13:09] * salwasser (~Adium@2601:197:101:5cc1:29da:73fe:9157:5555) Quit (Quit: Leaving.)
[13:11] * nardial (~ls@p5DC06A4A.dip0.t-ipconnect.de) has joined #ceph
[13:16] <peetaur2> JANorman: only time I had that, it was because an osd other than the one I purposely killed to test was also down
[13:16] <peetaur2> and ceph -s annoyingly doesn't tell you this for a long time
[13:17] <JANorman> Ah! Dug through the logs, seems unreadable osd keyring
[13:22] * kuku (~kuku@112.203.59.175) has joined #ceph
[13:22] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[13:25] * b0e (~aledermue@213.95.25.82) has joined #ceph
[13:26] * doppelgrau (~doppelgra@132.252.235.172) has joined #ceph
[13:27] <peetaur2> JANorman: meaning another osd was down, right?
[13:29] * kuku (~kuku@112.203.59.175) Quit (Remote host closed the connection)
[13:33] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[13:33] * bniver (~bniver@71-9-144-29.static.oxfr.ma.charter.com) has joined #ceph
[13:37] * georgem (~Adium@24.114.50.175) has joined #ceph
[13:40] <JANorman> Yeah
[13:45] <peetaur2> and do you have a small number of osds, like 3?
[13:46] <peetaur2> my test was with 3 nodes with 1 osd each....killed one machine and another died after that
[13:46] * scg (~zscg@181.122.0.60) has joined #ceph
[13:47] * rdas (~rdas@121.244.87.116) Quit (Quit: Leaving)
[13:53] * georgem (~Adium@24.114.50.175) Quit (Quit: Leaving.)
[13:53] * georgem (~Adium@206.108.127.16) has joined #ceph
[13:57] * scg (~zscg@181.122.0.60) Quit (Quit: Ex-Chat)
[14:00] * JANorman (~JANorman@81.137.246.31) Quit (Remote host closed the connection)
[14:00] * JANorman (~JANorman@81.137.246.31) has joined #ceph
[14:04] * bitserker (~toni@88.87.194.130) has joined #ceph
[14:05] * georgem (~Adium@206.108.127.16) Quit (Quit: Leaving.)
[14:05] <wiebalck> rraja jcsp : as expected, skipping the manila authID when evicting clients in the CephFS Native driver allows multiple m-shr servers to create/delete shares (at the expense of potential race conditions as we know now).
[14:08] * JANorman (~JANorman@81.137.246.31) Quit (Ping timeout: 480 seconds)
[14:08] * mitchty_ (~quassel@130-245-47-212.rev.cloud.scaleway.com) has joined #ceph
[14:12] * mitchty (~quassel@130-245-47-212.rev.cloud.scaleway.com) Quit (Read error: Connection reset by peer)
[14:12] * xophe (~xophe@62-210-69-147.rev.poneytelecom.eu) Quit (Read error: Connection reset by peer)
[14:12] * xophe (~xophe@62-210-69-147.rev.poneytelecom.eu) has joined #ceph
[14:13] * Kurt1 (~Adium@193-83-29-65.adsl.highway.telekom.at) Quit (Quit: Leaving.)
[14:13] * Kurt (~Adium@193-83-29-65.adsl.highway.telekom.at) has joined #ceph
[14:15] <rraja> wiebalck: good to know. thanks!
[14:18] * mattbenjamin (~mbenjamin@76-206-42-50.lightspeed.livnmi.sbcglobal.net) has joined #ceph
[14:26] * georgem (~Adium@206.108.127.16) has joined #ceph
[14:36] * dgurtner (~dgurtner@178.197.234.97) Quit (Ping timeout: 480 seconds)
[14:42] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[14:45] * Kurt (~Adium@193-83-29-65.adsl.highway.telekom.at) Quit (Quit: Leaving.)
[14:45] * libracious (~libraciou@catchpenny.cf) has joined #ceph
[14:45] * zioproto (~proto@macsp.switch.ch) has joined #ceph
[14:46] <zioproto> hello
[14:47] * Kurt (~Adium@193-83-29-65.adsl.highway.telekom.at) has joined #ceph
[14:47] <zioproto> I am here because we have a 320 OSDs cluster unusable since yesterday because of slow requests :) the slow requests come from almost all the OSDs (1/3 according to log file analisys).
[14:47] <zioproto> we are using latest Hammer
[14:47] * jcsp (~jspray@82-71-16-249.dsl.in-addr.zen.co.uk) has joined #ceph
[14:47] * jcsp (~jspray@82-71-16-249.dsl.in-addr.zen.co.uk) Quit (Remote host closed the connection)
[14:47] <zioproto> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
[14:47] <TMM> zioproto, how's the load on your cluster?
[14:47] <TMM> zioproto, like, the individual nodes
[14:47] <TMM> zioproto, also, do you use cache tiering?
[14:47] <zioproto> the individual nodes look fine
[14:47] <zioproto> no cache tiering
[14:48] <zioproto> network has been double checked
[14:48] <zioproto> 10Gbps between servers and to client nodes
[14:48] <zioproto> workload is mostly openstack cinder volumes
[14:48] <TMM> no high loads on individual servers?
[14:48] <TMM> or high cpu use?
[14:48] <zioproto> no, we did a reboot of all the 320 OSDs already
[14:49] <zioproto> now we are doing a reboot of all servers 1 at the time
[14:49] <TMM> what's the output of ceph -s
[14:49] <zioproto> we did set noscrub and not-deepscrub
[14:49] <zioproto> paste here or pastebin.com ?
[14:50] <zioproto> ~10 lines
[14:50] <TMM> paste.fedoraproject.org would be best
[14:50] <zioproto> http://paste.fedoraproject.org/445550/84463514/
[14:50] <zioproto> that is basically OK
[14:50] <zioproto> it would be HEALTH_OK
[14:50] <zioproto> of the flags are not set
[14:50] <zioproto> but as soon as we start machines in Openstack
[14:51] <zioproto> everything hangs with slow requests
[14:51] <zioproto> we noticed write IOPS increased. But we cant track down to which ceph client.
[14:55] <zioproto> we get stuff in logs like: http://paste.fedoraproject.org/445554/84494714/
[14:57] * Kurt (~Adium@193-83-29-65.adsl.highway.telekom.at) Quit (Quit: Leaving.)
[14:58] * Kurt (~Adium@193-83-29-65.adsl.highway.telekom.at) has joined #ceph
[14:58] * Kurt (~Adium@193-83-29-65.adsl.highway.telekom.at) Quit ()
[15:05] * mhack (~mhack@24-151-36-149.dhcp.nwtn.ct.charter.com) has joined #ceph
[15:07] <zioproto> we are now rebooting all servers (32) 1 by 1, lets see if we fix it
[15:09] * gucore (~fridim@56-198-190-109.dsl.ovh.fr) Quit (Ping timeout: 480 seconds)
[15:13] <mistur> zioproto: nothing change on the network ?
[15:13] <mistur> jumbo frame ?
[15:14] * jcsp (~jspray@82-71-16-249.dsl.in-addr.zen.co.uk) has joined #ceph
[15:15] <mistur> no switch has been rebooted in the past few days ?
[15:16] * mattbenjamin (~mbenjamin@76-206-42-50.lightspeed.livnmi.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[15:17] <Be-El> zioproto: you can try to access rbd images from one of the client hosts using the rbd command to have a simple test case instead of openstack
[15:18] <Be-El> there's also a bench-write subcommand for rbd to create some load
[15:20] * gucore (~fridim@56-198-190-109.dsl.ovh.fr) has joined #ceph
[15:21] * evelu (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[15:25] * bvi (~Bastiaan@185.56.32.1) Quit (Quit: Leaving)
[15:25] * T1w (~jens@node3.survey-it.dk) Quit (Ping timeout: 480 seconds)
[15:28] * jeh (~jeh@76.16.206.198) has joined #ceph
[15:29] <btaylor> weird. used the pg calc tool to figure out the PGs for my cluster, and got ???error E2BIG???
[15:30] <zioproto> mistur: no, network looks ok. Interface error counters are clean
[15:30] * mhack (~mhack@24-151-36-149.dhcp.nwtn.ct.charter.com) Quit (Quit: I'm outta here!)
[15:31] <btaylor> i have 170 osd???s, did 3 replicas, only 1 pool (testing), 200 PGs per OSD. pgnum = 16384
[15:32] * Hemanth (~hkumar_@125.16.34.66) Quit (Quit: Leaving)
[15:33] <btaylor> ???i guess i???ll update the size to 5, keep pgnum at 8192
[15:33] <mistur> zioproto: journal on SSD or disk ?
[15:33] * evelu (~erwan@37.161.38.138) has joined #ceph
[15:33] <zioproto> mistur: yes
[15:33] <mistur> zioproto: if journal on SSD, are SSD health ok ?
[15:34] <zioproto> mistur: how to check that ?
[15:34] <zioproto> mistur: with SMART check ?
[15:34] <mistur> I guess
[15:41] <mistur> zioproto: smartctl -A /dev/sdX | grep Total_LBAs_Written
[15:41] * mattbenjamin (~mbenjamin@12.118.3.106) has joined #ceph
[15:42] * Racpatel (~Racpatel@2601:87:3:31e3::34db) has joined #ceph
[15:43] <zioproto> mistur: and that is a good value for that ?
[15:43] <mistur> Raw value
[15:43] <mistur> : reports the total number of sectors written
[15:43] <mistur> by the host system. The raw value is increased by 1 for
[15:43] <mistur> every 65,536 sectors (32MB) written by the host.
[15:43] <mistur> Normalized value
[15:43] * yanzheng1 (~zhyan@125.70.23.12) Quit (Quit: This computer has gone to sleep)
[15:43] <zioproto> I mean what is a good value for that ?
[15:44] * dyasny (~dyasny@cable-192.222.152.136.electronicbox.net) Quit (Ping timeout: 480 seconds)
[15:45] * sudocat (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) has joined #ceph
[15:45] <mistur> it's counter
[15:45] <mistur> depends on your ssd endurance
[15:47] <mistur> zioproto: what are your SSDs ?
[15:48] <zioproto> so if I have 6125158
[15:48] <zioproto> it means I wrote on the story of the drive 32 * 6125158 Mb ?
[15:49] <mistur> I think so
[15:49] <zioproto> so that is like 200 Tb
[15:50] * dyasny (~dyasny@cable-192.222.152.136.electronicbox.net) has joined #ceph
[15:50] <mistur> zioproto: on Intel S3500DC : 240GB:
[15:50] <mistur> 140 TBW
[15:50] <mistur> http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3500-spec.pdf
[15:50] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[15:50] <peetaur2> how mudh disk space do you need for mon and mds?
[15:51] <peetaur2> nevermind...I guess I found it in the docs
[15:57] * salwasser (~Adium@72.246.0.14) has joined #ceph
[15:57] * derjohn_mob (~aj@169.red-176-83-69.dynamicip.rima-tde.net) Quit (Read error: No route to host)
[15:57] * sudocat1 (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) has joined #ceph
[15:57] * sudocat (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) Quit (Read error: Connection reset by peer)
[15:59] * Malcovent (~Kizzi@tor2r.ins.tor.net.eu.org) has joined #ceph
[16:03] * neurodrone (~neurodron@158.106.193.162) Quit (Quit: neurodrone)
[16:05] * sudocat1 (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[16:07] <zioproto> mistur: oh ok so I should be below 140 Tb written, but look like a have an higher value, right ?
[16:07] * Ivan1 (~ipencak@213.151.95.130) Quit (Quit: Leaving.)
[16:07] * EinstCrazy (~EinstCraz@116.238.122.20) has joined #ceph
[16:08] <mistur> zioproto: you have Intel S3500 DC ?
[16:08] <zioproto> INTEL SSDSC2BW240A4
[16:08] <mistur> for S3700 TBW it is 7PBW for 400G
[16:08] <mistur> (the TBW depends on the size of the SSDs
[16:11] * b0e (~aledermue@213.95.25.82) Quit (Quit: Leaving.)
[16:11] * rotbeard (~redbeard@185.32.80.238) Quit (Quit: Leaving)
[16:16] * ira (~ira@12.118.3.106) has joined #ceph
[16:16] <mistur> zioproto: The SSD will have a minimum useful life
[16:16] <mistur> based on a
[16:16] <mistur> typical client workload
[16:16] <mistur> assuming up to
[16:16] <mistur> 20 GB of host writes per day.
[16:17] <mistur> for 5 years
[16:17] <mistur> http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-530-sata-specification.pdf
[16:17] <mistur> less than 40TB
[16:17] <mistur> I don't know if it's the origine of the pb, but you should take car of you SSDs
[16:24] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[16:25] * Hemanth (~hkumar_@125.16.34.66) has joined #ceph
[16:29] * nardial (~ls@p5DC06A4A.dip0.t-ipconnect.de) Quit (Quit: Leaving)
[16:29] * Malcovent (~Kizzi@tor2r.ins.tor.net.eu.org) Quit ()
[16:30] * EinstCrazy (~EinstCraz@116.238.122.20) Quit (Remote host closed the connection)
[16:30] <zioproto> if I buy the better SSDs, what is the good procedure to move the journal to the new disk >
[16:30] <zioproto> ?
[16:31] <TMM> zioproto, I'd just set the osd weight to 0, remove it from the cluster, then create a new OSD
[16:32] <TMM> zioproto, I wouldn't try to move anything
[16:32] <mistur> I think you can flush the journal then change the ssd and recreate it
[16:32] <mistur> but I'm not sur about the process
[16:32] <zioproto> TMM: sure, but I have to change 64 SSDs drives. It is 32 servers each one with 2 jourlan drive and 10 OSDs
[16:33] <zioproto> if I have to reweight 320 OSDs 1 at the time it will take forever
[16:33] <TMM> hmm
[16:33] <TMM> I dunno, dd then? :P
[16:34] <TMM> I have found thus far that the only time I've ever had any problems with ceph have been when I was trying to do things faster
[16:34] <zioproto> Yes I think that stopping the OSD and flushing the journal and replacing the disk sounds good
[16:35] <mistur> https://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/
[16:36] <zioproto> http://paste.fedoraproject.org/445610/47585096/
[16:36] <mistur> zioproto: it still an hypothese for the SSD performance
[16:38] <mistur> I mean, it's something important for write performance, but it might be something else
[16:38] * jermudgeon (~jermudgeo@tab.biz.whitestone.link) has joined #ceph
[16:38] <zioproto> mistur: now that we noticed we will change all SSDs because our TBW is too hight
[16:38] <mistur> zioproto: ok
[16:39] <zioproto> looks like rebooting the servers 1 by 1 fixed the performance issue. Slow requests are gone for now
[16:39] <mistur> zioproto: not necessary related to the perfornance issu ?
[16:39] <mistur> ok good
[16:39] <zioproto> we still are not totally sure about what happened
[16:40] <TMM> we've had similar issues with osds using transparent huge pages with tcmalloc
[16:40] <TMM> setting THP from 'always' to 'madvise' fixed that issue
[16:40] <TMM> also increasing the threadpool size from 128MB to 256MB
[16:41] <zioproto> we dont have huge pages and we dont ahve tcmalloc :(
[16:41] <mistur> zioproto: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009549.html
[16:42] <zioproto> you have to explicitly install and configure this two things I guess, right ?
[16:45] * kristen (~kristen@jfdmzpr01-ext.jf.intel.com) has joined #ceph
[16:47] <TMM> zioproto, THP is a default setting on almost all distros, and tcmalloc is the default allocator for hammer packages.
[16:48] <TMM> zioproto, unless you build from source you're using these
[16:49] <TMM> zioproto, check the file /sys/kernel/mm/transparent_hugepage/enabled on one of your OSDs
[16:49] <TMM> if it says [always] you may be hitting the same problem were hitting
[16:49] * vikumar (~vikumar@121.244.87.116) Quit (Quit: Leaving)
[16:49] * Rickus_ (~Rickus@office.protected.ca) has joined #ceph
[16:49] <TMM> as for the thread cache check /etc/sysconfig/ceph
[16:50] <TMM> we have TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=256M there
[16:54] <TMM> oh also if you have broadcom nics you may just want to preventatively burn them on a pire and get some intel nics
[16:55] * jermudgeon (~jermudgeo@tab.biz.whitestone.link) Quit (Quit: jermudgeon)
[16:55] <TMM> always good advice for many connections and high packet rates
[16:55] <TMM> broadcom firmwares are teh suck, not even once (r)
[16:55] * Rickus (~Rickus@office.protected.ca) Quit (Ping timeout: 480 seconds)
[16:56] <TMM> if a reboot solved your issue I'd bet it was the same memory fragmentation issue we had, but shoddy broadcom nic firmwares are ALWAYS an option ;)
[16:56] * mhack (~mhack@24-151-36-149.dhcp.nwtn.ct.charter.com) has joined #ceph
[16:57] * Concubidated (~cube@h4.246.129.40.static.ip.windstream.net) Quit (Quit: Leaving.)
[17:00] <TMM> (That's a PSA ;))
[17:00] * analbeard (~shw@support.memset.com) Quit (Quit: Leaving.)
[17:00] <zioproto> TMM: it is [always] madvise never
[17:00] <zioproto> in my ubuntu system /etc/sysconfig/ceph does not exist
[17:00] <TMM> zioproto, ok, you want to set that to [madavise] everywhere and reboot all your boxes again
[17:01] <TMM> zioproto, try looking in /etc/default/ceph
[17:01] <zioproto> what I am looking for exactly ? also that file does not exist
[17:02] <TMM> you're looking for a file that has TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in it
[17:02] * ircolle (~Adium@2601:285:201:633a:1496:70a8:f562:73f9) has joined #ceph
[17:02] <TMM> if it doesn't you should check out how you're supposed to set environment variables in ubuntu services
[17:02] <TMM> I actually don't know how that works
[17:02] <TMM> for the hugepages stuff I think ubuntu has something like /etc/sysfs.conf for that
[17:02] <TMM> try adding kernel/mm/transparent_hugepage/enabled = madvise to that file
[17:03] <TMM> you should have something like 'sysutils' or something installed
[17:03] <TMM> I actually don't know if that's still how that works on ubuntu
[17:05] * Hemanth (~hkumar_@125.16.34.66) Quit (Ping timeout: 480 seconds)
[17:05] * dmanchad (~dmanchad@nat-pool-bos-t.redhat.com) Quit (Quit: ZNC 1.6.2 - http://znc.in)
[17:05] <TMM> you can also disable thp by modifying your grub config btw
[17:05] * ivve (~zed@cust-gw-11.se.zetup.net) Quit (Ping timeout: 480 seconds)
[17:06] <TMM> you may just want to google 'disable thp' for your distro :)
[17:06] * sudocat (~dibarra@45-17-188-191.lightspeed.hstntx.sbcglobal.net) has joined #ceph
[17:06] <TMM> but I'd advise instead of 'never' to go for 'madvise'
[17:06] <zioproto> TMM: http://tracker.ceph.com/issues/15588
[17:06] <zioproto> looks like it is a bug that I dont have this value
[17:06] <TMM> zioproto, I recommend adding it :)
[17:07] <TMM> zioproto, and the thp settings
[17:07] <zioproto> what would be the default value ?
[17:07] <TMM> then reboot all your nodes again
[17:07] <zioproto> I have Hammer
[17:07] * dmanchad (~dmanchad@nat-pool-bos-t.redhat.com) has joined #ceph
[17:07] <TMM> We've had problems with 128M but that's the default setting
[17:07] <TMM> maybe try that first
[17:07] <TMM> err 128MB
[17:07] * erwan_taf (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[17:10] <zioproto> but how can I get a hint that is actually this huge pages setting that is causing the problem ? because I have plenty of free RAM and nothing weird in the log files
[17:11] <TMM> if you have perf installed you should be able to see your kernel spending tons of time trying to find memory
[17:14] * sudocat (~dibarra@45-17-188-191.lightspeed.hstntx.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[17:14] <TMM> seeing khugepaged appearing in top often is also a good indicator
[17:14] <zioproto> this perf ? http://www.brendangregg.com/perf.html
[17:15] <TMM> yes, that perf
[17:15] * evelu (~erwan@37.161.38.138) Quit (Ping timeout: 480 seconds)
[17:17] <TMM> anyway, good luck with this, I'm going home to get some weekend. Your cluster is working OK right now, right?
[17:17] <zioproto> yes it is
[17:17] <zioproto> we are still rebooting a few servers
[17:17] * bitserker1 (~toni@88.87.194.130) has joined #ceph
[17:17] <zioproto> thanks for tips
[17:18] <zioproto> I am now looking where to install perf
[17:18] <TMM> alright, I'd expect it to keep running alright for at least several days
[17:18] <TMM> but really, implement the TCMALLOC setting asap
[17:18] <TMM> there's a reason it's the default now
[17:18] * dmanchad (~dmanchad@nat-pool-bos-t.redhat.com) Quit (Ping timeout: 480 seconds)
[17:18] * dmanchad (~dmanchad@nat-pool-bos-t.redhat.com) has joined #ceph
[17:18] <TMM> good luck!
[17:18] * Concubidated (~cube@68.140.239.164) has joined #ceph
[17:18] * TMM (~hp@185.5.121.201) Quit (Quit: Ex-Chat)
[17:19] <zioproto> thanks !
[17:19] * nilez (~nilez@104.129.29.42) Quit (Ping timeout: 480 seconds)
[17:24] * Be-El (~blinke@nat-router.computational.bio.uni-giessen.de) Quit (Quit: Leaving.)
[17:24] * bitserker (~toni@88.87.194.130) Quit (Ping timeout: 480 seconds)
[17:25] * sudocat (~dibarra@192.185.1.20) has joined #ceph
[17:26] * nilez (~nilez@104.129.29.42) has joined #ceph
[17:28] * valeech (~valeech@pool-96-247-203-33.clppva.fios.verizon.net) Quit (Quit: valeech)
[17:32] * bitserker1 (~toni@88.87.194.130) Quit (Quit: Leaving.)
[17:35] * doppelgrau (~doppelgra@132.252.235.172) Quit (Quit: Leaving.)
[17:39] * jeh (~jeh@76.16.206.198) Quit (Ping timeout: 480 seconds)
[17:45] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) has joined #ceph
[17:49] * sudocat (~dibarra@192.185.1.20) Quit (Ping timeout: 480 seconds)
[17:50] <zioproto> have a good weekend
[17:51] * zioproto (~proto@macsp.switch.ch) Quit (Quit: WeeChat 1.5)
[17:55] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[17:56] * Skaag (~lunix@65.200.54.234) has joined #ceph
[17:58] * vbellur (~vijay@71.234.224.255) Quit (Ping timeout: 480 seconds)
[18:00] * TomasCZ (~TomasCZ@yes.tenlab.net) has joined #ceph
[18:04] * Skaag (~lunix@65.200.54.234) Quit (Ping timeout: 480 seconds)
[18:04] * Skaag (~lunix@65.200.54.234) has joined #ceph
[18:09] * salwasser (~Adium@72.246.0.14) Quit (Quit: Leaving.)
[18:13] * davidzlap (~Adium@2605:e000:1313:8003:fddb:dd57:2a2f:8ab4) has joined #ceph
[18:14] * TMM (~hp@dhcp-077-248-009-229.chello.nl) has joined #ceph
[18:16] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) has joined #ceph
[18:18] * dneary (~dneary@rrcs-24-103-206-82.nys.biz.rr.com) Quit (Ping timeout: 480 seconds)
[18:20] <limebyte> well the 100Mbit cluster seems to run fine
[18:21] <limebyte> if you not doing lots of access at the same time
[18:21] * rraja (~rraja@125.16.34.66) Quit (Quit: Leaving)
[18:21] * xinli (~charleyst@32.97.110.52) has joined #ceph
[18:22] * vasu (~vasu@c-73-231-60-138.hsd1.ca.comcast.net) has joined #ceph
[18:26] * andreww (~xarses@c-73-202-191-48.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[18:30] * jermudgeon (~jermudgeo@gw1.ttp.biz.whitestone.link) has joined #ceph
[18:33] * Nicho1as (~nicho1as@00022427.user.oftc.net) has joined #ceph
[18:39] <minnesotags> I just installed a new ceph on Debian (wheezy), when I reboot the machine it creates the /var/run/ceph and var/lib/run/ceph directories with root owner, which breaks permissions. Is there a script which can be adjust to create these directories as ceph owner?
[18:41] * jermudgeon (~jermudgeo@gw1.ttp.biz.whitestone.link) Quit (Quit: jermudgeon)
[18:41] * jermudgeon (~jermudgeo@gw1.ttp.biz.whitestone.link) has joined #ceph
[18:43] * sleinen1 (~Adium@2001:620:0:82::105) has joined #ceph
[18:45] <blizzow> minnesotags: why aren't you using ceph-deploy?
[18:46] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:47] * ade_b (~abradshaw@p4FF78AE7.dip0.t-ipconnect.de) Quit (Quit: Too sexy for his shirt)
[18:48] * TomasCZ (~TomasCZ@yes.tenlab.net) Quit (Quit: Leaving)
[18:49] * TomasCZ (~TomasCZ@yes.tenlab.net) has joined #ceph
[18:50] * sleinen (~Adium@2001:620:0:2d:a65e:60ff:fedb:f305) Quit (Ping timeout: 480 seconds)
[18:50] * efirs (~firs@98.207.153.155) has joined #ceph
[18:53] * xinli (~charleyst@32.97.110.52) Quit (Remote host closed the connection)
[18:53] * xinli (~charleyst@32.97.110.52) has joined #ceph
[18:55] * mykola (~Mikolaj@91.245.75.214) has joined #ceph
[18:57] * efirs (~firs@98.207.153.155) Quit (Quit: Leaving.)
[19:00] * andreww (~xarses@64.124.158.3) has joined #ceph
[19:00] * efirs (~firs@98.207.153.155) has joined #ceph
[19:05] * johnavp1989 (~jpetrini@8.39.115.8) Quit (Ping timeout: 480 seconds)
[19:06] * sleinen1 (~Adium@2001:620:0:82::105) Quit (Read error: Connection reset by peer)
[19:10] * Unai (~Adium@50-115-70-150.static-ip.telepacific.net) has joined #ceph
[19:11] * Unai (~Adium@50-115-70-150.static-ip.telepacific.net) Quit ()
[19:11] * Unai (~Adium@50-115-70-150.static-ip.telepacific.net) has joined #ceph
[19:13] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[19:15] * Skaag (~lunix@65.200.54.234) Quit (Read error: Connection reset by peer)
[19:15] <Unai> Hello ???. can I do anything to reinsert in the cluster an OSD that was erroneously marked as lost?
[19:15] * Skaag (~lunix@162.211.147.250) has joined #ceph
[19:17] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[19:22] * linuxkidd (~linuxkidd@ip70-189-214-97.lv.lv.cox.net) Quit (Ping timeout: 480 seconds)
[19:22] * linuxkidd (~linuxkidd@ip70-189-214-97.lv.lv.cox.net) has joined #ceph
[19:22] * stiopa (~stiopa@cpc73832-dals21-2-0-cust453.20-2.cable.virginm.net) has joined #ceph
[19:24] * bstillwell (~bryan@bokeoa.com) has joined #ceph
[19:37] <davidzlap> Unai: Not that I know of. The cluster should be cleaning up or has finished that process. You should add back the disk as a new empty OSD and let the cluster re-distribute the data. You can do that off hours if necessary.
[19:41] * valeech (~valeech@173-14-113-41-richmond.hfc.comcastbusiness.net) has joined #ceph
[19:41] * Nicho1as (~nicho1as@00022427.user.oftc.net) Quit (Quit: A man from the Far East; using WeeChat 1.5)
[19:42] * Hemanth (~hkumar_@103.228.221.149) has joined #ceph
[19:43] * stiopa (~stiopa@cpc73832-dals21-2-0-cust453.20-2.cable.virginm.net) Quit (Ping timeout: 480 seconds)
[19:48] * debian112 (~bcolbert@c-73-184-103-26.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[19:54] * vbellur (~vijay@nat-pool-bos-t.redhat.com) has joined #ceph
[19:59] * debian112 (~bcolbert@c-73-184-103-26.hsd1.ga.comcast.net) has joined #ceph
[20:10] * fdmanana (~fdmanana@2001:8a0:6e0c:6601:2038:c77a:4100:5349) Quit (Ping timeout: 480 seconds)
[20:12] * debian112 (~bcolbert@c-73-184-103-26.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[20:14] * sudocat (~dibarra@74-196-187-158.onalcmtk01.res.dyn.suddenlink.net) has joined #ceph
[20:19] * ffilzwin (~ffilz@c-67-170-185-135.hsd1.or.comcast.net) Quit (Quit: Leaving)
[20:19] * debian112 (~bcolbert@c-73-184-103-26.hsd1.ga.comcast.net) has joined #ceph
[20:22] * sudocat (~dibarra@74-196-187-158.onalcmtk01.res.dyn.suddenlink.net) Quit (Ping timeout: 480 seconds)
[20:27] * debian112 (~bcolbert@c-73-184-103-26.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[20:31] * linuxkidd (~linuxkidd@ip70-189-214-97.lv.lv.cox.net) Quit (Ping timeout: 480 seconds)
[20:32] * sudocat (~dibarra@192.185.1.20) has joined #ceph
[20:36] <minnesotags> blizzow, I am using ceph-deploy.
[20:37] <minnesotags> Or rather, I did use ceph-deploy.
[20:37] * cyphase (~cyphase@000134f2.user.oftc.net) has joined #ceph
[20:38] <limebyte> well I use clinto-deploy
[20:38] * ffilzwin (~ffilz@c-67-170-185-135.hsd1.or.comcast.net) has joined #ceph
[20:39] <blizzow> minnesotags: hrm, when I install using ceph-ceploy /var/run/ceph is owned by ceph:ceph.
[20:39] * debian112 (~bcolbert@c-73-184-103-26.hsd1.ga.comcast.net) has joined #ceph
[20:39] <blizzow> same with /var/run/ceph
[20:42] * Rickus_ (~Rickus@office.protected.ca) Quit (Read error: Connection reset by peer)
[20:42] * Rickus_ (~Rickus@office.protected.ca) has joined #ceph
[20:42] <minnesotags> blizzow, and you installed jewel on wheezy?
[20:43] * linuxkidd (~linuxkidd@ip70-189-214-97.lv.lv.cox.net) has joined #ceph
[20:45] <blizzow> oh, whoops, didn't notice that you're on wheezy, I'm using ubuntu 16.04
[20:46] <minnesotags> Big difference.
[20:47] <minnesotags> Thanks anyway.
[20:48] * newbie (~kvirc@host217-114-156-249.pppoe.mark-itt.net) has joined #ceph
[20:52] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[20:53] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:56] <hoonetorg> how is the status of bitrot detection/autorepair with bluestore
[21:08] * linuxkidd (~linuxkidd@ip70-189-214-97.lv.lv.cox.net) Quit (Ping timeout: 480 seconds)
[21:08] * linuxkidd (~linuxkidd@ip70-189-214-97.lv.lv.cox.net) has joined #ceph
[21:13] * shaunm (~shaunm@mf05a36d0.tmodns.net) has joined #ceph
[21:17] <blizzow> Anybody here seen kernel panics using in vms using an RBD image to store the VM OS?
[21:18] <blizzow> I converted some of my VMs to use virtio-scsi instead of virtio and I'm getting stuff like this: http://i.imgur.com/cdF4ZcQ.png http://i.imgur.com/0tBcF4o.png
[21:19] <bstillwell> blizzow: No, we have hundreds (thousands?) of VMs using RBD without a problem.
[21:19] <blizzow> ugh.
[21:19] <bstillwell> We're running trusty though
[21:19] <bstillwell> Looks like you're on vivid?
[21:20] <blizzow> xenial.
[21:21] <bstillwell> Your screenshot says linux-lts-vivid-3.19.0
[21:21] * FierceForm (~Moriarty@tor-exit.squirrel.theremailer.net) has joined #ceph
[21:22] <blizzow> I swear it's running xenial.
[21:22] <blizzow> Installed fresh a few days ago.
[21:22] * sleinen (~Adium@2001:620:1000:3:a65e:60ff:fedb:f305) has joined #ceph
[21:24] * valeech (~valeech@173-14-113-41-richmond.hfc.comcastbusiness.net) Quit (Quit: valeech)
[21:24] <bstillwell> Strange, I wouldn't expect linux-lts-vivid-3.19.0 to be in the panic message unless you were running 15.04
[21:25] <kysse> I think you can run old kernel with new xenial also ;-D
[21:25] <bstillwell> Are you using librbd, or the kernel rbd?
[21:26] <blizzow> bstillwell: on my hypervisor or on the VM?
[21:27] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[21:27] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:27] <blizzow> The vm is xenial, and has no librbd installed on it.
[21:28] <bstillwell> blizzow: The vm shouldn't know it's running on rbd.
[21:28] <bstillwell> Are both the hypervisor and the VM running xenial?
[21:28] * Hemanth (~hkumar_@103.228.221.149) Quit (Quit: Leaving)
[21:33] <blizzow> bstillwell: yes.
[21:33] <blizzow> Both installed within the last week.
[21:34] <bstillwell> Which kernels? (uname -rv)
[21:35] <blizzow> hypervisor is running "ceph-common - Version: 10.2.3-1xenial" "librbd1 - Version: 10.2.3-1xenial" kernel is: 4.4.0-38-generic #57-Ubuntu SMP Tue Sep 6 15:42:33 UTC 2016
[21:35] * valeech (~valeech@173-14-113-41-richmond.hfc.comcastbusiness.net) has joined #ceph
[21:37] <bstillwell> Is this OpenStack, or are you mapping the drive with the rbd command first?
[21:40] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[21:43] <blizzow> I'm not using openstack, and I'm not mapping the drive with rbd first. I created the vm using virt-manager and added the ceph drive.
[21:44] <blizzow> So however libvirt is doing the rbd mapping I guess.
[21:44] <blizzow> The command it uses is this... http://pastebin.com/pL0R2E8z
[21:45] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:45] <blizzow> Or rather, that's what the process looks like for a running VM.
[21:45] * sleinen (~Adium@2001:620:1000:3:a65e:60ff:fedb:f305) Quit (Quit: Leaving.)
[21:47] * mykola (~Mikolaj@91.245.75.214) Quit (Quit: away)
[21:47] <bstillwell> Yeah, looks like librbd using scsi instead of virtio
[21:48] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit ()
[21:48] <bstillwell> We use virtio, so I'm not sure if the problem is with the scsi part or not.
[21:49] <SamYaple> virtio-scsi is the best one
[21:49] <SamYaple> it is the fastest, but more importantly it supports discard
[21:49] <blizzow> virtio-scsi has cased 4 VMs to kernel panic on me in the last 24 hours :(
[21:49] <blizzow> s/cased/caused
[21:49] <SamYaple> blizzow: esh. kernel version? qemu version?
[21:50] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:50] <bstillwell> SamYaple: Read scrollback
[21:50] <SamYaple> bstillwell: no qemu there
[21:50] * scubacuda (sid109325@0001fbab.user.oftc.net) Quit (Read error: Connection reset by peer)
[21:50] * bassam (sid154933@id-154933.brockwell.irccloud.com) Quit (Remote host closed the connection)
[21:50] * braderhart (sid124863@braderhart.user.oftc.net) Quit (Read error: Connection reset by peer)
[21:50] * zeestrat (sid176159@id-176159.brockwell.irccloud.com) Quit (Remote host closed the connection)
[21:50] <bstillwell> true
[21:50] <bstillwell> kernel is there though, and the panic seems to imply the VM is running vivid 3.19.0 to me.
[21:51] <bstillwell> http://i.imgur.com/cdF4ZcQ.png
[21:51] <blizzow> SamYaple: QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.5), Copyright (c) 2003-2008 Fabrice Bellard
[21:51] * FierceForm (~Moriarty@tor-exit.squirrel.theremailer.net) Quit ()
[21:53] <blizzow> kernel is: 4.4.0-38-generic #57-Ubuntu SMP Tue Sep 6 15:42:33 UTC 2016
[21:54] <blizzow> It's happening on the VMs where the OS is on an rbd image and set to use virtio-scsi as the driver, writeback caching, and native threads.
[21:54] <SamYaple> blizzow: i agree with bstillwell assesment here. I am also running 2.5 QEMU, thought it is the ubuntu build
[21:54] <SamYaple> blizzow: where did you get that version of qemu....?
[21:54] <blizzow> This is a straight ubuntu xenial installation.
[21:55] <SamYaple> oh right. sorry. i thought we were talking about vivid, but vivid is the guest
[21:55] * huats (~quassel@stuart.objectif-libre.com) Quit (Ping timeout: 480 seconds)
[21:55] <blizzow> SamYaple: nope, they're both xenial.
[21:55] <SamYaple> 3.19 is not xenial
[21:55] <blizzow> I have no idea why the panic is showing that.
[21:55] <SamYaple> there is no 3.19 kernel for xenial
[21:55] <SamYaple> you should check into that
[21:55] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[21:55] <kysse> SamYaple: it doesn't prevent you using 3.19 kernel :)
[21:56] <SamYaple> kysse: i mean im not going to troubleshoot someones custom build kernel or non-standard kernel
[21:56] <blizzow> Both the hypervisor and VM are fresh installations of 16.04 in the last week straight from the ubuntu-server iso.
[21:56] <SamYaple> thats more of my point
[21:56] <blizzow> They're using the standard kernel.
[21:56] <kysse> yeh
[21:56] <SamYaple> 3.19 on xenial isn't standard
[21:57] <blizzow> I understand that and can't explain to you why the panic says that.
[21:58] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:58] * GooseYArd (~GooseYArd@ec2-52-5-245-183.compute-1.amazonaws.com) Quit (Quit: leaving)
[21:59] <SamYaple> blizzow: i bet when you figure that out the issue goes away
[21:59] <blizzow> uname -a from the hypervisor: Linux myhypervisor 4.4.0-38-generic #57-Ubuntu SMP Tue Sep 6 15:42:33 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[21:59] <blizzow> uname -a from the vm: Linux myVM.mydomain.com 4.4.0-38-generic #57-Ubuntu SMP Tue Sep 6 15:42:33 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[21:59] <SamYaple> lookup the module loaded for the scsi driver
[22:00] <SamYaple> make sure its the right one and not some old module for 3.19
[22:00] * huats (~quassel@stuart.objectif-libre.com) has joined #ceph
[22:01] <SamYaple> blizzow: http://pastebin.com/nzK6d0xA
[22:01] * GooseYArd (~GooseYArd@ec2-52-5-245-183.compute-1.amazonaws.com) has joined #ceph
[22:01] <bstillwell> Found another occurrence of that panic:
[22:01] <bstillwell> https://git.stoo.org/infra/nixpkgs/commit/c06f066f220e32986f17219b2cf0c1a2b0fe2627?view=parallel
[22:01] <SamYaple> though the kernel is a bit outdated, that is what it should say
[22:01] <bstillwell> not really helpful
[22:01] * jarrpa (~jarrpa@63.225.131.166) has joined #ceph
[22:03] <bstillwell> blizzow: On your vm, run 'modinfo sym53c8xx' as well
[22:04] * vbellur (~vijay@nat-pool-bos-t.redhat.com) Quit (Quit: Leaving.)
[22:04] <bstillwell> Although, if you're using using virtio_scsi, shouldn't the vm be using virtio_scsi?
[22:04] * sudocat (~dibarra@192.185.1.20) Quit (Ping timeout: 480 seconds)
[22:04] <blizzow> modinfo for virtio-scsi on both the hypervisor and VM... http://pastebin.com/mh1EGnar
[22:05] <SamYaple> blizzow: exactly
[22:06] <SamYaple> bstillwell: ^^
[22:06] <SamYaple> blizzow: idk. seems strange. where did you download this image again?
[22:06] <SamYaple> i want to poke it
[22:06] <blizzow> I'm doing no special module loading or any kernel modifications of any sort on either the hypervisor or the vm
[22:06] <SamYaple> i believe you
[22:07] <blizzow> The image is the straight ubuntu 16.04.1 install md5sum'ed to verify.
[22:07] <SamYaple> humor me and link me to it please. also post a `virsh dumpxml` of your instance (or the qemu cmdline if you arent using libvirt)
[22:08] <blizzow> The md5sum of ~/ubuntu-16.04.1-server-amd64.iso is d2d939ca0e65816790375f6826e4032f
[22:08] * davidzlap (~Adium@2605:e000:1313:8003:fddb:dd57:2a2f:8ab4) Quit (Quit: Leaving.)
[22:08] <SamYaple> got it
[22:09] <SamYaple> you should probably use the cloud images, though that still doesnt explain this
[22:09] <blizzow> Downloaded straight from: http://releases.ubuntu.com/16.04/
[22:09] <SamYaple> yea i got it
[22:09] * valeech (~valeech@173-14-113-41-richmond.hfc.comcastbusiness.net) Quit (Quit: valeech)
[22:11] <blizzow> The xml for the vm... http://pastebin.ca/3726228
[22:14] * valeech (~valeech@173-14-113-41-richmond.hfc.comcastbusiness.net) has joined #ceph
[22:14] <SamYaple> well dude i am at a loss. I just strings'd my entire install for the guest built the same way as yours and have just a few files that have 'vivid' in them, none of them important
[22:14] * gucore (~fridim@56-198-190-109.dsl.ovh.fr) Quit (Ping timeout: 480 seconds)
[22:14] <SamYaple> i know that string isn't baked into the kernel either
[22:16] <blizzow> modinfo for the VM and it's detected scsi driver... http://pastebin.ca/3726230
[22:17] <blizzow> I have no idea either. :/
[22:17] <SamYaple> blizzow: the VM is saying thats its scsi driver...?
[22:18] <SamYaple> can you `lscpi -k` in the vm?
[22:18] <blizzow> ls pci -k gives this: http://pastebin.ca/index.php
[22:19] <bstillwell> Could the first screenshot be an old screenshot maybe?
[22:19] <blizzow> bstillwell: nope, took those less than a couple hours ago.
[22:19] <SamYaple> blizzow: lol thats an index page
[22:19] <bstillwell> well, the second screenshot says sym53c8xx as well
[22:20] <bstillwell> I don't even know why that module would be loaded...
[22:20] <SamYaple> blizzow: can you post a new pastebin url :)
[22:20] <blizzow> whoops...
[22:20] <blizzow> http://pastebin.ca/3726231
[22:21] * scubacuda (sid109325@0001fbab.user.oftc.net) has joined #ceph
[22:21] <SamYaple> where is that ocming from
[22:21] <SamYaple> are you doing pci passthrough of some kind?
[22:22] <SamYaple> oh. wait. /dev/VolGroupSSD/LogVolMYVM
[22:22] <SamYaple> that
[22:23] * Concubidated (~cube@68.140.239.164) Quit (Quit: Leaving.)
[22:23] <bstillwell> So the problem isn't with Ceph, but with the sym53c8xx driver you're using for accessing /dev/VolGroupSSD/LogVolMYVM?
[22:23] * dontron (~Peaced@tor-exit.dhalgren.org) has joined #ceph
[22:23] <SamYaple> yea this isnt ceph related for sure
[22:23] * stiopa (~stiopa@cpc73832-dals21-2-0-cust453.20-2.cable.virginm.net) has joined #ceph
[22:23] <SamYaple> possibl virtio-related
[22:25] * georgem (~Adium@206.108.127.16) Quit (Ping timeout: 480 seconds)
[22:25] * xinli (~charleyst@32.97.110.52) Quit (Ping timeout: 480 seconds)
[22:25] <bstillwell> blizzow: could you provide the output of: ls -l /dev/disk/by-path/
[22:25] <bstillwell> on the vm
[22:26] <SamYaple> bstillwell: no its pretty clear he is just using the normal scsi controller
[22:26] <SamYaple> one of those scsi controllers is not like the other
[22:27] <blizzow> that's weird. The logical volume is storage for elasticsearch.
[22:27] <blizzow> ls -l output from the vm.. http://pastebin.ca/3726235
[22:27] <SamYaple> the one at 0:04 is virtio-scsi in libvirt
[22:27] <SamYaple> the other 0:07 is not virtio-scsi in that xml
[22:27] <SamYaple> blizzow: how did you create this thing?
[22:27] * sudocat (~dibarra@192.185.1.20) has joined #ceph
[22:27] <SamYaple> you need to fix your libvirt xml to have both controllers as scsi
[22:27] <SamYaple> or rather virtio-scsi
[22:27] * valeech (~valeech@173-14-113-41-richmond.hfc.comcastbusiness.net) Quit (Quit: valeech)
[22:28] <SamYaple> http://pastebin.ca/3726228
[22:28] <SamYaple> line 62 compared to 66
[22:28] <bstillwell> Ahh, that makes sense
[22:30] <blizzow> SamYaple: created via virt-manager. I made the changes to the the rbd based disk per this: http://docs.ceph.com/docs/master/rbd/libvirt/
[22:30] * vbellur (~vijay@71.234.224.255) has joined #ceph
[22:30] <SamYaple> blizzow: in virt-manager, go double check your attached disks are attached via the virtio-scsi controller
[22:31] <SamYaple> i wasnt actually aware virt-manager could do virtio-scsi yet
[22:31] <SamYaple> even though its just letters, it still ahs to be built into the generation through virtio
[22:31] <SamYaple> or virt-manager*
[22:31] <SamYaple> man im tired. im going to go take a nap
[22:34] * erwan_taf (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) Quit (Ping timeout: 480 seconds)
[22:35] * wwalker (~wwalker@68.168.119.154) has joined #ceph
[22:35] * erwan_taf (~erwan@2a01:e34:eecb:7400:4eeb:42ff:fedc:8ac) has joined #ceph
[22:37] <blizzow> hrm. What I did in virt-manager was change the disk bus on the OS from virtio to SCSI. Virt manager added two scsi controllers on it's own. A) Controller Virtio SCSI and B) Controller SCSI.
[22:38] <wwalker> on Jewel, I'm getting "Cannot find zone id=default (name=default) \n couldn't init storage provider" when trying to create a user. "radosgw zone list" shows a single zone named "default". Any pointers?
[22:39] <wwalker> I didn't set this up. I inherited it in my current gig, so I can't answer any "how did it get that way questions" :-(
[22:39] * ira (~ira@12.118.3.106) Quit (Ping timeout: 480 seconds)
[22:39] * jermudgeon (~jermudgeo@gw1.ttp.biz.whitestone.link) Quit (Quit: jermudgeon)
[22:40] * Miouge (~Miouge@208.143-65-87.adsl-dyn.isp.belgacom.be) Quit (Quit: Miouge)
[22:41] * davidzlap (~Adium@cpe-172-91-154-245.socal.res.rr.com) has joined #ceph
[22:43] * bniver (~bniver@71-9-144-29.static.oxfr.ma.charter.com) Quit (Remote host closed the connection)
[22:47] <SamYaple> blizzow: thats.... strange
[22:47] <SamYaple> blizzow: i wonder if virtio-scsi can't do local /dev passthrough
[22:47] * Jeffrey4l__ (~Jeffrey@110.252.62.76) Quit (Ping timeout: 480 seconds)
[22:47] <SamYaple> blizzow: change the LVM volume back to virtio and pretend like none of this ever happened
[22:48] <blizzow> I was hoping for some noticeable speed boost out of using virtio-scsi, but this headache has me re-thinking.
[22:48] * Jeffrey4l__ (~Jeffrey@110.252.62.76) has joined #ceph
[22:48] <blizzow> I might just go back to virtio like you say.
[22:48] <SamYaple> no no
[22:49] <SamYaple> leave the ceph one as virtio-scsi
[22:49] <SamYaple> switch your LVM one to virtio
[22:49] <SamYaple> virtio-scsi is not causing this headache
[22:51] <blizzow> I can try that.
[22:53] * dontron (~Peaced@tor-exit.dhalgren.org) Quit ()
[22:58] * Kingrat (~shiny@2605:6000:1526:4063:79cd:be53:c453:1c6) Quit (Ping timeout: 480 seconds)
[23:04] * Unai (~Adium@50-115-70-150.static-ip.telepacific.net) Quit (Ping timeout: 480 seconds)
[23:07] * mattbenjamin (~mbenjamin@12.118.3.106) Quit (Ping timeout: 480 seconds)
[23:08] * georgem (~Adium@24.114.51.213) has joined #ceph
[23:08] * georgem (~Adium@24.114.51.213) Quit ()
[23:08] * georgem (~Adium@206.108.127.16) has joined #ceph
[23:08] * Kingrat (~shiny@cpe-76-187-192-172.tx.res.rr.com) has joined #ceph
[23:11] * shaunm (~shaunm@mf05a36d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[23:12] * jermudgeon (~jermudgeo@gw1.ttp.biz.whitestone.link) has joined #ceph
[23:16] * sudocat (~dibarra@192.185.1.20) Quit (Ping timeout: 480 seconds)
[23:19] * Jeffrey4l__ (~Jeffrey@110.252.62.76) Quit (Ping timeout: 480 seconds)
[23:20] * valeech (~valeech@pool-96-247-203-33.clppva.fios.verizon.net) has joined #ceph
[23:28] * newbie (~kvirc@host217-114-156-249.pppoe.mark-itt.net) Quit (Ping timeout: 480 seconds)
[23:31] * braderhart (sid124863@braderhart.user.oftc.net) has joined #ceph
[23:35] * davidzlap (~Adium@cpe-172-91-154-245.socal.res.rr.com) Quit (Quit: Leaving.)
[23:36] * wjw-freebsd (~wjw@smtp.digiware.nl) has joined #ceph
[23:39] * kristen (~kristen@jfdmzpr01-ext.jf.intel.com) Quit (Quit: Leaving)
[23:40] * georgem (~Adium@206.108.127.16) Quit (Ping timeout: 480 seconds)
[23:40] * Unai (~Adium@192.77.237.216) has joined #ceph
[23:44] * davidzlap (~Adium@2605:e000:1313:8003:b11f:ca14:b4d7:9e9a) has joined #ceph
[23:47] * verbalins (~Zeis@108.61.166.139) has joined #ceph
[23:49] * Racpatel (~Racpatel@2601:87:3:31e3::34db) Quit (Ping timeout: 480 seconds)
[23:52] * scubacuda (sid109325@0001fbab.user.oftc.net) Quit (Read error: Connection reset by peer)
[23:52] * braderhart (sid124863@braderhart.user.oftc.net) Quit (Read error: Connection reset by peer)
[23:56] * vasu (~vasu@c-73-231-60-138.hsd1.ca.comcast.net) Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.