#ceph IRC Log


IRC Log for 2016-09-29

Timestamps are in GMT/BST.

[0:00] * davidzlap1 (~Adium@2605:e000:1313:8003:513a:75bc:802b:182b) Quit (Quit: Leaving.)
[0:06] * dneary (~dneary@nat-pool-bos-u.redhat.com) Quit (Ping timeout: 480 seconds)
[0:13] * davidzlap (~Adium@2605:e000:1313:8003:7467:6dea:3b5a:59e8) has joined #ceph
[0:30] * Neon (~Aramande_@exit0.liskov.tor-relays.net) has joined #ceph
[0:31] * brians__ (~brian@ has joined #ceph
[0:37] * brians (~brian@ Quit (Ping timeout: 480 seconds)
[0:38] * ntpttr_ (~ntpttr@ Quit (Remote host closed the connection)
[0:44] * ntpttr_ (~ntpttr@ has joined #ceph
[0:49] * vata (~vata@ Quit (Quit: Leaving.)
[0:50] * jermudgeon (~jhaustin@ has joined #ceph
[0:54] * davidzlap (~Adium@2605:e000:1313:8003:7467:6dea:3b5a:59e8) Quit (Quit: Leaving.)
[0:57] * stiopa (~stiopa@cpc73832-dals21-2-0-cust453.20-2.cable.virginm.net) Quit (Ping timeout: 480 seconds)
[1:00] * kuku (~kuku@ has joined #ceph
[1:00] * Neon (~Aramande_@exit0.liskov.tor-relays.net) Quit ()
[1:02] * kuku_ (~kuku@ has joined #ceph
[1:03] * davidzlap (~Adium@cpe-172-91-154-245.socal.res.rr.com) has joined #ceph
[1:08] * jfaj_ (~jan@p4FD26CF1.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[1:09] * kuku (~kuku@ Quit (Ping timeout: 480 seconds)
[1:15] <rkeene> How can I see how much space a snapshot is using ?
[1:17] * vicente (~vicente@1-161-184-59.dynamic.hinet.net) has joined #ceph
[1:20] * vata (~vata@ has joined #ceph
[1:23] * jfaj_ (~jan@p20030084AD01A9005EC5D4FFFEBB68A4.dip0.t-ipconnect.de) has joined #ceph
[1:24] * Racpatel (~Racpatel@2601:87:3:31e3::4d2a) Quit (Ping timeout: 480 seconds)
[1:26] * vicente (~vicente@1-161-184-59.dynamic.hinet.net) Quit (Ping timeout: 480 seconds)
[1:30] * ntpttr__ (~ntpttr@ has joined #ceph
[1:30] * ntpttr_ (~ntpttr@ Quit (Remote host closed the connection)
[1:30] * [0x4A6F]_ (~ident@p4FC27A7C.dip0.t-ipconnect.de) has joined #ceph
[1:32] * oms101 (~oms101@p20030057EA009900C6D987FFFE4339A1.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[1:32] * [0x4A6F] (~ident@0x4a6f.user.oftc.net) Quit (Ping timeout: 480 seconds)
[1:32] * [0x4A6F]_ is now known as [0x4A6F]
[1:35] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[1:41] * oms101 (~oms101@p20030057EA008100C6D987FFFE4339A1.dip0.t-ipconnect.de) has joined #ceph
[1:42] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) Quit (Quit: doppelgrau)
[1:43] * ntpttr__ (~ntpttr@ Quit (Remote host closed the connection)
[1:53] * wushudoin (~wushudoin@ Quit (Ping timeout: 480 seconds)
[1:55] * andreww (~xarses@ Quit (Ping timeout: 480 seconds)
[1:56] * vicente (~vicente@1-161-184-59.dynamic.hinet.net) has joined #ceph
[1:59] * salwasser (~Adium@2601:197:101:5cc1:5457:1d6a:2956:e456) has joined #ceph
[1:59] * Frostshifter (~Rens2Sea@exit0.radia.tor-relays.net) has joined #ceph
[2:05] * Concubidated (~cube@ Quit (Quit: Leaving.)
[2:09] * kristen (~kristen@ Quit (Quit: Leaving)
[2:26] * salwasser (~Adium@2601:197:101:5cc1:5457:1d6a:2956:e456) Quit (Quit: Leaving.)
[2:29] * Frostshifter (~Rens2Sea@exit0.radia.tor-relays.net) Quit ()
[2:31] * Concubidated (~cube@ has joined #ceph
[2:33] * andreww (~xarses@c-73-202-191-48.hsd1.ca.comcast.net) has joined #ceph
[2:37] * ntpttr_ (~ntpttr@ has joined #ceph
[2:42] * gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[2:54] * wjw-freebsd (~wjw@smtp.digiware.nl) Quit (Ping timeout: 480 seconds)
[3:03] * rakeshgm (~rakesh@ Quit (Ping timeout: 480 seconds)
[3:07] * lincolnb (~lincoln@c-71-57-68-189.hsd1.il.comcast.net) has joined #ceph
[3:10] * derjohn_mobi (~aj@x590cda80.dyn.telefonica.de) has joined #ceph
[3:11] * Green (~Green@ has joined #ceph
[3:14] * vicente (~vicente@1-161-184-59.dynamic.hinet.net) Quit (Ping timeout: 480 seconds)
[3:14] * davidzlap (~Adium@cpe-172-91-154-245.socal.res.rr.com) Quit (Quit: Leaving.)
[3:16] * jfaj_ (~jan@p20030084AD01A9005EC5D4FFFEBB68A4.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[3:17] * rakeshgm (~rakesh@ has joined #ceph
[3:17] * derjohn_mob (~aj@x4db06885.dyn.telefonica.de) Quit (Ping timeout: 480 seconds)
[3:23] * gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) has joined #ceph
[3:25] * jfaj_ (~jan@p20030084AD19E6005EC5D4FFFEBB68A4.dip0.t-ipconnect.de) has joined #ceph
[3:26] * Jeffrey4l__ (~Jeffrey@ has joined #ceph
[3:28] * Peltzi (peltzi@peltzi.fi) Quit (Read error: Connection reset by peer)
[3:28] * Peltzi (peltzi@peltzi.fi) has joined #ceph
[3:29] * liiwi (liiwi@idle.fi) Quit (Read error: Connection reset by peer)
[3:33] * liiwi (liiwi@idle.fi) has joined #ceph
[3:36] * Diablodoct0r (~KeeperOfT@ has joined #ceph
[3:37] * yanzheng (~zhyan@ has joined #ceph
[3:45] * jermudgeon (~jhaustin@ Quit (Quit: jermudgeon)
[3:45] * jermudgeon (~jhaustin@ has joined #ceph
[3:47] * Concubidated1 (~cube@66-87-118-60.pools.spcsdns.net) has joined #ceph
[3:48] * ntpttr_ (~ntpttr@ Quit (Ping timeout: 480 seconds)
[3:55] * Concubidated (~cube@ Quit (Ping timeout: 480 seconds)
[4:03] * scuttle|afk (~scuttle@nat-pool-rdu-t.redhat.com) Quit (Ping timeout: 480 seconds)
[4:06] * Diablodoct0r (~KeeperOfT@ Quit ()
[4:10] * rakeshgm (~rakesh@ Quit (Quit: Peace :))
[4:11] * Concubidated (~cube@ has joined #ceph
[4:11] * Concubidated1 (~cube@66-87-118-60.pools.spcsdns.net) Quit (Read error: Connection reset by peer)
[4:16] * scuttle|afk (~scuttle@nat-pool-rdu-t.redhat.com) has joined #ceph
[4:20] * baotiao (~baotiao@ has joined #ceph
[4:25] * haplo37 (~haplo37@ Quit (Remote host closed the connection)
[4:30] * ira (~ira@c-24-34-255-34.hsd1.ma.comcast.net) Quit (Ping timeout: 480 seconds)
[4:40] * sudocat (~dibarra@2602:306:8bc7:4c50:2d7d:a0ed:89ab:8683) Quit (Ping timeout: 480 seconds)
[4:45] * vicente (~~vicente@125-227-238-55.HINET-IP.hinet.net) has joined #ceph
[4:55] * ntpttr_ (~ntpttr@ has joined #ceph
[5:01] * shengping (~shengping@ has joined #ceph
[5:01] * ntpttr_ (~ntpttr@ Quit (Remote host closed the connection)
[5:06] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) has joined #ceph
[5:06] * Vacuum_ (~Vacuum@ has joined #ceph
[5:06] <shengping> hello, friends, I am using ceph python rados these day. I met a problem that puzzles me quite a long time, when I using this ???ioctx.set_xattr(measures, name, ??????)???, it takes a long time. why is this happens, is that because there are so much xattrs in measures object, I noticed when I try ???rados -p test-pool listxattr measure > measure.txt???, it also takes quite long time.
[5:09] * sudocat (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) has joined #ceph
[5:12] * kuku_ (~kuku@ Quit (Remote host closed the connection)
[5:13] * Vacuum__ (~Vacuum@ Quit (Ping timeout: 480 seconds)
[5:15] * jermudgeon (~jhaustin@ Quit (Quit: jermudgeon)
[5:15] * jermudgeon (~jhaustin@ has joined #ceph
[5:20] * jarrpa (~jarrpa@ Quit (Ping timeout: 480 seconds)
[5:20] * jarrpa (~jarrpa@ has joined #ceph
[5:22] * overclk (~quassel@ Quit (Remote host closed the connection)
[5:23] * overclk (~quassel@2400:6180:100:d0::54:1) has joined #ceph
[5:26] * niknakpa1dywak (~xander.ni@outbound.lax.demandmedia.com) Quit (Remote host closed the connection)
[5:26] * niknakpaddywak (~xander.ni@outbound.lax.demandmedia.com) has joined #ceph
[5:32] * jowilkin (~jowilkin@184-23-213-254.fiber.dynamic.sonic.net) Quit (Quit: Leaving)
[5:33] * vimal (~vikumar@ has joined #ceph
[5:34] * owasserm (~owasserm@2001:984:d3f7:1:5ec5:d4ff:fee0:f6dc) Quit (Ping timeout: 480 seconds)
[5:44] * wgao (~wgao@ Quit (Read error: Connection timed out)
[5:45] * niknakpaddywak (~xander.ni@outbound.lax.demandmedia.com) Quit (Ping timeout: 480 seconds)
[5:46] * owasserm (~owasserm@a212-238-239-152.adsl.xs4all.nl) has joined #ceph
[5:49] * niknakpaddywak (~xander.ni@outbound.lax.demandmedia.com) has joined #ceph
[5:51] * sudocat (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[5:57] * ntpttr_ (~ntpttr@fmdmzpr02-ext.fm.intel.com) has joined #ceph
[5:59] * jermudgeon (~jhaustin@ Quit (Quit: jermudgeon)
[6:00] * sudocat (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) has joined #ceph
[6:01] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) Quit (Quit: wes_dillingham)
[6:01] * vimal (~vikumar@ Quit (Quit: Leaving)
[6:03] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) has joined #ceph
[6:05] * ivve (~zed@cust-gw-11.se.zetup.net) has joined #ceph
[6:11] * walcubi (~walcubi@p5797A30D.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[6:12] * walcubi (~walcubi@p5795BD11.dip0.t-ipconnect.de) has joined #ceph
[6:17] * jarrpa (~jarrpa@ Quit (Remote host closed the connection)
[6:28] * vimal (~vikumar@ has joined #ceph
[6:35] * jermudgeon (~jhaustin@ has joined #ceph
[6:38] * ntpttr_ (~ntpttr@fmdmzpr02-ext.fm.intel.com) Quit (Remote host closed the connection)
[6:46] * shengping (~shengping@ Quit (Quit: shengping)
[6:53] * kefu_ (~kefu@ has joined #ceph
[6:53] * vikhyat (~vumrao@ has joined #ceph
[6:58] * vata (~vata@ Quit (Quit: Leaving.)
[6:59] * Skaag (~lunix@ has joined #ceph
[7:01] * Skaag1 (~lunix@ has joined #ceph
[7:03] * TomasCZ (~TomasCZ@yes.tenlab.net) Quit (Quit: Leaving)
[7:07] * Skaag (~lunix@ Quit (Ping timeout: 480 seconds)
[7:11] <ivve> anyone here?
[7:11] <jermudgeon> probably not
[7:11] <jermudgeon> seldom is
[7:11] <ivve> :)
[7:12] <ivve> im about to move a journal from ssd back to the same sata source disk, my conclusion is that i have to migrate off data and zap it and recreate
[7:13] <ivve> moving the journal to an ssd is easy, however back, not so much as it requires a partition
[7:13] <jermudgeon> you can resize partitions, although not live
[7:13] <ivve> well i would shutdown the osd of course
[7:13] <jermudgeon> yes, and there???s always some risk in resizing
[7:13] <jermudgeon> how much data is on it / how many osds total?
[7:14] <ivve> quarter peta :P
[7:14] <jermudgeon> I mean on that specific osd
[7:14] <ivve> 4tbb ~70% allocation
[7:14] <ivve> tb*
[7:14] <ivve> and x 100
[7:14] <jermudgeon> if you think of it in terms of total rebuild/rebalance time,
[7:15] <ivve> yea it will take ages
[7:15] <jermudgeon> it seems like it???s not going to be much slower just to wipe and recreate that one
[7:15] <jermudgeon> because of all the downtime while you resize etc
[7:15] <ivve> i have to do it with all of them
[7:15] <jermudgeon> Oh.
[7:15] <jermudgeon> All 100?!
[7:15] <ivve> yea
[7:15] <jermudgeon> I???m curious, why?
[7:15] <ivve> i want to use ssds as cache and ec on the cold disks
[7:16] <jermudgeon> got it
[7:16] <ivve> for better performance and allocation
[7:16] <ivve> so tiering
[7:16] <jermudgeon> there are some tradeoffs with that though, isn???t it not a clear-cut case of always better performance?
[7:16] <ivve> well backups
[7:16] <ivve> and readforward
[7:16] <ivve> or readproxy
[7:16] <ivve> i know when data comes in and is read
[7:16] <ivve> or when it happens at the same time
[7:17] <jermudgeon> seems to me like you should just create some new ssd-only nodes, given the time and expense of switching
[7:17] <jermudgeon> but I am not an expert!
[7:18] <ivve> ill have a few ssds on each node as they are quite fat
[7:18] <jermudgeon> how big are the journals?
[7:18] <ivve> 5gb atm, so default
[7:18] <ivve> although partitions on ssds are larger
[7:18] <jermudgeon> can you resize the journals and reclaim enough space to do the caching tier? performance would be not as good, of course
[7:19] <ivve> yea i dont wanna do that
[7:19] <jermudgeon> yeah
[7:19] <ivve> thought of it though
[7:19] <ivve> feels so messy
[7:20] <ivve> especially when scaling
[7:20] <jermudgeon> a creeping migration is going to take a long time, but it???s theoretically feasible
[7:20] <ivve> and scaling will happen
[7:20] <ivve> yea well i did a legacy to hammer tunables
[7:20] <ivve> on that quarter peta
[7:20] <ivve> :P
[7:20] <jermudgeon> I bow in the presence
[7:20] <jermudgeon> srsly
[7:21] <ivve> :D
[7:21] <jermudgeon> I just tracked down a bad 10G DAC today that was causing slow performance??? frustrated the heck out of me
[7:21] <jermudgeon> blocking writes
[7:22] <ivve> it kinda sucks
[7:22] <ivve> but the best solution in ceph is to just kill off whatever is not working
[7:22] <jermudgeon> true!
[7:22] <ivve> and replace/scratch
[7:22] <jermudgeon> max replicas on your 100?
[7:22] <ivve> 2 atm
[7:23] <ivve> shoulda been 3
[7:23] <ivve> but as you can understand, this cluster needs love
[7:23] <jermudgeon> you don???t have space for 3!
[7:24] <ivve> legacy -> hammer, upgrading to 10.2.3, moving to tiering with ec 2:1 to start with
[7:24] <ivve> it could when it started out
[7:24] <ivve> but now you're right
[7:24] <jermudgeon> can I ask what kind of data set/workload? I don???t often get to talk to someone running at your scale
[7:25] <ivve> right now not much, its running backups
[7:25] <ivve> and logging
[7:25] <jermudgeon> do you run any kind of deduplication frontend to backups, or is it unique data?
[7:26] <ivve> yes it is deduped before it lands in ceph
[7:26] <jermudgeon> what do you dedupe with?
[7:26] <ivve> veeam
[7:26] <jermudgeon> gotcha
[7:26] <jermudgeon> I use ceph exclusively for rbd, but haven???t settled on backup method yet
[7:27] <ivve> my personal preference would be something else
[7:27] <ivve> but it is what it is
[7:27] * gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) Quit (Quit: Textual IRC Client: www.textualapp.com)
[7:27] * Goodi (~Hannu@85-76-98-131-nat.elisa-mobile.fi) Quit (Quit: This computer has gone to sleep)
[7:28] <ivve> elk stack is the logging
[7:28] <jermudgeon> have you tried cephfs yet?
[7:28] <ivve> nah
[7:28] <jermudgeon> I did for like half an hour but only got as far as settting up the fs??? I don???t have a use case for it yet
[7:28] <ivve> i think i rather wait for iscsi
[7:29] <ivve> i've just test nfs over rbd
[7:29] <jermudgeon> I thought we already had rbd+iscsi?
[7:29] <ivve> however i didn't get it to work as i wanted, apparently it works well
[7:29] <jermudgeon> cifs gateway?
[7:29] * karnan (~karnan@ has joined #ceph
[7:30] <ivve> tbh i haven't tried rgw too much, got any good examples i can look at
[7:30] <jermudgeon> not me
[7:31] <ivve> would be really interesting to try on vmware (not that i like the hypervisor, but everyone and their mom runs it)
[7:31] <jermudgeon> have you thought of enhanceio?
[7:31] <ivve> to be rbd + kvm is enough
[7:31] <ivve> to me*
[7:31] <ivve> nope
[7:32] <jermudgeon> you can enable/disable while the source is in use
[7:32] <jermudgeon> https://wiki.archlinux.org/index.php/EnhanceIO
[7:32] <ivve> oh
[7:33] <jermudgeon> I???m not sure how it would scale in your application, but one of the devs did suggest something like that
[7:33] <jermudgeon> ssd caching rather than tiering
[7:33] * raphaelsc (~raphaelsc@ has joined #ceph
[7:33] <ivve> have you tried any tiering?
[7:34] <jermudgeon> no
[7:34] <jermudgeon> I looked at the tradeoffs and it didn???t seem the right choice
[7:34] <jermudgeon> for backup workloads I think it would work well, as you have a lot of cold data
[7:34] <ivve> readforward/readproxy is cool, if you have a test cluster i highly recommend testing it
[7:34] <jermudgeon> for hot loads the difference can be small or even a reduction
[7:35] <ivve> especially when working with sata
[7:35] <ivve> you can take really high load spikes no worries
[7:35] <ivve> and then you just scale your "hot tier" accordingly with size and cache-settings
[7:36] <ivve> because backup usually isn't one constant flow of data
[7:37] <jermudgeon> yeah
[7:37] <ivve> its really cool because you can do a max write and max read at the same time
[7:37] <ivve> with no penalty
[7:38] <jermudgeon> because you???re writing to different tiers
[7:38] <ivve> yea since reading happens of the ecpool and writing happens to the ssdpool
[7:38] <ivve> reading from ecpool is usually really fast, even from sata
[7:39] <ivve> spread out over the cluster
[7:39] <ivve> its easy to max out the network
[7:39] <jermudgeon> what do you get on rados bench? what???s your total cluster bandwidth?
[7:40] <ivve> so the problem is rather the backup is slow or network
[7:40] <ivve> "slow" :P
[7:40] <ivve> ill try again when im done with this thing
[7:40] <ivve> which will take a little while :D
[7:40] <jermudgeon> you???d definitely benefit from EC
[7:40] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) Quit (Quit: wes_dillingham)
[7:40] <jermudgeon> it???s like a 25% savings, right? 2x = 1.5x?
[7:41] <ivve> well yea and since ec won't allow rbds directly
[7:41] <ivve> yeah thats right
[7:41] <jermudgeon> so not only will you have to change the osds, set up new pools and tiering, you???ll also have to move the data?
[7:41] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) has joined #ceph
[7:41] <ivve> one funny thing is
[7:42] <ivve> going from legacy -> hammer was ~20% performance
[7:42] <jermudgeon> I started at hammer
[7:42] <ivve> from being able to write 1mb blocks ~550mb/sec
[7:42] <ivve> to 750mb/sec
[7:42] <jermudgeon> mb or mB?
[7:42] <ivve> megabytes
[7:42] * Schaap (~biGGer@ has joined #ceph
[7:42] <jermudgeon> nice!
[7:43] <ivve> and thats sata!
[7:43] <jermudgeon> whee
[7:43] <ivve> :)
[7:43] <jermudgeon> I still don???t envy you the move
[7:43] <ivve> well since its backups
[7:43] <ivve> there are really 2 options
[7:43] <ivve> or well 3
[7:43] <jermudgeon> new backups -> new pools
[7:43] <jermudgeon> old backups age out
[7:43] <ivve> exactly
[7:44] <jermudgeon> that???ll work
[7:44] <ivve> but this cluster has 1 mega image
[7:44] <ivve> i don't like that
[7:44] <jermudgeon> what do you use for monitoring your osd apart from the obvious smartd?
[7:44] <ivve> so i want many smaller
[7:44] <ivve> like 10tb or so
[7:44] <jermudgeon> aggregating your mon/osd logs into something like splunk?
[7:44] <jermudgeon> yeah, that should be more manageable
[7:45] <ivve> so that i can do snaps and export image with incrementals
[7:45] <ivve> basically i have to move the data three times in this cluster
[7:45] <ivve> tunables-pool/images-journals
[7:46] <ivve> in any order
[7:46] <ivve> :)
[7:46] <ivve> monitoring is quite limited
[7:46] <jermudgeon> you must have a fairly steady churn of replacing drives
[7:47] <ivve> but im thinking of something that uses smartctl and then some nagios specific for ceph
[7:47] <ivve> actually nothing yet :)
[7:47] <jermudgeon> how do you detect drive failure then?
[7:47] <jermudgeon> bitrot?
[7:47] <ivve> right now ceph does its magic if something fails
[7:48] <ivve> but that has to be improved upon
[7:48] <jermudgeon> cool, it must mark the osd as out
[7:48] <ivve> yea and ill notice the health_warn
[7:48] <ivve> and "lost" it
[7:48] <jermudgeon> yep
[7:48] <ivve> if its broken
[7:48] <jermudgeon> makes sense
[7:48] <John341> I'm CentOS7/Hammer and I have radosgw multipart and shadow objects in .rgw.buckets even though I have deleted all buckets 2weeks ago, can anybody advice on how to prune or garbage collect the orphan objects?
[7:49] <jermudgeon> ivve: how many osds per node?
[7:50] * Skaag (~lunix@ has joined #ceph
[7:51] * gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) has joined #ceph
[7:56] * Skaag1 (~lunix@ Quit (Ping timeout: 480 seconds)
[8:00] * Goodi (~Hannu@ has joined #ceph
[8:01] * garphy is now known as garphy`aw
[8:04] * garphy`aw is now known as garphy
[8:06] * shengping (~shengping@ has joined #ceph
[8:11] <ivve> jermudgeon: plan is 4ssd and rest sata, which is around 30-34
[8:12] <jermudgeon> do you ahve a separate 10g osd network? I tried running separate public and private networks, and thought I had screwed things up royally ??? but it probably was the bad cable, so I might switch back
[8:12] <jermudgeon> currently I???m nowhere near maxing the 10g
[8:12] * Schaap (~biGGer@ Quit ()
[8:13] <ivve> currently one 10gb with separate vlan, not how i want it though
[8:14] <jermudgeon> I put dual on all nodes, but only using one so far as I need to buy another 10g switch
[8:14] <jermudgeon> or move some ports around
[8:14] <jermudgeon> thenI can have separate phy layer instead of vlans
[8:14] <ivve> best would be 2 nics with 2 ports and lacp
[8:14] <jermudgeon> true
[8:14] <ivve> since its only 4 nodes
[8:14] <ivve> but once nodes go up
[8:15] <ivve> lacp becomes too expensive
[8:15] <ivve> so just a single nic with two switches might be enough
[8:15] <jermudgeon> have to balance aggregate osd throughput with node density
[8:15] <ivve> guess there is a hotspot after a while
[8:15] <ivve> same with os disk in raid1
[8:16] <ivve> we have it now, but later on it becomes redundant
[8:16] <jermudgeon> one 10 gig link isn???t a whole lot per osd at your density
[8:16] <ivve> guess the hotspot is ~8-10 nodes
[8:16] <ivve> since one node is ~130TB atm
[8:16] <jermudgeon> yeah
[8:17] * Be-El (~blinke@nat-router.computational.bio.uni-giessen.de) has joined #ceph
[8:18] <jermudgeon> Be-El: thanks for your help the other day, things are running smoother now
[8:18] <Be-El> great to know
[8:18] <jermudgeon> had a bad DAC
[8:20] <jermudgeon> ivve: Be-El would know if there???s a shortcut to the architecture you want
[8:23] <Be-El> jermudgeon: i remember discussing about something with you, but i do not remember the exact topic anymore (too much different stuff on a day-to-day base is not good for poor little Be-El ;-).
[8:26] * hgjhgjh (~Scymex@ has joined #ceph
[8:26] * valeech (~valeech@pool-96-247-203-33.clppva.fios.verizon.net) Quit (Quit: valeech)
[8:28] * Skaag (~lunix@ Quit (Ping timeout: 480 seconds)
[8:29] <Be-El> ivve: caffeine levels have been stabilized, feel free to ask
[8:31] <jermudgeon> Be-El: none taken
[8:32] * jermudgeon (~jhaustin@ Quit (Quit: jermudgeon)
[8:32] * shengping (~shengping@ has left #ceph
[8:33] <ivve> ouch
[8:35] <ivve> journal read_header error decoding journal header :(
[8:38] <ivve> seems journals got trashed after boot :P
[8:40] <Be-El> sounds like a hardware problem
[8:40] <ivve> all osds on a node?
[8:40] <ivve> after a reboot after patching
[8:40] <ivve> seems unlikley, but not impossible
[8:41] <Be-El> do you use one ssd for the journals, multiple ssds or colocated journals on hdds?
[8:41] <ivve> multiple journals on one ssd
[8:42] * garphy is now known as garphy`aw
[8:42] <ivve> hmm maybe it's udev
[8:43] <Be-El> if you run 'ceph-disk list' on the host it should list all osd partitions and journal partitions
[8:43] <ivve> it does
[8:44] <Be-El> so the partition type uuid are ok
[8:44] <ivve> blkid shows different names on the disks
[8:44] <Be-El> if the journals got mixed up between osds there should be a different error message
[8:44] <ivve> sdb & sdp instead
[8:44] <ivve> of sdah & sdam
[8:45] <ivve> thats the issue
[8:45] <Be-El> devices names are not fixed in linux and may change during reboots. that's why ceph usually uses partition uuid to refers to journals
[8:45] <ivve> yeah exactly
[8:45] <Be-El> what's the target of the journal symlink in the /var/lib/ceph/osd/XYZ/ directory?
[8:45] <ivve> however the link doesn't point to uuid
[8:46] <ivve> it points to /dev/sdaX
[8:46] <ivve> it used to before
[8:46] <ivve> but now it doesn't
[8:46] <ivve> journal -> /dev/sdah1
[8:46] <ivve> so it got swapped around, how does ceph identify it and create the link?
[8:48] <Be-El> on our hosts the symlinks also point to the devices itself instead of the uuid entries
[8:49] <ivve> no udev rules
[8:49] <ivve> shouldn't there be one there?
[8:49] * wes_dillingham (~wes_dilli@209-6-222-74.c3-0.hdp-ubr1.sbo-hdp.ma.cable.rcn.com) Quit (Quit: wes_dillingham)
[8:49] <Be-El> not sure whether this is done by udev rules or ceph-disk
[8:49] <ivve> well udevrules is empty
[8:49] <Be-El> all current releases shoud use udev rules
[8:49] <ivve> no ceph related rules anyway
[8:49] <ivve> maybe something went wrong with the upgrade
[8:49] <Be-El> upgrade to jewel?
[8:50] <ivve> ye
[8:50] <ivve> the others went fine
[8:50] <Be-El> on our centos hosts we have /lib/udev/rules.d/60-ceph-by-parttypeuuid.rules and /lib/udev/rules.d/95-ceph-osd.rules
[8:50] <ivve> however i can see that they too have lazy /dev/sdaX journal link
[8:51] <ivve> yea
[8:51] <ivve> i was looking in etc
[8:51] <ivve> they are there
[8:51] <ivve> 60-ceph-by-parttypeuuid.rules
[8:51] <ivve> 95-ceph-osd.rules
[8:52] <Be-El> do you use a systemd based startup system or upstart?
[8:52] <ivve> cent
[8:52] <ivve> os systemd
[8:52] <ivve> so*
[8:52] <Be-El> ok, the same setup we have
[8:52] <Be-El> let's find out who's starting the osd processes
[8:52] <Be-El> and which parameters are used
[8:52] * derjohn_mobi (~aj@x590cda80.dyn.telefonica.de) Quit (Ping timeout: 480 seconds)
[8:54] <Be-El> ceph-osd is started by /lib/systemd/system/ceph-osd@.service, but it does not mount the osd partition or handles the journal symlink
[8:54] <ivve> aye, got that far
[8:54] * branto (~branto@nat-pool-brq-t.redhat.com) has joined #ceph
[8:55] <Be-El> and /lib/udev/rules.d/60-ceph-by-parttypeuuid.rules only handles the parttypeuuid links in /dev
[8:55] <ivve> ceph-disk@.service
[8:55] <Be-El> good catch, missed that one
[8:56] <ivve> it does the activation
[8:56] <ivve> in other words mounting?
[8:56] * hgjhgjh (~Scymex@ Quit ()
[8:56] <Be-El> mounting and maybe handling of the journal
[8:56] <ivve> and starting the ceph-osd@$id ?
[8:56] * fridim (~fridim@56-198-190-109.dsl.ovh.fr) has joined #ceph
[8:57] <Be-El> i think the osd process itself is started by ceph-osd@.services, since their show up at ceph-osd@.services instances in systemctl
[8:57] <ivve> seems it uses ceph-disk to trigger from udev rules
[8:57] <Be-El> that's what the second udev rule is for
[8:58] <Be-El> so you either trigger ceph-disk by udev, or use the ceph-disk@.service systemd service
[8:58] * Lokta (~Lokta@carbon.coe.int) has joined #ceph
[8:58] <Be-El> in both cases ceph-disk trigger $dev is called
[8:58] <ivve> trigger Trigger an event (caled by udev)
[8:58] <ivve> hard to read that man :P
[8:59] <Be-El> ceph-disk is just python code, so let's dig into it
[8:59] * malevolent (~quassel@ Quit (Ping timeout: 480 seconds)
[8:59] <ivve> aye
[8:59] * igoryonya (~kvirc@ has joined #ceph
[9:00] <Be-El> packages are in /usr/lib/python2.7/site-packages/ceph_disk/
[9:00] <ivve> alright
[9:01] * malevolent (~quassel@ has joined #ceph
[9:03] * rdas (~rdas@ has joined #ceph
[9:03] <Be-El> there's a main_trigger function which is probably the code path if trigger option is given
[9:04] <ivve> what exactly does it look for?
[9:05] <Be-El> it just invokes ceph-disk active or ceph-disk active-journal, depending on the partition type
[9:05] <ivve> when checking with blkid all are there
[9:05] * niknakpa1dywak (~xander.ni@outbound.lax.demandmedia.com) has joined #ceph
[9:06] <ivve> i tried invoking activate manually, its a no-go
[9:06] <ivve> thats more or less the first thing i tried
[9:06] <Be-El> it should work. what's the error you have encountered?
[9:06] <ivve> it succeeds in the mounting of the osd part, but the link fails (it links to some part that doesn't exist)
[9:07] <ivve> that is the journal
[9:07] <ivve> subcommand activate
[9:07] * niknakpaddywak (~xander.ni@outbound.lax.demandmedia.com) Quit (Ping timeout: 480 seconds)
[9:08] <ivve> main_activate
[9:08] <ivve> one sec
[9:08] <ivve> might find it here
[9:08] <ivve> so it locates the correct OSD part
[9:09] <ivve> with main_activate, get_dm_uuid im guessing
[9:09] <ivve> nothing about the journal
[9:10] * ade (~abradshaw@2a02:810d:a4c0:5cd:9001:2bba:886a:5200) has joined #ceph
[9:11] * TMM (~hp@dhcp-077-248-009-229.chello.nl) Quit (Quit: Ex-Chat)
[9:12] <ivve> this is wierd
[9:12] <ivve> how would the osd know which journal is "its" journal
[9:12] <ivve> i have a healthy osd
[9:12] <Be-El> that's exactly the command i'm currently looking for
[9:13] <Be-El> both osds and journals have id, and they need to match
[9:13] <ivve> it has /var/lib/ceph/osd/ceph-102/journal_uuid
[9:13] <ivve> however this osd has none
[9:13] <Be-El> and there's a command to tell the uuid of journal partition
[9:13] <ivve> ah
[9:15] <Be-El> well, it used to be there in hammer release...jewel has changed a lot under the hood for bluestore support
[9:16] <ivve> --check-needs-journal for ceph-osd?
[9:17] <ivve> or wants
[9:19] <ivve> just thinking loudly here, the osds were shutdown properly before and update
[9:19] <ivve> should be possible to just recreate them
[9:19] <ivve> however thats a workaround
[9:19] <ivve> would be nice to know why this happend in the first place
[9:20] <Be-El> i don't know whether a shutdown flushes the journal
[9:20] * T1w (~jens@node3.survey-it.dk) has joined #ceph
[9:20] <ivve> ah maybe it doesn't
[9:22] <Be-El> ok, the only command to query information from a journal i've found so far is ceph-osd --get-device-fsid
[9:22] <ivve> yea
[9:23] <peetaur2_> Be-El: also there's ceph-objectstore-tool but not sure what it does :)
[9:23] * analbeard (~shw@support.memset.com) has joined #ceph
[9:24] <ivve> same error i get in messages when it tried to mount it
[9:24] <Be-El> the id returned by --get-device-fsid and the id store in /var/lib/ceph/osd/XYZ/fsid should be the same
[9:24] <ivve> 2016-09-29 10:23:34.657691 7f341f80b800 -1 bluestore(/dev/sde1) _read_bdev_label unable to decode label at offset 66: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end of struct encoding
[9:24] <ivve> 2016-09-29 10:23:34.664451 7f341f80b800 -1 journal read_header error decoding journal header
[9:24] <ivve> failed to get device fsid for /dev/sde1: (22) Invalid argument
[9:25] <Be-El> the first message is just bluestore support and can be ignored (got the same message here)
[9:25] <Be-El> the second one is troubling. your journals are indeed invalid
[9:25] <ivve> they are here but udev gave them new names
[9:26] <Be-El> and /dev/sde is the journal device, with /dev/sde1 being one of the journal partitions?
[9:26] <ivve> and the link is pointing to old name
[9:27] <ivve> ./dev/sde1 is ceph data partition
[9:27] <Be-El> try with one of the journal partitions instead
[9:27] <ivve> k
[9:27] <ivve> bueno!
[9:28] <ivve> 2016-09-29 10:27:44.281922 7f80b6099800 -1 bluestore(/dev/sdb1) _read_bdev_label unable to decode label at offset 66: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end of struct encoding
[9:28] <ivve> 4486b952-5305-4b27-bdb1-d5da92e5f425
[9:28] * rraja (~rraja@ has joined #ceph
[9:28] <Be-El> ok, that's the uuid of the osd that journal belongs to
[9:28] <ivve> ./dev/sdb1 belongs to /dev/sde1 as i thought, fsid is a match
[9:28] <Be-El> 'cat /var/lib/ceph/osd/*/fsid' should print it for one of the osds
[9:28] <ivve> yeah it all matches
[9:29] <ivve> so now is the question
[9:29] <ivve> if i re-link /var/lib/ceph/osd/ceph-XX/journal
[9:29] <ivve> is it persistent or is that link something that ceph manipulates on start/stop
[9:29] <Be-El> so the reboot has probably changed the device order (e.g. pci-e nvme ssd before sata/sas hdds), and osd activation system did not update the journal symlinks
[9:30] <Be-El> i would propose the check that osds are ok by changing the symlink manually for one osd and start it (ceph-disk activate...).
[9:30] <ivve> aye
[9:30] <Be-El> if that works you might need to change all the symlinks
[9:31] <ivve> yeah thats no problem
[9:31] <ivve> but yeah, lets test
[9:31] <ivve> worst case scenario i just wipe these osds and let it rebuild
[9:31] <ivve> thing is i want to move journals from ssd back to sata(same as cephdata block dev)
[9:31] <peetaur2_> ivve: FYI you can also put the journal path in ceph.conf, eg. osd journal = /dev/disk/by-partlabel/osd.$id.journal
[9:31] <Be-El> and maybe also send a mail to the mailing list and ask whether this is intended behaviour. afaik the journals should use persistent device names for exactly the reason you've just encountered
[9:32] <ivve> yeah well it was hammer before
[9:32] <ivve> 0.94.7?
[9:32] <ivve> i think
[9:32] <ivve> and it didn't use partbyuuid
[9:33] <ivve> although 0.94.9 does
[9:33] <ivve> or maybe not
[9:33] * wgao (~wgao@ has joined #ceph
[9:36] <peetaur2_> I just partitioned myself and set a partlabel...or you can use partlabel too
[9:36] <peetaur2_> er uuid too
[9:37] <ivve> which is default when using prepare via ceph-deploy?
[9:37] <ivve> although that is linked to ceph-depoly version
[9:37] <ivve> i guess
[9:37] <peetaur2_> when I created my osds, I specifically gave it that partlabel, and either manually set a symlink (which worked but not during upgrade from hammer to jewel...) or with the ceph.conf
[9:38] <ivve> yeah thats what i did
[9:38] <ivve> 2 machines worked, 1 not so much
[9:38] <peetaur2_> I don't know what ceph-deploy does, but if you gave it /dev/sdc1 maybe it would keep that instead of looking up the uuid/partuuid/partlabel... dunno
[9:38] <ivve> but im gonna set bypartuuid on all of them now
[9:38] <ivve> latest ceph-deploy used uuid
[9:38] * Goodi (~Hannu@ Quit (Quit: This computer has gone to sleep)
[9:38] <ivve> with 10.2.3
[9:39] <ivve> well these errors are good, you always learn something
[9:39] <ivve> thanks a bunch, as usual Be-El :)
[9:39] <Be-El> yeah, digging through the guts of ceph ;-)
[9:39] <ivve> hehe indeed
[9:39] <ivve> but its good fun
[9:39] <Be-El> -> open source \o/
[9:39] <ivve> o7
[9:40] <ivve> just have to convince the rest of the guys at the office
[9:40] * derjohn_mobi (~aj@b2b-94-79-172-98.unitymedia.biz) has joined #ceph
[9:41] <ivve> btw, switching journal from ssd back to same data blockdevice, any tips other than migrating, zapping and recreating
[9:41] <ivve> its a tedious job with 0.5P
[9:42] <ivve> resizing partitions.. scary but if it fails i can always just loose that failed osd. however verification of data? deep scrub it?
[9:42] <ivve> resizing will be much faster
[9:42] <ivve> and its just 5gb
[9:43] * Sue__ (~sue@2601:204:c600:d638:6600:6aff:fe4e:4542) Quit (Ping timeout: 480 seconds)
[9:45] * nathani1 (~nathani@2607:f2f8:ac88::) has joined #ceph
[9:46] * nathani (~nathani@2607:f2f8:ac88::) Quit (Read error: Connection reset by peer)
[9:47] <T1w> hm..
[9:47] <T1w> does anyone know what this means?
[9:47] <T1w> libceph: osd3 socket closed (con state OPEN)
[9:47] <T1w> the OSD changes a bit at times, but they appear regularly on a client that has several RBD mapped via kernel
[9:50] * Mika_c (~Mika@ has joined #ceph
[9:52] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) has joined #ceph
[9:55] <ben1> T1w: google said before that it was from idle connections and harmless
[9:55] <T1w> ah
[9:55] <T1w> ok
[9:56] <ben1> it looks really sus though
[9:56] <T1w> yeah
[9:56] <T1w> that' what had me worried a bit
[9:56] <Be-El> ivve: flush the journal, stop the osd, resize the filesystem, create new journal partition on same device (i won't use journal files), update journal symlink, use ceph-osd --mkjournal, start osd
[9:57] <T1w> on the other hand, my cluster performs admirably, so
[9:57] <T1w> it's probably working as intended
[9:57] <ben1> heh
[9:57] <ivve> im guessing flush after stop?
[9:57] <ivve> or stop flush before stop
[9:57] <Be-El> ivve: ah, sure
[9:57] <Be-El> stop, then flush
[9:57] <ivve> makes sense :)
[9:58] <ivve> yeah im thinking this will be ok
[9:58] <ben1> i'm surprised more people don't complain about performance of ceph tbh
[9:58] <ivve> saves time
[9:58] <ben1> over the years, storage performance seems to be a never ending problem
[9:58] <ben1> but it seems when people go for ceph they generally go for "big enough" system
[9:58] <ben1> s
[9:58] * TMM (~hp@ has joined #ceph
[9:59] <ivve> i think it performs quite well
[10:00] * garphy`aw is now known as garphy
[10:00] <Be-El> ben1: that's a fundamental problem with all 'scale out' storage solutions. their need to be big enough (enough hosts, enough bandwidth, enough spindles) to perform well
[10:00] <T1w> and it's hard to get kickass performance from a small test-setup
[10:01] <ben1> Be-El: yeh, but it doesnt' seem that common for people to undersize
[10:01] <ben1> hey i got pretty good performance from my small test setup :)
[10:01] <ben1> i had misbalanced disks though . ceph ain't so great with that
[10:01] <ivve> its quite important to have enough cpu per osd
[10:01] * raphaelsc (~raphaelsc@ Quit (Ping timeout: 480 seconds)
[10:02] <ben1> heh i ran into slow ssd problems with my test setup
[10:02] <ivve> following the 1core/2gb per osd is quite good + some extra for air
[10:02] <ben1> cos i was using cheap and nasty ssd's
[10:02] <ben1> ending up having to go ssd per spindle pretty much
[10:02] <ivve> especially when using ec and tiering
[10:02] <ben1> ivve: 2gb per osd?
[10:02] * garphy is now known as garphy`aw
[10:03] <ben1> i thought the general recommendation was 1gb per ssd?
[10:03] <ben1> per osd
[10:03] * Sue_ (~sue@2601:204:c600:d638:6600:6aff:fe4e:4542) has joined #ceph
[10:03] <Be-El> the more the better
[10:04] <Be-El> ceph vs other storage is similar to linux vs. mac osx
[10:04] <ivve> well what Be-El said
[10:04] <ivve> :]
[10:04] <Be-El> mac osx runs on a defined hardware platform without much variation. it is optimized for exactly that case
[10:05] <ivve> but it comes down to be quite cpu intensive as well
[10:05] <ivve> when it comes to rebuilding
[10:05] <Be-El> linux on the other hand....well, we all know our hardware zoos
[10:05] <ivve> so as building any production environment
[10:05] <ivve> you need air
[10:06] <Be-El> any extra ram you can squeeze into a ceph box can be used as page cache -> less disk i/o, more serving from ram
[10:06] <ben1> i found ceph had bad seek rates
[10:06] <ben1> and yeh is a bit cpu hungry
[10:08] * DanFoster (~Daniel@office.34sp.com) has joined #ceph
[10:08] <ivve> agreed
[10:08] <Be-El> seek times are especially bad if you colocate journal and data. it's trashing the performance completely
[10:08] <ivve> however, if you underequip with cpu, you will have issues
[10:08] * wjw-freebsd (~wjw@smtp.digiware.nl) has joined #ceph
[10:08] <ivve> never less than 1/osd
[10:09] <ben1> oh even cached in ram data seek speeds are way worse than ssd ime
[10:09] <ivve> i have two machines that have problems due to that
[10:09] <ben1> i mean they're not unusably bad
[10:09] <ben1> but you really need parallel loads to get good random performance :)
[10:09] <ivve> osd's going suicide due to heartbeats not responding due to wait
[10:09] <ben1> i really wish more software was optimised for parallel disk accesses
[10:09] <ivve> during high loads
[10:09] <ben1> linux has buffer bloat issues with disks by default too
[10:10] <peetaur2_> ben1: are you using rbd or what to test performance? and what do you compare it to? Only free comparable thing I see out there tested well against it is glusterfs, which sucks at block devices and snapshots (what they do is block all IO and use lvm snapshots...a new feature :D), and people say it crawls in recovery situations unlike ceph.
[10:10] <Be-El> ben1: yeah, the single thread performance is comparably low
[10:10] <ben1> rbd yeah
[10:10] * Goodi (~Hannu@office.proact.fi) has joined #ceph
[10:10] <ben1> peetaur2_: compared to ping, and ssd
[10:10] <ivve> now i have to fix my journals, bbl
[10:10] <peetaur2_> (and people say that ceph scales out better... still fast or faster when huge, but I never saw any huge scale test data)
[10:11] <ben1> peetaur2_: i'm comparing to traditional storage yes
[10:12] <ben1> i suspect part of the single thread performance is due to cpu
[10:12] * raphaelsc (~raphaelsc@ has joined #ceph
[10:12] <Be-El> cpu, that fact that you talk to a single hdd in the worst case, encoding overhead, network overhead etc.
[10:12] <ben1> Be-El: well even if you write and read back and read back and read back and read back
[10:12] <ben1> to make sure it's in cache it'll still stay slow
[10:13] <ben1> i mean compared to ssd etc
[10:13] * karnan (~karnan@ Quit (Ping timeout: 480 seconds)
[10:14] <ben1> i think it is mostly cpu
[10:15] <ben1> now that all ssd storage etc is becoming more common it sticks out a bit more
[10:16] <ben1> but it may get drowned out on 7.2k disks
[10:17] <peetaur2_> ben1: compare it to something more similar...like put your ssd on a network server; but even then that's not similar enough... it isn't making 2 copies before returning from the sync write to keep it HA
[10:17] <Be-El> i hope that bluestore will help with the ssd performance. combined with a cache tier setup (ssd for performance, hdd for capacity)
[10:17] <ben1> peetaur2_: i'm looking at read speed rather than writ
[10:17] <ben1> write
[10:17] <ben1> write days can often be hidden
[10:17] <peetaur2_> with sync, write isn't hidden much
[10:17] <ben1> i haven'nt experimented with bluestore yet
[10:17] <Be-El> peetaur2_: i would prefer to compare ceph to emc or netapp storage solutions instead of local disks
[10:17] <ben1> delays argh
[10:18] <peetaur2_> (some examples change that, like virtualbox which doesn't sync 100% safely)
[10:18] <peetaur2_> Be-El: yes I agree
[10:19] <Be-El> we have one application that is trashing cephfs by doing lots and lots of 4k read requests (mmap'ed binary file with > 70GB size, binary search on file)
[10:19] * bara (~bara@nat-pool-brq-t.redhat.com) has joined #ceph
[10:19] <Be-El> compared to local disk cephfs is waaaay to slow, but thinking about what actually happens under the hood it is ok
[10:20] <Be-El> (and a EMC Isilon acting as NFS server is not significantly faster than our cephfs setup)
[10:21] <ben1> Be-El: did you experiment with cstates at all?
[10:21] <peetaur2_> Be-El: how much faster is the NFS thing, and did you keep rsize to a reasonably low level?
[10:21] <peetaur2_> (if you set rsize too high, the readahead kills random performance)
[10:22] <Be-El> peetaur2_: both setup are not really comparable, and the nfs tests were done in a different environment. client connectivity should be 1 gbe in both cases, so the network is the bottleneck
[10:22] <ben1> i noticed about half the latency with 10gbe vs gbe
[10:22] <Be-El> peetaur2_: rsize for our cephfs setup is set to 8 mb (two chunks with default encoding)
[10:23] <ben1> for small non bw heavy requests
[10:24] <Be-El> ben1: and no, no specific cstate setup yet
[10:24] <ben1> 10gbe is probably a better first step
[10:24] <ben1> if possible :)
[10:25] <Be-El> we are about to purchase our own isilon (*sigh*), and i'll definitely do some benchmarks with it
[10:25] * karnan (~karnan@ has joined #ceph
[10:29] <arthur> i have my test ceph cluster running with 8 ssd's spread out over 6 nodes and get about 22k iops (4k randwrite), cpu's are really busy
[10:30] <arthur> inside a vm with rbd storage
[10:30] * dgurtner (~dgurtner@84-73-130-19.dclient.hispeed.ch) has joined #ceph
[10:30] * mattch (~mattch@w5430.see.ed.ac.uk) has joined #ceph
[10:32] <peetaur2_> arthur: anything special in the setup? I have seen benchmarks where the ssd-only pools didn't look that good
[10:32] <peetaur2_> maybe you didn't test sync writes?
[10:33] <Be-El> for ssd based osd i would like to have a mode without journal (and enforced sync on the filesystem)
[10:33] <arthur> peetaur2_: i use fio with --direct=1 (O_DIRET)
[10:33] <arthur> O_DIRECT
[10:34] <Be-El> arthur: that's only skipping the page cache in the vm, but librbd may also be caching data (depending on the configuration)
[10:35] <peetaur2_> yep, so be sure to test with sync=1 also
[10:37] * fdmanana (~fdmanana@2001:8a0:6e0c:6601:4418:f044:c01f:37d4) has joined #ceph
[10:37] <Be-El> also performance related: what's necessary to switch to jemalloc for osd? setting the option in /etc/sysconfig/ceph does not seems to be sufficient (centos 7.2, jewel release)
[10:37] <arthur> peetaur2_: sync=1 halves the no of iops i get
[10:39] * karnan (~karnan@ Quit (Ping timeout: 480 seconds)
[10:39] <peetaur2_> ok sounds closer to the expected amount then
[10:40] <arthur> running 0.94.9 atm
[10:43] * shengping (~shengping@ has joined #ceph
[10:43] * shengping (~shengping@ has left #ceph
[10:45] * rotbeard (~redbeard@ has joined #ceph
[10:45] * raphaelsc (~raphaelsc@ Quit (Read error: Connection reset by peer)
[10:48] <peetaur2_> hmm a test that had 805-925 iops with hammer now has 468-518 iops with jewel
[10:48] * bara (~bara@nat-pool-brq-t.redhat.com) Quit (Read error: Connection reset by peer)
[10:50] * Tusker (~tusker@CPE-124-190-175-165.snzm1.lon.bigpond.net.au) has joined #ceph
[10:50] <Tusker> heya guys
[10:50] * karnan (~karnan@ has joined #ceph
[10:52] <ivve> Be-El: resizing the partition if its XFS? can't shrink it, no?
[10:52] <peetaur2_> last I checked, you could not shrink XFS
[10:52] * bara (~bara@nat-pool-brq-t.redhat.com) has joined #ceph
[10:53] <ivve> exactly my problem :)
[10:53] <ivve> so moving the journal to cephdata disk won't be possible
[10:53] <ivve> since cephdata resides in xfs
[10:54] <peetaur2_> so copy it somewhere else, mkfs, cp back
[10:54] <peetaur2_> which I'm sure will take really long
[10:55] <ivve> then might as well go to plan A
[10:55] <Be-El> ivve: or use a journal file instead of a partition
[10:55] <Be-El> but performance with files will be lower due to filesystem overhead
[10:56] <Be-El> you should test it with one osd and do some benchmarking (ceph tell osd.X bench) to get an overview of the osd performance afterwards
[10:56] <Tusker> i am having an issue with ceph health_err... 194 pgs degraded, 1 pgs inconsistent, 1 pgs recovering, 193 pgs recovery_wait, 192 pgs stuck unclean
[10:56] <Tusker> this is a production system, with 6 OSD, 3 node ceph cluster
[10:57] <Mika_c> Hi all, I have a small ceph cluster with version 0.94.3(1 mon and 3 osd servers). Because the power outage, I found monitor can not execute ceph command anymore.
[10:57] <ivve> yeah ill just migrate off all data with crush instead and zap the disks and do it from scratch
[10:57] <Mika_c> And a process "/usr/bin/python /usr/sbin/ceph-create-keys --cluster=ceph -i mona" that looks like monitor not in quorum. Any idea to fix this???
[10:57] <Be-El> Mika_c: have a look at the mon logs first
[10:58] <Be-El> Tusker: any recent problems with osd?
[10:59] <Tusker> Be-El: there was a server that was hard rebooted
[10:59] <Kdecherf> hello there
[10:59] <Be-El> Tusker: and all osds are back?
[10:59] <Kdecherf> is there a way to have a kind-of iotop for rbd volumes?
[10:59] <Tusker> yeah: osdmap e624: 6 osds: 6 up, 6 in; 1 remapped pgs
[11:00] <Tusker> ceph pg <id> query hangs
[11:00] <Be-El> Kdecherf: there's 'ceph osd pool stat', but i'll only give you i/o information on pool level
[11:00] <Mika_c> Be-El:Yes, so many duplicated messages like " 7f9f08246700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished"
[11:01] * Danjul (~Danjul@ has joined #ceph
[11:02] * Animazing (~Wut@ Quit (Ping timeout: 480 seconds)
[11:02] <Be-El> Mika_c: and the ceph-mon process is running?
[11:03] <Be-El> Tusker: seems to be problem with some of the osds handling that pg
[11:04] <Be-El> Tusker: can you put the complete output of 'ceph -s' to a pastebin?
[11:04] <Mika_c> Be-El: Yes. But something strange. Using strace to check the process then return "lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)"
[11:05] * Concubidated (~cube@ Quit (Quit: Leaving.)
[11:05] * karnan (~karnan@ Quit (Ping timeout: 480 seconds)
[11:05] * ivve (~zed@cust-gw-11.se.zetup.net) Quit (Ping timeout: 480 seconds)
[11:05] <Be-El> Mika_c: if the process is running you can check its state using the local daemon socket
[11:05] <Be-El> Mika_c: ceph daemon mon.XYZ mon_stat
[11:05] * ivve (~zed@cust-gw-11.se.zetup.net) has joined #ceph
[11:07] <peetaur2_> and also from upgrading hammer->jewel, my cephfs test went from 712 iops to 1952 (which I think my storage is incapable of... cluster is 3 vms with one disk each, plus sharing the same ssd as a journal)
[11:07] <peetaur2_> fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=1G -filename=ssd.test.file -name="test" -iodepth=64 -runtime=30 -sync=1
[11:11] <Tusker> be-el: http://www.pastebin.ca/3723303
[11:12] <Mika_c> Be-El: This monitor state should be "leader", but right now is "probing". And already out of quorum.
[11:14] * Animazing (~Wut@ has joined #ceph
[11:15] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) Quit (Quit: doppelgrau)
[11:16] <Tusker> http://pastebin.ca/3723304 is probably helpful too (ceph pg dump_stuck unclean)
[11:16] * karnan (~karnan@ has joined #ceph
[11:18] * Green (~Green@ Quit (Ping timeout: 480 seconds)
[11:21] * b0e (~aledermue@ has joined #ceph
[11:21] * lkoranda (~lkoranda@nat-pool-brq-t.redhat.com) Quit (Quit: Splunk> Be an IT superhero. Go home early.)
[11:22] * walcubi (~walcubi@p5795BD11.dip0.t-ipconnect.de) Quit (Quit: Leaving)
[11:22] * walcubi (~walcubi@p5795BD11.dip0.t-ipconnect.de) has joined #ceph
[11:22] * lkoranda (~lkoranda@nat-pool-brq-t.redhat.com) has joined #ceph
[11:25] * Green (~Green@ has joined #ceph
[11:29] * tserong (~tserong@203-214-92-220.dyn.iinet.net.au) Quit (Quit: Leaving)
[11:32] * kefu_ (~kefu@ Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[11:36] * bara (~bara@nat-pool-brq-t.redhat.com) Quit (Quit: Bye guys! (??????????????????? ?????????)
[11:39] * bara (~bara@ has joined #ceph
[11:41] <Tusker> be-el: any idea ?
[11:41] * karnan (~karnan@ Quit (Ping timeout: 480 seconds)
[11:42] <Be-El> Tusker: did the running recovery finish in the mean time?
[11:43] <Be-El> Mika_c: can you put the complete output to some pastebin?
[11:50] * karnan (~karnan@ has joined #ceph
[11:51] * doppelgrau (~doppelgra@ has joined #ceph
[11:54] * rmart04 (~rmart04@support.memset.com) has joined #ceph
[12:06] * vicente (~~vicente@125-227-238-55.HINET-IP.hinet.net) Quit (Quit: Leaving)
[12:07] * fridim (~fridim@56-198-190-109.dsl.ovh.fr) Quit (Ping timeout: 480 seconds)
[12:10] <s3an2> When upgrading from 10.2.X to 10.2.X I typically update all the packages, restart the mon's then the OSD's and RGW's and it works well. In a cluster here I now have MDS servers I assume these are just restarted after the mon's and OSD's - is there anything else I should really consider?
[12:11] <peetaur2_> if there's more to consider, it's in the release notes
[12:12] <peetaur2_> and also it says when to restart mds in this numbered list here http://docs.ceph.com/docs/master/install/upgrading-ceph/
[12:19] * fridim (~fridim@56-198-190-109.dsl.ovh.fr) has joined #ceph
[12:33] * tserong (~tserong@203-214-92-220.dyn.iinet.net.au) has joined #ceph
[12:35] * bara (~bara@ Quit (Ping timeout: 480 seconds)
[12:46] * bara (~bara@nat-pool-brq-t.redhat.com) has joined #ceph
[12:51] * Mika_c (~Mika@ Quit (Remote host closed the connection)
[12:57] <Tusker> be-el: http://pastebin.ca/3723330
[13:02] <Be-El> Tusker: i would propose to restart one of the affected osds and check their log for errors
[13:03] <Tusker> from what I can see, all osd's are affected...
[13:03] <Tusker> how can I restart one osd at a time ?
[13:03] * shengping (~shengping@ has joined #ceph
[13:03] <Be-El> osd 7-12 are shown in the ceph pg dump
[13:04] <Tusker> that's all that exist as far as I can see
[13:04] * salwasser (~Adium@2601:197:101:5cc1:190d:bbb1:7a9a:1ec5) has joined #ceph
[13:05] <peetaur2_> I would use ceph pg dump_stuck to see the id that is "inconsistent" and then do ceph pg repair <pg_id> (which will possibly choose a bad copy of it that failed scrub to fix the missing copy... possibly normal in a size=2 pool where 1 is missing, so there's no 2/3 consensus to check)
[13:05] * goretoxo (~psilva@ has joined #ceph
[13:05] <Be-El> Tusker: ah ok, so there are no osd 1-6?
[13:06] <Tusker> correct
[13:06] * salwasser (~Adium@2601:197:101:5cc1:190d:bbb1:7a9a:1ec5) Quit ()
[13:06] * goretoxo (~psilva@ Quit ()
[13:06] <Be-El> Tusker: which operation system do you use?
[13:06] <Tusker> peetaur2_: already tried that
[13:07] <Tusker> Be-El: operation system ?
[13:07] * bara (~bara@nat-pool-brq-t.redhat.com) Quit (Ping timeout: 480 seconds)
[13:09] <Be-El> Tusker: yes, since the commands for restarting osds differs between systemd and upstart init systems
[13:11] <Tusker> systemd
[13:12] <Be-El> in that case the command for restarting should be systemctl restart ceph-osd@7.service
[13:13] <Tusker> does it matter which node it is run on ?
[13:16] * shengping (~shengping@ has left #ceph
[13:16] <Be-El> it should be run on the node the osd is located
[13:17] <Tusker> OK
[13:17] <Tusker> I did that
[13:17] <Tusker> do it one by one ?
[13:17] <Be-El> the log is located in /var/log/ceph/
[13:18] <Be-El> first have a look at the log and check that the osd comes back and is operationa
[13:18] <Tusker> well, ceph osd tree shows all 6 up
[13:19] <Be-El> does recovery start of some of the pgs?
[13:21] <Tusker> hmm... that didn't seem to work
[13:21] <Be-El> does the log indicate that the osd is operational?
[13:22] <Tusker> systemd complained it couldn't restart
[13:22] <Tusker> and didn't see anything in the logs that indicated that it tried
[13:23] <Be-El> what's the reason systemd is giving?
[13:24] <Tusker> Sep 29 19:20:51 stor01 systemd[1]: ceph-osd@7.service start request repeated too quickly, refusing to start.
[13:27] <Be-El> that's not the reason the osd didn't start. there may be more information in the system log
[13:27] <Tusker> Sep 29 18:03:36 stor01 ceph[20074]: Running as unit ceph-osd.7.1475143415.259280386.service.
[13:27] <Tusker> does it matter that the service name is long ?
[13:28] <Tusker> Sep 29 19:13:00 stor01 ceph-osd[23872]: 2016-09-29 19:13:00.563175 7f190fbff800 -1 osd.7 0 OSD::pre_init: object store '/var/lib/ceph/osd/ceph-7' is currently in use. (Is ceph-osd already running?)
[13:28] <Be-El> ok the osd is not stopped yet
[13:29] <Tusker> http://pastebin.ca/3723339
[13:29] <Be-El> so you need to stop the osd daemon first
[13:29] <Tusker> systemctl stop ceph-osd@7.service ?
[13:30] <Be-El> nope, the instance is from a former setup. you need the long name
[13:30] <Tusker> ceph-osd.7.1475143415.259280386.service ?
[13:30] * dgurtner (~dgurtner@84-73-130-19.dclient.hispeed.ch) Quit (Quit: leaving)
[13:30] * dgurtner (~dgurtner@84-73-130-19.dclient.hispeed.ch) has joined #ceph
[13:31] <Be-El> yes
[13:31] <Tusker> Sep 29 19:31:11 stor01 bash[20335]: 2016-09-29 19:31:11.848260 7fafa1971700 -1 osd.7 648 *** Got signal Terminated ***
[13:32] <Tusker> and start it with the short name ?
[13:32] <Be-El> yes
[13:33] <Tusker> "start too quickly" error again
[13:33] <Tusker> probably have to start with the long name then...
[13:33] <Be-El> and the ceph log?
[13:33] <Be-El> nope
[13:34] <peetaur2_> I would start it manually in foreground... ceph-osd -f -i {id} [--setuser ceph --setgroup ceph]
[13:34] <Be-El> just ensure that the process is terminated
[13:34] <Tusker> 2016-09-29 19:31:11.848427 mon.0 2440 : cluster [INF] osd.7 marked itself down
[13:34] <peetaur2_> to get a full command example, us eps -ef and look at a good osd, and take that, change id, and add -f
[13:34] <Be-El> Tusker: before you restart it....what's the current state of the ceph -s?
[13:34] <peetaur2_> (systemd is the most dreadful thing to deal with when debugging)
[13:35] <Be-El> systemd is a piece of crap and poettering should avoid dark alleys....
[13:35] <Tusker> http://pastebin.ca/3723340
[13:36] <Be-El> does ceph -w show any recovery i/o?
[13:36] <Tusker> showing more and more objects degraded
[13:37] <Be-El> no recovery i/o at all?
[13:37] <Tusker> can't see any
[13:37] * Lokta (~Lokta@carbon.coe.int) Quit (Ping timeout: 480 seconds)
[13:38] <Be-El> any suspicious messages in the osd logs?
[13:38] <Tusker> recovery_wait
[13:38] <Tusker> just lots of slow requests
[13:39] * bniver (~bniver@71-9-144-29.static.oxfr.ma.charter.com) has joined #ceph
[13:39] <Tusker> 2016-09-29 19:39:28.250435 osd.12 [WRN] slow request 495.292281 seconds old, received at 2016-09-29 19:31:12.958083: osd_op(client.1900519.0:1946452 14.a7dcdb6f rbd_data.266c492ae8944a.000000000000013c [set-alloc-hint object_size 4194304 write_size 4194304,write 2265088~8192] snapc 0=[] RETRY=40 ack+ondisk+retry+write e649) currently waiting for missing object
[13:40] <Tusker> lots of things like that, waiting for missing objects
[13:40] <Be-El> that's because the recovery does not start
[13:40] <Be-El> try to restart osd 7 now and watch the state of the cluster with ceph -w
[13:41] <Tusker> doesn't want to start with the long command...
[13:41] <Tusker> should I do a service ceph restart on it ?
[13:42] <Tusker> Unit ceph-osd.7.1475143415.259280386.service failed to load: No such file or directory.
[13:43] <Tusker> "1 pgs recovering" in ceph -s now
[13:43] <Tusker> but the osd is down
[13:44] <Tusker> should I do "/usr/bin/ceph-osd -i 7 --pid-file /var/run/ceph/osd.7.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -f" ?
[13:45] <Be-El> do not use the long service names
[13:45] <Be-El> you can try to run it in the foreground, too
[13:46] <Tusker> running in the foreground now
[13:47] * ira (~ira@c-24-34-255-34.hsd1.ma.comcast.net) has joined #ceph
[13:50] <Be-El> any changes to the cluster state?
[13:51] <Tusker> no
[13:51] <Be-El> anything in the ceph osd logs?
[13:52] <Tusker> just "waiting for object" etc
[13:52] * peetaur2 (~peter@i4DF67CD2.pool.tripleplugandplay.com) has joined #ceph
[13:53] <peetaur2> sigh...why does ceph in vms make my machine hang
[13:54] <ira> peetaur2: Go back a step. Why does ANY vm make your machine hang...
[13:54] <Be-El> Tusker: ok, last idea is restarting all osds, since for some reason the backfill/recovery does not kick in
[13:54] <peetaur2> only ceph vms can do it
[13:54] * Hatsjoe (~Hatsjoe@2a01:7c8:aaba:4f8:5054:ff:fe6e:d37d) Quit (Quit: ZNC 1.6.3 - http://znc.in)
[13:54] <ira> ... Why can any VM no matter how badly behaved, do it.
[13:54] <ira> ceph is a symptom here :)
[13:54] <peetaur2> sometimes it's possible due to bugs...probably a cpu, kernel, or qemu bug
[13:54] * Hatsjoe (~Hatsjoe@2a01:7c8:aaba:4f8:5054:ff:fe6e:d37d) has joined #ceph
[13:54] <Be-El> peetaur2: problems with the amount of sync calls on the block deices?
[13:55] <peetaur2> such a thing when properly discovered would be classified as a DoS security bug
[13:55] <ira> I run a full ceph cluster on my machine.
[13:55] <ira> via vagrant.
[13:55] <peetaur2> Be-El: heavy IO seems not to matter...best is when I kill some ceph-osd processes or play with min_size
[13:55] <Tusker> be-el: i tried restarting ceph fully on nodes, is that the same as just restarting each osd ?
[13:55] <ira> peetaur2: Are you giving it enough memory/swap
[13:55] <ira> ?
[13:55] <Tusker> or, is it important that the mon is running while i restart each osd ?
[13:56] <peetaur2> ira: if I didn't give it enough swap, oom killer will just kill vms... and yeah I have 16GB and gave each VM 2GB
[13:56] <peetaur2> s/swap/ram/
[13:56] <ira> peetaur2: How many VMs?
[13:57] <ira> What else are you doing...
[13:57] <peetaur2> I was trying to reproduce what the guy on the ML said with his cephfs and web servers... I verified it works gracefully with min_size 1 on rbd, and then when I did ls in my cephfs, that command hung, then 10s or so later, whole machine hung
[13:58] <ira> peetaur2: When I setup LARGE clusters for Gluster, I can make my machine "appear" to hang. Via pushing it into swap hard.
[13:58] <peetaur2> ira: 3 kvm vms, with 1 osd each, and sharing an ssd for journals...so 3osd per journal hw... 3x mon, 3x osd, 2x mds (one standby); I had another qemu running with an rbd disk, and cephfs was mounted
[13:58] * peetaur2_ (~peter@i4DF67CD2.pool.tripleplugandplay.com) Quit (Ping timeout: 480 seconds)
[13:58] <ira> peetaur2: Cut your VM size down to 1GB each.
[13:58] <ira> For a 16GB machine that's way too much memory pressyre.
[13:59] <ira> pressure.
[13:59] <ira> (Especially if you do things like talk to me on irc, use firefox... etc.)
[13:59] <ira> It all adds up.
[14:00] <Be-El> Tusker: mons should be ok and running during the restart, otherwise the osds won't be able to contact the cluster
[14:00] * valeech (~valeech@pool-96-247-203-33.clppva.fios.verizon.net) has joined #ceph
[14:00] <ira> (I run a full setup like what you describe + 3 samba VMs... on a 32 GB machine, and I can feel the memory pressure.)
[14:01] <Be-El> peetaur2: cephfs mount via ceph-fuse?
[14:01] <peetaur2> Be-El: no, kernel
[14:01] <peetaur2> but the cephfs mount is not required... it can hang without it
[14:01] <Be-El> peetaur2: bad idea if osd are colocated on the same machine
[14:02] <peetaur2> and btw now it can't start mons... open: failed to open pid file '/var/run/ceph/mon.ceph1.pid': (13) Permission denied
[14:02] <peetaur2> it's first reboot since upgrading to jewel, but restarting all services worked before
[14:02] <peetaur2> Be-El: bad performance I'm sure, but hang? never
[14:02] * Lokta (~Lokta@Link01.WAN.ILL1FR.e-supinfo.net) has joined #ceph
[14:03] <peetaur2> it's for testing... so I wanted 3 nodes; so I have 3 hdds for 3 nodes
[14:03] <peetaur2> maybe I could add 3 more disks too
[14:03] <Be-El> peetaur2: the kernel client and the osds are fighting for ram (page cache for cephfs vs. page cache for osds).
[14:04] <peetaur2> I have 3 x 2U machines with 12 x 3.5" + 2 x 2.5" hot swap bay chassis on order... should be here next week. :)
[14:04] <peetaur2> I'm pretty sure there's plenty of ram
[14:04] <peetaur2> and oom killer should just start killing, not hanging things
[14:04] <Be-El> nope, oom killer does not even notice it
[14:05] <Be-El> cephfs is blocked in requesting data from the osds, which in turn needs to request memory for the file system, which tries to release page cache elements, which needs to be acknowledged by cephfs, which...oh, deadlock
[14:06] <Be-El> we have 256 gb of ram in our osd hosts. it takes 5 minutes of cephfs activity on these hosts to make them unusable -> hard reboot
[14:07] <peetaur2> well I also said that I don't need to mount cephfs to cause it
[14:07] * T1w (~jens@node3.survey-it.dk) Quit (Ping timeout: 480 seconds)
[14:08] * jonas1 (~jonas@ has joined #ceph
[14:08] <Be-El> it was a more general remark. mixing osds and kernel based ceph client (cephfs/rbd) on the same host calls for trouble
[14:08] <Be-El> user space + osd is ok
[14:08] <peetaur2> and in my case, the client is the host, and the osds are guest vms, so they aren't in the same kernel
[14:08] <peetaur2> and now this mysterious problem again when starting an osd... filestore(/var/lib/ceph/osd/ceph-0) mount failed to open journal /dev/disk/by-partlabel/osd.0.journal: (13) Permission denied
[14:09] <peetaur2> also still the /var/run/ceph owned by root problem still exists.... I just did chown to temporarily solve it
[14:09] <peetaur2> chown on journals would be a pretty hacky fix. The ceph-osd process should just open the journal before dropping privs....why doesn't it?
[14:09] <Tusker> be-el: http://pastebin.ca/3723348 that's after all osd are stopped and started
[14:11] <Be-El> Tusker: which ceph release do you use?
[14:11] <Tusker> 10.2.1-1~bpo80+1
[14:12] <Be-El> and why do you use the ancient tunables?
[14:12] <Be-El> part of the problems seems to be crush related, with being unable to find enough osds for the some of the pgs
[14:13] <Tusker> this system was set up ages ago... I can't tell you the answer regarding the tunables
[14:14] <peetaur2> one reason may be that they don't mention the fix "ceph osd crush tunables optimal" anywhere in upgrade docs, release notes, etc. and it's hard to find online
[14:14] <Tusker> just type that ?
[14:14] <peetaur2> that is what you should have typed long ago before the problem.. I don't know if now is a good time
[14:14] <Be-El> Tusker: does ceph pg query for one of the pg gives you any output?
[14:14] <Be-El> i would do that right now, since it may make it worse
[14:15] <Tusker> be-el: it hung last time
[14:15] <Tusker> wouldn't ? would ?
[14:15] <Be-El> wouldn't ;-)
[14:15] <Be-El> and does it still hang?
[14:15] <Tusker> let me see
[14:16] <Tusker> on one of the degraded ones ?
[14:16] * Racpatel (~Racpatel@2601:87:3:31e3::4d2a) has joined #ceph
[14:16] <Be-El> for example
[14:16] <Be-El> maybe also on one of the active+clean ones
[14:16] * Racpatel (~Racpatel@2601:87:3:31e3::4d2a) Quit ()
[14:16] * Racpatel (~Racpatel@2601:87:3:31e3::4d2a) has joined #ceph
[14:17] <Tusker> not hanging now
[14:17] <Tusker> want the output ?
[14:17] <Be-El> of one of the degraded ones, yes
[14:17] <Tusker> http://pastebin.ca/3723350
[14:20] <Be-El> have a look at the peer_info section. the stats listed for each for the replicates in that section should be identical
[14:20] <Be-El> but there are no values for osd 8
[14:20] <peetaur2> so....now to hang it again without cephfs. :)
[14:21] <Be-El> Tusker: that's why the pg is degraded; but it should be backfilling or backfill_wait
[14:21] <Tusker> so, how to coax it into action? :)
[14:22] * bara (~bara@ip4-83-240-10-82.cust.nbox.cz) has joined #ceph
[14:22] <peetaur2> why should "might_have_unfound" have 7,8,11, but the rest says 7,9,11 ? :/
[14:23] * dneary (~dneary@nat-pool-bos-u.redhat.com) has joined #ceph
[14:23] <Be-El> oh, i mixed up osd 8 and 9
[14:23] <Be-El> peetaur2: maybe the unfound objects are on osd 8 (e.g. if one of the replicates was on that osd before)
[14:24] <Be-El> Tusker: one possible attempt would be increasing the number of parallel backfills, since it might trigger recovery/backfilling
[14:25] <Tusker> OK, how? :)
[14:25] <Be-El> Tusker: "ceph tell osd.* injectargs '--osd-max-backfills=2'"
[14:26] <Tusker> that's taking time to come back...
[14:26] <Be-El> it should print a message for each osd
[14:27] <Be-El> are you sure the network connectivity is ok?
[14:27] <Tusker> yeah, ping is no problem to each node
[14:28] <Be-El> and direct connections to the osds and between the osd (on the right ports of cause....)
[14:30] <Be-El> also test with larger packet sizes. the whole problem smells like a mtu / connectivity problem on the cluster network
[14:30] * mattbenjamin1 (~mbenjamin@76-206-42-50.lightspeed.livnmi.sbcglobal.net) has joined #ceph
[14:30] <Be-El> did you change anything on the network setup?
[14:31] <Tusker> not recently
[14:31] <Tusker> but, the guys who own the hardware, moved it between data centres
[14:32] <Be-El> so a different switch with a different configuration
[14:33] <Tusker> ping -s 9000 < seems fine
[14:33] <Be-El> the basic connectivity between the nodes seems to be ok (osd heartbeats are being received), but there seems to be little to no data traffic
[14:33] <Tusker> yeah, i was watching iftop before, and it's hardly doing any traffic
[14:34] <Be-El> do you use different networks for public/cluster network?
[14:35] <Tusker> yes
[14:36] <Be-El> can you verify that you are using the cluster network if you use ping with a large packet size between two osd nodes?
[14:38] * wjw-freebsd (~wjw@smtp.digiware.nl) Quit (Ping timeout: 480 seconds)
[14:40] <Tusker> interesting, no actitivity on the 10.1. network
[14:42] * mhack (~mhack@24-151-36-149.dhcp.nwtn.ct.charter.com) has joined #ceph
[14:43] <Tusker> but, the network still performs well as far as I can see
[14:43] <Be-El> tusker: is the public network, the cluster one?
[14:43] <Tusker> as far as I can see, yes
[14:43] <Tusker> the load balancer is on
[14:44] <Be-El> recovery/backfilling/almost-all-osd-osd-interaction only take place on the cluster network
[14:45] <Tusker> hmm...
[14:45] <Tusker> how do we check that ?
[14:45] <Be-El> check with standard ping sizes, since osds are not reported as down -> osd heartbeats are ok
[14:45] <Tusker> all ping seems fine
[14:45] <arthur> given the choice of using a multiple hba vs single hba connected to a sas expander backplane, which configuration would be preferred when deploying ceph?
[14:46] <Be-El> Tusker: also with larger sizes (-> mtu)?
[14:46] <Be-El> arthur: is the hba able to handle the i/o of all connected drives (including the pci-e link to the cpu)?
[14:46] <arthur> i plan on connecting 2/3 ssd and 9/10 spinning disks
[14:47] <Be-El> Tusker: and check firewall settings/acls on the switch
[14:47] <Tusker> be-el: yeah, 9000 ping size on both networks responds fine
[14:50] <Be-El> Tusker: you can also check connectivity with nc, e.g. pasting a large file / running dd with output to nc between two nodes
[14:50] <Be-El> binding to the cluster network is the important part of this test
[14:50] <Tusker> using iperf enough ?
[14:51] <Be-El> iperf should also be ok
[14:51] <arthur> Be-El: I'm looking at a LSI SAS 9300-4i atm which supports 4x 12Gb/s and eight-lane full-duplex pcie 3.0 performance (64Gb/s)
[14:52] <Tusker> [ 4] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec
[14:52] <Be-El> arthur: and the expander is connected to one 12Gb/s connector?
[14:52] <Tusker> same performance on both networks
[14:53] <arthur> Be-El: that would be a multi-lane connector, so 4x12Gb/s I guess
[14:53] <Be-El> Tusker: did you check that the traffic was sent via the correct interfaces (nmon / bmon/ iptraf) ?
[14:53] <Tusker> i bound to the interface
[14:54] * igoryonya (~kvirc@ Quit (Read error: Connection reset by peer)
[14:55] <Tusker> 60 bytes from 0c:c4:7a:46:1e:2f ( index=0 time=25.696 msec
[14:55] <Tusker> 60 bytes from 0c:c4:7a:46:1e:2e ( index=1 time=25.727 msec
[14:55] <Tusker> interesting, they've not configured the VLAN properly
[14:55] <Tusker> both networks are on the same VLAN, and hence responding to ARP from both MAC addresses
[14:56] <Be-El> arthur: 4x 12Gbit/s => 6 Gbyte/s... seems ok to me
[14:57] <Tusker> regardless, the transfer speed between the nodes is very fast
[14:58] <Be-El> Tusker: you need to fix that if both ethernet interfaces are bound to different ip addresses (no trunk etc.)
[14:58] <Tusker> yes, correct
[14:58] <Be-El> in case of lacp trunks there should only be one reply
[14:59] <Tusker> agree
[14:59] <Be-El> arthur: do you plan multiple sockets?
[15:01] * topro (~prousa@ Quit (Remote host closed the connection)
[15:02] * topro (~prousa@p578af414.dip0.t-ipconnect.de) has joined #ceph
[15:03] * Lokta (~Lokta@Link01.WAN.ILL1FR.e-supinfo.net) Quit (Ping timeout: 480 seconds)
[15:03] <arthur> Be-El: Yes
[15:05] <Be-El> arthur: not completely sure about this, but you also might want to consider distributing i/o load over multiple cpu (e.g. one cpu handles disks, one cpu networks etc.)
[15:06] <Be-El> arthur: i've put the nvme pci-e ssds on the second cpu in our setup, and raid controller + network on the first one. but i'm not sure whether this is a good setup (too much traffic between cpus?)
[15:07] <arthur> Be-El: Good point about traffic/latency between cpu's
[15:09] <Be-El> unfortunately there's no perfect solution except a verrry large cpus with enough cores to handle i/o for networks, raids, pci-e and all osd threads locally ;-)
[15:09] <arthur> this video has good info on input/output on x86 servers https://www.youtube.com/watch?v=s9CUDhE19v0
[15:09] <Be-El> (ceph on knights landing.... *g* )
[15:10] <TMM> quickpath should still be faster than any network or nvme device combined
[15:10] <TMM> I wouldn't be too worried about it :)
[15:10] <TMM> Or stop futzing around with x86 if your i/o requirements really are that high
[15:10] <TMM> get a nice POWER8 box from IBM
[15:13] <TMM> Most likely just keeping those tasks pinned to the same numa node will help you more than enough to offset any quickpath slowness
[15:16] * vimal (~vikumar@ Quit (Quit: Leaving)
[15:17] * ivve (~zed@cust-gw-11.se.zetup.net) Quit (Ping timeout: 480 seconds)
[15:18] * brad_mssw (~brad@ has joined #ceph
[15:21] * mattbenjamin1 (~mbenjamin@76-206-42-50.lightspeed.livnmi.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[15:23] * derjohn_mobi (~aj@b2b-94-79-172-98.unitymedia.biz) Quit (Ping timeout: 480 seconds)
[15:33] * rwheeler (~rwheeler@pool-108-7-196-31.bstnma.fios.verizon.net) Quit (Quit: Leaving)
[15:35] * salwasser (~Adium@ has joined #ceph
[15:37] * wes_dillingham (~wes_dilli@ has joined #ceph
[15:39] * yanzheng (~zhyan@ Quit (Quit: This computer has gone to sleep)
[15:39] * shaunm (~shaunm@ has joined #ceph
[15:39] * rdas (~rdas@ Quit (Quit: Leaving)
[15:41] * derjohn_mobi (~aj@b2b-94-79-172-98.unitymedia.biz) has joined #ceph
[15:42] * Tusker (~tusker@CPE-124-190-175-165.snzm1.lon.bigpond.net.au) Quit (Ping timeout: 480 seconds)
[15:45] * doppelgrau1 (~doppelgra@ has joined #ceph
[15:48] * doppelgrau (~doppelgra@ Quit (Ping timeout: 480 seconds)
[15:54] * Shadow386 (~Rens2Sea@tor-exit.squirrel.theremailer.net) has joined #ceph
[16:00] * mattbenjamin1 (~mbenjamin@ has joined #ceph
[16:02] * theancient (~jasonj@173-165-224-105-minnesota.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[16:02] * Hemanth (~hkumar_@ has joined #ceph
[16:04] * owasserm (~owasserm@a212-238-239-152.adsl.xs4all.nl) Quit (Remote host closed the connection)
[16:04] * owasserm (~owasserm@2001:984:d3f7:1:5ec5:d4ff:fee0:f6dc) has joined #ceph
[16:05] * Mibka (~andy@ has joined #ceph
[16:07] * Green (~Green@ Quit (Ping timeout: 480 seconds)
[16:07] * theancient (~jasonj@173-165-224-105-minnesota.hfc.comcastbusiness.net) has joined #ceph
[16:07] * ron-slc_ (~Ron@173-165-129-118-utah.hfc.comcastbusiness.net) Quit (Remote host closed the connection)
[16:09] <Be-El> does radosgw federation require radosgw on all sites, or does it only require s3 access?
[16:14] * vata (~vata@ has joined #ceph
[16:17] * kristen (~kristen@ has joined #ceph
[16:19] * Goodi (~Hannu@office.proact.fi) Quit (Quit: This computer has gone to sleep)
[16:24] * Shadow386 (~Rens2Sea@tor-exit.squirrel.theremailer.net) Quit ()
[16:27] <Mibka> If I have a 3 node cluster, each cluster having the same amount of disk space and I tell Ceph to store only one additional copy (size=2?). And I'm using about 85% of space for example. What happens if one node dies? There won't be enough space on the two other nodes for rebalancing with size=2.. Will Ceph just not start rebalancing or will it be a problem and fail?
[16:29] <llua> it should rebalance until it hits the full ratio, then stop writes to the cluster
[16:29] <m0zes> 85% full? I think it will start rebalancing until it hits 90% and stop. anything more than 75% full means you need more machines. and size 2 is potentially dangerous...
[16:31] <Mibka> why is size 2 dangerous? It's storing a copy of the data on another machine, right? so it's more safe than just raid1 in a single system ? .. The same data now resides on a single machine running raid10 :)
[16:33] <m0zes> the likelihood that 2 disks fail at any point in time, or even in the time that it takes to rebalance is unexceptably high for some people. I can't remember the charts, at the moment, but iirc it is something like ~90% safe.
[16:33] * Defaultti1 (~AotC@ has joined #ceph
[16:34] <Be-El> the prohability should be the same as in the raid 5 case
[16:36] <m0zes> in the situations that require a size 2 for me, I make sure the min_size is 2 as well...
[16:37] * Animazing (~Wut@ Quit (Ping timeout: 480 seconds)
[16:40] * branto (~branto@nat-pool-brq-t.redhat.com) Quit (Quit: Leaving.)
[16:41] * Kurt1 (~Adium@2001:628:1:5:f0a2:333c:c761:8c90) Quit (Quit: Leaving.)
[16:45] * rraja (~rraja@ Quit (Quit: Leaving)
[16:45] <Mibka> Okay. Well, I was going to setup 3 nodes only. Ceph + Proxmox and I don't actually need more nodes since I can add more disks to each node and there's plenty of cpu and ram in these 3 nodes.
[16:45] <Mibka> Is there any big advantage for Ceph if I add a 4th node?
[16:47] * kefu (~kefu@ has joined #ceph
[16:48] * b0e (~aledermue@ Quit (Quit: Leaving.)
[16:50] * rraja (~rraja@ has joined #ceph
[16:52] * Green (~Green@ has joined #ceph
[16:55] * vicente (~vicente@1-161-184-59.dynamic.hinet.net) has joined #ceph
[16:55] * scuttle|afk is now known as scuttlemonkey
[16:57] * Animazing (~Wut@ has joined #ceph
[16:57] * karnan (~karnan@ Quit (Quit: Leaving)
[16:57] * wushudoin (~wushudoin@2601:646:8200:c9f0:2ab2:bdff:fe0b:a6ee) has joined #ceph
[16:58] * mykola (~Mikolaj@ has joined #ceph
[16:59] <Be-El> Mibka: more accumulated bandwidth
[16:59] <Be-El> Mibka: and you can start to think about erasure coding pools
[17:02] <Mibka> I haven't read into that yet. Not sure what exactly an erasure coding pool is :)
[17:02] * analbeard (~shw@support.memset.com) Quit (Quit: Leaving.)
[17:03] <Be-El> Mibka: replicated pool behave like raid1, erasure coding pools like raid5/raid6
[17:03] * Defaultti1 (~AotC@ Quit ()
[17:03] * ircolle (~ircolle@nat-pool-bos-u.redhat.com) has joined #ceph
[17:04] * haplo37 (~haplo37@ has joined #ceph
[17:04] <SamYaple> Be-El: i mean 3 nodes would really be enough for EC. You just can't lose two nodes and still maintain your data unlike replicated
[17:05] <Be-El> SamYaple: 3 nodes is the minimum, yes
[17:06] <SamYaple> Be-El: i think 2 nodes is the minimum, but thats not highly available
[17:06] <SamYaple> nothing to stop k=1,m=1
[17:09] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) has joined #ceph
[17:09] * vbellur (~vijay@ Quit (Ping timeout: 480 seconds)
[17:09] <Mibka> When running 3 nodes ceph+proxmox and ssd's only. Would you think erasure coding pools are a good idea? It's probably better to just stick to replicated pools
[17:10] <peetaur2> if you want IOPS, probably replicated is best
[17:11] <Be-El> if you just want to use rbd, you need to use replicated pools (or a cache tier on top of an erasure coding pool)
[17:11] <peetaur2> also I think EC is just object storage without some extra work, then you get RBD too
[17:12] * CephFan1 (~textual@68-233-224-175.static.hvvc.us) has joined #ceph
[17:16] * Concubidated (~cube@ has joined #ceph
[17:21] * sudocat (~dibarra@104-188-116-197.lightspeed.hstntx.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[17:29] * andreww (~xarses@c-73-202-191-48.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[17:29] * ade (~abradshaw@2a02:810d:a4c0:5cd:9001:2bba:886a:5200) Quit (Ping timeout: 480 seconds)
[17:30] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[17:38] * Danjul (~Danjul@ Quit (Ping timeout: 480 seconds)
[17:39] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) has joined #ceph
[17:42] * ggarg (~ggarg@host-82-135-29-34.customer.m-online.net) Quit (Remote host closed the connection)
[17:43] * andreww (~xarses@ has joined #ceph
[17:45] * newbie (~kvirc@host217-114-156-249.pppoe.mark-itt.net) has joined #ceph
[17:45] * rmart04 (~rmart04@support.memset.com) Quit (Quit: rmart04)
[17:46] * jowilkin (~jowilkin@184-23-213-254.fiber.dynamic.sonic.net) has joined #ceph
[17:48] * bara (~bara@ip4-83-240-10-82.cust.nbox.cz) Quit (Quit: Bye guys! (??????????????????? ?????????)
[17:50] * sudocat (~dibarra@ has joined #ceph
[17:51] * baotiao (~baotiao@ Quit (Quit: baotiao)
[17:52] * doppelgrau1 (~doppelgra@ Quit (Quit: Leaving.)
[17:55] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[17:55] * rraja (~rraja@ Quit (Quit: Leaving)
[17:56] * TMM (~hp@ Quit (Quit: Ex-Chat)
[17:56] * branto (~branto@ip-78-102-208-181.net.upcbroadband.cz) has joined #ceph
[17:57] * branto (~branto@ip-78-102-208-181.net.upcbroadband.cz) Quit ()
[17:58] * vikhyat (~vumrao@ Quit (Quit: Leaving)
[18:00] * sudocat1 (~dibarra@ has joined #ceph
[18:01] * vbellur (~vijay@nat-pool-bos-t.redhat.com) has joined #ceph
[18:05] * Be-El (~blinke@nat-router.computational.bio.uni-giessen.de) Quit (Quit: Leaving.)
[18:05] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) has joined #ceph
[18:05] * sudocat1 (~dibarra@ Quit (Read error: Connection reset by peer)
[18:05] * madkiss (~madkiss@2a02:8109:8680:2000:9589:79df:471:def7) has joined #ceph
[18:05] * sudocat (~dibarra@ Quit (Ping timeout: 480 seconds)
[18:05] * sudocat (~dibarra@ has joined #ceph
[18:07] * kefu (~kefu@ Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[18:07] * davidzlap (~Adium@2605:e000:1313:8003:b4ef:7b3d:6320:971b) has joined #ceph
[18:13] * kefu (~kefu@ has joined #ceph
[18:13] * vimal (~vikumar@ has joined #ceph
[18:15] * davidzlap (~Adium@2605:e000:1313:8003:b4ef:7b3d:6320:971b) Quit (Quit: Leaving.)
[18:17] * Green (~Green@ Quit (Ping timeout: 480 seconds)
[18:18] * mattch (~mattch@w5430.see.ed.ac.uk) Quit (Remote host closed the connection)
[18:18] * dgurtner (~dgurtner@84-73-130-19.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[18:21] * ggarg (~ggarg@host-82-135-29-34.customer.m-online.net) has joined #ceph
[18:23] * kefu (~kefu@ Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[18:24] * jprins_ (~jprins@bbnat.betterbe.com) Quit (Ping timeout: 480 seconds)
[18:24] * ffilzwin (~ffilz@c-67-170-185-135.hsd1.or.comcast.net) Quit (Quit: Leaving)
[18:27] * davidzlap (~Adium@2605:e000:1313:8003:61bc:80cc:ada:6a34) has joined #ceph
[18:27] * jprins (~jprins@bbnat.betterbe.com) has joined #ceph
[18:29] * kefu (~kefu@ has joined #ceph
[18:29] * tries (~tries__@2a01:2a8:2000:ffff:1260:4bff:fe6f:af91) Quit (Ping timeout: 480 seconds)
[18:30] * ffilzwin (~ffilz@c-67-170-185-135.hsd1.or.comcast.net) has joined #ceph
[18:34] * fridim (~fridim@56-198-190-109.dsl.ovh.fr) Quit (Ping timeout: 480 seconds)
[18:41] * kefu (~kefu@ Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[18:41] * rakeshgm (~rakesh@ has joined #ceph
[18:48] * kefu (~kefu@ has joined #ceph
[18:49] * shyu_ (~shyu@ has joined #ceph
[18:53] * jermudgeon (~jhaustin@ has joined #ceph
[18:55] * rotbeard (~redbeard@ Quit (Quit: Leaving)
[19:01] * doppelgrau (~doppelgra@dslb-088-072-094-200.088.072.pools.vodafone-ip.de) has joined #ceph
[19:03] * DanFoster (~Daniel@office.34sp.com) Quit (Quit: Leaving)
[19:06] * tries (~tries__@2a01:2a8:2000:ffff:1260:4bff:fe6f:af91) has joined #ceph
[19:10] * vZerberus (~dogtail@00021993.user.oftc.net) has joined #ceph
[19:12] * stefano (~stefano@ has joined #ceph
[19:12] * stefano is now known as stefan0
[19:13] <stefan0> Hi guys!
[19:13] <stefan0> I??m having a sizing issue for Tyco DVR VideoEdge system, for 200 Tbs liquid
[19:14] * shyu_ (~shyu@ Quit (Remote host closed the connection)
[19:14] <stefan0> they do have a sizing guide and connectors for EMC Isilon and VSS
[19:15] <stefan0> the system shall support 200 HD cameras writing at the same time
[19:15] <jermudgeon> how many megabits per camera?
[19:16] * lae (~lae@ has joined #ceph
[19:16] <lae> Is there much benefit to having a ceph journal on a separate sata drive than on the osd sata drive itself?
[19:16] <stefan0> the VSS sizing point that 1 GbE NICs is good to go
[19:17] <stefan0> jermudgeon, 380 Mbps for the entire environment
[19:17] <jermudgeon> that sounds doable
[19:17] * fdmanana (~fdmanana@2001:8a0:6e0c:6601:4418:f044:c01f:37d4) Quit (Ping timeout: 480 seconds)
[19:19] <stefan0> lae, imho, depends of how many concurrent writes/read your system will handle.. SATA is not SAS/SAS-NL when we focus queueing, if you have a SATA drive more idle than the OSD that sounds a good catch!
[19:22] <stefan0> jermudgeon, do you think using a 1 Gbps LAN we won??t get any writing bottlenecks?
[19:23] <jermudgeon> Nominally you shouldn???t, however, if you can do a LAG that would no doubt help. Are you talking about 1 Gbps for the client side, or for the OSD side? You???ll need more bandwidth for replication on the OSD side
[19:24] <stefan0> usually I don??t do ceph sizing without a 10 Gb switch/networking... the rebuild will make the system bleed :\
[19:24] <jermudgeon> yep
[19:25] * TMM (~hp@dhcp-077-248-009-229.chello.nl) has joined #ceph
[19:27] * squizzi (~squizzi@ Quit (Quit: bye)
[19:30] * Vacuum__ (~Vacuum@i59F790B3.versanet.de) has joined #ceph
[19:30] * Vacuum_ (~Vacuum@ Quit (Read error: Connection reset by peer)
[19:32] * kefu (~kefu@ Quit (Quit: My Mac has gone to sleep. ZZZzzz???)
[19:32] * vimal (~vikumar@ Quit (Quit: Leaving)
[19:33] * jermudgeon (~jhaustin@ Quit (Quit: jermudgeon)
[19:33] * jermudgeon (~jhaustin@ has joined #ceph
[19:34] * karnan (~karnan@ has joined #ceph
[19:35] * squizzi (~squizzi@2001:420:2240:1268:ad85:b28:ee1c:890) has joined #ceph
[19:36] * Teddybareman (~Maariu5_@tsn109-201-154-143.dyn.nltelcom.net) has joined #ceph
[19:37] * stiopa (~stiopa@cpc73832-dals21-2-0-cust453.20-2.cable.virginm.net) has joined #ceph
[19:39] * Green (~Green@ has joined #ceph
[19:56] <stefan0> Regarding the Data Striping with no replica set (min_size 1), if the system lost one OSD, does the data get lost? I??m reading http://docs.ceph.com/docs/jewel/architecture/ and the data striping session says: ??Ceph???s striping offers the throughput of RAID 0 striping, the reliability of n-way RAID mirroring and faster recovery.??
[19:57] <darkfader> yes, one replica always means that
[20:06] * Teddybareman (~Maariu5_@tsn109-201-154-143.dyn.nltelcom.net) Quit ()
[20:15] * Vacuum_ (~Vacuum@ has joined #ceph
[20:18] * andreww (~xarses@ Quit (Remote host closed the connection)
[20:19] * Green (~Green@ Quit (Read error: Connection reset by peer)
[20:20] * xarses (~xarses@ has joined #ceph
[20:20] <mnaser> Is there a way of determining the version that ceph clients are running (ex: ceph tell client.* version or something) to know if it's safe to change tunables from an upgrade?
[20:22] * Vacuum__ (~Vacuum@i59F790B3.versanet.de) Quit (Ping timeout: 480 seconds)
[20:25] * derjohn_mobi (~aj@b2b-94-79-172-98.unitymedia.biz) Quit (Remote host closed the connection)
[20:26] * TomasCZ (~TomasCZ@yes.tenlab.net) has joined #ceph
[20:27] * Vacuum__ (~Vacuum@ has joined #ceph
[20:34] * Vacuum_ (~Vacuum@ Quit (Ping timeout: 480 seconds)
[20:36] * xarses (~xarses@ Quit (Remote host closed the connection)
[20:38] * xarses (~xarses@ has joined #ceph
[20:40] * xarses (~xarses@ Quit (Remote host closed the connection)
[20:42] * xarses (~xarses@ has joined #ceph
[20:46] * karnan (~karnan@ Quit (Remote host closed the connection)
[20:52] * darkid (~Fapiko@ has joined #ceph
[21:09] * Mibka (~andy@ Quit (Quit: bbl)
[21:13] * cathode (~cathode@50-232-215-114-static.hfc.comcastbusiness.net) has joined #ceph
[21:14] * peetaur (~peter@p200300E10BC04E00667002FFFE2E10FC.dip0.t-ipconnect.de) has joined #ceph
[21:15] <peetaur> howdy. Where are the docs for trying BlueStore?
[21:16] <peetaur> or if someone just tells me enough, I'll write the docs
[21:21] * darkid (~Fapiko@ Quit ()
[21:26] * ircolle (~ircolle@nat-pool-bos-u.redhat.com) Quit (Quit: Leaving.)
[21:26] <rkeene> Every release of Ceph has more and more build bugs :-(
[21:27] <mnaser> rkeene, what issues you're having?
[21:43] * Hemanth (~hkumar_@ Quit (Quit: Leaving)
[21:46] <rkeene> mnaser, #17438 -- compiling Ceph v10.2.3 with --with-radosgw --without-openldap fails
[21:47] <rkeene> Since librgw.c references a function called parse_rgw_ldap_bindpw() that lives in the source file rgw_ldap.cc, which is only compiled and linked when you enable OpenLDAP
[21:47] <rkeene> As of commit fe57aceeb02ad9163feb2d196589b5927cedfa0f
[21:50] <lurbs> Are there plans for an official jemalloc builds?
[21:50] * rdas (~rdas@ has joined #ceph
[21:50] * rdas (~rdas@ Quit (Remote host closed the connection)
[21:51] * Hemanth (~hkumar_@ has joined #ceph
[21:51] * salwasser (~Adium@ Quit (Quit: Leaving.)
[21:52] <rkeene> Tangentially related, has anyone tried Ceph with SuperMalloc ? I fixed up a bug in SuperMalloc and it Ceph with SuperMalloc seems to work, but I haven't heavily tested it
[21:57] * joshd1 (~jdurgin@2602:30a:c089:2b0:d811:6c84:cc46:adea) has joined #ceph
[22:12] * mykola (~Mikolaj@ Quit (Quit: away)
[22:15] * Jyron1 (~Silentkil@mtest1.im-in-the-tor-network.xyz) has joined #ceph
[22:16] * vbellur (~vijay@nat-pool-bos-t.redhat.com) Quit (Quit: Leaving.)
[22:22] * davidzlap (~Adium@2605:e000:1313:8003:61bc:80cc:ada:6a34) Quit (Quit: Leaving.)
[22:27] * davidzlap (~Adium@2605:e000:1313:8003:61bc:80cc:ada:6a34) has joined #ceph
[22:28] * derjohn_mob (~aj@x590cda80.dyn.telefonica.de) has joined #ceph
[22:30] * KindOne (kindone@0001a7db.user.oftc.net) Quit (Remote host closed the connection)
[22:32] * KindOne (kindone@h252.172.16.98.dynamic.ip.windstream.net) has joined #ceph
[22:35] * Jeffrey4l_ (~Jeffrey@ has joined #ceph
[22:38] * rwheeler (~rwheeler@pool-108-7-196-31.bstnma.fios.verizon.net) has joined #ceph
[22:39] * Jeffrey4l__ (~Jeffrey@ Quit (Ping timeout: 480 seconds)
[22:42] * davidzlap (~Adium@2605:e000:1313:8003:61bc:80cc:ada:6a34) Quit (Quit: Leaving.)
[22:43] * wes_dillingham (~wes_dilli@ Quit (Quit: wes_dillingham)
[22:45] * Jyron1 (~Silentkil@mtest1.im-in-the-tor-network.xyz) Quit ()
[22:47] * derjohn_mob (~aj@x590cda80.dyn.telefonica.de) Quit (Ping timeout: 480 seconds)
[22:49] * jarrpa (~jarrpa@ has joined #ceph
[22:52] * bniver (~bniver@71-9-144-29.static.oxfr.ma.charter.com) Quit (Remote host closed the connection)
[22:52] * wgao (~wgao@ Quit (Read error: Connection timed out)
[22:56] * newbie (~kvirc@host217-114-156-249.pppoe.mark-itt.net) Quit (Ping timeout: 480 seconds)
[22:58] * derjohn_mob (~aj@x590cda80.dyn.telefonica.de) has joined #ceph
[23:08] * jstrassburg (~oftc-webi@ has joined #ceph
[23:11] * Hemanth (~hkumar_@ Quit (Ping timeout: 480 seconds)
[23:16] <jstrassburg> Hello all, looking for some guidance. We're running Firefly and are having an issue during recovery. We'll have an OSD process spin up a CPU and then we get blocked / slow requests. If we restart that OSD, another one will spin and block. Our recovery (disk replacement) was going well for a while and then we got to this spot. Does anyone have any other ideas for allowing the recovery to continue without blocking requests? I can provide any other informa
[23:17] <jermudgeon> jstrassburg: I???m not an expert??? how replicas are you running for max=?
[23:17] * brad_mssw (~brad@ Quit (Quit: Leaving)
[23:17] <jstrassburg> 3 replicas
[23:17] <jstrassburg> 3 nodes, 24 OSDs
[23:19] <jstrassburg> We've been doing work to reduce our pgs/OSD (we had far too many from previous work) and have reduced it to around 600 / OSD (which is still very high, I know)
[23:19] <jermudgeon> are you reweighting or changin crush weight on an osd before you restart it?
[23:20] <jstrassburg> No, do you think that would help?
[23:20] <jermudgeon> sec???
[23:21] <jermudgeon> on reading this, that might be a red herring http://ceph.com/planet/difference-between-ceph-osd-reweight-and-ceph-osd-crush-reweight/
[23:21] <jermudgeon> are you marking the OSDs out or down before restarting?
[23:21] <jstrassburg> If it matters, currently our weights match for all our OSDs and we have 8 OSDs per node.
[23:22] <jstrassburg> I'll ask the person that restarted it if he marked it down
[23:23] <jstrassburg> he says he did not
[23:23] <jstrassburg> mark them down, would that be of help perhaps?
[23:27] <doppelgrau> jstrassburg: can you paste somewhre ceph -s and ceph health detail output?
[23:28] <jstrassburg> yeah, just a min
[23:29] <jstrassburg> cluster 91eda869-9a34-4f74-8189-7d5fb6952f4a health HEALTH_WARN 76 requests are blocked > 32 sec monmap e15: 5 mons at {DCFS-MON05=,DCFS-MON06=,DCFS-MON07=,DCFS-MON11=,DCFS-MON12=}, election epoch 534, quorum 0,1,2,3,4 DCFS-MON05,DCFS-MON06,DCFS-MON07,DCFS-MON11,DCFS-MON12 mdsmap e36: 0/1/1 up osdmap e158518: 24 osds: 24 up, 24 in
[23:30] <doppelgrau> jstrassburg:
[23:30] <jstrassburg> pgmap v60387737: 5782 pgs, 23 pools, 3903 GB data, 996 kobjects 13068 GB used, 118 TB / 130 TB avail 5781 active+clean 1 active+clean+scrubbing client io 20226 B/s rd, 4107 kB/s wr, 333 op/s
[23:30] * nathani1 (~nathani@2607:f2f8:ac88::) Quit (Quit: WeeChat 1.4)
[23:30] <doppelgrau> somewhre on pastebin or something else, with linebreaks
[23:30] * vbellur (~vijay@ has joined #ceph
[23:30] <jstrassburg> HEALTH_WARN 114 requests are blocked > 32 sec; 1 osds have slow requests 6 ops are blocked > 524.288 sec 43 ops are blocked > 262.144 sec 2 ops are blocked > 131.072 sec 25 ops are blocked > 65.536 sec 38 ops are blocked > 32.768 sec
[23:30] <jstrassburg> 6 ops are blocked > 524.288 sec on osd.20 43 ops are blocked > 262.144 sec on osd.20 2 ops are blocked > 131.072 sec on osd.20 25 ops are blocked > 65.536 sec on osd.20 38 ops are blocked > 32.768 sec on osd.20 1 osds have slow requests
[23:31] <jstrassburg> Here is better formatting: http://pastebin.com/chXXsxCf
[23:32] * nathani (~nathani@2607:f2f8:ac88::) has joined #ceph
[23:32] <doppelgrau> all right, besides the slow requests, it lloks heatlthy
[23:32] <doppelgrau> are there more information on the osd-log?
[23:33] <doppelgrau> if not, try increasing debug-level
[23:33] <jstrassburg> Yeah, our recovery finally completed as I was chatting
[23:33] * valeech (~valeech@pool-96-247-203-33.clppva.fios.verizon.net) Quit (Quit: valeech)
[23:35] <jstrassburg> ok, looking at logs
[23:35] <jstrassburg> do you think taking the OSD out before restart would help? I didn't see a response on that
[23:35] <jstrassburg> I suppose we could try
[23:35] <doppelgrau> out?
[23:36] <doppelgrau> out = rebalancing the data = higher load if it comes back
[23:36] <jstrassburg> only if it is out for like 5 minutes correct?
[23:36] <jstrassburg> or we set noout
[23:37] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:37] <jstrassburg> This is all we have in the osd log other than the notes of slow requests:
[23:37] <jstrassburg> 2016-09-29 16:05:41.574894 7f42cef72700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f42cbf6c700' had timed out after 15
[23:37] <doppelgrau> jstrassburg: down = no ping , down => out after a configured time in state down (or manually)
[23:38] <doppelgrau> increasing verbosity?
[23:38] <jstrassburg> Yeah, I'll try that. thx
[23:42] * mattbenjamin1 (~mbenjamin@ Quit (Ping timeout: 480 seconds)
[23:44] * squizzi (~squizzi@2001:420:2240:1268:ad85:b28:ee1c:890) Quit (Ping timeout: 480 seconds)
[23:46] * blizzow (~jburns@50-243-148-102-static.hfc.comcastbusiness.net) has joined #ceph
[23:48] * Chaos_Llama (~K3NT1S_aw@ has joined #ceph
[23:55] * CephFan1 (~textual@68-233-224-175.static.hvvc.us) Quit (Quit: My MacBook Pro has gone to sleep. ZZZzzz???)
[23:59] * mhack (~mhack@24-151-36-149.dhcp.nwtn.ct.charter.com) Quit (Remote host closed the connection)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.