#ceph IRC Log

Index

IRC Log for 2013-08-14

Timestamps are in GMT/BST.

[0:04] * sprachgenerator (~sprachgen@130.202.135.224) Quit (Quit: sprachgenerator)
[0:09] * mschiff (~mschiff@85.182.236.82) Quit (Remote host closed the connection)
[0:10] <sagewk> joelio: the vm talks directly to librbd; libvirt's only role is in starting the vm with the right config and args.
[0:10] <sagewk> i'm not a big qemu expert, but I think the concurrency is mostly a function of the VM OS and driver (virtio or scsi or whatever) that is being used.
[0:11] * mschiff (~mschiff@85.182.236.82) has joined #ceph
[0:17] <sagewk> joao: btw on that 220gb mon thing,they probably alos need to copy the other files in the dir, since the tool only copies the store.db content?
[0:18] <joelio> sagewk: right, I see, that makes much more sense thanks.
[0:25] * mschiff (~mschiff@85.182.236.82) Quit (Remote host closed the connection)
[0:28] <joao> sagewk, I don't think that should be needed
[0:28] <joao> unless the monitor crashes
[0:28] <sagewk> there's the keyring file, i think 'whoami', and some other stuff... ?
[0:28] <joao> ah
[0:28] <joao> that
[0:28] <joao> that's true
[0:28] <joao> totally forgot about that
[0:28] <sagewk> just cp * should do it
[0:28] <joao> yeah
[0:29] <joao> oh, actually, that may not be needed at all
[0:29] <joao> I think the monitor will write the keyring file if not present
[0:29] <joao> let me check
[0:30] <joao> (and we don't keep anything else besides the keyring on the mon data dir, apart from store.db)
[0:30] * alram (~alram@208.86.100.62) Quit (Read error: Operation timed out)
[0:31] <Kioob> zjohnson: I think Intel 530 are not "enterprise level" SSD, only old 320 and new DC S3500/S3700 are
[0:34] <Kioob> mm I don't find informations about that, but for me the 530 series can loose is write cache in case of power failure... but, I can be wrong
[0:34] <zjohnson> oh, yeah I heard the 320 has some supercaps on it
[0:34] <Kioob> (and the 320 serie is slow)
[0:35] <nhm_> S3700 looks fantastic
[0:35] <joao> sagewk, indeed, the keyring is written out when the monitor starts
[0:35] <joao> so no need to copy it :)
[0:35] <sagewk> from what though?
[0:35] <sagewk> that is only helpful when doing a --mkfs or bootstrap
[0:35] <Kioob> Yes, S3700 looks pretty good :) I only have S3500, and I'm happy
[0:35] <sagewk> on an existing cluster that is where the mon. key is stored.. needs to come from somewhere
[0:37] <joao> sagewk, I should check the cuttlefish branch, but on next it tries to first load the keyring from mon_data/keyring; if it fails, it attempts to grab the keys from the keyserver, and will then write them out
[0:38] <joao> yeah, cuttlefish does the same; during Monitor::preinit()
[0:39] <sagewk> the mon. key isn't stored in the keyserver tho
[0:39] <zjohnson> S3700 400GB is $946
[0:39] <sagewk> that code is partially there for the transition from when it was
[0:39] * alram (~alram@199.66.166.55) has joined #ceph
[0:39] <sagewk> (ceph auth list won't include mon.)
[0:41] <dmick> sagewk: oh?
[0:41] <joao> alright, I'll instruct Nelson of that
[0:41] <dmick> it does when I run it...is that peculiar to vstart?
[0:41] <sagewk> partly security, partly makes bootstrap and mon/mon authentication less awkward
[0:41] <clayb> zjohnson I'm quite happy with the Crucial M500 in my current clusters FWIW
[0:41] <sagewk> dmick: yeah
[0:41] <dmick> k
[0:42] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[0:42] <dmick> probably because someone did an auth import -i keyring?
[0:42] <sagewk> dmick: actually, maybe.. my vstart doesn't show it
[0:42] <dmick> hm
[0:44] <zjohnson> clayb: cool. what size did you end up using. and are you using it for boot/os/joural/meta/swap?
[0:45] <dmick> weird. maybe it's a side effect of one of the cephtool or rest tests
[0:45] * AfC (~andrew@2407:7800:200:1011:a17b:e651:3b26:70d2) has joined #ceph
[0:45] <dmick> just rebuilt, not there for me now either
[0:46] <clayb> Actually my apologies, we are evaling the M500; we're running their last gen (I guess the M4). We use them for everything; usually 8x500GB per machine.
[0:48] <zjohnson> clayb: so just doing to full SSD for storage?
[0:49] <clayb> zjohnson: yup full SSD for our VM hosts
[0:58] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) Quit (Quit: ...)
[0:59] * devoid (~devoid@130.202.135.227) Quit (Quit: Leaving.)
[1:05] * alram (~alram@199.66.166.55) Quit (Read error: Connection reset by peer)
[1:11] * clayb (~kvirc@69.191.241.59) Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[1:19] * mozg (~andrei@host109-151-35-94.range109-151.btcentralplus.com) Quit (Quit: Ex-Chat)
[1:26] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Read error: Connection reset by peer)
[1:26] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[1:27] * rturk is now known as rturk-away
[1:29] * nhm_ (~nhm@65-128-168-138.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[1:42] * nhm (~nhm@65-128-168-138.mpls.qwest.net) has joined #ceph
[1:46] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[1:47] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[1:50] <sagewk> RFC: http://fpaste.org/31939/43784113/
[1:52] <sjust> line 11: recovery load ...
[1:52] <sjust> line 12: client load ...
[1:52] <sjust> ?
[1:52] <sagewk> yeah
[1:52] <sjust> s/load/io?
[1:53] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[1:53] <dmick> I don't understand the numbers of degraded and states
[1:54] <dmick> would expect to see an explanation of the 57 there, I guess
[1:55] * AfC (~andrew@2407:7800:200:1011:a17b:e651:3b26:70d2) Quit (Quit: Leaving.)
[1:55] * AfC (~andrew@2407:7800:200:1011:a17b:e651:3b26:70d2) has joined #ceph
[1:58] <sagewk> http://fpaste.org/31940/43829113/
[1:58] <sagewk> better
[1:59] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[2:00] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[2:01] <sjust> how would it look with peering/inactive?
[2:01] <sjust> oh, just another line
[2:01] <sjust> looks good
[2:06] * AfC (~andrew@2407:7800:200:1011:a17b:e651:3b26:70d2) Quit (Quit: Leaving.)
[2:09] * yehudasa_ (~yehudasa@2602:306:330b:1410:61f5:8419:7d78:c4ae) has joined #ceph
[2:09] * Qu310 (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[2:14] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[2:16] * DarkAce-Z (~BillyMays@50.107.55.36) has joined #ceph
[2:17] * huangjun (~kvirc@221.234.157.115) has joined #ceph
[2:18] * DarkAceZ (~BillyMays@50.107.55.36) Quit (Ping timeout: 480 seconds)
[2:19] <Kioob> recovering 363 o/s, 1407MB/s
[2:19] <Kioob> :)
[2:21] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[2:23] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[2:30] * sagelap (~sage@2600:1012:b02c:6ed8:cd0d:7d6:865:6b7e) has joined #ceph
[2:31] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[2:33] <huangjun> if i want to deploy multi-mon on the same host,what should i do?
[2:34] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[2:36] <Kioob> at least changing the default port
[2:36] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[2:42] <dmick> huangjun: nothing special
[2:43] <huangjun> but know we can not make it.
[2:43] <huangjun> we have only on NIC and the socket.gethostname() returns the same hostname
[2:44] <dmick> what are you trying, and what is it saying?
[2:44] * AfC (~andrew@2407:7800:200:1011:8c69:bbc4:9be:f05e) has joined #ceph
[2:45] <huangjun> i'm trying to deploy 3 mons on one host machine that have only one eth interface,
[2:48] * tserong_ is now known as tserong
[2:48] <dmick> what command are you typing, and what is the response
[2:49] * LeaChim (~LeaChim@176.248.81.121) Quit (Ping timeout: 480 seconds)
[2:52] <huangjun> dmick: sorry, i think i need to check the problem again.
[2:52] <dmick> there's no fundamental problem with having multiple monitors on one machine. They're just daemons.
[2:53] <sagelap> ceph-deploy doesn't support it (well or at all, i forget) though
[2:54] * yanzheng (~zhyan@jfdmzpr01-ext.jf.intel.com) has joined #ceph
[2:56] <dmick> sagelap: ah
[2:56] <dmick> well *that's* probably true, thinking about it
[2:58] <huangjun> will be supported in last release?
[2:58] <huangjun> next release
[3:02] * rudolfsteiner (~federicon@190.220.6.51) has joined #ceph
[3:02] <huangjun> so i need turn to mkcephfs if want to deploy like this
[3:03] <alfredo|afk> huangjun: I would not mind a ticket as a feature request to work on it :)
[3:03] * alfredo|afk is now known as alfredodeza
[3:04] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[3:04] <huangjun> how?
[3:05] <dmick> adding monitors manually isn't very hard, and in production (or even nontrivial experiments) you probably really don't want more than one monitor on a single host
[3:05] <dmick> monitors are very important to keep separated for fault-tolerance. I don't think it would be a very useful feature
[3:06] * rudolfsteiner (~federicon@190.220.6.51) Quit ()
[3:08] <dmick> http://ceph.com/docs/master/rados/operations/add-or-rm-mons/
[3:08] * sagelap (~sage@2600:1012:b02c:6ed8:cd0d:7d6:865:6b7e) Quit (Read error: No route to host)
[3:08] <dmick> although step 6 will need to be modified for the ceph-deploy-style deployment
[3:08] * rudolfsteiner (~federicon@190.220.6.51) has joined #ceph
[3:09] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Remote host closed the connection)
[3:10] <huangjun> change the mon addr port?
[3:10] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[3:10] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[3:11] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[3:12] <dmick> that, and you don't use a section in ceph.conf to start the mon, but rather the right sort of files in /var/lib/ceph
[3:14] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Remote host closed the connection)
[3:19] <huangjun> i see, the sysvinit file will be found in /var/lib/ceph, so we need not to write the mon section in ceph.conf
[3:19] <dmick> right, or upstart, depending on your distro
[3:19] <dmick> we should probably update that document; I think someone mentioned it earlier this week
[3:19] * Cube (~Cube@c-98-208-30-2.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:20] * mjevans (~mje@209.141.34.79) Quit (Remote host closed the connection)
[3:20] * mjevans- (~mje@209.141.34.79) has joined #ceph
[3:20] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[3:22] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[3:28] * rudolfsteiner (~federicon@190.220.6.51) Quit (Ping timeout: 480 seconds)
[3:34] * mikedawson_ (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[3:39] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[3:39] * mikedawson_ is now known as mikedawson
[3:42] * john_barbee_ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) has joined #ceph
[3:48] * joao (~JL@2607:f298:a:607:9eeb:e8ff:fe0f:c9a6) Quit (Quit: Leaving)
[3:49] * Cube (~Cube@c-98-208-30-2.hsd1.ca.comcast.net) has joined #ceph
[3:50] * nerdtron (~kenneth@202.60.8.252) has joined #ceph
[3:50] * nerdtron is now known as kenneth
[3:50] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[3:50] <kenneth> when i try ceph -w what does scrub mean?
[3:50] <kenneth> deep scrub
[3:52] * gregaf (~Adium@2607:f298:a:607:5c72:b976:4e18:287d) has joined #ceph
[3:53] <bandrus> scrubbing maintains data integrity at the cost of performance: http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing
[3:53] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Remote host closed the connection)
[3:54] * yy (~michealyx@122.233.46.26) has joined #ceph
[3:57] <huangjun> deep scrub means check the consistence of replica and primary objects
[3:58] <kenneth> ok scrub scrub scrub..haha i'm just amused its named scrub rather than file check
[4:00] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[4:00] * Cube (~Cube@c-98-208-30-2.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[4:01] <absynth> that's because it is not checking any files
[4:01] <absynth> it could be called "consistency check", that would probably be more consistent
[4:01] <huangjun> dmick: sorry for distrub you, if we have two hosts to deploy mons, what should i do?
[4:01] <absynth> get a third?
[4:05] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[4:13] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[4:14] * yy (~michealyx@122.233.46.26) has left #ceph
[4:15] * yy (~michealyx@122.233.46.26) has joined #ceph
[4:19] <kenneth> 2 mons is not quite good, if you lost one, there will be no quorum
[4:19] <kenneth> you should have at least three..even a extra pc with no OSD attached can be paticipate as a monitor
[4:25] * Dark-Ace-Z (~BillyMays@50.107.55.36) has joined #ceph
[4:26] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:28] * DarkAce-Z (~BillyMays@50.107.55.36) Quit (Ping timeout: 480 seconds)
[4:29] * Dark-Ace-Z is now known as DarkAceZ
[4:35] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[4:36] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[4:37] <Psi-Jack> Heh, I'm certainly glad I have zabbix monitoring my ceph cluster extensively, because when an OSD just stops running out of the blue gets kinda frustrating. :0
[4:38] <Psi-Jack> A ceph-osd process, that is.
[4:41] <mikedawson> Psi-Jack: did you get a backtrace in the osd log?
[4:41] <Psi-Jack> Not sure, but it just happened again.
[4:43] <Psi-Jack> http://paste.hostdruids.com/view/5c5431fa
[4:43] <Psi-Jack> That?
[4:45] <sjustlaptop> Psi-Jack: that could be a bad disk
[4:45] <sjustlaptop> dmesg have anything good?
[4:45] <Psi-Jack> dmesg shows nothing wrong.
[4:45] <sjustlaptop> same osd process?
[4:45] <Psi-Jack> Same osd process, yes.
[4:46] <Psi-Jack> It was up for 5~6 minutes before dying.
[4:46] <Psi-Jack> So, I'll give it that time again, and see.
[4:47] <Psi-Jack> diag on the disk shows nothing wrong.
[4:47] <mikedawson> Psi-Jack: anything like this? http://www.spinics.net/lists/ceph-users/msg00357.html
[4:48] <sjustlaptop> Psi-Jack: that method doesn't even have an assert
[4:48] <sjustlaptop> wtf
[4:49] <sjustlaptop> the scroll back probably has the assert line number?
[4:49] <Psi-Jack> its down again/
[4:49] <sjustlaptop> sorry, the log probably has the assert line number?
[4:49] <sjustlaptop> can you paste that line?
[4:49] <mikedawson> sjustlaptop: I can't seem to get verbose rbd client logging started with injectargs. Do I need to restart the volume in question (or the mons/osds)?
[4:49] <sjustlaptop> possibly
[4:50] <sjustlaptop> I'm not sure you can use injectargs with a client?
[4:50] <Psi-Jack> http://paste.hostdruids.com/view/c87be7ec
[4:50] <sjustlaptop> you'd have to use the client admin socket?
[4:50] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[4:50] <sjustlaptop> Psi-Jack: that's a corrupt leveldb store
[4:50] <sjustlaptop> can the cluster recover?
[4:51] <Psi-Jack> sjustlaptop: Recover how? This keeps happening on osd.7.
[4:51] <Psi-Jack> The cluster does fully recover, but then osd drops again.
[4:51] <sjustlaptop> yes. will the cluster survive without osd 7?
[4:51] <Psi-Jack> osd.7 specifically every time.
[4:51] <Psi-Jack> Eh.. it's degraded.
[4:51] <sjustlaptop> ok, mark osd 7 out and let it recover
[4:51] <sjustlaptop> that disk may be toast
[4:51] <sjustlaptop> at the least, the leveldb instance is toast
[4:51] <Psi-Jack> The disk is fine. All diagnostics shows it's in good shape.
[4:52] <sjustlaptop> ok, that assert is saying that the leveldb store is inconsistent
[4:52] <sjustlaptop> the easiest way to recover is to let the cluster rebuild
[4:52] <Psi-Jack> Okay? I don't know what that means.
[4:52] <Psi-Jack> I see. That'll take a while. heh.
[4:52] <Psi-Jack> like, hours.. Many many hours.
[4:53] <sjustlaptop> don't do anything drastic to the disk though, if it doesn't recover you can try something more drastic with osd.7
[4:53] <sjustlaptop> once the cluster heals, you can reintroduce osd7
[4:53] <sjustlaptop> that is, wipe the disk and introduce a new replacement osd
[4:53] <sjustlaptop> but not until the cluster heals
[4:53] <Psi-Jack> okay, so How do I kick it out for the time being? It's almost already to the point of auto-kicking it out though.
[4:54] <sjustlaptop> ceph osd out 7
[4:54] <sjustlaptop> and don't let the process restart
[4:54] <Psi-Jack> OKay, yeah, osd.7 is already out. heh
[4:54] <Psi-Jack> it's starting recovery now.
[4:55] <sjustlaptop> Psi-Jack: rbd?
[4:55] <Psi-Jack> it's primarily used for RBD for virtual machines. As well as cephfs volumes for webservers, mail servers, storage, etc
[4:55] <sjustlaptop> ok
[4:55] <sjustlaptop> just curious
[4:55] <sjustlaptop> how many osds?
[4:56] <sjustlaptop> nodes?
[4:56] <Psi-Jack> total of 9 osds.
[4:56] <sjustlaptop> ok
[4:56] <sjustlaptop> how full is your cluser?
[4:56] <sjustlaptop> *cluster?
[4:56] <Psi-Jack> heh. not even close to full. :)
[4:56] <sjustlaptop> good
[4:56] <sjustlaptop> .
[4:56] <Psi-Jack> 1TB out of 5 TB total.
[4:56] <sjustlaptop> the results would be uncomfortable if losing osd 7 runs you over 85% or so
[4:57] <Psi-Jack> heh yeah, that's unlikely. it'll potentially strain the smaller OSDs but I crush-mapped it to put less weight on the smaller disks anyway.
[4:58] <sjustlaptop> ok
[4:58] <Psi-Jack> yeah, this'll take several hours for sure. :(
[4:58] <Psi-Jack> heh
[4:59] <sjustlaptop> do you have non-default settings for osd_max_recovery_active, osd_max_backfills, osd_recovery_op_priority, or osd_client_op_priority?
[4:59] <Psi-Jack> When I noticed it, it was already 1 hour degraded and recovering. At 21:39 it was around 17% degraded. By 22:34 it was down to around 13%
[4:59] <Psi-Jack> No.
[5:00] <sjustlaptop> you may be able to speed it up by raising osd_recovery_max_active to 15 (default 10)
[5:00] <sjustlaptop> at the cost possibly of increased (possibly severely so) client io latency
[5:00] <sjustlaptop> dumpling has some improvements which should improve recovery speed :P
[5:00] <Psi-Jack> Eh, I can wait.
[5:00] <sjustlaptop> k
[5:00] <Vjarjadian> so ceph doesnt like having an OSD added while it's rebalancing the cluster? even if it's at 0 weight?
[5:01] <Psi-Jack> I'll just let it run through the night and check it in the morn. :)
[5:01] <sjustlaptop> Vjarjadian: that's generally fine
[5:01] <sjustlaptop> why?
[5:01] <Psi-Jack> sjustlaptop: So, when this is recovered.. What's next?
[5:01] <sjustlaptop> Psi-Jack: you can wipe the osd 7 disk and create a new osd and add it in
[5:02] <sjustlaptop> or better yet, use a fresh disk
[5:02] <Psi-Jack> So, just delete /srv/ceph/osd/7/* and restart the process? :p
[5:02] <Vjarjadian> was just reading up... someone recommended waiting for the cluster to heal first...
[5:02] <sjustlaptop> you'd have to do --mkcephfs or something
[5:02] <sjustlaptop> oops
[5:02] <sjustlaptop> --mkfs or something
[5:02] <sjustlaptop> the docs have that part covered pretty well
[5:02] <sjustlaptop> Vjarjadian: if another failure or two occurs during the cluster rebuild he might need the data on osd 7
[5:02] <sjustlaptop> that's all
[5:02] <Psi-Jack> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
[5:03] <Psi-Jack> That, I presume?
[5:03] <sjustlaptop> looks right
[5:03] <sjustlaptop> Vjarjadian: he'd be wiping the disk and readding it
[5:03] <Vjarjadian> one reason to have an appropriate number of replicas. to withstand multiple failures
[5:03] <Psi-Jack> ceph-osd -i {osd-num} --mkfs --mkkey it seems, mostly.
[5:03] <sjustlaptop> Vjarjadian: true, but little reason to take chances
[5:03] <sjustlaptop> looks right
[5:04] <Psi-Jack> So, steps 7, and 8, then 9's already taken care of in the existing crush map.
[5:05] <sjustlaptop> you first need to fully remove the osd
[5:05] <sjustlaptop> and then fully add it back in
[5:05] <Psi-Jack> hmmm, i see.
[5:06] * fireD (~fireD@93-139-170-207.adsl.net.t-com.hr) has joined #ceph
[5:08] * fireD_ (~fireD@93-142-199-217.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:08] <sjustlaptop> Psi-Jack: actually, can you tar/gzip up the current/omap directory on osd.7 and attach it to bug #5958
[5:08] <sjustlaptop> or upload it to cephdrop@ceph.com and mention the filename on that bug?
[5:08] <sjustlaptop> (sftp)
[5:09] <sjustlaptop> Psi-Jack: filesystem on osd 7?
[5:09] * julian (~julianwa@125.70.132.20) has joined #ceph
[5:09] <sjustlaptop> btrfs?
[5:12] * athrift (~nz_monkey@203.86.205.13) Quit (Remote host closed the connection)
[5:13] <sjustlaptop> Psi-Jack: ^
[5:13] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[5:14] * athrift (~nz_monkey@203.86.205.13) has joined #ceph
[5:16] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[5:18] * yy (~michealyx@122.233.46.26) has left #ceph
[5:24] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[5:27] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[5:32] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[5:36] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[5:40] * Vjarjadian (~IceChat77@90.214.208.5) Quit (Quit: REALITY.SYS Corrupted: Re-boot universe? (Y/N/Q))
[5:52] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[5:53] * Cube (~Cube@c-98-208-30-2.hsd1.ca.comcast.net) has joined #ceph
[6:05] * Cube (~Cube@c-98-208-30-2.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[6:10] * shoosah (~ssha637@en-279303.engad.foe.auckland.ac.nz) has joined #ceph
[6:12] <shoosah> Im trying to install ceph, but when I want to modify the type of osd crush chooseleaf, I will receive this error message : osd: command not found
[6:13] * shoosah (~ssha637@en-279303.engad.foe.auckland.ac.nz) Quit ()
[6:18] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[6:21] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[6:21] * john_barbee__ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) has joined #ceph
[6:22] * john_barbee___ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) has joined #ceph
[6:23] * shoosah (~ssha637@en-279303.engad.foe.auckland.ac.nz) has joined #ceph
[6:23] <shoosah> hello
[6:23] <sage> shoosah: what command did you type?
[6:24] <shoosah> sage I could solve the previous problem
[6:24] <shoosah> thanks btw
[6:25] * john_barbee____ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) has joined #ceph
[6:25] <shoosah> now my problem is that when I try to run this command : ceph-deploy mon create ceph-node, I receive that pushy.protocol.proxy.ExceptionProxy: [Errno 2] No such file or directory: '/var/lib/ceph/mon/cep-node'
[6:26] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[6:26] * john_barbee_ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[6:27] <shoosah> and then I create "mon" directory in /var/lib/ceph manually! and it works, but it doesnt create bootstrap-mds and bootstrap-osd as it mentioned in http://ceph.com/docs/master/start/quick-ceph-deploy/
[6:27] * john_barbee_____ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) has joined #ceph
[6:27] <sage> cep-node or ceph-node?
[6:27] <sage> if you have to create the 'mon' directory it sounds like the ceph package wasn't installed? what distro?
[6:28] <sage> (or you ran ceph-deploy purgedata at some point)
[6:28] <shoosah> ceph-node, thats fine I just substitute that with my server name which is en-439--215-005
[6:28] <shoosah> I just purge it already and this is the second time that I go through the whole thing!
[6:29] <shoosah> but why it works when I create "mon" directory manually?!
[6:29] <Qu310> hi all, when setting up the journal for a osd. how do you go about using a device instead of file?
[6:29] * john_barbee__ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[6:29] <Qu310> is it just "osd journal = /dev/sda1/"?
[6:31] <shoosah> sage : when I install ceph > ceph-deploy install --stable cuttlefish ceph-node, it doesnt echo "ok"! it actually doesnt print anything!
[6:31] <Qu310> or do you need to mount the disk as it can't be used "raw"?
[6:31] * john_barbee___ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[6:32] <sage> apt-get --reinstall install ceph and the dir will appear on its own
[6:32] <sage> Qu310: yes, without the trailing /
[6:32] <sage> Qu310: or, with a ceph-deploy-based clsuter, ceph-deploy osd create HOST:DATADEV:JOURNALDEV
[6:33] <Qu310> sage: thanks, /facepalm
[6:33] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[6:33] * john_barbee____ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[6:33] <sage> shoosah: older ceph-dpeloy didn't print anything, but the newly released version prints lots of helpful stuff.. get the latest package or pip install ceph-deploy or git pull, whichever you prefer
[6:36] <yanzheng> sage, I think it's ok to access dentry->d_parent->d_inode as long as dentry->d_lock is locked
[6:36] <shoosah> sage : great ;) apt-get --reinstall install ceph actually created mon, bootrstrap-mds and bootstrap-osd
[6:37] <yanzheng> vfs never change dentry->d_inode while dentry's referenced
[6:37] <sage> yanzheng: a directory dentry won't get unlinked from the inode generally?
[6:37] <sage> ah, cool.
[6:37] <yanzheng> yes
[6:37] <sage> great. have you done any testing with that patch yet?
[6:38] <yanzheng> I'm still running the test
[6:38] <sage> k
[6:39] <yanzheng> did you see my wip-zfs update?
[6:40] * Rom (~Rom@c-107-3-156-152.hsd1.ca.comcast.net) has joined #ceph
[6:40] <shoosah> sage : but unfortunately bootstrap-mds and bootstrap-osd dont contain keyrings after entering this command : ceph-deploy mon create ceph-node
[6:41] <sage> shoosah: mon isn't forming quorum. do ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node.asok mon_status to see what it is doing
[6:41] <sage> yanzheng: not yet... wip-zfs in your tree on github?
[6:41] <Qu310> i'm using mkcephfs, i got this error , few lines of SG_IO: bad/missing sense data, sb[]: 70 00 blah blah is this anything to worry about? looksk like the osd was created ok
[6:42] <sage> Qu310: sounds like a bad disk to me
[6:43] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[6:44] <yanzheng> yes, https://github.com/ukernel/ceph.git wip-zfs
[6:46] <yanzheng> the main issue I encountered is that zfs' sync time becomes longer and longer over time.
[6:46] <yanzheng> eventually cause sync timeout and make osd kill itself
[6:46] <shoosah> sage what does "ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node.asok mon_status" is all about?!
[6:47] <sage> yanzheng: weird.. are the old snapshots getting trimmed?
[6:47] <yanzheng> yes
[6:48] <sage> does the ceph_test_filestore_workloadgen thing trigger it? if so that's an easy thing to tell the zfs devs to run in their environment
[6:48] <shoosah> sage : Im following the steps so based on that the keyrings are supposed to be in bootstrap-mds and bootstrap-osd
[6:48] <sage> shoosah: pastebin teh output?
[6:48] <yanzheng> will try
[6:48] <shoosah> { "name": "en-439--215-005",
[6:48] <shoosah> "rank": 0,
[6:48] <shoosah> "state": "leader",
[6:48] <shoosah> "election_epoch": 2,
[6:48] <shoosah> "quorum": [
[6:48] <shoosah> 0],
[6:48] <shoosah> "outside_quorum": [],
[6:48] <shoosah> "extra_probe_peers": [],
[6:49] <shoosah> "monmap": { "epoch": 1,
[6:49] <shoosah> "fsid": "c7b4d58c-378f-4fae-a984-02aa4f0bc3a5",
[6:49] <shoosah> "modified": "0.000000",
[6:49] <shoosah> "created": "0.000000",
[6:49] <sage> shoosah: hmm so mon is okay. is there a ceph-create-keys process running?
[6:49] <shoosah> "mons": [
[6:49] <shoosah> { "rank": 0,
[6:49] <shoosah> "name": "en-439--215-005",
[6:49] <sage> what distro is this?
[6:49] <shoosah> "addr": "130.216.217.46:6789\/0"}]}}
[6:49] <shoosah> I havent tried that before
[6:49] <sage> oh wait.. did you say you did your install with mkcephfs?
[6:50] <Rom> I had the same issue... On 2 of my nodes those bootstrap keys didn't appear, but for some reason they were there on my 3rd node - so I just copied them to the other two.
[6:50] <sage> nm, that was Qu310
[6:50] <sage> if you run ceph-create-keys -n `hostname` it should create them. that is supposed to run when teh mons start, though.. so restarting ceph-mon should also do the trick
[6:51] <shoosah> how to restart ceph-mon then?!
[6:51] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[6:51] <sage> service ceph restart mon, or restart ceph-mon-all (for non-ubuntu or ubuntu)
[6:52] <shoosah> Rom : thats not the right way I guess, coz I tried that before and I stuck somewhere else!
[6:52] <Qu310> sage: its a 100GB SSD behind a Perc 710p in raid0 :\
[6:53] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[6:54] <sage> yanzheng: that branch looks good. i'll pull it into ceph.git so that we can run the rados test suite against it
[6:54] <yanzheng> thanks
[6:55] <sage> it makes me curious how well nilfs2 would hold up
[6:55] <Rom> shoosah: I never had any issues. I used ceph-deploy create mon node1 node2 node3, then copied the keys from /var/lib/ceph/bootstrap-mds and /var/lib/ceph/bootstrap-osd to node1 and node2 from node3. Then I was able to do a ceph-deploy gatherkeys and create the osd's
[6:55] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[6:56] <sage> rom: that's a workaround, but if restarting the mon isn't creating the keys i'd like to figure out why
[6:56] <shoosah> Rom : my problem is that the keys are not in /var/lib/ceph/bootstrap-mds and /var/lib/ceph/bootstrap-osd
[6:56] <Qu310> sage: yep with mkcephfs
[6:57] <Rom> shoosah: I know, I was just commenting that there was an inconsistency. For me node1 and node2 didn't get them, but node3 did.
[6:57] * AfC (~andrew@2407:7800:200:1011:8c69:bbc4:9be:f05e) Quit (Quit: Leaving.)
[6:57] <sage> shoosah: did you try restarting the mon? does the ceph-create-keys process start? did the keys appear after that?
[6:57] * AfC (~andrew@2407:7800:200:1011:8c69:bbc4:9be:f05e) has joined #ceph
[6:57] <Rom> sage: agreed. It should get auto-created. A question - if ceph-create-keys runs on each monitor, don't they end up with different keys in bootstrap-mds and bootstrap-osd?
[6:58] <yanzheng> nilfs2 completely does not support xattr
[6:58] <yanzheng> I think ceph can't live with that
[6:58] <sage> the keys are created via teh monitor, which forms a quorum first. it will either create a new key or (if present) get the existing one
[6:58] <sage> yanzheng: oh well
[6:59] <Rom> sage: ok, thought it must of tried something like that. thanks!
[7:01] <shoosah> sage: shall I try this command > ceph-create-keys -n 'hostname' ?
[7:02] <sage> shoosah: what happens when you restart the mon?
[7:02] <shoosah> ceph-mon-all start/running
[7:03] <shoosah> shall I try this command > ceph-create-keys -n 'hostname' ?
[7:03] <shoosah> after restarting?
[7:03] <sage> initctl list | grep ceph-create-keys
[7:04] <sage> shoosah: please do not run ceph-create-keys manually until we figure out why it isn't happening automatically
[7:04] <shoosah> alright
[7:05] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[7:05] <shoosah> alrightI just tried wht u said and I got > ceph-create-keys stop/waiting
[7:06] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[7:08] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[7:08] <sage> are the keys there?
[7:08] <sage> if not, tail /var/log/upstart/ceph-create-keys.log to see what happened
[7:11] <madkiss> g'day sage
[7:12] <shoosah> INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
[7:12] <shoosah> INFO:ceph-create-keys:Key exists already: /etc/ceph/ceph.client.admin.keyring
[7:12] <shoosah> INFO:ceph-create-keys:Talking to monitor...
[7:12] <shoosah> 2013-08-14 16:54:04.254010 7fd2a08f7780 -1 unable to authenticate as client.admin
[7:12] <shoosah> 2013-08-14 16:54:04.254805 7fd2a08f7780 -1 ceph_tool_common_init failed.
[7:12] <shoosah> INFO:ceph-create-keys:Cannot get or create bootstrap key for osd, permission denied
[7:12] <shoosah> INFO:ceph-create-keys:Talking to monitor...
[7:12] <shoosah> 2013-08-14 16:54:04.263342 7fdb852f1780 -1 unable to authenticate as client.admin
[7:12] <shoosah> 2013-08-14 16:54:04.263782 7fdb852f1780 -1 ceph_tool_common_init failed.
[7:12] <shoosah> INFO:ceph-create-keys:Cannot get or create bootstrap key for mds, permission denied
[7:13] <shoosah> basically the permission is denied!
[7:13] <sage> cat /var/lib/ceph/mon/ceph-*/keyring
[7:14] <sage> oh
[7:14] <sage> try rm /etc/ceph/ceph.client.admin.keyring... i bet that is from an earlier attempt
[7:14] * KrisK (~krzysztof@213.17.226.11) has joined #ceph
[7:14] <sage> madkiss: hi!
[7:16] <shoosah> I just removed that
[7:16] <sage> try restart ceph-mon-all again?
[7:16] <shoosah> and this is the content of keyring :
[7:16] <shoosah> [mon.]
[7:16] <shoosah> key = AQCe/QpSAAAAABAAxt+WfCu9+WHb05fEw0tCxg==
[7:16] <shoosah> caps mon = "allow *"
[7:18] <sage> looks right. try restart ceph-mon-all and see if the keys appear
[7:18] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[7:18] <shoosah> yup, I just tried, it didnt work out :(
[7:18] <sage> what is in the upstart log now?
[7:19] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[7:19] <shoosah> INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
[7:19] <shoosah> INFO:ceph-create-keys:Key exists already: /etc/ceph/ceph.client.admin.keyring
[7:19] <shoosah> INFO:ceph-create-keys:Talking to monitor...
[7:19] <shoosah> 2013-08-14 16:54:04.254010 7fd2a08f7780 -1 unable to authenticate as client.admin
[7:19] <shoosah> 2013-08-14 16:54:04.254805 7fd2a08f7780 -1 ceph_tool_common_init failed.
[7:19] <shoosah> INFO:ceph-create-keys:Cannot get or create bootstrap key for osd, permission denied
[7:19] <shoosah> INFO:ceph-create-keys:Talking to monitor...
[7:19] <shoosah> 2013-08-14 16:54:04.263342 7fdb852f1780 -1 unable to authenticate as client.admin
[7:19] <shoosah> 2013-08-14 16:54:04.263782 7fdb852f1780 -1 ceph_tool_common_init failed.
[7:19] <shoosah> INFO:ceph-create-keys:Cannot get or create bootstrap key for mds, permission denied
[7:20] <sage> are you sure you removed /etc/ceph/ceph.client.admin.keyring ?
[7:20] <sage> it says it was already there
[7:20] <shoosah> let me check it again
[7:21] <sage> maybe remove that log file too so we can tell it's the output of the *new* attempt to start it that we're looking at
[7:21] <sage> the restart ceph-mon-all
[7:21] <sage> and make sure it doesn't still say the file already exists :)
[7:22] <shoosah> ssha637@en-439--215-005:/etc/ceph$ sudo rm ceph.client.admin.keyring
[7:22] <shoosah> rm: cannot remove `ceph.client.admin.keyring': No such file or directory
[7:23] <shoosah> I swear that I already removed it!
[7:28] <sage> ah. remove the upstart log, /var/log/upstart/ceph-create-keys*
[7:28] <sage> and then 'restart ceph-all'
[7:28] <sage> and see if ceph-create-keys runs (that log should reappear in /var/log/upstart)
[7:28] <shoosah> alright a moment
[7:32] <shoosah> after I removed ceph-create-keys.log, I restart mon, but ceph-create-keys.log is not in the upstart anynmore!
[7:32] <sage> is the job running?
[7:33] <sage> initctl list | grep create
[7:33] <sage> this sounds a bit like upstart isn't starting the task
[7:33] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[7:33] <sage> ok i give up.. ceph-create-keys --id `hostname` should do the trick
[7:34] <shoosah> yesssss
[7:34] <shoosah> thanks :D
[7:37] <shoosah> so u think where does the problem come from?! is it becoz if that I already installed and purge ceph?
[7:38] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit (Quit: sprachgenerator)
[7:44] * yy-nm (~chatzilla@122.233.46.26) has joined #ceph
[7:45] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[7:47] * shoosah (~ssha637@en-279303.engad.foe.auckland.ac.nz) Quit (Quit: Konversation terminated!)
[7:57] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[7:57] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[8:01] <sage> dumpling is out!
[8:03] * hugokuo (~hugokuo@210.65.146.4) has joined #ceph
[8:03] <hugokuo> morning
[8:06] <hugokuo> I found that more concurrency will reduce the performance of RadosGW. for 1KB object with 100 concurrency : RadosGW can handle 1500reqs/sec. But when the concurrency is 200, the result in 400reqs/sec. Is there a way to optimize RadosGW for higher concurrency ?
[8:07] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[8:07] * ChanServ sets mode +o scuttlemonkey
[8:11] <yanzheng> how many ods
[8:12] <yy-nm> hay,all. i have a question about bufferlist? is it class? struct? or template? i can't find its define file
[8:12] <lxo> hey, the ceph-release fc18 rpm for dumpling points to the cuttlefish repos. that can't be right
[8:14] <yanzheng> yy-nm, src/common/buffer.c
[8:14] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Read error: Operation timed out)
[8:14] <lxo> congrats on the new stable release, while at that. just in time! I'm just about done verifying that my cluster recovered correctly from a double-disk failure
[8:15] <lxo> to fix the ceph-release error, I've run this command: sed -i 's,cuttlefish,dumpling,;s,18,$releasever,' /etc/yum.repos.d/ceph.repo
[8:15] * davidzlap1 (~Adium@ip68-5-239-214.oc.oc.cox.net) has joined #ceph
[8:15] <lxo> it is also future-proof, for I'm planning on upgrading to fc19 in the near future
[8:19] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[8:19] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[8:20] <yy-nm> yanzheng, but i can't find the bufferlist keyword in src/common/buffer.cc
[8:21] <yanzheng> src/include/buffer.h
[8:21] * davidzlap (~Adium@ip68-5-239-214.oc.oc.cox.net) Quit (Read error: Operation timed out)
[8:24] <yy-nm> yanzheng, thank you, i just find using wrong parameter in grep
[8:28] * odyssey4me (~odyssey4m@41.13.224.216) has joined #ceph
[8:28] <sage> yy-nm: there's a typedef buffer::list bufferlist; somewhere
[8:28] <yy-nm> sage, again. i find with yanzheng's help
[8:29] <yy-nm> thanks again
[8:31] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[8:32] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[8:45] * AfC (~andrew@2407:7800:200:1011:8c69:bbc4:9be:f05e) Quit (Ping timeout: 480 seconds)
[8:47] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[8:48] * odyssey4me (~odyssey4m@41.13.224.216) Quit (Ping timeout: 480 seconds)
[8:48] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[8:52] * yy-nm (~chatzilla@122.233.46.26) Quit (Read error: Connection reset by peer)
[8:53] * yy-nm (~chatzilla@122.233.46.26) has joined #ceph
[8:54] * john_barbee_____ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[8:55] * capri (~capri@212.218.127.222) has joined #ceph
[8:57] * john_barbee_ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) has joined #ceph
[8:57] * JM (~oftc-webi@193.252.138.241) has joined #ceph
[9:01] * phoenix (~phoenix@vpn1.safedata.ru) has joined #ceph
[9:01] <phoenix> hello all
[9:02] <phoenix> i have littel problem, mb who know answer
[9:02] <hugokuo> yanzheng, 2 nodes each has 10 OSDs. Per osd mapping to a drive. The journal isolated on a single SSD
[9:03] <hugokuo> yanzheng, s/2 nodes/3 nodes/
[9:03] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Quit: Leaving.)
[9:04] <phoenix> I opened seph cluster and received a limitation on the file in 1TB. how to register in the config variable to work around this limitation.
[9:05] <phoenix> mds max file size = in mds section don`t work
[9:06] * mschiff (~mschiff@pD9510E87.dip0.t-ipconnect.de) has joined #ceph
[9:11] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[9:14] <phoenix> thread who know knows how to solve this problem?
[9:18] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[9:20] * john_barbee_ (~jbarbee@c-98-220-74-174.hsd1.in.comcast.net) Quit (Remote host closed the connection)
[9:20] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[9:25] * hugo_kuo (~hugokuo@75-101-56-159.dsl.static.sonic.net) has joined #ceph
[9:28] * kenneth (~kenneth@202.60.8.252) Quit (Remote host closed the connection)
[9:32] * hugokuo (~hugokuo@210.65.146.4) Quit (Ping timeout: 480 seconds)
[9:35] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[9:38] * tnt (~tnt@91.176.3.64) Quit (Ping timeout: 480 seconds)
[9:38] <phoenix> anybody here?
[9:41] <liiwi> might be bit quiet at this hour
[9:41] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[9:42] <yanzheng> limitation of file size in cephfs?
[9:43] <phoenix> yep
[9:43] <phoenix> 1tb
[9:44] <phoenix> how to register in the config variable to work around this limitation
[9:49] * dpippenger1 (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[9:51] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:55] <yanzheng> phoenix, set mds max_file_size in mds section of ceph.conf
[9:55] <yanzheng> than recreate the fs
[9:55] <phoenix> i set
[9:55] <Kioob> hi
[9:55] <Kioob> Aug 14 09:55:10 alg kernel: [ 5289.787424] libceph: osd66 10.0.0.10:6809 socket error on read
[9:55] <Kioob> is there a way to have more informations about that network error ?
[9:56] <yanzheng> phoenix, you need to recreate the fs
[9:56] <yanzheng> ceph mds newfs ...
[9:57] <yanzheng> the limitation is stored in monitor, so far no command to change it
[9:58] <phoenix> i try :[mds.a]
[9:58] <phoenix> host = namenode1
[9:58] <phoenix> mds max file size = 1ULL << 44 and try this mds max file size = 16000000 and then re-create the file system: mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.keyring
[9:58] <phoenix> and have 1tb limitetions
[9:58] * odyssey4me (~odyssey4m@41.4.103.70) has joined #ceph
[9:58] * yy-nm (~chatzilla@122.233.46.26) Quit (Read error: Connection reset by peer)
[9:59] <phoenix> you give an example to 255 terabytes limit. Thank you and sorry for my English.
[10:00] * yy-nm (~chatzilla@122.233.46.26) has joined #ceph
[10:04] * odyssey4me2 (~odyssey4m@165.233.71.2) has joined #ceph
[10:06] * odyssey4me (~odyssey4m@41.4.103.70) Quit (Ping timeout: 480 seconds)
[10:06] <huangjun> yanzheng: what boost version did ceph required?
[10:07] <yanzheng> AC_MSG_FAILURE(["Can't find boost statechart headers; need 1.34 or later"]))
[10:07] <huangjun> thanks
[10:10] <phoenix> huangjun: you give an example to 255 terabytes limit in the config file. Thank you and sorry for my English.
[10:11] <indego> I just installed Cuttlefish yesterday for my first ceph experience (Stock Debian 7). Went OKish. Just upgraded to Dumpling and had the following error performing the debian package update: Compiling /usr/lib/pymodules/python2.6/ceph_deploy/test/test_cli.py ...
[10:11] <indego> SyntaxError: ('invalid syntax', ('/usr/lib/pymodules/python2.6/ceph_deploy/test/test_cli.py', 44, 26, ' assert {p.basename for p in tmpdir.listdir()} == set()\n'))
[10:11] <yanzheng> mds max_file_size = 280375465082880
[10:11] <yanzheng> put above line to mds section of ceph.conf
[10:11] <yanzheng> restart monitor
[10:11] <yanzheng> recreate the fs
[10:12] <phoenix> thx
[10:12] <phoenix> i ty
[10:12] <phoenix> i try
[10:15] * dosaboy (~dosaboy@faun.canonical.com) has joined #ceph
[10:16] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[10:16] * ChanServ sets mode +v andreask
[10:18] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit ()
[10:18] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[10:18] * ChanServ sets mode +v andreask
[10:22] <Kioob> mmm, if "rbd map XXX" hang : no response, no logs in ceph/client.$HOSTNAME.log, neither in kernel.log, what can be the problem ?
[10:26] <odyssey4me2> Dumpling is released?
[10:26] <odyssey4me2> I see that packages are available - with a date stamp of yesterday.
[10:29] <Kioob> udevadm settle - timeout of 120 seconds reached, the event queue contains:
[10:29] <Kioob> /sys/devices/virtual/block/rbd1 (3298)
[10:30] <yanzheng> phoenix, sorry, you should put that line to the mon section
[10:30] <absynth> odyssey4me2: yeah, the announcement hit the mailing list already
[10:30] <indego> odyssey4me2, it is on the homepage
[10:31] <indego> I guess it is a good time to start experimenting with ceph ;)
[10:34] * bergerx_ (~bekir@78.188.101.175) has joined #ceph
[10:39] <niklas> Hi. Why do my osds shutdown randomly?
[10:39] <niklas> I just noticed three of them being down and out. Their log says:
[10:39] <niklas> 2013-08-13 19:10:02.050134 7fb570ae6700 -1 osd.85 611 *** Got signal Terminated ***
[10:40] <niklas> 2013-08-13 19:10:02.123989 7fb57bb96700 0 -- 192.168.181.27:6849/7251 >> 192.168.181.25:6833/21506 pipe(0x1dc82a00 sd=92 :6849 s=0 pgs=0 cs=0 l=0 c=0x13316c60).accept connect_seq 1240 vs existing 1239 state standby
[10:40] <niklas> 2013-08-13 19:10:02.877100 7fb570ae6700 -1 osd.85 611 shutdown
[10:40] <niklas> 2013-08-13 19:10:02.878886 7fb570ae6700 20 osd.85 611 kicking pg 5.144c
[10:40] <niklas> After that it kicks pgs for 2 minutes and then shuts down…
[10:40] <absynth> SIGTERM could be a segmentation fault
[10:40] <absynth> look on the osd box and check for corefiles or something
[10:41] <niklas> where would I search?
[10:42] <absynth> good question, dunno where the core files would usually be
[10:42] <niklas> those three osds are on different hosts
[10:42] <absynth> either in /
[10:42] <absynth> or the osd dir
[10:42] <absynth> you should look on osd.85
[10:45] * LeaChim (~LeaChim@176.248.81.121) has joined #ceph
[10:45] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[10:46] * raso (~raso@deb-multimedia.org) Quit (Quit: WeeChat 0.3.8)
[10:47] <niklas> also I get loads of these in my kernel.log:
[10:47] <niklas> http://pastebin.com/UGawLSYy
[10:48] <niklas> Always in the evening on the last few days. The three osds shut down in the evening (18:10, 18:30 and 19:10), too
[10:48] <niklas> but they did so on the 12th and the 23th
[10:48] <niklas> *13th
[10:50] * raso (~raso@deb-multimedia.org) has joined #ceph
[10:53] <Rom> Any idea what the system resources looked like at those times? memory and swap usage especially..
[10:55] <yanzheng> fixed by http://oss.sgi.com/archives/xfs/2013-03/msg00409.html
[10:55] <absynth> you have an issue with xfs and ceph-osd
[10:56] <absynth> exactly, what yanzheng said
[11:00] <odyssey4me2> absynth, indego - I see so. Awesome. :)
[11:03] <niklas> yanzheng: that was meant for me?
[11:04] <niklas> This patch is from Mar, 14. how come it is not fixed in testing?
[11:04] * yanzheng (~zhyan@jfdmzpr01-ext.jf.intel.com) Quit (Quit: Leaving)
[11:05] <shimo> niklas: looks like the patch is for xfs
[11:06] <shimo> not ceph
[11:06] <niklas> oh
[11:06] <niklas> well
[11:08] <niklas> thanky
[11:08] <niklas> Could that be the reason for my osds shutting down?
[11:10] <shimo> well others who know much more than i do seem to think so
[11:11] <niklas> ok…
[11:17] * odyssey4me2 (~odyssey4m@165.233.71.2) Quit (Ping timeout: 480 seconds)
[11:28] * allsystemsarego (~allsystem@5-12-241-157.residential.rdsnet.ro) has joined #ceph
[11:33] * saabylaptop (~saabylapt@2a02:2350:18:1010:bc4f:516d:8111:16d7) has joined #ceph
[11:35] * Rom (~Rom@c-107-3-156-152.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[11:45] <jtang> good morning
[11:51] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[11:56] * infinitytrapdoor (~infinityt@134.95.27.132) has joined #ceph
[11:59] * yy-nm (~chatzilla@122.233.46.26) Quit (Quit: ChatZilla 0.9.90.1 [Firefox 23.0/20130730113002])
[12:05] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:13] * nhm (~nhm@65-128-168-138.mpls.qwest.net) Quit (Read error: Operation timed out)
[12:14] * nhm (~nhm@67-220-20-222.usiwireless.com) has joined #ceph
[12:14] * kyann (~kyann@did75-15-88-160-187-237.fbx.proxad.net) Quit (Quit: HydraIRC -> http://www.hydrairc.com <- *I* use it, so it must be good!)
[12:15] * hugo__kuo (~hugokuo@210.65.146.4) has joined #ceph
[12:15] * huangjun (~kvirc@221.234.157.115) Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[12:18] * Tamil (~tamil@38.122.20.226) Quit (Read error: Connection reset by peer)
[12:21] * hugo_kuo (~hugokuo@75-101-56-159.dsl.static.sonic.net) Quit (Ping timeout: 480 seconds)
[12:24] <loicd> morning
[12:29] <loicd> what does (PENDING 9) means in this context : http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-precise-amd64-gcov/#origin/wip-5510 ?
[12:35] <paravoid> sage: note for next time: Debian version should have been 0.67~rc3-1 to ease upgrades (1 > 1~1)
[12:36] * hugo__kuo (~hugokuo@210.65.146.4) Quit (Quit: ??)
[12:37] <Psi-Jack> Fun fun..
[12:38] <Psi-Jack> sjust: And no.. Not btrfs. XFS
[12:40] <loicd> paravoid: man deb-version is an interesting read, indeed :-)
[12:41] * odyssey4me (~odyssey4m@165.233.71.2) has joined #ceph
[12:55] * yanzheng (~zhyan@101.82.237.93) has joined #ceph
[12:59] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[12:59] * ChanServ sets mode +v andreask
[13:03] <Psi-Jack> sjust: Hmmm.. that omap is like 441M...
[13:08] <Psi-Jack> OKay. How am I supposed to drop a file for a bug to cephdrop@ceph,com?
[13:09] <dwm> paravoid: Ah, was about to report that myself.
[13:19] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[13:22] <Psi-Jack> Oi,,, Rebuilding osd.7 now, but I did take that current/omap tarball for the bug report,. Just can't put it anywhere,
[13:25] * skatteola_2 (~david@c-0784e455.16-0154-74657210.cust.bredbandsbolaget.se) has joined #ceph
[13:26] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[13:27] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[13:27] <jtang> just upgraded a test cluster to dumpling from cuttlefish
[13:27] <jtang> so far nothing has exploded
[13:29] <paravoid> I've been using rc2/rc3 for a while, I upgraded to final 0.67 today
[13:30] <paravoid> everything's okay
[13:32] * skatteola (~david@c-0784e455.16-0154-74657210.cust.bredbandsbolaget.se) Quit (Ping timeout: 480 seconds)
[13:33] * Josh_ (~IceChat9@rrcs-74-218-204-10.central.biz.rr.com) has left #ceph
[13:33] * Josh_ (~IceChat9@rrcs-74-218-204-10.central.biz.rr.com) has joined #ceph
[13:33] <Josh_> if the osd does not appear in initctl list | grep ceph yet ceph -s says it is up and in is it really up or down?
[13:34] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[13:35] <Psi-Jack> Hmmm, okaaay.... So I wiped my osd.7 that was failing due to a likely bug in ceph, I removed the osd from the crushmap and out of the ceph cluster entirely, now re-adding it back in, I'm failing on "ceph auth add osd.7 osd 'allow *' mon 'allow rwx' -i /srv/ceph/osd/ceph-7/keyring" because the keyring doesn't exist.
[13:35] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[13:35] <Psi-Jack> Oh d'uh. I see where it is.
[13:37] <loicd> ccourtaut: do you know of a tool that would allow me to match code coverage information with a diff ? The general idea being that I'd like to make sure all the lines modified by the diff were run at least once.
[13:37] * sha (~kvirc@81.17.168.194) Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[13:38] <ccourtaut> loicd: no i don't, sorry
[13:39] <Josh_> Every morning it seems something turns off all of my OSDs and logs show this.....^[[0;31m ** ERROR: error converting store /var/lib/ceph/osd/ceph-6: (16) Device or resource busy^[[0
[13:39] <loicd> ccourtaut: I suspect tools around clang might do something smart like that ;-)
[13:40] * yanzheng (~zhyan@101.82.237.93) Quit (Ping timeout: 480 seconds)
[13:41] <Psi-Jack> Well.... This is cool. So far, got the osd.7 back in place, and it's rebuilding.
[13:41] * yanzheng (~zhyan@101.83.112.13) has joined #ceph
[13:52] * zhyan_ (~zhyan@101.82.117.163) has joined #ceph
[13:55] * Psi-Jack (~Psi-Jack@psi-jack.user.oftc.net) Quit (Ping timeout: 480 seconds)
[13:56] <Josh_> How can I clear an OSD completely so that I can re add it?
[13:58] * yanzheng (~zhyan@101.83.112.13) Quit (Ping timeout: 480 seconds)
[14:00] <loicd> ccourtaut: valgrind does not complain about wip-5510 but about tcmalloc ;-) using flavor notcmalloc now
[14:01] <ccourtaut> :)
[14:04] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[14:05] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:09] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit (Read error: Connection reset by peer)
[14:09] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[14:09] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[14:09] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:10] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:11] * Josh_ (~IceChat9@rrcs-74-218-204-10.central.biz.rr.com) Quit (Quit: Don't push the red button!)
[14:12] * Psi-Jack (~psi-jack@psi-jack.user.oftc.net) has joined #ceph
[14:13] <Psi-Jack> Blah.
[14:14] <Psi-Jack> So, my osd is rebuilding and took down my whole network in the process, so far, so I have no further access to it, till I get back home. :/
[14:14] * Josh (~IceChat9@rrcs-74-218-204-10.central.biz.rr.com) has joined #ceph
[14:14] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[14:16] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[14:17] * Josh (~IceChat9@rrcs-74-218-204-10.central.biz.rr.com) Quit ()
[14:19] * ChoppingBrocoli (~quassel@rrcs-74-218-204-10.central.biz.rr.com) has joined #ceph
[14:20] <indego> Has anyone heard of or used these (journal)?: ACARD Dynamic SSD RAM Disk made from DRAM memory modules - ANS-9010 - http://www.acard.com/english/fb01-product.jsp?idno_no=270&prod_no=ANS-9010&type1_idno=5&ino=28
[14:21] <loicd> {duration: 354.1877238750458, flavor: notcmalloc, owner: loic@dachary.org, success: true}
[14:21] <loicd> INFO:teuthology.run:pass
[14:21] <loicd> ccourtaut: ^ \o/
[14:21] <ccourtaut> \o/
[14:21] * loicd trying again with more iterations to increase the odds of a problem occuring
[14:21] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[14:22] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[14:24] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:27] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[14:27] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:27] * dosaboy (~dosaboy@faun.canonical.com) Quit (Read error: Connection reset by peer)
[14:29] <Kioob> Which version of Linux is recommended for use with Kernel RBD client ?
[14:29] <Psi-Jack> The latest, usually. :)
[14:29] <Kioob> (here both 3.10.* and 3.9.* throw some panics)
[14:29] <kraken> http://i.imgur.com/H7PXV.gif
[14:31] <ChoppingBrocoli> Kioob: How do you 100% remove an osd from the cluster?
[14:32] <Psi-Jack> ChoppingBrocoli: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
[14:32] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[14:34] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[14:36] <ChoppingBrocoli> Would it be dumb to make a new cluster using mkcephfs? using ceph-deploy my osds just wont stay up
[14:36] <alfredodeza> mkcephfs is very old and does not have all the fixes of ceph-deploy
[14:37] <alfredodeza> ChoppingBrocoli: what version of ceph-deploy are you using?
[14:37] <ChoppingBrocoli> THe one before last friday, waiting for a deb
[14:37] <alfredodeza> ChoppingBrocoli: debs and RPMs are out
[14:37] <alfredodeza> and the Python package too
[14:38] <alfredodeza> ChoppingBrocoli: do you need links for the repos?
[14:38] <ChoppingBrocoli> please
[14:39] <alfredodeza> ceph-deploy repos are http://ceph.com/packages/ceph-extras/
[14:39] <ChoppingBrocoli> I have no idea what is going on but osds turn off every night (process just stops running) log claiming locking error. Will this fix that?
[14:42] <ChoppingBrocoli> version 1.0-1?
[14:42] <alfredodeza> no, 1.2
[14:42] <alfredodeza> I am not sure if this will fix that ChoppingBrocoli, but for sure, we have a *wealth* of bug fixes in it
[14:43] <ChoppingBrocoli> how can I tell what version I have?
[14:43] <alfredodeza> great question
[14:43] <ChoppingBrocoli> there is no command?
[14:44] <alfredodeza> I am pretty sure you cannot from ceph-deploy itself as of the last release, I just fixed that :)
[14:44] <alfredodeza> however
[14:44] <alfredodeza> you can certainly tell the version from your package manager
[14:44] * dosaboy (~dosaboy@faun.canonical.com) has joined #ceph
[14:44] <alfredodeza> yum/apt
[14:44] <loicd> alfredodeza: good morning :-)
[14:44] <alfredodeza> morning loicd
[14:45] <ChoppingBrocoli> ubuntu
[14:47] <ChoppingBrocoli> my source looks like this deb http://ceph.com/packages/ceph-extras/debian/ raring main
[14:47] * loicd wrote down notes about using teuthology & valgrind : http://dachary.org/?p=2223 . Nothing too complicated but not easy to guess.
[14:47] <ChoppingBrocoli> pacacke manage says 1.0-1 eventhough it did update...
[14:47] <alfredodeza> ChoppingBrocoli: dpkg -l | grep ceph-deploy
[14:48] <ChoppingBrocoli> ii ceph-deploy 1.0-1 all Ceph-deploy is an easy to use configuration tool
[14:48] <alfredodeza> try removing+purging?
[14:48] <alfredodeza> right
[14:48] <alfredodeza> so 1.0-1 seems old to me
[14:48] <ChoppingBrocoli> it wont take down the cluster correct?
[14:48] <loicd> ChoppingBrocoli: apt-cache policy ceph-deploy ?
[14:50] <loicd> http://ceph.com/packages/ceph-extras/debian/pool/main/c/ceph-deploy/ alfredodeza only has one .deb
[14:51] <alfredodeza> welp
[14:51] <alfredodeza> that doesn't look good
[14:51] * alfredodeza opens a ticket
[14:51] <alfredodeza> ChoppingBrocoli: what do you mean? by removing it?
[14:51] <alfredodeza> ceph-deploy is action driven
[14:52] <loicd> $ apt-cache policy ceph-deploy
[14:52] <loicd> ceph-deploy:
[14:52] <loicd> Installed: (none)
[14:52] <loicd> Candidate: 1.0-1
[14:52] <loicd> Version table:
[14:52] <loicd> 1.0-1 0
[14:52] <loicd> 500 http://ceph.com/packages/ceph-extras/debian/ raring/main amd64 Packages
[14:53] <ChoppingBrocoli> ceph-deploy:
[14:53] <ChoppingBrocoli> Installed: 1.0-1
[14:53] <ChoppingBrocoli> Candidate: 1.0-1
[14:53] <ChoppingBrocoli> Version table:
[14:53] <ChoppingBrocoli> *** 1.0-1 0
[14:53] <ChoppingBrocoli> 500 http://ceph.com/packages/ceph-extras/debian/ raring/main amd64 Packages
[14:53] <ChoppingBrocoli> 100 /var/lib/dpkg/status
[14:53] <alfredodeza> ChoppingBrocoli: yep, don't do anything just yet, I am going to create a ticket for the lack of the 1.2 deb
[14:53] <ChoppingBrocoli> ok
[14:53] <alfredodeza> *or* if you are familiar with Python install tools, you could go that way too
[14:53] <ChoppingBrocoli> once I get the new one, should I delete and re add all of my osds?
[14:54] <ChoppingBrocoli> never tried python before
[15:02] <ChoppingBrocoli> also if I have 4 disks per host, should I raid them and add 1 osd or should I leave them as is and add 4 osds?
[15:03] * berant (~blemmenes@gw01.ussignalcom.com) has joined #ceph
[15:04] <indego> OK, I have a 2 node ceph cluster. Been messing with it a little and now want to try to do something approaching what I intend, virtualization. I have the cluster replication via a cross-cable, I have installed the DEB ceph-common on a 3rd host. If I try to perform a "rbd map" form that host I get "rbd: add failed: (5) Input/output error". What do I need to access a RBD/OSD from a non-cluster member (or do all hosts need to be in the cluster)?
[15:05] <ChoppingBrocoli> indego: make sure the client is on the same network as the mon, give client key and ceph.conf I believe
[15:06] * dosaboy_ (~dosaboy@faun.canonical.com) has joined #ceph
[15:07] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[15:07] <indego> ChoppingBrocoli, OK, thanks. I can separate the OSD network from the RBD network however, correct? I see the mon only on my 'OSD' network...
[15:08] <ChoppingBrocoli> yes cluster network and public network
[15:08] * dosaboy (~dosaboy@faun.canonical.com) Quit (Ping timeout: 480 seconds)
[15:08] <ChoppingBrocoli> indego: Are you sure you want to use crossover? You will limit future growth
[15:09] <indego> That is what I am trying to get right in my head. Is the 'public network' my virt servers and the cluster network just for OSD replication? Then I will have my 'real' public network that is the clients talking to the virt servers.
[15:10] <indego> ChoppingBrocoli, the x-over is just for a quick test. Will be a switch soon.
[15:11] <ChoppingBrocoli> indego: sounds like you want 3 networks? 1 for osd replication, 1 for Virt --to--> cluster public network 1 for clients ---to--> virt hosts
[15:12] <indego> ... or is the cluster network for *all* cluster traffic, including RBD mounts and the public network for everything else.
[15:14] <indego> If the latter, and I export OSDs from a virt machine, then I have more write overhead, no? I though the Idea was to split the OSD writing from the RBD access.
[15:14] <alfredodeza> ChoppingBrocoli: I've opened ticket 5960
[15:14] <ChoppingBrocoli> alfredodeza: thank you
[15:15] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[15:15] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[15:15] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[15:16] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[15:16] <alfredodeza> ah, the bot died
[15:16] <alfredodeza> opened ticket 5960
[15:16] <kraken> alfredodeza might be talking about: http://tracker.ceph.com/issues/5960
[15:16] <alfredodeza> there ^ ^
[15:17] <ChoppingBrocoli> indego: Me personally (not sure if this is best or not) I took 2 switches and stacked them using a 40GB interface, then every server hosting osds I bonded mode 4 (as many interfaces as you can). THen I took my hosts and bonded 2 interfaces mode 4. That was my storage network all like that. Then for clients I used 2 other adapters on my virt host and bonded to my client vlan. This is tolerant against swtich failurs and thanks to
[15:17] <ChoppingBrocoli> ceph server failures
[15:19] <Psi-Jack> Well, this is really frustrating. my servers are all still offline. Seems like lunch time will be a time to drop by the house to see what's up.
[15:19] <ChoppingBrocoli> Remember to configure switches for mode 4 also, that was you can split the bond between both switches without worrying about STP
[15:19] * skatteola_2 is now known as skatteola
[15:20] <ChoppingBrocoli> For dell switches (don't laugh yes dell) console(config-if)#channel-group 1 mode auto
[15:21] <indego> ChoppingBrocoli, OK, thanks for the input. Your osds are on the same vlan as the virt servers?
[15:21] <ChoppingBrocoli> yes
[15:21] <indego> can you have a monitor listening on multiple IPs?
[15:22] <ChoppingBrocoli> Not sure I think no, anyone else know on this one?
[15:22] <indego> if not, I will need to re-jig things.
[15:22] <indego> this seems to reference 3 independent networks: http://ceph.com/docs/master/install/hardware-recommendations/#networks
[15:23] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit (Quit: sprachgenerator)
[15:24] <ChoppingBrocoli> osd virt client
[15:25] <ChoppingBrocoli> ?
[15:25] <indego> I was wanting the OSD network to be the x-over for the moment and then the othet 1G NIC for the 'client'. I was trying to access a RBD device via my desktop system as a test and stick a KVM image on the rbd.
[15:26] * jeff-YF (~jeffyf@67.23.117.122) has joined #ceph
[15:26] <ChoppingBrocoli> and just to be sure you have the ceoh.conf and key in /etc/ceph?
[15:28] <indego> not yet, will copy but I see "libceph: mon0 10.0.0.133:6789 connection failed" and my OSD network is a different subnet. The MON is listening only on the other subnet address...
[15:28] <ChoppingBrocoli> In your ceph.conf what ip is listed for mon_host?
[15:28] <indego> mon_host = 10.110.0.12
[15:29] <ChoppingBrocoli> what is your desktop ip?
[15:29] <indego> different subnet
[15:29] <ChoppingBrocoli> then you will not be able to connect
[15:30] * deadsimple (~infinityt@134.95.27.132) has joined #ceph
[15:30] * KrisK (~krzysztof@213.17.226.11) Quit (Quit: KrisK)
[15:30] <indego> hence the question of the network topology. My understanding, from the docs, was that it was recomended to split the OSD replication network from the rest for performance.
[15:31] <indego> If the mon is listening only of the OSD network then how does this work...
[15:31] <ChoppingBrocoli> correct, but you still need a public network for clients
[15:31] <Gugge-47527> mon should listen on public network only
[15:31] <Gugge-47527> and osd's should listen on both public and private network
[15:31] <Gugge-47527> the private is only used for osd to osd communication
[15:32] <indego> Gugge-47527, OK, got it. Is that specified anywhere, which IPs the OSDs communicate on?
[15:32] * nhm_ (~nhm@174-20-36-25.mpls.qwest.net) has joined #ceph
[15:32] <Gugge-47527> yes, it is the private network :)
[15:33] <loicd> alfredodeza: have you tried extracting code coverage from teuthology ? I'm puzzled and start obsessing about this although I know I should not ;-)
[15:33] <Gugge-47527> "cluster network = "
[15:33] <indego> OK, but *how* do the OSDs know which network is which?
[15:33] <Gugge-47527> http://ceph.com/docs/master/rados/configuration/network-config-ref/
[15:33] <indego> Gugge-47527, OK, missed that one.
[15:33] <Gugge-47527> by the "private network = " and the "cluster network = "
[15:33] <alfredodeza> loicd: I have not
[15:34] <loicd> alfredodeza: do you know someone who heard about someone who saw someone do it ? ;-)
[15:35] * zhyan_ (~zhyan@101.82.117.163) Quit (Ping timeout: 480 seconds)
[15:35] <indego> Time for a little reading and re-config - thanks all
[15:37] * infinitytrapdoor (~infinityt@134.95.27.132) Quit (Ping timeout: 480 seconds)
[15:37] * nhm (~nhm@67-220-20-222.usiwireless.com) Quit (Read error: Operation timed out)
[15:41] <alfredodeza> loicd: I will ping someone today and let you know
[15:41] <alfredodeza> I also saw your email :)
[15:41] <loicd> alfredodeza: thanks :-)
[15:45] * alram (~alram@208.86.100.62) has joined #ceph
[15:47] * jeff-YF_ (~jeffyf@67.23.117.122) has joined #ceph
[15:51] * jeff-YF (~jeffyf@67.23.117.122) Quit (Ping timeout: 480 seconds)
[15:51] * jeff-YF_ is now known as jeff-YF
[15:52] * zhyan_ (~zhyan@101.82.225.145) has joined #ceph
[15:56] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) has joined #ceph
[16:00] * markbby (~Adium@168.94.245.2) has joined #ceph
[16:03] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[16:04] <ofu_> Dumpling!
[16:04] <absynth> http://blogs.villagevoice.com/food/IMG_1239.JPG
[16:04] <absynth> nom nom!
[16:06] <indego> how can I change the mon address specified in ceph-deploy?
[16:06] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[16:07] <alfredodeza> indego: what do you mean by mon address?
[16:07] <indego> the/a monitor address
[16:07] <alfredodeza> you can 'map' a name to a FQDN
[16:08] <indego> yes, sorry, was blind in the config. mon_host AND mon_initial_members
[16:14] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[16:18] <loicd> ccourtaut: in the context of http://tracker.ceph.com/issues/5878 do you have suggestions regarding the simplest way to handle dynamical load of plugins ?
[16:18] * clayb (~kvirc@199.172.169.79) has joined #ceph
[16:18] <loicd> I don't think ceph already has code for that. Not sure though.
[16:19] * dosaboy (~gizmo@faun.canonical.com) has joined #ceph
[16:22] * dosaboy (~gizmo@faun.canonical.com) Quit ()
[16:22] * BillK (~BillK-OFT@124-169-72-15.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[16:25] * sjm (~oftc-webi@c73-103.rim.net) has joined #ceph
[16:29] <Kdecherf> hey world
[16:29] <Kdecherf> what is the unit of latency lines in rados bench?
[16:33] * capri (~capri@212.218.127.222) Quit (Quit: Verlassend)
[16:35] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[16:36] * L2SHO_ is now known as L2SHO
[16:38] <niklas> what are object namespaces within pools?
[16:39] * saabylaptop (~saabylapt@2a02:2350:18:1010:bc4f:516d:8111:16d7) has left #ceph
[16:39] <niklas> They seem to be introduced with the new version, but I have no Idea what they are
[16:40] <niklas> btw: Channel Topic is not up to date ( andreask leseb rturk-away )
[16:44] * deadsimple (~infinityt@134.95.27.132) Quit ()
[16:46] * ccourtaut looking at #5878
[16:46] <alfredodeza> issue 5878
[16:46] <kraken> alfredodeza might be talking about: http://tracker.ceph.com/issues/5878
[16:47] <ccourtaut> loicd: i don't know if ceph already as code to load/unload dynamically plugins
[16:50] <ccourtaut> loicd: btw plugin interface are mostly in written in C
[16:51] <ccourtaut> loicd: and the use our best friends dlopen/dlsym/dlclose
[16:52] <ccourtaut> plugins usually have an init function, and you might want a registry to allow plugins to register into various hooks of the pipeline
[16:53] * bergerx_ (~bekir@78.188.101.175) Quit (Remote host closed the connection)
[16:54] * bergerx_ (~bekir@78.188.204.182) has joined #ceph
[16:57] <loicd> ccourtaut: you are correct. Since ceph relies on automake and it has good support for libtool, I'll just use this.
[16:59] * frank9999 (~frank@kantoor.transip.nl) Quit ()
[17:07] <ChoppingBrocoli> Can you upgrade to dumpling with ceph-deploy?
[17:07] <alfredodeza> you should be able to do that
[17:12] <ChoppingBrocoli> not ready yet, broken link W: Failed to fetch http://ceph.com/debian-dumplilng/dists/raring/main/binary-amd64/Packages 404 Not Found
[17:13] <alfredodeza> ah, correct, issue 5960 should fix that
[17:13] <kraken> alfredodeza might be talking about: http://tracker.ceph.com/issues/5960
[17:13] * KindTwo (KindOne@198.14.201.126) has joined #ceph
[17:18] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:18] * KindTwo is now known as KindOne
[17:25] * dosaboy_ (~dosaboy@faun.canonical.com) Quit (Read error: Connection reset by peer)
[17:26] * dosaboy (~dosaboy@faun.canonical.com) has joined #ceph
[17:27] * sprachgenerator (~sprachgen@130.202.135.222) has joined #ceph
[17:28] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[17:36] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[17:37] * glowell1 (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:39] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:40] * berant (~blemmenes@gw01.ussignalcom.com) Quit (Quit: berant)
[17:41] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[17:43] * jcfischer (~fischer@peta-dhcp-3.switch.ch) Quit (Ping timeout: 480 seconds)
[17:45] * sagelap (~sage@76.89.177.113) has joined #ceph
[17:46] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:49] * devoid (~devoid@130.202.135.235) has joined #ceph
[17:54] <indego> ChoppingBrocoli, Gugge-47527 - Got it running. KVM VM on desktop to 2 node cluster. Thanks for the pointers.
[17:54] * sagelap (~sage@76.89.177.113) Quit (Ping timeout: 480 seconds)
[17:55] * tnt (~tnt@91.176.3.64) has joined #ceph
[17:56] * indego (~indego@91.232.88.10) Quit (Quit: long weekend)
[17:56] * bergerx_ (~bekir@78.188.204.182) Quit (Quit: Leaving.)
[17:56] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Quit: jlogan1)
[17:57] * davidzlap1 (~Adium@ip68-5-239-214.oc.oc.cox.net) Quit (Quit: Leaving.)
[17:58] * sagelap (~sage@2600:1012:b026:dd31:cd0d:7d6:865:6b7e) has joined #ceph
[17:59] * davidzlap (~Adium@ip68-5-239-214.oc.oc.cox.net) has joined #ceph
[18:04] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[18:05] * davidzlap1 (~Adium@ip68-5-239-214.oc.oc.cox.net) has joined #ceph
[18:05] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[18:06] * davidzlap (~Adium@ip68-5-239-214.oc.oc.cox.net) Quit (Read error: Connection reset by peer)
[18:09] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[18:09] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[18:10] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[18:24] * sagelap1 (~sage@2607:f298:a:607:cd0d:7d6:865:6b7e) has joined #ceph
[18:25] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[18:26] * berant (~blemmenes@gw01.ussignalcom.com) has joined #ceph
[18:27] * sagelap (~sage@2600:1012:b026:dd31:cd0d:7d6:865:6b7e) Quit (Ping timeout: 480 seconds)
[18:32] * odyssey4me (~odyssey4m@165.233.71.2) Quit (Ping timeout: 480 seconds)
[18:33] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[18:35] * dumbdumbgj (~bonghisss@dsl-209-55-75-223.centex.net) has joined #ceph
[18:36] <dumbdumbgj> im glad its logged
[18:36] <dumbdumbgj> cause i gw.scriptkitties.com/tclplugin/
[18:36] <dumbdumbgj> Tcl interface loaded
[18:36] <dumbdumbgj> * Looking up irc.quakenet.org
[18:36] <dumbdumbgj> * Connecting to irc.quakenet.org (208.64.121.85) port 6667...
[18:38] * joao (~JL@2607:f298:a:607:9eeb:e8ff:fe0f:c9a6) has joined #ceph
[18:38] * ChanServ sets mode +o joao
[18:38] * dumbdumbgj (~bonghisss@dsl-209-55-75-223.centex.net) Quit (Max SendQ exceeded)
[18:38] <cjh_> congrats on the dumpling release!
[18:38] <alfredodeza> loicd: FYI, I just updated the teuthology readme to include instructions for non-debian OSs
[18:38] <alfredodeza> loicd: https://github.com/ceph/teuthology#build
[18:41] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[18:46] * gregaf (~Adium@2607:f298:a:607:5c72:b976:4e18:287d) Quit (Quit: Leaving.)
[18:46] * frank9999 (~Frank@kantoor.transip.nl) has joined #ceph
[18:48] * nwf_ (~nwf@67.62.51.95) Quit (Ping timeout: 480 seconds)
[18:52] * zhyan__ (~zhyan@101.83.207.14) has joined #ceph
[18:53] * KindTwo (KindOne@50.96.226.208) has joined #ceph
[18:53] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Quit: Nettalk6 - www.ntalk.de)
[18:54] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[18:54] * Psi-Jack_ (~Psi-Jack@psi-jack.user.oftc.net) has joined #ceph
[18:54] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[18:54] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[18:54] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[18:54] * KindTwo is now known as KindOne
[18:55] <xdeller> hey, there is an observation of osd rss memory consumption - when I stop osd, wait a bit and then start it, memory consumption of exactly this process is about 20% higher than the rest when recovery starts and going back to the normal in a day or two
[18:56] <loicd> alfredodeza: what does "repo collab" mean next to the nick name in https://github.com/ceph/teuthology/pull/31/files#r5768112 ?
[18:56] <xdeller> also recovery process on neighbours has almost same effect except raise of rss commit is a bit lower, about 5% and it going down to 'normal' state in such long period too
[18:56] <alfredodeza> loicd: it means I have permissions to push to that repo directly
[18:58] <loicd> :-)
[18:59] * zhyan_ (~zhyan@101.82.225.145) Quit (Ping timeout: 480 seconds)
[19:03] * julian (~julianwa@125.70.132.20) Quit (Quit: afk)
[19:04] * gregaf (~Adium@2607:f298:a:607:fd96:c553:1184:eb0a) has joined #ceph
[19:04] <paravoid> gregaf: it's in the release notes already.
[19:06] <gregaf> yeah, saw that immediately afterwards
[19:11] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[19:16] * Tamil (~tamil@38.122.20.226) has joined #ceph
[19:17] * berant (~blemmenes@gw01.ussignalcom.com) Quit (Quit: berant)
[19:19] * berant (~blemmenes@vpn-main.ussignalcom.com) has joined #ceph
[19:19] * berant (~blemmenes@vpn-main.ussignalcom.com) Quit ()
[19:20] * berant (~blemmenes@vpn-main.ussignalcom.net) has joined #ceph
[19:29] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[19:31] * sagelap1 (~sage@2607:f298:a:607:cd0d:7d6:865:6b7e) Quit (Ping timeout: 480 seconds)
[19:33] <sjusthm> next is now decoupled from dumpling?
[19:34] <loicd> sjust: https://github.com/ceph/ceph/pull/414 being valgrind clean after running the rados teuthology tasks, is it worth it for me to try to extract the coverage information and check that the code is covered ? Or would you advise me to run other teuthology tasks that you know to be more thorough ? I'm not sure how long it will take me to get code coverage :-)
[19:35] <sjusthm> loicd: I wouldn't worry the coverage
[19:35] <sjusthm> the next step would be to run the same test without valgrind
[19:35] <Pauline> Hmmz. I rebooted one of my hosts with an mon on it, NTP active. However, as ntp takes a little time to ping the hosts it wants to talk to a few times, the mon was up way earlier than NTP, and marked my cluster unhealthy... Will it recover on its own? ntp adjusted the time for about 47ms by now.
[19:35] <sjusthm> then schedule a suite run
[19:35] <sjusthm> the rados suite specifically
[19:35] <loicd> ok, will do
[19:36] <loicd> sjust: https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados ?
[19:36] <sjusthm> yeah
[19:36] <loicd> cool :-)
[19:37] <sjusthm> ./schedule_suite.sh rados wip-4982-4983-oloc-rebase testing sam.just@inktank.com basic master mira
[19:37] <sjusthm> something like that with your branch instead of wip-4982.*oloc
[19:37] <sjusthm> oops
[19:38] <sjusthm> something like that with s/wip-4982-4983-oloc-rebase/<your branch>
[19:41] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[19:44] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit ()
[19:48] * schelluri (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[19:50] <sagewk> mikedawson: ping
[19:52] <paravoid> sagewk: upgrade to 0.67 was seamless, first upgrade with no real downtime
[19:52] <sagewk> \o/
[19:53] <paravoid> (from 0.67-rc3)
[19:53] <cjh_> paravoid: awesome :D
[19:53] * janos hopes to do bobtail to cuttlefish this weekend
[19:53] <sagewk> that's a short hop i guess
[19:53] <paravoid> only "issue" was the message I left you a few hours ago
[19:53] <paravoid> rcs should had been versioned 0.67~rc3-1
[19:54] <paravoid> 0.67 > 0.67~rc3 while 0.67 < 0.67-rc3
[19:54] <paravoid> not a real issue obviously :)
[19:55] <cjh_> i was going to wait for the point release before going cuttlefish -> bobtail but your upgrade is encouraging
[19:55] <sagewk> yeah... we'll do that next time.
[19:55] <paravoid> oh the other thing is
[19:55] <sagewk> doesn't work for rpms, but we have to mangle those versions already anyway
[19:56] <paravoid> "ceph pg dump pgs_brief" seems to be the same as "ceph pg dump" to me
[19:57] <paravoid> same as "ceph pg dump sum" for that matter
[19:57] <paravoid> I'll file a bug
[19:58] <sagewk> only works with --format=json (not plain)
[19:58] * sagelap (~sage@2607:f298:a:607:ea03:9aff:febc:4c23) has joined #ceph
[19:58] <sagewk> the plain is always a full dump
[19:59] <paravoid> ah!
[19:59] <sagewk> go ahead and open a bug though, would be nice to fix that
[19:59] <paravoid> or at least error instead of happily printing the same output :)
[19:59] <sagewk> yeah
[20:01] <sagewk> the idea came up this week to add a ceph-submit-file or ceph-post-file or something similar that would do the equivalent of sending a file to cephdrop, but in a much easier way. something like 'ceph-post-file -d paravoid_peering_problem /var/log/ceph/ceph-osd.*'
[20:01] <paravoid> nice
[20:01] <paravoid> I actually had an additional idea that might also be relevant here
[20:01] <sagewk> but it would be (actually) one-way (other users can read it), and it would output some tag or something that references that particular post
[20:02] <sagewk> *can't
[20:02] <paravoid> a command something like 'ceph log "starting ceph peering test 1"'
[20:02] <paravoid> oh wow, that exists
[20:02] <sagewk> heh, that command already exists :)
[20:02] <paravoid> hahaha
[20:02] <paravoid> I just thought of just running that
[20:02] * Schelluri (~Sriram@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[20:03] <paravoid> that's so funny
[20:04] <paravoid> ceph-post-file sounds great
[20:04] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[20:06] <joelio> sagewk: cephdrop?
[20:07] <joelio> sounds interesting!
[20:07] <sagewk> (ceph drop is a lightly-protected sftp account where users can upload logs for devs to look at)
[20:08] * dosaboy (~dosaboy@faun.canonical.com) Quit (Quit: leaving)
[20:10] <joelio> ahh, thought it may have been some object storing tool akin to dropbox
[20:12] <joelio> https://github.com/gmarik/gist.sh is handy
[20:22] * fridudad (~oftc-webi@p4FC2C8FE.dip0.t-ipconnect.de) has joined #ceph
[20:22] * devoid (~devoid@130.202.135.235) has left #ceph
[20:23] * dosaboy (~dosaboy@host109-154-149-172.range109-154.btcentralplus.com) has joined #ceph
[20:23] * dosaboy__ (~dosaboy@host109-157-181-219.range109-157.btcentralplus.com) Quit (Read error: Operation timed out)
[20:23] * Schelluri (~Sriram@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Read error: Operation timed out)
[20:39] <Kioob> what is the process to fix "deep-scrub stat mismatch" ?
[20:39] <Kioob> (ceph pg repair doesn't fix that)
[20:43] <sjusthm> mm, it's supposed to
[20:44] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[20:44] * markbby (~Adium@168.94.245.2) has joined #ceph
[20:45] <Kioob> Then I will retry
[20:49] * markbby (~Adium@168.94.245.2) Quit ()
[20:49] * markbby (~Adium@168.94.245.2) has joined #ceph
[20:50] <ChoppingBrocoli> IS anyone using flashcache for their virt machines or is direct rbd better? (Production servers)
[20:54] <Gugge-47527> flashcache in writeback is a bad idea :)
[20:54] <Gugge-47527> in writethrough im sure it would be fine
[20:58] <loicd> sjusthm: if I'm not mistaken you suggested that src/cls/* is a source of inspiration regarding dynamically loaded plugins. It looks like it's designed to expose functions from ceph libraries so that they can be loaded by third party software / used for language bindings. Am I missing something ?
[20:59] <sjusthm> they are osd plugins
[20:59] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[20:59] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[21:01] <sjusthm> you can invoke them with IoCtx::exec()
[21:02] <sjusthm> looks like src/cls is the collection of plugins and src/objclass is more of the implemenation
[21:03] * loicd looking
[21:05] <mikedawson> sagewk: pong
[21:07] <fridudad> sjusthm: did you had the chance to look at the ceph.log i sent you this morning regarding cuttlefish recovery problem?
[21:08] <sjusthm> no, what was the subject line?
[21:09] <fridudad> sjusthm Re: still recovery issues with cuttlefish
[21:10] <fridudad> ceph.log.gz was attached
[21:11] <sjusthm> ah, there it is
[21:12] <fridudad> i know this are not all logs from the osds it gives you an idea about the problem
[21:13] * nwl (~levine@atticus.yoyo.org) Quit (Quit: leaving)
[21:15] * sjm (~oftc-webi@c73-103.rim.net) Quit (Quit: Page closed)
[21:16] * nwl (~levine@atticus.yoyo.org) has joined #ceph
[21:21] <sjusthm> fridudad: you upgraded all of the osds?
[21:23] <fridudad> sjusthm: yes - first thing i did - upgrade to latest origin/cuttlefish - then restartet all mons then ALL osds.
[21:24] <fridudad> then osd.X stop; sleep 30; osd.X start (didn have looked in the log again which OSD it was)
[21:25] * mschiff (~mschiff@pD9510E87.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[21:30] * mtl1 (~Adium@c-67-176-54-246.hsd1.co.comcast.net) has joined #ceph
[21:30] * mtl (~Adium@c-67-176-54-246.hsd1.co.comcast.net) Quit (Read error: Connection reset by peer)
[21:33] <sjusthm> fridudad: ok, we'll need osd logs again
[21:34] <fridudad> sjusthm: i expected that.
[21:34] * jmlowe (~Adium@149.160.194.126) has joined #ceph
[21:34] <fridudad> so complicated to get them all without crashing the production thing ;-) do you think updating to dumpling will help?
[21:35] <sjusthm> fridudad: it might
[21:38] <fridudad> mhm i think i don't risk this right now...
[21:40] * scuttlemonkey (~scuttlemo@107.16.78.55) has joined #ceph
[21:40] * ChanServ sets mode +o scuttlemonkey
[21:41] <fridudad> sjusthm> ok thanks i'll have a look when i'm able to gather complete logs again
[21:43] * ishkabob (~c7a82cc0@webuser.thegrebs.com) has joined #ceph
[21:43] <ishkabob> hey Ceph devs, I was wondering if ceph-deploy will be available as an RPM in the dumpling repo?
[21:43] <ishkabob> it is currently not there
[21:44] <alfredodeza> ishkabob: where exactly are you looking
[21:44] <ishkabob> alfredodeza: it used to be here - http://ceph.com/rpm/el6/noarch/
[21:44] <alfredodeza> nope, it is here: http://ceph.com/packages/ceph-extras/rpm/
[21:44] <ishkabob> h cool, thanks
[21:45] <ishkabob> i notice there is no scientific linux repo
[21:45] <ishkabob> i could just use the cent one i guess
[21:45] <ishkabob> will there be an SL repo?
[21:46] <alfredodeza> I don't think so
[21:46] <alfredodeza> you probably want the CentOS one
[21:46] <ishkabob> ok, thanks
[21:52] * zhyan_ (~zhyan@101.82.180.20) has joined #ceph
[21:54] <ishkabob> alfredodeza: so this version of Ceph says that it requires pushy-0.5.2, but 0.5.1 is what's in the repo
[21:54] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[21:54] * Schelluri (~Sriram@50-197-184-177-static.hfc.comcastbusiness.net) has joined #ceph
[21:54] <alfredodeza> sorry ishkabob, thought that was fixed
[21:54] <ishkabob> i'm sorry, this version of ceph-deploy
[21:54] <ishkabob> no problem, were you going to relax the restriction or update pushy in the repo?
[21:55] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[21:56] <alfredodeza> no, we are working to get a build that will get the new version of pushy out
[21:56] <alfredodeza> it has some upstream changes that fixes some important bugs, hence the hard requirement
[21:56] <ishkabob> cool, if i download the ceph-deploy source from git, is this where you are building this package?
[21:57] <ishkabob> i can fix it and submit a pull request
[21:57] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has left #ceph
[21:57] <alfredodeza> oh, wait, if you are familiar with Python install tools, you can install from the Python package Index
[21:57] <alfredodeza> `pip install ceph-deploy` should work correctly
[21:59] * zhyan__ (~zhyan@101.83.207.14) Quit (Ping timeout: 480 seconds)
[22:01] <ishkabob> i think pip only works with python2.7 right?
[22:02] <alfredodeza> oh no
[22:02] <alfredodeza> 2.6 too
[22:02] <alfredodeza> if you prefer some other installation tool, that should work as well
[22:04] <ishkabob> nah, that worked, we'll most likely depend on the RPMs in production, but this ought to get me going
[22:04] <ishkabob> cheers :)
[22:06] * Schelluri (~Sriram@50-197-184-177-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[22:08] <alfredodeza> excellent !
[22:10] * allsystemsarego (~allsystem@5-12-241-157.residential.rdsnet.ro) Quit (Quit: Leaving)
[22:10] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[22:11] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[22:17] <sjusthm> sagewk, gregaf: http://pastebin.com/v4h2JNFi
[22:17] <sjusthm> for perf informatino?
[22:18] <gregaf> *confused* is this one of the papercut things?
[22:20] <mikedawson> Is there anyway to ask rbd to trim it's client.volumes.log without restarting the qemu process? Running out of space with ridiculous logging levels waiting for an error condition.
[22:20] * skatteola (~david@c-0784e455.16-0154-74657210.cust.bredbandsbolaget.se) Quit (Quit: Paus!)
[22:21] * sprachgenerator (~sprachgen@130.202.135.222) Quit (Quit: sprachgenerator)
[22:21] * berant (~blemmenes@vpn-main.ussignalcom.net) Quit (Quit: berant)
[22:22] * sprachgenerator (~sprachgen@130.202.135.222) has joined #ceph
[22:23] <sjusthm> gregaf: yeah, it's a quick way to get an overview of which osds are heavily loaded
[22:23] <gregaf> is that command sending off a quick op to every osd?
[22:23] <sjusthm> no
[22:24] <gregaf> so where's it getting the stats?
[22:24] <gregaf> in particular I'm confused about the 0-ms commits
[22:24] <sjusthm> a summary of the perf counters is attached to the osd_stat_t which the osds send
[22:25] <sjusthm> currently, the summary is those two things
[22:25] <sjusthm> it does lag, but if you have a massively slow disk, this should make it pretty obvious at a glance
[22:25] <sjusthm> which is its main purpose
[22:25] <gregaf> okay, that's what I thought
[22:25] <sjusthm> 0ms latency here means that nothing happened since the last stat was sent
[22:25] <sjusthm> the queue is empty
[22:25] <gregaf> ah
[22:26] <sjusthm> it's not perfect, but it does fit with the lower-is-better intuition
[22:26] <gregaf> well, in that case we need to document very carefully…but it'd be nice if we didn't need documentation
[22:26] <sjusthm> 0ms could also mean <ms
[22:26] <gregaf> because somebody is going to use it wrong
[22:26] <sjusthm> how are you afraid they will use it?
[22:26] <joshd> mikedawson: if you have an admin socket for the process there's a log reopen command command you can use after (re)moving the current log file
[22:27] <gregaf> I dunno, but they're going to look at it and say "oh, I have 8 disks with fast commits and one disk that's taking 12ms to commit; I'd better replace that disk" or something
[22:27] <sjusthm> uh
[22:27] <gregaf> or like the people who look at the pg stat summary of IO
[22:27] <gregaf> and say "my idle cluster is doing 40MB/s of IO!"
[22:27] <sjusthm> gregaf: I don't understand
[22:28] <gregaf> sjusthm: these statistics are using 0 to mean "no activity" and somebody is going to misinterpret that and do something bad
[22:28] <gregaf> we've seen it every time we output some developer-focused stats to solve a specific problem
[22:28] <wrencsok1> i have a 0.61.7 mon node that doesn't want to rejoin. i put some load on the cluster and eventually consumed all free swap space, ran out of order 0 memory and oom killer came along and killed the ceph-mon that won't restart properly or rejoin. i see these http://pastebin.com/RfAwAP4K in the logs. any pointers on resolving that?
[22:29] <mikedawson> joshd: thx. 'truncate -s 0 way-too-big-log' seems to work too
[22:29] <sjusthm> gregaf: no, 0 means io is not backed up
[22:29] <gregaf> for instance by getting approximate cluster throughput stats to users via the pgmap (which lags a fair bit and doesn't update if there's no traffic)
[22:29] <sjusthm> I could have it default to the average since the osd started
[22:29] <sjusthm> ok, that would be better
[22:29] <gregaf> you just said it meant there hadn't been any commits
[22:29] <joshd> mikedawson: good to know, wasn't sure about that myself
[22:30] <sjusthm> gregaf: I'll have it use the average over all time in that case
[22:30] <gregaf> sjusthm: yeah, it might be better to use averages than most-recent, or perhaps include both and that way it's obvious that the 0 millisecond entry probably doesn't mean instantaneous commits
[22:30] <sjusthm> gregaf: it is using averages, it's just the average since the last report
[22:30] <sjusthm> if no IO happened
[22:30] <sjusthm> then there were no commits over which to average
[22:30] <sjusthm> so it reports 0
[22:31] <gregaf> yeah, and I get that but I'm telling you with that interface a lot of users are going to be confused
[22:31] <gregaf> *shrug*
[22:31] <sjusthm> instead, I'll have it just report the average over all time
[22:31] <sjusthm> that way, it's more of an estimate for the commit and apply latency
[22:31] <sjusthm> which is reasonable
[22:31] <gregaf> if the real purpose is not to expose latency data but to expose slow OSDs perhaps we should output that data
[22:31] <sjusthm> what data?
[22:31] <gregaf> "5 OSDs are consistently slower than the others by <threshold>; there they are"
[22:32] <sjusthm> too hard to get right
[22:32] <gregaf> s/there/here/
[22:37] * rudolfsteiner (~federicon@200.68.116.185) Quit (Quit: rudolfsteiner)
[22:38] <ishkabob> is there any way that I can get some extra logging for ceph-deploy? I'm using ceph-deploy -v, but it seems to be hanging on ceph-create-keys and I don't know why
[22:38] <ishkabob> this is for bootstrapping monitors
[22:38] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[22:38] * markbby (~Adium@168.94.245.2) has joined #ceph
[22:38] <alfredodeza> what version of ceph-deploy are you using? the latest one (1.2) has verbosity turned on at is highest
[22:39] * fridudad (~oftc-webi@p4FC2C8FE.dip0.t-ipconnect.de) Quit (Remote host closed the connection)
[22:39] <ishkabob> how can i tell which version i'm running?
[22:39] <ishkabob> i installed it with easy_install (pip)
[22:40] <ishkabob> ceph version is 0.67
[22:42] <alfredodeza> pip didn't tell you?
[22:42] <alfredodeza> I just added a --version flag yesterday :/
[22:42] <alfredodeza> so if you see a bunch of DEBUG/INFO log output, you are already maxed out on logging level
[22:42] <ishkabob> ;ppls ;ole ots ceph-deploy 1.2.1
[22:42] <ishkabob> hah
[22:42] <ishkabob> looks like its ceph-deploy 1.2.1
[22:42] <alfredodeza> ok
[22:43] <ishkabob> i just ran the create keys command manually
[22:43] <ishkabob> so i get this
[22:43] <alfredodeza> you should check your monitor logs and see what is going on
[22:43] <ishkabob> # /usr/bin/python /usr/sbin/ceph-create-keys -i camelot
[22:43] <ishkabob> INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
[22:45] <sagewk> http://fpaste.org/32186/37651310/
[22:45] <sagewk> ceph-post-file ^^
[22:46] <ishkabob> alfredodeza: i wonder why its trying to connect to a ceph-mon, i believe the monitor it should be connecting to is "ceph-camelot" not "ceph-mon"
[22:47] <ishkabob> alfredodeza: i do see this in the ceph-deploy output - ceph-mon: mon.noname-a 10.198.1.3:6789/0 is local, renaming to mon.camelot
[22:47] <ishkabob> maybe its a problem with that?
[22:48] <alfredodeza> that seems like a good clue
[22:50] * zhyan_ (~zhyan@101.82.180.20) Quit (Read error: Operation timed out)
[22:52] * jmlowe (~Adium@149.160.194.126) Quit (Quit: Leaving.)
[22:54] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[22:56] <ishkabob> alfredodeza: when you run ceph-deploy mon create, does it try to bring up all the monitors before it creates the client keyring?
[22:56] <ishkabob> i think maybe my commands aren't being pushed to the other hosts
[22:56] * Schelluri (~Sriram@50-197-184-177-static.hfc.comcastbusiness.net) has joined #ceph
[22:56] <alfredodeza> I can look
[22:56] <alfredodeza> I am not sure
[22:57] <ishkabob> yeah, that's totally what's happening, because if i bootstrap with just the local monitor, everything works fine
[22:58] <ishkabob> is there a good way to test those pushy connections?
[23:00] <Kioob> sjusthm: you was right, "pg repair" fix "stat mismatch errors", I suppose I did a mistake. This time it's ok : [ERR] : 3.b repair 1 errors, 1 fixed
[23:00] <sjustlaptop> Kioob: cool
[23:01] <Kioob> So, now all is fine. :)
[23:02] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[23:02] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[23:04] * markbby (~Adium@168.94.245.2) has joined #ceph
[23:05] <alfredodeza> ishkabob: pushy is the most problematic library I've ever worked with :(
[23:10] <sagewk> mikedawson: ping' :)
[23:10] <mikedawson> here
[23:10] <sagewk> did you ahve a change to try that modified librbd?
[23:13] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[23:13] <mikedawson> sagewk: 18 hours in, and frustratingly we haven't seen the hangup. We did switch from virtio to e1000 on these 11 Windows guest at the same time. Wonder if that has any bearing here
[23:14] <sagewk> hrm
[23:15] <mikedawson> sagewk: I did generate about 1TB of logs showing it working properly, in case you need some light reading
[23:15] <sagewk> fwiw it looks like either a very strange timing thing, or a stray memory write or something. the log did not make much sense
[23:15] <sagewk> :)
[23:17] * __jt___ (~james@rhyolite.bx.mathcs.emory.edu) Quit (Ping timeout: 480 seconds)
[23:18] <mikedawson> sagewk: I'll keep trying. I am seeing lots of small dips that have me concerned, but they pick back up right away. Before, when it hung, it was hung up until we intervened with a 'virsh screenshot' or any NoVNC console input. Never understood why.
[23:24] * sprachgenerator (~sprachgen@130.202.135.222) Quit (Quit: sprachgenerator)
[23:25] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[23:29] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[23:29] * scuttlemonkey (~scuttlemo@107.16.78.55) Quit (Ping timeout: 480 seconds)
[23:31] <mikedawson> sagewk: are there any strings in the log that I can grep for that would be red flags?
[23:32] <sjusthm> gregaf: review on wip-5910 (the papercut thing)
[23:32] <sjusthm> ?
[23:32] <sagewk> not really...
[23:32] <sagewk> i'd wait for the hang. :(
[23:32] * Schelluri (~Sriram@50-197-184-177-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:32] <sjusthm> gregaf: I adjusted it to return the since-boot average if there hasn't been a change
[23:33] <sagewk> i found it by looking for the approx timestamp and dong a bunch of sort|uniq on completion pointers to find the mismatch
[23:41] * BillK (~BillK-OFT@124-169-72-15.dyn.iinet.net.au) has joined #ceph
[23:47] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[23:49] * yanzheng (~zhyan@101.82.180.20) has joined #ceph
[23:50] <sagewk> anybody know if there is a special place i should install the public ssh key for drop.ceph.com?
[23:50] <sagewk> effectively the global known_hosts?
[23:50] <sagewk> /usr/share/ceph/...?
[23:51] <dmick> /etc/ssh/ssh_known_hosts?
[23:51] <dmick> although I imagine you'd want to give the sysadmin the "press to fire" button
[23:51] <joao> sagewk, ping
[23:51] <sagewk> this is just 1 key though.. there's no /etc/ssh/known_hosts.d
[23:52] <sagewk> joao: hey
[23:52] <joao> sagewk, wip-4635
[23:52] * jmlowe (~Adium@2601:d:a800:97:3d71:6a87:eafe:ac4e) has joined #ceph
[23:52] <joao> two topmost commits
[23:53] <joao> working on 'osd crush add' now
[23:54] <sagewk> k
[23:54] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Read error: Connection reset by peer)
[23:54] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.