#ceph IRC Log

Index

IRC Log for 2013-08-28

Timestamps are in GMT/BST.

[0:00] <jhujhiti> hmm i'll read it again. unless i missed something the first time, it only said "none" and "cephx"
[0:00] <bstillwell> gregaf1: I was following this:
[0:00] <bstillwell> http://ceph.com/docs/next/start/quick-ceph-deploy/#list-disks
[0:00] <lxo> sjust, I am going to do that. I just wanted to use that same data to speed up the replication to the new disk
[0:00] <sjust> lxo: not easily
[0:01] <lxo> other than adjusting the osd number in the superblock, what else comes to mind?
[0:01] <itatar> xarses: http://ceph.com/docs/master/start/quick-ceph-deploy/ has a section called 'Single Node Quick Start' that instructs to add 'osd crush chooseleaf type = 0' to ceph.conf. I have two questions
[0:01] <itatar> 1. while I have an admin node (cephadmin) and a server node (cephserver), this is a one node cluster that consists only of cephserver, right?
[0:01] <itatar> 2. assuming the above is correct, do I add this line to cephadmin's ceph.conf or to cephserver's ceph.conf?
[0:01] <lxo> duplicating info about one of the osds in the osdmap so that it applies to the other, maybe?
[0:02] <sjust> you'd have to be sure to copy over the appropriate data from leveldb as well...
[0:02] * rudolfsteiner (~federicon@200.41.133.175) Quit (Ping timeout: 480 seconds)
[0:02] <xarses> itatar: if you have two osd's you dont need to do that
[0:02] <sjust> I don't really know off hand
[0:02] <lxo> I'm taking an exact replica of the osd; rsync -aAX
[0:02] <sjust> well, you could try just changing the osd id
[0:02] <lxo> so meta/, omap/, and all xattrs will go along
[0:03] <sjust> you'd need to change the keys too?
[0:03] <sjust> let us know if it works
[0:03] * jcfischer (~fischer@port-212-202-245-234.static.qsc.de) Quit (Quit: jcfischer)
[0:03] <jhujhiti> gregaf1: no, i don't see anything about it being optional in the cephx guide
[0:03] <lxo> you mean the key the osd uses to access the cluster, which is external to the snap_* dirs, or are there copies of keys in there somewhere?
[0:04] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) has joined #ceph
[0:04] <itatar> xarses: so I can have two OSDs (and other processes like the monitor) on cephserver, right?
[0:04] <xarses> yes
[0:05] <lxo> I remember long ago, when I had fewer larger multi-disk filesystems, each holding one complete copy of the cluster, it was possible to compare them bit by bit, and the only difference was the osd number in the superblock
[0:05] <xarses> as long as you have 1 monitor and two osd's you dont need to do that crushmap
[0:05] <xarses> keep in mind that it's not considered a production deployment though
[0:05] <gregaf1> bstillwell: I have no idea then, sorry :( try Tamil or jluis, maybe?
[0:05] <xarses> but it will come up active+clean as long as there are two or more osd's
[0:06] <lxo> iirc at some point I could no longer do that, but I don't recall whether it had to do with the introduction of leveldb or something else, or just my changing the filesystem arrangements to one filesystem per disk
[0:06] <bstillwell> gregaf1: ok, thanks
[0:06] <bstillwell> Tamil: jluis: Any ideas?
[0:06] <gregaf1> jhujhiti: you would change the "required" options mentioned in http://ceph.com/docs/next/rados/operations/authentication/#enabling-cephx to "none", but then if somebody tried to connect with cephx it would allow them in
[0:07] <lxo> anyway, if there's hope that just tweaking the superblock still works, I'll give that a try. if it works, I'll let you know. if it fails, I'll probably dig up a bit and see what else needs changing. hopefully then I'll come back with a recipe for that, or a request for help ;-)
[0:07] <jhujhiti> gregaf1: ahh, it wasn't clear but i kind of had a feeling it would work like that
[0:08] <jhujhiti> gregaf1: as always, testing this in production. fingers crossed ;)
[0:09] <itatar> xarses: ok. the next step is to ceph-deploy install cephserver. the tutorial says that the command should echo 'OK'. I don't see that but I don't see any error messages.. the last couple of lines were as foollows. does it mean that the command failed or succeeded?:
[0:09] <itatar> [cephserver][INFO ] Processing triggers for man-db ...
[0:09] <itatar> [cephserver][INFO ] Processing triggers for ureadahead ...
[0:09] <itatar> [cephserver][INFO ] Setting up ceph-common (0.67.2-1raring) ...
[0:09] <itatar> [cephserver][INFO ] Setting up ceph (0.67.2-1raring) ...
[0:09] <itatar> [cephserver][INFO ] ceph-all start/running
[0:09] <itatar> [cephserver][INFO ] Setting up ceph-fs-common (0.67.2-1raring) ...
[0:09] <itatar> [cephserver][INFO ] Processing triggers for ureadahead ...
[0:09] <itatar> [cephserver][INFO ] Setting up ceph-mds (0.67.2-1raring) ...
[0:09] <itatar> [cephserver][INFO ] ceph-mds-all start/running
[0:09] <itatar> [cephserver][INFO ] Processing triggers for ureadahead ...
[0:09] <itatar> [cephserver][INFO ] Running command: ceph --version
[0:09] <itatar> [cephserver][INFO ] ceph version 0.67.2 (eb4380dd036a0b644c6283869911d615ed729ac8)
[0:09] <dmick> please don't spam irc. if you have multiline output use pastebin or the like
[0:10] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) Quit (Quit: Leaving.)
[0:10] <lxo> FWIW, the reason I moved to one filesystem per disk was that mon and metadata performance sucked when they had to sync multiple disks. I'm now avoiding metadata and mons on larger disks, keeping them on smaller disks that sync faster and get much less bigger data I/O
[0:10] <xarses> itatar: you can ceph ---verson on cephserver
[0:10] <xarses> but it looks like ceph-deploy install was sucessful
[0:11] <itatar> dmick: ok, sorry
[0:11] <jhujhiti> gregaf1: 2013-08-27 18:11:03.498157 7f89a1984780 -1 unable to authenticate as client.admin after changing them to 'none' on the mon
[0:12] * sprachgenerator (~sprachgen@130.202.135.222) Quit (Quit: sprachgenerator)
[0:12] <gregaf1> you may need to restart them; I'm really not sure
[0:12] * BillK (~BillK-OFT@220-253-162-118.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[0:12] <jhujhiti> i did restart them
[0:12] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) has joined #ceph
[0:12] <gregaf1> and did you make sure to adjust the client options so they'll accept cluster connections which don't require cephx?
[0:14] <jhujhiti> gregaf1: not sure what you mean, sorry. my colleague used a stupid deployment tool for this, so of coruse i have no idea how it works now. i set 'auth [client|service|cluster] required = none' in ceph.conf and restarted all the ceph services on that machine. i did the same on the client and it cannot connect
[0:15] <dmick> jhujhiti: might try ceph --debug-auth=30 --debug-ms=10 -s
[0:15] <dmick> see what is actually failing
[0:15] <nhm> what was the other performance issue with kernel 3.5 we saw besides the zone locking issues?
[0:16] <itatar> xarses: actually ceph --version failed on cephserver (the output is here: http://pastebin.com/download.php?i=RAyJ6tWR)
[0:16] <gregaf1> oh, try setting "auth client required = cephx, none"
[0:16] <itatar> oh sorry, I misspelled version in the command
[0:16] <jhujhiti> gregaf1: that looks like it worked
[0:17] <xarses> itatar: is there no /etc/ceph/ceph.conf?
[0:17] <Tamil1> bstillwell: you can use -v option in ceph-deploy to get more info , which version of ceph-deploy are you using?
[0:17] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Quit: Leaving.)
[0:18] <itatar> carses: I just did a 'sudo find / -name ceph.conf' and it didn't find anything
[0:18] <itatar> (on cephserver)
[0:18] <bstillwell> Tamil1: 1.2.2
[0:19] <bstillwell> Tamil1: Adding -v doesn't appear to change the output
[0:20] <jhujhiti> and it solved my downstream issue! thanks gregaf1
[0:20] <gregaf1> :)
[0:20] <jhujhiti> can't believe that took me three days to find
[0:20] <bstillwell> Tamil1: It seems to be an issue with the sudo/pushy stuff to me
[0:20] <itatar> carses: and ceph.conf should be there after 'ceph-deploy install cephserver', right?
[0:20] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) has joined #ceph
[0:21] <Tamil1> bstillwell: you dont see much info on ceph.log? or int he command line output?
[0:21] <xarses> i dont think it should be there after install
[0:21] <itatar> ah ok :)
[0:21] <xarses> only after mon create does it have to be there
[0:21] <xarses> so that might be OK
[0:21] <jhujhiti> awesome product you guys have got here (and i don't say that often). can't wait for the freebsd port so i can use it in more important environments
[0:21] <bstillwell> Tamil1: the command line output
[0:21] <xarses> new might create it also
[0:22] <xarses> i just didn't pay that much attention
[0:22] <bstillwell> Tamil1: I may be on to something. This fails as well:
[0:22] <bstillwell> ssh den2ceph001 "sudo ceph-disk list"
[0:22] <bstillwell> path issue is my guess
[0:22] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[0:25] <gregaf1> glad you enjoy it jhujhiti; good luck going forward :)
[0:26] * doxavore (~doug@99-89-22-187.lightspeed.rcsntx.sbcglobal.net) Quit (Quit: :qa!)
[0:26] * yanzheng (~zhyan@101.82.58.61) Quit (Ping timeout: 480 seconds)
[0:27] <sjust> sagewk: 1 comment, looks good otherwise
[0:27] <sagewk> sjust: stripped out the dead code and repushed
[0:27] <sjust> cool
[0:27] <sjust> looked good other than that, passed tests?
[0:27] * BillK (~BillK-OFT@220-253-162-118.dyn.iinet.net.au) has joined #ceph
[0:27] <sjust> though I don't think you touched any tested code
[0:27] <sagewk> going to schedule now.
[0:27] <sjust> k
[0:27] <sagewk> oh, i forgot to add the txn clear in execute_ctx()
[0:28] <sagewk> will od that and run the tests
[0:28] <sjust> ah, right
[0:28] <sjust> k
[0:28] <Tamil1> bstillwell: maybe, am not sure why you are seeing that, how many nodes in the cluster?
[0:29] <bstillwell> Tamil1: I'm installing the cluster right now. It'll be 8-nodes running CentOS 6.4
[0:29] <bstillwell> Tamil1: so running the following doesn't have /usr/sbin in the path:
[0:29] <bstillwell> ssh den2ceph001 'echo $PATH'
[0:30] <bstillwell> /usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin
[0:30] <bstillwell> but when I login as the ceph user and display the path it does have /usr/sbin
[0:30] <Tamil1> bstillwell: oh ok, and were you trying disk list command without installing ceph on the nodes?
[0:31] <bstillwell> Tamil1: It's installed on the nodes.
[0:31] <bstillwell> Tamil1: That's why I can run ceph-disk list manually
[0:31] <Tamil1> bstillwell: sorry, i thought you just said you were installing the cluster?
[0:32] * mtanski (~mtanski@69.193.178.202) has joined #ceph
[0:32] <bstillwell> Tamil1: I'm part way through the install. Past the 'ceph-deploy install' step
[0:33] <Tamil1> bstillwell: oh ok
[0:34] <itatar> xarses: this time it looks better. this time I used /tm/osd{0,1} for osd backing store. 1. I see ceph-osd processes on cephserver 2. ceph -s still fails unless I use the -k option. 3. but the output of ceph -s shows that 50% of storage is degraded (here is its output: http://pastebin.com/raw.php?i=JFpdxatp). any idea why that is?
[0:37] * clayb (~kvirc@proxy-ny1.bloomberg.com) has joined #ceph
[0:38] <xarses> itatar: 3) give it some time to balance
[0:38] <xarses> if you run it again it it should be more than 50%
[0:39] <xarses> 2) is that from cephserver or cephadmin?; 1) your ceph -s shows both osd's online
[0:39] <gregaf1> joshd: sjust: https://github.com/ceph/ceph/pull/549 if you're interested in reviewing the user_version stuff before it goes in
[0:40] <sjust> looking
[0:40] <gregaf1> the doc is a very basic .rst for you now, Josh, and actually makes sense instead of being a straight glob out of my Evernote ;)
[0:41] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[0:42] <itatar> xarses: 3) it keeps reporting 50% and I didn't write any data into the cluster yet.. 2) from cephadmin
[0:43] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Read error: Connection reset by peer)
[0:44] <gregaf1> sagewk: when we discussed directing objecter requests, I guess in ticket terms it was #6032?
[0:44] <gregaf1> (issue #6032)
[0:44] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[0:44] <sagewk> yeah
[0:45] <gregaf1> (oh right, we lost kraken — gotta get that moved to something more permanent)
[0:45] <sagewk> can update to be in terms of the pg_pool_t::overlay field or whatever
[0:45] <bstillwell> Tamil1: looks like ssh keeps its own path, because this works:
[0:45] <bstillwell> ssh ceph@den2ceph001 'PATH=$PATH:/usr/sbin:/sbin; ceph-disk list'
[0:46] <gregaf1> sagewk: is your branch (forget the number) in a shape I can branch off from?
[0:47] <gregaf1> (and presumably review)
[0:47] <sagewk> sort of.. let me rebase it so it's a bit more stable.
[0:47] <gregaf1> k
[0:47] <sagewk> may still shift around a bit, but shouldn't be too bad.
[0:48] <bstillwell> Tamil1: Seems like more people running CentOS 6.x would be seeing this issue too
[0:49] <sagewk> gregaf1: wip-tier
[0:49] <itatar> xarses: which brings me to the next question. what's the easiest way to write and read some data to/from the cluster (should I try cephfs or something else)
[0:51] <joshd> gregaf1: thanks, it makes more sense now
[0:53] * LeaChim (~LeaChim@176.24.168.228) Quit (Ping timeout: 480 seconds)
[0:53] <Tamil1> bstillwell: i never hit this issue, please file a bug with your logs and we can have someone take a look at it
[0:54] <bstillwell> Tamil1: ok, will do
[0:54] <hug> I'm tring to remove cephfs/mds from my ceph cluster: http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/
[0:54] <hug> but my ceph version doesn't support 'ceph mds newfs'
[0:54] <gregaf1> joshd: yep. should I add your reviewed-by on merge, or were you just checking the doc? :)
[0:55] <joshd> mainly just the doc
[0:55] <Tamil1> bstillwell: thanks
[0:56] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) Quit (Quit: Leaving.)
[0:56] <gregaf1> k
[0:57] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) has joined #ceph
[0:58] <xarses> itatar: i use glance and cinder, but thats because i use openstack
[0:58] <xarses> suposedly you can can create a mount point using mds ... map ... something
[0:59] <xarses> cephfs might be easiest, but I've never used it
[0:59] <xarses> keep in mind that metadata (mds) can be quite heavy on resources
[1:00] <itatar> ok, any idea how to find out why I am at 50%?
[1:00] <xarses> are you still at 50%?
[1:00] <itatar> yes
[1:01] <joshd> sagewk: https://github.com/ceph/teuthology/pull/55
[1:01] <hug> xarses: is there any way to use ceph for instance storage? in my setup, openstack downloads the image from cinder to the local disk of the compute node and keeps the COW image on the local disk too.
[1:01] * piti (~piti@82.246.190.142) Quit (Ping timeout: 480 seconds)
[1:01] <clayb> I'm trying to understand how best to replace a disk when a disk fails (completely ex-parrot -- never coming back) and one wishes to replace the disk on the machine it seems there's two schools of thought but I don't understand why.
[1:02] <clayb> hug You'd like "Boot from Volume" support
[1:02] <sagewk> joshd: looks good
[1:02] <xarses> hug: my understanding is, yes
[1:02] <sagewk> well, no idea about he orchestra one ;)
[1:02] <xarses> i've not done it
[1:02] <xarses> (yet)
[1:02] <sagewk> joshd: yeah, that one looks good too.
[1:02] <xarses> itatar: try a ceph osd tree
[1:03] <joshd> sagewk: hit the first because the workers would fail to start, pyflakes found the rest
[1:03] <clayb> For replacing a disk there seem two options: 1) remove the old OSD ID from the crush map, auth rules, etc. and add a new OSD (what I think is advocated in http://tracker.ceph.com/issues/4032). 2) Or, one can add the new disk in and try to load the old OSD ID (what was explained at http://ceph.com/w/index.php?title=Replacing_a_failed_disk/OSD)
[1:03] <hug> clayb: well, boot from volume works if you convert the image to a volume, create a snapshot, and then create a volume based on this volume snapshot to boot. which is a bit of a hassle
[1:03] * piti (~piti@82.246.190.142) has joined #ceph
[1:04] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[1:04] <itatar> hm, interesting: http://pastebin.com/raw.php?i=t5YaFWMF
[1:04] <clayb> Why would one not want to preserve the OSD ID if the disk is dead as it seems one would have to edit the crush map to add in the neew OSD ID where the old OSD ID was?
[1:04] * tnt (~tnt@91.177.230.140) Quit (Read error: Operation timed out)
[1:05] <clayb> hug Yes, it is a bit of a pain that it's not automagic in that regard
[1:05] <itatar> xarses: not sure where ids -1 and -2 came from and what that means
[1:05] * ScOut3R (~scout3r@54026B73.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[1:05] <hug> so there's no automagic way yet?
[1:07] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) Quit (Quit: Leaving.)
[1:07] <clayb> hug I haven't seen one but I only inherited a cluster with it enabled so far (so I haven't investigated deeply yet)
[1:07] <clayb> hug I don't think you have to go through all those steps on my cluster
[1:08] <hug> clayb: what do you mean by 'it enabled' ? is it patched to do it automatically?
[1:08] * Karcaw_ (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) has joined #ceph
[1:08] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) Quit (Read error: Connection reset by peer)
[1:08] <hug> it's probably a simple patch to do that..
[1:08] <xarses> itatar: odd, you are about at the limit of my understanding. it should balance out at this point maybe some one else can help. The negative numbers appear to be some magic that correspond to the hosts themselves and the root not the osd's themselves.
[1:08] <clayb> hug I make a RAW image (from a QCOW2) using qemu-img and then can use it as an "image" but it gets loaded onto a volume which is my instance's root volume.
[1:09] <clayb> hug It is patched and I'm not sure how it was patched...
[1:09] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) has joined #ceph
[1:10] <hug> clayb: ah, ok. so you just specify an image and it's automatically converted into a volume on startup.. I'll probably patch my cluster to do the same...
[1:10] <xarses> itatar: also if you copy cephserver:/etc/ceph/ceph.conf to the cephadmin it might let you drop the -k syntax
[1:11] <clayb> Exactly, so I do have what you were talking about roughly. I just wish it was smart enough to automagically convert a qcow2 to a raw image for me.
[1:11] <xarses> itatar: also put ceph.client.admin.keyring in /etc/ceph/
[1:11] <itatar> on cephserver?
[1:12] * jwilliams (~jwilliams@72.5.59.176) Quit (Ping timeout: 480 seconds)
[1:12] <mtanski> the two fscache changes are now on the mailing list
[1:12] <hug> clayb: well,converting qcow2 would require an additional step instead of just cloning the image.
[1:12] <xarses> from cephadmin ~/ceph-try2 or cephserver:/etc/ceph/
[1:12] <hug> and it would be less efficient
[1:14] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:14] <clayb> hug Yes, I would be fine with glance doing the conversion (so it'd only be done once) but right now I have a number of "unusable" qcows someone's uploaded which simply need to be converted (admittedly I don't know what process they're following to get the images in so perhaps that's in place and they're defeating it somehow).
[1:15] <xarses> joshd: I'd like to submit a patch for ceph-deploy. Should i create a pull request on github or create a tracker?
[1:17] <joshd> xarses: either is fine - pull requests are a little easier
[1:19] <hug> clayb: right, doing this in glance would be useful. but it's difficult to define the correct logic for conversion.
[1:20] <hug> clayb: I guess you can't easily find the patch, right? so I'll try it myself..
[1:20] <itatar> xarses: I moved cephadmin:~/ceph-try2 to cephserver:/etc/ceph/ and cephadmin's ceph.client.admin.keyring into cephserver:/etc/ceph/ but the -k is still required. I can live with that for now. Thank you for trying. I will now move on to getting writing/reading to work
[1:20] * indeed (~indeed@206.124.126.33) Quit (Remote host closed the connection)
[1:20] <clayb> hug ENOTIME -- yet...
[1:25] <xarses> joshd: https://github.com/ceph/ceph-deploy/pull/54
[1:26] <xarses> courtesy of rmoe sitting behind me
[1:29] <joshd> xarses: thanks! it seems fine to me, but I'm not too familiar with ceph-deploy code. alfredodeza, dmick, or sagewk should be able to merge it
[1:30] <sagewk> xarses: can you add a Signed-off-by: line to the commit message?
[1:31] <sagewk> otherwise, looks good to me!
[1:33] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[1:34] * indeed (~indeed@206.124.126.33) has joined #ceph
[1:34] <xarses> sagewk: how so?
[1:35] <xarses> you want me to comment on the pull request?
[1:35] <MACscr> does ceph care if storage nodes have different numbers of osds?
[1:35] <MACscr> so could i do 12 on two and 6 on two?
[1:35] <joshd> xarses: git commit --amend -s
[1:35] <bstillwell> Is there some kind of approval process for new accounts on tracker.ceph.com?
[1:36] <sagewk> and then 'git push -f origin branchname' will update teh pull req
[1:36] <bstillwell> I just registered a new account to file a bug, but now it won't let me login.
[1:36] <sagewk> bstillwell: what user?
[1:36] <bstillwell> bstillwell
[1:36] <sagewk> fixed
[1:36] <bstillwell> thanks!
[1:36] <sagewk> sometimes it locks the account; not sure why.
[1:37] <bstillwell> it's working now. :)
[1:38] * loicd ENIGHT
[1:38] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[1:40] <dmick> bstillwell: sagewk: it always tries to send a confirmation email
[1:41] <dmick> it seems particularly prone to being caught in spam traps
[1:41] <dmick> acct is locked until you hit the URL in the email
[1:42] <bstillwell> dmick: I haven't seen an email yet, but sometimes messages marked as spam take a while to show up for some reason.
[1:42] <xarses> sagewk: done
[1:43] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[1:43] <rturk> bstillwell: I've also noticed a 5m delay on the emails from tracker.ceph.com
[1:43] <rturk> although it's now been > 5m :)
[1:44] <bstillwell> It would be nice if the tracker mentioned that you must verify your email address before you can login.
[1:44] <rturk> agree, I'll see if there's a template we can change
[1:44] <bstillwell> :)
[1:45] <xarses> bstillwell i got the impression that i needed to verify from the messages i saw
[1:46] <rturk> ah, yes - it does say "to activate, click on the link"
[1:46] <bstillwell> xarses: Heh, I probably skipped over it then
[1:46] <dmick> I know it says something; don't remember what
[1:47] <xarses> it was well, somewhat vauge
[1:47] <xarses> but ya
[1:47] <rturk> I'll experiment and see if there's a way we can make it more obvious
[1:48] <MACscr> any takers on my number of osd's variance per node?
[1:48] <rturk> the "click to verify" is displayed on the same line as "account creation successful" - easy to miss
[1:49] <gregaf1> should be fine, MACscr
[1:49] <MACscr> gregaf1: my only way to create a third storage node to help with keeping things online
[1:49] <MACscr> well, really 4 total, but you get the point
[1:50] <gregaf1> I believe it will all be automatic, but if you're paranoid you can look at "ceph osd tree" once it's all on and make sure the weights are set correctly
[1:50] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) Quit (Ping timeout: 480 seconds)
[1:51] <itatar> I am trying to mount cephfs (with 'sudo mount -t ceph cephserver:/ /mnt/mycephfs/') and get 'mount: No such process' error. any idea what is going on?:
[1:52] * mtanski (~mtanski@69.193.178.202) Quit (Ping timeout: 480 seconds)
[1:52] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) has joined #ceph
[1:54] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has left #ceph
[1:58] <gregaf1> at a guess the MDS isn't running, but I'm not actually seeing where that error code would come from in our code base
[1:59] <itatar> /usr/bin/ceph-mds process is running
[1:59] <itatar> on the server node (cephserver)
[2:01] <itatar> I executed the mount command on the admin node (cephadmin). is that correct?
[2:01] <dmick> itatar: you should be able to mount it from whatever server you like
[2:01] <itatar> ok, so cephadmin should be good enough
[2:03] <itatar> in parallel I am also trying to enable the S3 gateway by following http://ceph.com/docs/master/start/quick-rgw/. In the 'Create a Gateway Configuration File' section there are directions to save the given sample file in /etc/apache2/sites-available directory. It is not clear what the file name should be.. rgw.conf?
[2:04] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) Quit (Quit: Leaving.)
[2:06] <xarses> does ceph-deploy not work with lvm volume parts?
[2:06] <MACscr> gregaf1: if i dont need the space, think its worth the extra investment adding the two sets of 6 drives to the two sets of 12 just for the sake of stability and recovery time?
[2:06] <MACscr> would cost me about $1k to do =/
[2:06] <gregaf1> I have no idea?
[2:07] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) has joined #ceph
[2:08] <gregaf1> depends on how you balance things, and I'm not sure what config you have versus are proposing
[2:09] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) Quit ()
[2:11] <MACscr> well i literally have nothing setup right now. This is my current design i plan to implement, though i will be testing if im going to use ssd for journals or not and based on advice i have been given, if i do, i will use two of them in each storage node
[2:11] <MACscr> http://www.screencast.com/t/AJUfS2og
[2:18] * Cube (~Cube@12.248.40.138) Quit (Read error: Operation timed out)
[2:19] <gregaf1> so what, those chassis hold 18 disks but only 12 come pre-filled?
[2:20] <MACscr> no, they only hold 12. I would put 6 drives each in two compute nodes
[2:20] <gregaf1> ah, so as to have more than 2 nodes involved in the cluster
[2:21] <MACscr> correct
[2:21] <gregaf1> at that scale I don't think I'd bother, but you'd have to look at the costs and benefits yourself ;)
[2:21] * bandrus (~Adium@12.248.40.138) Quit (Quit: Leaving.)
[2:21] * alram (~alram@38.122.20.226) Quit (Quit: leaving)
[2:23] <MACscr> yeah, i would have to buy 12 more drives, so $600, then another $80 in caddies, another $240 in 10gbE nics, plus another $100 for another sas mezz card =/
[2:24] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) Quit (Quit: Leaving.)
[2:27] * sglwlb (~sglwlb@221.12.27.202) has joined #ceph
[2:29] <MACscr> if i wasnt hurting on power and space, id just buy another storage node =/
[2:29] <MACscr> well, i have the space, but not enough amps
[2:30] <MACscr> im already cutting it pretty close
[2:30] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[2:30] * tserong_ (~tserong@124-168-231-241.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[2:32] * tserong_ (~tserong@124-168-231-241.dyn.iinet.net.au) has joined #ceph
[2:44] * xarses (~andreww@204.11.231.50.static.etheric.net) Quit (Ping timeout: 480 seconds)
[2:45] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) has joined #ceph
[2:47] * indeed (~indeed@206.124.126.33) Quit (Remote host closed the connection)
[2:51] * mschiff_ (~mschiff@port-50007.pppoe.wtnet.de) has joined #ceph
[2:52] * jmlowe1 (~Adium@c-98-223-198-138.hsd1.in.comcast.net) has joined #ceph
[2:52] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[2:53] * clayb (~kvirc@proxy-ny1.bloomberg.com) Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[2:55] * jmlowe (~Adium@2601:d:a800:97:e008:28ad:b281:62bc) Quit (Read error: Connection reset by peer)
[2:55] * markbby (~Adium@168.94.245.2) Quit (Remote host closed the connection)
[2:59] * mschiff (~mschiff@46.59.142.56) Quit (Ping timeout: 480 seconds)
[3:04] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[3:05] * rturk is now known as rturk-away
[3:09] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) Quit (Quit: Leaving.)
[3:11] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) has joined #ceph
[3:22] * hugokuo (~hugokuo@118.233.227.15) has joined #ceph
[3:22] <hugokuo> Hi
[3:22] <hugokuo> How to setup the replicas number of a pool ?
[3:25] <lurbs> hugokuo: http://ceph.com/docs/master/rados/operations/pools/#set-the-number-of-object-replicas
[3:29] <hugokuo> iurbs thanks
[3:31] * jaydee (~jeandanie@124x35x46x4.ap124.ftth.ucom.ne.jp) has joined #ceph
[3:36] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) has joined #ceph
[3:40] * hugo_kuo (~hugokuo@50-197-147-249-static.hfc.comcastbusiness.net) has joined #ceph
[3:41] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[3:42] * jaydee (~jeandanie@124x35x46x4.ap124.ftth.ucom.ne.jp) Quit (Read error: Operation timed out)
[3:43] * jaydee (~jeandanie@124x35x46x15.ap124.ftth.ucom.ne.jp) has joined #ceph
[3:44] * hugokuo (~hugokuo@118.233.227.15) Quit (Ping timeout: 480 seconds)
[3:45] * yy-nm (~Thunderbi@122.233.46.4) has joined #ceph
[3:47] * silversurfer (~jeandanie@124x35x46x4.ap124.ftth.ucom.ne.jp) has joined #ceph
[3:51] * jaydee (~jeandanie@124x35x46x15.ap124.ftth.ucom.ne.jp) Quit (Ping timeout: 480 seconds)
[3:56] * sherry (~sherry@wireless-nat-10.auckland.ac.nz) has joined #ceph
[3:59] <sglwlb> hi, what is pgp meam in a pool?
[4:00] <lurbs> sglwlb: http://ceph.com/docs/master/rados/operations/pools/#create-a-pool
[4:01] <dmick> lurbs: are you a bot? :)
[4:01] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Read error: Operation timed out)
[4:01] <lurbs> All I know is that I keep passing the Turing test.
[4:01] <sherry> hi, Im new to Ceph and I want to start developing in ceph with c++, is there any procedure to show how to work with that in Ubuntu?
[4:02] <sglwlb> lurbs:pgp_num: placement groups for placement for a pool. I can't totally understand it
[4:02] <dmick> lurbs - Turing (tm) certified
[4:07] <lurbs> sglwlb: I don't quite understand the difference myself, you may need a developer. If you don't specify a value for pgp_num when creating a pool it will be set to the same value as pg_num.
[4:08] <yy-nm> hay, all. i put ceph's log to syslog. and how can i use to control the ceph log level to syslog?
[4:08] <yy-nm> i mean severity level
[4:09] <lurbs> sglwlb: Also, another part of the docs claim the pgp_num is "The effective number of placement groups to use when calculating data placement."
[4:09] <sglwlb> lurbs: Thanks, has a link?
[4:09] <lurbs> http://ceph.com/docs/master/rados/operations/pools/#set-pool-values
[4:11] * torment2 (~torment@pool-96-228-147-151.tampfl.fios.verizon.net) has joined #ceph
[4:11] <dmick> sglwlb: the ceph.com/docs site is searchable
[4:13] <sglwlb> dmick:
[4:13] <sglwlb> Thank you ,it's good
[4:15] <sherry> i, Im new to Ceph and I want to start developing in ceph with c++, is there any procedure to show how to work with that in Ubuntu?
[4:15] * sherry is now known as s
[4:16] * s is now known as sherry
[4:17] <dmick> sherry: as has been a theme here in the last 30 min, there is information on ceph.com: http://ceph.com/resources/development/
[4:19] <dmick> yy-nm: syslog always confuses me
[4:19] <dmick> facility and priority are encoded in the same bitmask, and it's never clear to me which is which in the calls
[4:20] <dmick> you can see what the ceph source code is doing; clone it (see link above) and git grep syslog
[4:22] <sherry> dmick I clone the codes already by following the steps in github which was mentioned in https://github.com/ceph/ceph, but what would be the next steps!
[4:26] <dmick> well, what are you trying to do next?
[4:27] <sherry> do I need to install any IDE in order to open and make changes in the source code?
[4:28] <yy-nm> dmick: hm...ok, now the ceph log is pouring into syslog...
[4:29] <dmick> sherry: it's just a tree of source code; you can do whatever you like.
[4:29] <dmick> yy-nm: yes. You can set what gets logged or not with the normal "debug_*" configuration
[4:29] <yanzheng> sherry, you can use eclipse to browser the code
[4:29] <dmick> or vi. or emacs. or kdevel. or ed. or gedit. or slickedit. or...
[4:30] <sherry> cn I run the codes?
[4:31] <dmick> can I suggest that this is not specific to Ceph, and that you probably want to learn how to develop in C++ on Ubuntu on something simpler?
[4:32] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:32] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[4:32] * diegows (~diegows@190.190.11.42) Quit (Ping timeout: 480 seconds)
[4:33] * imjustmatthew_ (~imjustmat@pool-72-84-255-225.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[4:34] <yanzheng> sherry, ceph is too complex for c++ newbie
[4:35] <sherry> yanzehng, I know c++ bt Im not sure how does ceph use it! what is ur suggestion?
[4:36] <yanzheng> reading the code
[4:37] <yanzheng> code in src/crush is relative standalone
[4:39] <sherry> so I have to start from src/crush, thanks
[4:39] <dmick> ceph doesn't "use" c++. Ceph is written *in* C++ (and C, and Python, and bits of assembly)
[4:39] <sherry> smick sorry my bad
[4:39] <sherry> dmick*
[4:50] * glowell1 (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[4:50] * glowell1 (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit ()
[4:50] * glowell1 (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[4:51] * glowell1 (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit ()
[4:54] * jlhawn (~jlhawn@208-90-212-77.PUBLIC.monkeybrains.net) Quit (Quit: jlhawn)
[4:57] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) has joined #ceph
[4:57] * KindTwo (~KindOne@50.96.225.173) has joined #ceph
[4:58] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[4:58] * KindTwo is now known as KindOne
[5:07] * fireD (~fireD@93-142-212-187.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:22] * Dark-Ace-Z (~BillyMays@50.107.55.36) has joined #ceph
[5:25] * DarkAce-Z (~BillyMays@50.107.55.36) Quit (Ping timeout: 480 seconds)
[5:27] * hugo__kuo (~hugokuo@118.233.227.15) has joined #ceph
[5:29] * hugo_kuo (~hugokuo@50-197-147-249-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[5:32] * Dark-Ace-Z is now known as DarkAceZ
[5:39] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[5:42] * rudolfsteiner (~federicon@190.244.11.181) has joined #ceph
[5:42] * yehudasa_ (~yehudasa@2602:306:330b:1410:ea03:9aff:fe98:e8ff) Quit (Ping timeout: 480 seconds)
[5:43] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) Quit (Quit: Leaving.)
[5:46] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[5:53] * rudolfsteiner (~federicon@190.244.11.181) Quit (Quit: rudolfsteiner)
[5:56] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) has left #ceph
[6:04] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:04] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) has joined #ceph
[6:06] * Tamil1 (~Adium@cpe-108-184-66-69.socal.res.rr.com) has left #ceph
[6:25] * yy-nm (~Thunderbi@122.233.46.4) Quit (Remote host closed the connection)
[6:26] * yy-nm (~Thunderbi@122.233.46.4) has joined #ceph
[6:28] * yy-nm (~Thunderbi@122.233.46.4) Quit ()
[6:31] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[6:32] * yehudasa_ (~yehudasa@me12736d0.tmodns.net) has joined #ceph
[6:33] * BillK (~BillK-OFT@220-253-162-118.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[6:35] * BillK (~BillK-OFT@124-169-255-116.dyn.iinet.net.au) has joined #ceph
[6:35] * sagelap (~sage@2600:1012:b006:254a:f945:a530:9596:aa59) has joined #ceph
[6:35] <sagelap> yanzheng: any suggestions on fixing the btrfs imbalance? i haven't had time to look closely
[6:36] <sagelap> i think the problem has actually been there for a while (i.e. was in 3.10 too). the kern.log check in teuthology broke a while back so we didn't notice
[6:38] <yanzheng> #define TRANS_ATTACH (__TRANS_ATTACH | __TRANS_FREEZABLE)
[6:47] * sagelap (~sage@2600:1012:b006:254a:f945:a530:9596:aa59) Quit (Read error: Connection reset by peer)
[6:51] * yy-nm (~Thunderbi@122.233.46.4) has joined #ceph
[6:52] * yehudasa_ (~yehudasa@me12736d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[6:53] * doubleg (~doubleg@69.167.130.11) Quit (Ping timeout: 480 seconds)
[6:56] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) has joined #ceph
[7:00] * yehudasa_ (~yehudasa@2602:306:330b:1410:ea03:9aff:fe98:e8ff) has joined #ceph
[7:18] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[7:18] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit (Quit: sprachgenerator)
[7:18] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[7:24] * ssejour (~sebastien@lif35-1-78-232-187-11.fbx.proxad.net) has joined #ceph
[7:25] * fireD (~fireD@93-139-165-73.adsl.net.t-com.hr) has joined #ceph
[7:26] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[7:30] * MooingLemur (~troy@phx-pnap.pinchaser.com) Quit (Remote host closed the connection)
[7:30] * MooingLemur (~troy@phx-pnap.pinchaser.com) has joined #ceph
[7:44] * gucki (~smuxi@84-73-190-65.dclient.hispeed.ch) has joined #ceph
[7:46] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[7:56] * ssejour (~sebastien@lif35-1-78-232-187-11.fbx.proxad.net) Quit (Quit: Leaving.)
[7:58] * aardvark (~Warren@2607:f298:a:607:ccbb:b6b0:2d7c:a034) has joined #ceph
[7:59] * odyssey4me (~odyssey4m@165.233.71.2) has joined #ceph
[8:05] * sleinen1 (~Adium@2001:620:0:25:294b:d7ab:1be3:9a7f) Quit (Quit: Leaving.)
[8:05] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[8:08] * sleinen1 (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[8:08] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Read error: Connection reset by peer)
[8:08] * foosinn (~stefan@office.unitedcolo.de) has joined #ceph
[8:09] * sleinen (~Adium@2001:620:0:26:18c:dba9:d45e:7501) has joined #ceph
[8:11] * tnt (~tnt@91.177.230.140) has joined #ceph
[8:16] * sleinen1 (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[8:19] <phantomcircuit> gentoo seems to be installing ceph to the wrong place
[8:22] * [caveman] (~quassel@boxacle.net) has joined #ceph
[8:22] * [cave] (~quassel@boxacle.net) Quit (Read error: Connection reset by peer)
[8:28] <yy-nm> hay, i have a question about clog in http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
[8:30] * sleinen (~Adium@2001:620:0:26:18c:dba9:d45e:7501) Quit (Quit: Leaving.)
[8:37] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[8:43] <MACscr> yy-nm: well get on with the actual question, duh
[8:46] * adam3 (~adam@46-65-111-12.zone16.bethere.co.uk) has joined #ceph
[8:52] * adam2 (~adam@46-65-111-12.zone16.bethere.co.uk) Quit (Ping timeout: 480 seconds)
[8:53] * ismell (~ismell@host-24-56-171-198.beyondbb.com) Quit (Read error: Operation timed out)
[8:55] <yy-nm> MACscr: i just wanderring the usage of clog
[8:55] * ismell (~ismell@host-24-56-171-198.beyondbb.com) has joined #ceph
[9:01] * ssejour (~sebastien@out-chantepie.fr.clara.net) has joined #ceph
[9:02] * thorus (~jonas@212.114.160.100) has left #ceph
[9:08] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:17] * vipr (~vipr@frederik.pw) has joined #ceph
[9:17] <dlan_> phantomcircuit: what's the problem?
[9:17] * dlan_ is now known as dlan
[9:19] * tnt (~tnt@91.177.230.140) Quit (Ping timeout: 480 seconds)
[9:27] * Bada (~Bada@195.65.225.142) has joined #ceph
[9:27] * AfC (~andrew@2407:7800:200:1011:f946:9508:d31e:c9fa) Quit (Quit: Leaving.)
[9:28] <phantomcircuit> dlan, /usr/usr/bin/ceph-disk
[9:28] <phantomcircuit> :/
[9:29] * sleinen (~Adium@2001:620:0:46:6169:839:6fe4:9528) has joined #ceph
[9:30] * sleinen (~Adium@2001:620:0:46:6169:839:6fe4:9528) Quit ()
[9:35] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:36] <dlan> phantomcircuit: do you use version 0.67?
[9:37] <dlan> I think that's a bug, and should fixed,
[9:37] <dlan> btw, I reported that bug, and fixed int version 0.61
[9:43] * jbd_ (~jbd_@2001:41d0:52:a00::77) has joined #ceph
[9:45] <dlan> https://bugs.gentoo.org/show_bug.cgi?id=481250
[9:46] <dlan> alexxy: can you take care of bug 481250?
[9:50] * artwork_lv (~artwork_l@adsl.office.mediapeers.com) has joined #ceph
[10:00] * loicd reviewing https://github.com/ceph/ceph/pull/550
[10:04] <gucki> hey. anybody upgraded from cuttlefish to dumpling? everything fine so far? can it be recommended for a production system or are there still any blockers?
[10:04] <wogri_risc> gucki, there seems to be a slowdown on osd's if you have a LOT of i/o on your infrastructure.
[10:05] <wogri_risc> other than that it seems to be painless.
[10:05] <gucki> wogri_risc: is the cause already known? i saw a ticket which was caused by some debug leftover code, but it's been fixed in 67.2 afaik
[10:06] <wogri_risc> according to the ML it's not fixed yet.
[10:06] <gucki> wogri_risc: ah ok, i'm not subscribed. do you know the issue id?
[10:06] <wogri_risc> nope. sorry.
[10:07] <wogri_risc> search in the archives for: Significant slowdown of osds since v0.67 Dumpling
[10:08] <gucki> wogri_risc: this one? http://www.spinics.net/lists/ceph-users/msg03510.html
[10:08] <gucki> wogri_risc: seems to be fixed http://tracker.ceph.com/issues/6040
[10:08] <wogri_risc> yes.
[10:08] <gucki> wogri_risc: that was the bug i mentioned
[10:08] <wogri_risc> ok.
[10:09] <wogri_risc> thing I read yesterday:
[10:09] <wogri_risc> Oliver,
[10:09] <wogri_risc> This patch isn't in dumpling head yet, you may want to wait on a
[10:09] <wogri_risc> dumpling point release.
[10:09] <wogri_risc> -Sam
[10:09] <wogri_risc> (last msg in ths thread)
[10:11] * mozg (~andrei@host86-185-78-26.range86-185.btcentralplus.com) has joined #ceph
[10:11] <mozg> wido, hello mate
[10:11] <mozg> are you online?
[10:12] <gucki> wogri_risc: ok, you are right. sorry, i mixed it up with a quite similar bug...
[10:12] <wogri_risc> gucki: good to know.
[10:12] <mozg> i've got an issue with 0.67.2 and was wondering if you've come across it before?
[10:12] <mozg> i am using rbd + qemu for storing vms in ceph cluster
[10:13] <mozg> i've noticed that after the upgrade to dumpling my vms are kernel panicing when I am using disk benchmark tools
[10:13] <mozg> i've ran phoronix test suite with pts/disk set of benchmarks
[10:14] <mozg> and i am frequently seeing kernel panic on the guest vms
[10:14] * Bada (~Bada@195.65.225.142) Quit (Ping timeout: 480 seconds)
[10:16] <gucki> wogri_risc: thanks for the hint. so i'm waiting for another point release before upgrading, and i'll probably subscribe to the ml :)
[10:17] <mozg> that's happened 4 times out of 5 runs
[10:17] <mozg> so I would say pretty consistent crashes
[10:19] <wogri_risc> gucki: the ML is really worth the read.
[10:19] <wogri_risc> ceph-users btw.
[10:19] <wogri_risc> ceph-devel is less interesting.
[10:20] <dlan> mozg: I'm thinking about trying rbd + qemu, but haven't done yet
[10:20] <dlan> you'd better open a issue for tracking this, probably more info needed (like kernel dmesg log, config ..etc)
[10:20] <mozg> dlan: i've not had many issues with performance apart from the benchmark which tends to kill the os
[10:21] <mozg> yeah
[10:22] <mozg> where do i submit the bug report?
[10:23] <wogri_risc> mozg: probably here: http://tracker.ceph.com/projects/rbd
[10:24] <mozg> thanks
[10:24] <dlan> mozg: I could try to see if I can reproduce your problem, if you can provide enough info..
[10:25] <mozg> dlan:thanks
[10:25] <mozg> i've downloaded the phoronix test suite
[10:25] <mozg> let me get the link
[10:25] <mozg> http://www.phoronix-test-suite.com/?k=downloads
[10:26] <mozg> i am using ubuntu 12.04 guest vm with the latest updates
[10:26] <mozg> install the debian/ubuntu package
[10:27] <mozg> and run phoronix-test-suite benchmark pts/disk
[10:27] <mozg> it will initially do the first setup steps and download the necessary benchmark utils like iozone, fio, etc, but it will do it within it's own environment and not system wide
[10:28] <mozg> after that it will start various benchmarks
[10:28] <mozg> it should run about 19 or 20 benchmarks
[10:28] * indego (~indego@91.232.88.10) has joined #ceph
[10:29] <mozg> and it takes about 7-8 hours for me to complete the run
[10:29] <mozg> I used to see hang tasks occasionally when I was using 0.61.7
[10:29] <mozg> but after the upgrade the benchmarks do not complete,
[10:30] <mozg> i just see a kernel panic
[10:30] * rudolfsteiner (~federicon@190.244.11.181) has joined #ceph
[10:31] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[10:33] * gucki (~smuxi@84-73-190-65.dclient.hispeed.ch) Quit (Remote host closed the connection)
[10:33] <wogri_risc> in the VM?
[10:36] * Vjarjadian (~IceChat77@176.254.37.210) has joined #ceph
[10:38] * rudolfsteiner (~federicon@190.244.11.181) Quit (Ping timeout: 480 seconds)
[10:42] * hugo__kuo (~hugokuo@118.233.227.15) Quit (Quit: ??)
[10:43] * allsystemsarego (~allsystem@5-12-37-127.residential.rdsnet.ro) has joined #ceph
[10:49] * Bada (~Bada@195.65.225.142) has joined #ceph
[10:50] <wido> mozg: I'm here now
[10:59] * Bada (~Bada@195.65.225.142) Quit (Ping timeout: 480 seconds)
[11:00] * mozg (~andrei@host86-185-78-26.range86-185.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[11:07] * Bada (~Bada@195.65.225.142) has joined #ceph
[11:15] * sleinen (~Adium@2001:620:0:46:3833:204f:349e:6b8e) has joined #ceph
[11:25] * sleinen (~Adium@2001:620:0:46:3833:204f:349e:6b8e) Quit (Quit: Leaving.)
[11:29] * sleinen (~Adium@2001:620:0:46:3833:204f:349e:6b8e) has joined #ceph
[11:31] * silversurfer (~jeandanie@124x35x46x4.ap124.ftth.ucom.ne.jp) Quit (Read error: Operation timed out)
[11:52] * yy-nm (~Thunderbi@122.233.46.4) Quit (Quit: yy-nm)
[11:52] * LeaChim (~LeaChim@176.24.168.228) has joined #ceph
[12:03] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[12:04] * sleinen (~Adium@2001:620:0:46:3833:204f:349e:6b8e) Quit (Quit: Leaving.)
[12:08] * odyssey4me (~odyssey4m@165.233.71.2) Quit (Ping timeout: 480 seconds)
[12:11] * odyssey4me (~odyssey4m@165.233.205.190) has joined #ceph
[12:15] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:16] <cfreak200> can I upgrade from 0.67.1 to 0.67.2 without any special order? (1 node at a time)
[12:17] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) has joined #ceph
[12:19] * odyssey4me (~odyssey4m@165.233.205.190) Quit (Ping timeout: 480 seconds)
[12:22] * odyssey4me (~odyssey4m@165.233.71.2) has joined #ceph
[12:26] <mozg> wido, sorry mate, I had to go to a meeting
[12:26] <mozg> are you still here?
[12:27] <mozg> anyone from Ceph technical team here?
[12:27] <mozg> as i've mentioned earlier, I am having some vm stability issues after upgrading to 0.67.2 from 0.61.7
[12:30] <wido> mozg: still here
[12:30] <mozg> dlan, hi. are you going to run the tests to see if you also get the crash?
[12:30] <mozg> wido, thanks
[12:30] <wido> what are you seeing?
[12:30] <mozg> wido, basically, i've upgraded from 0.61.7 to 0.67.2 over the weekend
[12:30] <dlan> mozg: it may take me a few time..
[12:30] <mozg> and started phoronix test suite benchmarks to check the performance and stability of the storage system
[12:31] <mozg> i was running the tests while I was on 0.61.7 and previous 0.61 releases
[12:31] <dlan> does the crash on vm side?
[12:31] <mozg> dlan: yeah, it doesn't kill the cluster
[12:31] <mozg> at least with my setup
[12:31] <mozg> i have a kernel panic on vms
[12:31] * rongze_ (~quassel@117.79.232.249) has joined #ceph
[12:32] <mozg> wido, so, i am noticing pretty consistent kernel panics on vms
[12:32] <mozg> which i've not seen with 0.61 branch
[12:32] <dlan> do u have the kernel log or screen shot (picture)
[12:32] <mozg> i mean i've got 4 kernel panics out of 5 runs
[12:32] <mozg> dlan, i've got a screenshot
[12:32] <mozg> should i send it to you?
[12:32] <wido> mozg: Running Qemu with librbd I think?
[12:33] <wido> What version of Qemu and what version of librbd?
[12:33] <mozg> wido, yes, as per yoru guide
[12:33] <mozg> let me check just now
[12:33] <mozg> one sec
[12:33] <dlan> mozg: I'd suggest u open a bug for tracking this, so people who have knowledge about this can help ..
[12:34] <mozg> wido, qemu-common 1.5.0+dfsg-3ubuntu2
[12:34] <mozg> and librbd1 0.67.2-1precise
[12:35] <mozg> wido, i've compiled qemu from backport sources
[12:35] <mozg> from 13.04 i believe
[12:35] * sleinen (~Adium@2001:620:0:46:9c49:e31:ee73:fc75) has joined #ceph
[12:35] <wido> mozg: What is the kernel panic inside the VM?
[12:35] <wido> pastebin or something?
[12:36] <mozg> let me check if it is in the logs
[12:36] <mozg> i've got a screenshot
[12:36] <mozg> but it is not complete
[12:37] <mozg> wido, i've also noticed a far larger number of slow requests with 0.67.2. Have you noticed anything like that?
[12:37] <wido> mozg: not really. But I've been mainly working on the RGW and doing internal CloudStack development
[12:38] * rongze (~quassel@106.120.176.78) Quit (Read error: Operation timed out)
[12:40] <mozg> wido, are you planning to integrate ceph with cloudstack using the gateway instead of rbd?
[12:40] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Remote host closed the connection)
[12:40] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[12:40] <wido> mozg: No
[12:41] * mschiff (~mschiff@port-50007.pppoe.wtnet.de) has joined #ceph
[12:41] * mschiff_ (~mschiff@port-50007.pppoe.wtnet.de) Quit (Remote host closed the connection)
[12:43] <wido> mozg: The RGW can be used as Secondary Storage, but I'm working on something different
[12:43] * morse (~morse@supercomputing.univpm.it) Quit (Ping timeout: 480 seconds)
[12:43] <mozg> wido, dlan, I am also seeing a lot of these entries in syslogs of the vms: http://ur1.ca/f9fzd
[12:44] <mozg> still searching for the kernel panic message
[12:44] <mozg> sounds interesting
[12:44] <wido> mozg: That is odd, that seems like a problem in Qemu
[12:44] <wido> I'd normally expect a hang, this is weird. I haven't seen that before
[12:45] <mozg> wido, I can't find the kernel panic in the logs
[12:45] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[12:45] <mozg> it seems like it is not recording it
[12:46] * sleinen1 (~Adium@2001:620:0:46:2c15:e8f8:5378:e523) has joined #ceph
[12:46] * sleinen (~Adium@2001:620:0:46:9c49:e31:ee73:fc75) Quit (Ping timeout: 480 seconds)
[12:46] <wido> mozg: problably, since it has disk issues
[12:47] <wido> mozg: Can you try with a different Qemu version?
[12:47] <wido> I'm not sure if this is a librbd issue
[12:48] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[12:49] <mozg> i've found the kernel log entries
[12:49] <mozg> let me fpaste it
[12:53] * BillK (~BillK-OFT@124-169-255-116.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[12:54] <mozg> wido, http://fpaste.org/35387/13776872/
[12:54] <mozg> here you go
[12:54] <mozg> the kernel log
[12:54] <mozg> do you know if this is something that has been discovered yet?
[12:54] <mozg> or is it a new issue?
[12:55] * BillK (~BillK-OFT@58-7-145-55.dyn.iinet.net.au) has joined #ceph
[12:55] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[12:58] * dmsimard (~Adium@108.163.152.2) has joined #ceph
[13:00] * sherry (~sherry@wireless-nat-10.auckland.ac.nz) Quit (Quit: Konversation terminated!)
[13:06] <dlan> mozg: this is wierd , INFO: task jbd2/vda1-8:238 blocked for more than 120 seconds.
[13:06] <dlan> also ata2: lost interrupt (Status 0x58)
[13:07] <dlan> what's the test command ? the iozone
[13:11] <mozg> dlan, this is the part of the phoronix test suite
[13:11] <mozg> it runs a bunch of different tests including iozone
[13:11] <mozg> not sure what command it uses
[13:11] <mozg> but it's from a default set
[13:12] <mozg> dlan, what i've noticed is that i used to get occasional hang tasks or even kernel panics on the early releases of 0.61 branch while running fio with direct 4k random writes using 4 files with 16 threads each
[13:13] <mozg> but these were not as frequent as 4 times of out 5
[13:13] * artwork_lv (~artwork_l@adsl.office.mediapeers.com) Quit (Quit: artwork_lv)
[13:13] <mozg> i probably had issues with every 4th or 5th run if I keep it running for an hour
[13:13] <mozg> i've not left it for longer
[13:14] * Han (~han@boetes.org) has joined #ceph
[13:15] <mozg> dlan, I see the "ata2: lost interrupt (Status 0x58)" on a regular basis from pretty much all linux vms
[13:18] <mozg> actually, i've got several vms where I see hang task errors without the lost interrupt issues
[13:18] <mozg> should I file 2 separate bugs with ceph?
[13:21] <wogri_risc> ata2? mozg, aren't you using virtio drivers?
[13:21] * sleinen1 (~Adium@2001:620:0:46:2c15:e8f8:5378:e523) Quit (Quit: Leaving.)
[13:21] <mozg> wogri_risc, yes I am
[13:22] <wogri_risc> interesing, then.
[13:23] * Han (~han@boetes.org) has left #ceph
[13:25] <mozg> wogri_risc, this is how the vm is started: http://ur1.ca/f9gm6
[13:25] <mozg> i can see virtio drivers in the cmd line
[13:27] <wogri_risc> yes. looks good. you don't happen to have a libvirt-config, do you?
[13:27] * smiley_ (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[13:29] * artwork_lv (~artwork_l@adsl.office.mediapeers.com) has joined #ceph
[13:34] <mozg> wogri_risc, let me check
[13:34] <mozg> i do have libvirt installed
[13:34] <mozg> as cloudstack uses it
[13:35] <wogri_risc> virsh edit 'instancename'
[13:35] * tryggvil (~tryggvil@178.19.53.254) has joined #ceph
[13:35] <mozg> wogri_risc, i don't have libvirt-config
[13:35] <mozg> i do have virsh
[13:35] <mozg> would you like me to try something?
[13:35] * tryggvil (~tryggvil@178.19.53.254) Quit ()
[13:36] <wogri_risc> type virsh edit 'name of your instance'
[13:36] * tryggvil (~tryggvil@178.19.53.254) has joined #ceph
[13:36] <wogri_risc> and paste the disk section - somewhere ;)
[13:36] <wogri_risc> just to be sure.
[13:37] <mozg> okay
[13:37] <mozg> one sec
[13:40] <mozg> wogri_risc, done
[13:40] <mozg> what should I try?
[13:43] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[13:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[13:45] * BManojlovic (~steki@91.195.39.5) Quit (Read error: Connection reset by peer)
[13:46] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[13:47] <wogri_risc> I'd like to see the <disk > section
[13:47] <mozg> one sec
[13:48] <mozg> wogri_risc, here you go: http://ur1.ca/f9gzt
[13:50] * steki (~steki@198.199.65.141) has joined #ceph
[13:54] * BManojlovic (~steki@91.195.39.5) Quit (Ping timeout: 480 seconds)
[13:55] * yanzheng (~zhyan@101.82.119.98) has joined #ceph
[13:56] * steki is now known as BManojlovic
[13:56] * BManojlovic (~steki@198.199.65.141) Quit (Quit: Ja odoh a vi sta 'ocete...)
[13:56] * BManojlovic (~steki@198.199.65.141) has joined #ceph
[13:57] <mozg> dlan, to shorten the time of running phoronix test you could try to run just the iozone test from the phoronix by running phoronix-test-suite benchmark pts/iozone
[13:57] <mozg> there is a set of 24 different iozone benchmarks that it runs with different record and file sizes
[13:58] <mozg> i am now trying to figure out if there is a particular test the kills the kernel
[13:58] <wogri_risc> mozg, I basically have the same virtio config as you have, it's not an virtio issue then.
[13:59] <mozg> wogri_risc, okay
[13:59] <mozg> are you also using cloudstack?
[13:59] <wogri_risc> no.
[13:59] <mozg> wogri_risc, have you tried running the same benchmarks to see if it causes the issue?
[14:00] <wogri_risc> I don't have dumping yet.
[14:00] <mozg> i c
[14:03] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:04] * zhyan_ (~zhyan@101.82.119.98) has joined #ceph
[14:04] * yanzheng (~zhyan@101.82.119.98) Quit (Read error: Connection reset by peer)
[14:05] * zhyan_ is now known as yanzheng
[14:08] * BillK (~BillK-OFT@58-7-145-55.dyn.iinet.net.au) Quit (Read error: Operation timed out)
[14:08] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[14:10] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:12] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:12] * BillK (~BillK-OFT@203-59-133-124.dyn.iinet.net.au) has joined #ceph
[14:12] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:13] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[14:13] <mozg> does anyone know if ceph -w reports accurate throughput statistics?
[14:13] <mozg> i am seeing differences between what the benchmark tool is showing me and what ceph -w reports
[14:13] <mozg> ceph -w reports speeds which vary from around 60mb/s to about 200mb/s
[14:14] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:14] <mozg> whereas the benchmark tool shows me a variation of around 110mb/s to 125mb/s
[14:18] <yanzheng> is cache enabled?
[14:18] <niklas> Hi there. Is this line in the osd log a good or a bad thing:
[14:18] <niklas> 2013-08-28 14:13:37.788733 7fc375f32700 0 -- 192.168.181.25:6893/19460 >> 192.168.181.25:6808/14035 pipe(0x1645e500 sd=23 :59403 s=2 pgs=186718 cs=29607 l=0 c=0x16540b00).fault, initiating reconnect
[14:20] <mozg> yanzheng, yeah, rbd cache is enabled
[14:20] <niklas> And why do I have it about 100.000 times in my log since I started the osd 10 Minutes ago?
[14:21] <yanzheng> mozg, the cache can explain the difference
[14:21] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[14:21] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:21] <mozg> yanzheng, for writes, i suppose it can. but i do not think rbd caching covers read part, does it?
[14:22] <niklas> ("cat ceph-osd.99.log | wc -l" gives me 110691, most of that is the line with different ip addresses and ports, etc)
[14:23] <yanzheng> mozg, why not
[14:29] * tziOm (~bjornar@194.19.106.242) has joined #ceph
[14:32] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[14:32] <loicd> would someone be interested in reviewing the erasure code abstract interface documentation https://github.com/ceph/ceph/pull/518/files#L2R20 ?
[14:40] <mozg> if anyone is interested, I've opened a new bug http://tracker.ceph.com/issues/6139 relating to the vm kernel panics that I am seeing
[14:40] <kraken> http://i.imgur.com/rhNOy3I.gif
[14:43] <loicd> kraken: http://dachary.org/wp-uploads/2013/07/kraken.jpg
[14:44] <ofu_> and the s release is going to be squid? This will cause some confusion...
[14:45] <yanzheng> mozg, doesn't look like rbd related
[14:47] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[14:48] * yy-nm (~Thunderbi@211.140.5.112) has joined #ceph
[14:51] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:52] * steki (~steki@91.195.39.5) has joined #ceph
[14:52] <loicd> ofu_: :-)
[14:56] * BManojlovic (~steki@198.199.65.141) Quit (Ping timeout: 480 seconds)
[14:59] * t0rn (~ssullivan@2607:fad0:32:a02:d227:88ff:fe02:9896) has joined #ceph
[15:01] * thorus (~jonas@212.114.160.100) has joined #ceph
[15:02] <thorus> I reweighted one osd because it was by far more full than the others, but then I got many pgs in state active+remapped so I reverted my change but still getting: 2013-08-28 15:00:30.457714 mon.0 [INF] pgmap v6228696: 1050 pgs: 147 active+clean, 903 active+remapped; 1479 GB data, 4701 GB used, 23912 GB / 28613 GB avail; 3018KB/s rd, 520KB/s wr, 112op/s; 8468/1185477 degraded (0.714%)
[15:04] <mozg> yanzheng, what do you think this issue relates to?
[15:04] <thorus> how to resolve the state remapped?
[15:04] * vata (~vata@2607:fad8:4:6:ce:36f0:efff:4474) has joined #ceph
[15:05] <yanzheng> thorus, wait until remap finishes
[15:06] <absynth> yep
[15:06] <yanzheng> mozg, ata disk driver bug
[15:07] <absynth> it's normal that after a reverted remap/reweight you will have new reweighting
[15:07] <absynth> because the data that was already moved off will have to be moved back onto the OSD
[15:07] <thorus> yanzheng: remap is finished, ceph health detail gives me stuck unclean back
[15:07] <absynth> all OSDs up&in?
[15:07] <absynth> all mons cool?
[15:08] <mozg> yanzheng, I am wondering why this issue would suddenly appear after i've upgraded to dumpling?
[15:08] <mozg> i was running the same benchmarks on 0.61.x without these issues
[15:08] <mozg> everything else is the same
[15:09] <mozg> the only thing changed was the upgrade to Dumpling
[15:09] <janos> mozg: just a shot in the dark here, but did you have any tunables set prior to the upgrade?
[15:11] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[15:11] <thorus> absynth: ceph -s says everything ok
[15:11] <thorus> except for pgmap degraded
[15:13] <absynth> i keep forgetting how to fix "stuck unclean" errors, sorry
[15:15] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:18] * yy-nm (~Thunderbi@211.140.5.112) Quit (Quit: yy-nm)
[15:20] * markbby (~Adium@168.94.245.2) has joined #ceph
[15:22] * sleinen (~Adium@macsl.switch.ch) has joined #ceph
[15:25] * X3NQ (~X3NQ@195.191.107.205) has joined #ceph
[15:26] * hybrid5121 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[15:30] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Quit: Leaving.)
[15:31] * sleinen (~Adium@macsl.switch.ch) Quit (Ping timeout: 480 seconds)
[15:32] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[15:32] <mozg> janos, not really. I've had rbd caching enabled
[15:32] <mozg> apart from that I do not think so
[15:33] <thorus> absynth: thats unfortunate :/
[15:40] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) Quit (Remote host closed the connection)
[15:40] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit (Quit: sprachgenerator)
[15:46] * BillK (~BillK-OFT@203-59-133-124.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[15:50] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[15:52] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:52] * alfredod_ (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[15:53] * zhyan_ (~zhyan@101.83.100.127) has joined #ceph
[15:58] * thanasisk (~akostopou@p5480A5A1.dip0.t-ipconnect.de) has joined #ceph
[15:58] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) has joined #ceph
[15:58] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[15:59] <thanasisk> failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i calypso --pid-file /var/run/ceph/mon.calypso.pid -c /etc/ceph/ceph.conf ' - any ideas how to fix that?
[16:00] * yanzheng (~zhyan@101.82.119.98) Quit (Ping timeout: 480 seconds)
[16:02] <alfredod_> thanasisk: what happens when you run that command in the failing host/
[16:02] * alfredod_ is now known as alfredodeza
[16:02] <thanasisk> alfredodeza, it runs normally
[16:02] <thanasisk> $? is 1
[16:03] <thorus> I guess our problem is that in ceph osd tree: http://paste.ubuntu.com/6036703/ the reweight is not 1 for some osds, how to change that?
[16:03] <thanasisk> sorry it does not run normally, $? is 1
[16:04] <thanasisk> alfredodeza, http://pastebin.com/21wEZi43
[16:05] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[16:06] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:10] * berant (~blemmenes@gw01.ussignalcom.com) has joined #ceph
[16:11] * ishkabob (~c7a82cc0@webuser.thegrebs.com) has joined #ceph
[16:13] <mattch> Given redhat's tendency to patch kernels like mad and not increment version numbers, does anyone know if The recommendation to turn off write cache for ceph journals on raw disks (http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/) still applies to RHEL6 and derivative OSes (which claim to be 2.6.32) ?
[16:17] <joelio> I thought it was usually just stability/bug fixes backported, not extending functionality? May be wrong though
[16:19] <alfredodeza> thanasisk: so ceph-create-keys is stuck
[16:19] <alfredodeza> what are your mon logs saying?
[16:21] * zhyan_ (~zhyan@101.83.100.127) Quit (Ping timeout: 480 seconds)
[16:23] * sleinen (~Adium@2001:620:0:46:1cde:90bd:9b0e:3af7) has joined #ceph
[16:24] <ishkabob> hey Ceph devs, I have a 2-node cluster (not my decision) where each box has 36 osds. We recently tried to optimize our placement groups (about an hour ago), and now the cluster is in a degraded state (~40% degraded). The load on both boxes is well into the 50s. We don't appear to be pinning the network at all, which is reasonable considering there are only 2 nodes. Is there anything that I can do to speed up this recovery?
[16:24] * tziOm (~bjornar@194.19.106.242) Quit (Remote host closed the connection)
[16:24] <mattch> joelio: Maybe... I'd go hunting in the kernel but I'm not sure what particular bit of 2.6.33 the docs is relating to :)
[16:25] <mattch> will just turn it off and leave it for now - probably not going to be a massive performance bottleneck
[16:25] * zhyan_ (~zhyan@101.82.56.121) has joined #ceph
[16:27] <thanasisk> alfredodeza, 2013-08-28 16:15:14.799019 7fb0ded9b780 0 mon.calypso does not exist in monmap, will attempt to join an existing cluster
[16:27] <thanasisk> 2013-08-28 16:15:14.799902 7fb0ded9b780 -1 no public_addr or public_network specified, and mon.calypso not present in monmap or ceph.conf
[16:27] <alfredodeza> that looks like a configuration issue, no?
[16:27] <joelio> mattch: afaik the 2.6.32 limitation was for async flushing to disk in rbd (on rbd backed VMs) - that can be partially fixed with forcing flushes on rbd. If this is for OSDs on the cluster, then I'm sure I'm afraid
[16:27] * raso (~raso@deb-multimedia.org) Quit (Quit: WeeChat 0.3.8)
[16:27] <thanasisk> alfredodeza, i fairly new to this, where should i look into?
[16:28] <thanasisk> e.g. first time touching ceph
[16:28] <mattch> joelio: thanks - will dig a bit more, but probably just turn it off and be done with it
[16:28] <alfredodeza> thanasisk: your config file should live in /etc/ceph/ceph.conf in that host
[16:29] <alfredodeza> make sure the monitor is properly mapped to an address and then restart your monitors
[16:29] * raso (~raso@deb-multimedia.org) has joined #ceph
[16:29] <thanasisk> it is using IPv6, how can I force it to use IPv4?
[16:29] <alfredodeza> your ceph.conf file should have a line that says: mon initial members = calypso
[16:30] <alfredodeza> and another one that says: mon host = {SOME IP ADDR}
[16:30] <alfredodeza> it *should* support IPV6
[16:30] <thanasisk> it actually says brizo, which is supposed to be my admin host
[16:31] * sleinen (~Adium@2001:620:0:46:1cde:90bd:9b0e:3af7) Quit (Ping timeout: 480 seconds)
[16:37] * sleinen (~Adium@macsl.switch.ch) has joined #ceph
[16:48] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Remote host closed the connection)
[16:48] * aliguori (~anthony@cpe-70-112-153-179.austin.res.rr.com) has joined #ceph
[16:49] * giancarloubuntu (~giancarlo@dynamic-adsl-94-34-230-248.clienti.tiscali.it) has joined #ceph
[16:51] <giancarloubuntu> ciao
[16:55] <giancarloubuntu> !list
[16:56] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[16:57] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[16:57] * ChanServ sets mode +o elder
[16:59] * zhyan_ (~zhyan@101.82.56.121) Quit (Ping timeout: 480 seconds)
[17:03] * steki (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:06] * artwork_lv (~artwork_l@adsl.office.mediapeers.com) Quit (Quit: artwork_lv)
[17:08] * sleinen (~Adium@macsl.switch.ch) Quit (Quit: Leaving.)
[17:09] * foosinn (~stefan@office.unitedcolo.de) Quit (Quit: Leaving)
[17:11] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Quit: Konversation terminated!)
[17:16] * sprachgenerator (~sprachgen@130.202.135.215) has joined #ceph
[17:18] * Karcaw_ (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) Quit (Read error: Operation timed out)
[17:18] <thanasisk> ceph-deploy new {mon-server-name}
[17:18] <thanasisk> ceph-deploy new mon-ceph-node
[17:18] <thanasisk> what is the difference between the two?
[17:18] <alfredodeza> no difference iirc
[17:19] * giancarloubuntu (~giancarlo@dynamic-adsl-94-34-230-248.clienti.tiscali.it) Quit (Quit: Sto andando via)
[17:19] <thanasisk> so it is just 1 command, correct?
[17:19] <alfredodeza> right
[17:19] <alfredodeza> what is the name of your node
[17:20] <thanasisk> deimos and calypso
[17:20] <thanasisk> im passing the FQDNs
[17:20] * sleinen (~Adium@2001:620:0:46:c0fe:371e:4fd8:beff) has joined #ceph
[17:21] <alfredodeza> so `ceph-deploy new deimos`
[17:22] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[17:24] * jmlowe1 (~Adium@c-98-223-198-138.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[17:25] <thanasisk> done
[17:25] * Bada (~Bada@195.65.225.142) Quit (Ping timeout: 480 seconds)
[17:25] <thanasisk> its chef-deploy mon that gives me errors
[17:26] * doxavore (~doug@99-7-52-88.lightspeed.rcsntx.sbcglobal.net) has joined #ceph
[17:37] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[17:37] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) has joined #ceph
[17:45] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[17:46] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:48] * thanasisk (~akostopou@p5480A5A1.dip0.t-ipconnect.de) Quit (Quit: Leaving)
[17:50] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[17:50] * ChanServ sets mode +o ircolle
[17:53] * jmlowe (~Adium@149.160.192.135) has joined #ceph
[17:54] * tnt (~tnt@91.177.230.140) has joined #ceph
[17:56] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) Quit (Ping timeout: 480 seconds)
[17:58] * zhyan_ (~zhyan@101.82.56.121) has joined #ceph
[17:59] * sagelap (~sage@2600:1012:b004:dbcf:f945:a530:9596:aa59) has joined #ceph
[17:59] * jmlowe (~Adium@149.160.192.135) Quit (Quit: Leaving.)
[18:04] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has joined #ceph
[18:06] * mozg (~andrei@212.183.128.48) has joined #ceph
[18:08] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[18:08] <n1md4> hi. I've changed the IP of the machine mon runs on, and have updated /etc/ceph/ceph.conf to reflect the change, but still not working. is there something else I should be doing..?
[18:09] <n1md4> using cuttlefish
[18:10] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[18:11] * artwork_lv (~artwork_l@adsl.office.mediapeers.com) has joined #ceph
[18:11] <mattch> n1md4: It used to be that you had to change each mon's data with monmaptool - don't know if that's changed
[18:14] * devoid (~devoid@130.202.135.221) has joined #ceph
[18:15] <jluis> mattch is right
[18:15] <jluis> n1md4, http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address
[18:16] <n1md4> where can I find the filemapname ...
[18:16] <n1md4> ah, thanks :)
[18:16] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[18:17] * mozg (~andrei@212.183.128.48) Quit (Ping timeout: 480 seconds)
[18:18] <niklas> At the end of stactraces ceph prints "NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this."
[18:18] <niklas> What would I put as executable?
[18:20] <n1md4> jluis: I don't suppose there's a way to update the IP address of mon after the address has been changed?
[18:20] * xarses (~andreww@204.11.231.50.static.etheric.net) has joined #ceph
[18:20] <n1md4> the docs require ceph-mon to be running in order to maintain it; it would seem
[18:21] <jluis> niklas, whatever daemon generated that trace
[18:21] <jluis> ceph-mon, ceph-osd, ...
[18:21] <jluis> n1md4, have you read the whole thing?
[18:21] <jluis> including the "messy way"?
[18:22] <jluis> if for some reason you didn't obtain a monmap *prior* to changing the ips, then you can obtain it with 'ceph-mon -i FOO --extract-monmap /tmp/monmap.FOO', available in recent versions
[18:22] <niklas> jluis: is it correct that I get 1067834 lines of memory from that?
[18:22] <jluis> cuttlefish may have it, not sure
[18:23] <jluis> niklas, no idea what led you to it, but I would assume so, if in fact that's what you get
[18:23] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[18:23] * jluis is now known as joao
[18:24] <niklas> joao: well one of my osds crashed… But there are still 14 running so…
[18:24] <niklas> I'll just attach it to my bug report
[18:24] <sagelap> alfredodeza: ceph-disk already hard-codes a bunch of /sbin and /usr/sbin paths; last time i looked at it i couldn't make the PATH work right, probably for some stupid reason
[18:24] <n1md4> jluis: I scanned the page, the first command of the 'messy way' is "ceph .... " but, because ceph-mon is broken, this instantly breaks.
[18:24] <sagelap> (i thin i set os.environ['PATH'] but no dice).
[18:25] <n1md4> This isn't production, and I've been here before, maybe best to scrub and start again.
[18:25] <joao> n1md4, use 'ceph-mon -i FOO --extract-monmap /tmp/monmap'
[18:25] <joao> FOO being the mon's id
[18:26] <alfredodeza> sagelap: I've already fixed a couple of things for the paths, I can attempt to do that for this as well
[18:26] <joao> you should only need to run that once, on any monitor server
[18:26] <n1md4> joao: thanks, how do i find the id?
[18:26] <joao> cat /etc/ceph.conf | grep '[mon.'
[18:26] <joao> would be the thing after the dot
[18:27] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) has joined #ceph
[18:27] <joao> alternatively, ls /var/lib/ceph/mon
[18:27] <joao> the thing after the dash on 'ceph-foo'
[18:27] <n1md4> the later, as cuttlefish doesn't define mons/mds/etc
[18:27] <alfredodeza> sagelap: https://github.com/ceph/ceph-deploy/blob/master/ceph_deploy/hosts/common.py#L15-27
[18:28] <alfredodeza> I am working on making that helper a bit more generalized so I can re-use it
[18:28] <sagelap> excellent
[18:29] * artwork_lv (~artwork_l@adsl.office.mediapeers.com) Quit (Quit: artwork_lv)
[18:29] * jmlowe (~Adium@149.160.195.49) has joined #ceph
[18:30] * diegows (~diegows@200.68.116.185) has joined #ceph
[18:31] <n1md4> joao: thanks for the assist, getting there now.
[18:33] * sagelap (~sage@2600:1012:b004:dbcf:f945:a530:9596:aa59) Quit (Read error: No route to host)
[18:37] * alram (~alram@38.122.20.226) has joined #ceph
[18:44] * ScOut3R (~scout3r@54026B73.dsl.pool.telekom.hu) has joined #ceph
[18:45] * DarkAce-Z (~BillyMays@50.107.55.36) has joined #ceph
[18:47] * aciancaglini (~quassel@79.59.209.97) has joined #ceph
[18:47] * DarkAceZ (~BillyMays@50.107.55.36) Quit (Ping timeout: 480 seconds)
[18:48] <aciancaglini> hi, we are in trouble
[18:48] <n1md4> could anyone comment here http://pastebin.com/raw.php?i=pV8RjHPd mon won't start :\
[18:48] <aciancaglini> i've get a ceph of 3 nodes
[18:49] <aciancaglini> 2 monitore are gone down
[18:49] <aciancaglini> and one as the FS in read only
[18:49] <aciancaglini> now i was able to start the monitor which were down
[18:50] <aciancaglini> but they both says : "sending client elsewhere" and "mon.st102-int@1 won leader election with quorum 1,2 - mon.st102-int calling new monitor election"
[18:50] <aciancaglini> someone could help us?
[18:50] <aciancaglini> is a prod environment
[18:50] <sagewk> joao: can you look at https://github.com/ceph/ceph/pull/552 ?
[18:51] <joao> looking
[18:51] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[18:51] * KindTwo (~KindOne@h88.32.28.71.dynamic.ip.windstream.net) has joined #ceph
[18:52] * KindTwo is now known as KindOne
[18:52] <gregaf1> think that's usually permissions n1md4, but joao can probably say real quick
[18:52] <jmlowe> aciancaglini: what led up to the monitors going down?
[18:53] <jmlowe> aciancaglini: what version are you running, what does ceph health say?
[18:53] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[18:53] <joao> gregaf1, n1md4, my guess is that you don't have a monitor with id = 'admin', and ceph defaults to 'admin' when no id is provided
[18:53] <joao> try -i FOO
[18:54] <joao> (FOO obviously being whatever id that monitor is supposed to have9
[18:54] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[18:54] * odyssey4me (~odyssey4m@165.233.71.2) Quit (Ping timeout: 480 seconds)
[18:54] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[18:55] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[18:55] <aciancaglini> jmlowe: version 0.61.4-1raring - cehp health hangs - i don't know what brings the mon daemon down
[18:55] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[18:56] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[18:56] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[18:57] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[18:57] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[18:58] <n1md4> joao: Nice, that worked! I'm, learning :)
[18:59] <jmlowe> aciancaglini: um, I don't know how to help you but those questions are probably what somebody that could help you would ask
[18:59] * Vjarjadian (~IceChat77@176.254.37.210) Quit (Quit: Oops. My brain just hit a bad sector)
[19:00] * jbd_ (~jbd_@2001:41d0:52:a00::77) has left #ceph
[19:00] <jmlowe> aciancaglini: they would also tell you to turn up logging for the mons
[19:02] <aciancaglini> jmlowe: the logging is on both for mon & osd
[19:02] <gregaf1> and ask if the times are okay
[19:02] <aciancaglini> time is ok
[19:03] <gregaf1> ie, ntp has them synced correctly and they're within the specified tolerances?
[19:03] <gregaf1> it could be many things but that's the guess I have from the very little you're telilng us
[19:03] <joao> define "brings the mon daemon down"?
[19:03] <aciancaglini> the cluster is in production since 3 months
[19:03] <aciancaglini> sorry, I'm just a bit worried...
[19:04] <aciancaglini> the cloudstack console says me that the storage was in unavailable state
[19:04] <aciancaglini> so i login on the nodes and found the monitors down
[19:04] <aciancaglini> i restarted them as usual and they come up
[19:05] <aciancaglini> they seems simply died
[19:05] * ishkabob (~c7a82cc0@webuser.thegrebs.com) Quit (Quit: TheGrebs.com CGI:IRC)
[19:06] <gregaf1> and yet your other monitor has read-only FS and these two won't turn on, so I'm thinking something went wrong with your hardware, or your config changed somewhere, or you have some other warning flag you've missed (out of space?)
[19:06] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[19:07] <jmlowe> what do the monitor logs say? pastebin please
[19:07] <joao> sagewk, pull request 552 looks okay
[19:07] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) Quit (Ping timeout: 480 seconds)
[19:07] <sagewk> great, can you merge?
[19:07] <joao> sure
[19:08] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[19:08] * xmltok (~xmltok@pool101.bizrate.com) Quit (Quit: Leaving...)
[19:08] <joao> sagewk, merged
[19:08] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[19:08] <joao> sagewk, want me to backport it too?
[19:11] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[19:11] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[19:11] <sagewk> cherry-pick -x to dumpling :)
[19:12] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[19:12] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[19:12] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[19:13] <aciancaglini> this is the log from one monitor i restarted : http://pastebin.com/LkFq9903
[19:15] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit ()
[19:15] <jmlowe> aciancaglini: well, I'm guessing the relevant error is: mon/Paxos.cc: 554: FAILED assert(begin->last_committed == last_committed)
[19:16] <aciancaglini> you geot it
[19:16] <aciancaglini> it is at the begining and i lost it...
[19:16] <jmlowe> joao: is that an easy one?
[19:16] <aciancaglini> someone of them is capable to say me the real risk to lost my data?
[19:17] <gregaf1> that's v0.61.4, so maybe there are some relevant fixed bugs, joao? although I don't remember what that assert actually implies about the system state
[19:17] * devoid (~devoid@130.202.135.221) Quit (Ping timeout: 480 seconds)
[19:18] <joao> that's an old version to be running, monitor-wise
[19:18] <joao> be back in a few minutes
[19:21] <joao> sagewk, https://github.com/ceph/ceph/pull/553 whenever you have the chance
[19:23] * zhyan__ (~zhyan@101.83.113.109) has joined #ceph
[19:24] <joao> sagewk, forgot a commit to that branch with the test that reproduces it; will push it shortly
[19:25] <aciancaglini> another info... we put the OS on USB, that's why the first node is in read only but to update in this condition
[19:25] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[19:25] <aciancaglini> i think is dangerous, no?
[19:26] * berant (~blemmenes@gw01.ussignalcom.com) Quit (Quit: berant)
[19:27] * markbby (~Adium@168.94.245.2) has joined #ceph
[19:27] * rturk-away is now known as rturk
[19:28] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[19:29] * jhujhiti (~jhujhiti@00012a8b.user.oftc.net) has left #ceph
[19:31] * zhyan_ (~zhyan@101.82.56.121) Quit (Ping timeout: 480 seconds)
[19:32] * markbby (~Adium@168.94.245.2) Quit (Remote host closed the connection)
[19:33] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[19:33] * markbby (~Adium@168.94.245.2) has joined #ceph
[19:33] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[19:35] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[19:37] <aciancaglini> there is someone has a good suggestion?
[19:40] <gregaf1> joao wants you to upgrade your monitors to v0.61.7 and see what they do then
[19:41] <aciancaglini> sorry, just to be clear... the risk is that I'm loosing my data?
[19:41] <aciancaglini> and i can upgrade just two monitors (n° 2 & n° 3, not n° 1 cause the FS failure....)
[19:43] * alfredodeza is now known as alfredo|noms
[19:44] * ssejour (~sebastien@out-chantepie.fr.clara.net) Quit (Quit: Leaving.)
[19:44] <gregaf1> well the one with a read-only FS you can't use for anything anyway so it's out; you can upgrade the other two and it won't make anything worse, yes
[19:46] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[19:48] <aciancaglini> we are very tired.. :)
[19:48] <aciancaglini> and forget that the system log was redirected to a logg server
[19:48] * tnt_ (~tnt@109.130.110.3) has joined #ceph
[19:50] <aciancaglini> so i discovered that the mon on node 1 is crashed for the FS error : http://pastebin.com/wDmrktm7
[19:51] <aciancaglini> the others system logs seems clean
[19:54] <joao> read-only fs will certainly cause the monitors to misbehave and crash
[19:54] <aciancaglini> so really the only thing you suggest is to upgrade?
[19:54] * tnt (~tnt@91.177.230.140) Quit (Ping timeout: 480 seconds)
[19:55] <aciancaglini> i fear to loose data... it would be a disaster
[19:55] <joao> I'd need to see some logs with debug mon = 10 on the crashing monitors, but upgrading the monitors, as greg said, would do no harm
[19:55] <joao> on the contrary, there's a bunch of bugs that were fixed since 0.61.4
[19:56] <joao> upgrade to 0.61.7 (or is it .8 by now?)
[19:56] <joao> that should help out with all the other bugs that were fixed; after that we can focus our efforts on what's causing your issues
[19:57] <joao> no use trying to figure out what's going on there if you can very well just hit the next one in line, right? :)
[19:58] * roald (~oftc-webi@87.209.150.214) has joined #ceph
[20:00] <sagewk> joao: reviewed 553.. one nit otherwise looks good
[20:05] * themgt (~themgt@201-223-239-26.baf.movistar.cl) has joined #ceph
[20:06] <aciancaglini> log with debug mon = 10 :) http://pastebin.com/twQTZTwE
[20:06] <aciancaglini> (but are on the mon died & restarted
[20:06] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[20:08] * doxavore (~doug@99-7-52-88.lightspeed.rcsntx.sbcglobal.net) Quit (Quit: :qa!)
[20:08] <n1md4> any ideas .. http://pastebin.com/raw.php?i=WQMrxeuG
[20:08] <aciancaglini> should be usefull to insert a new node and create a new mon?
[20:08] <sagewk> dmick: around?
[20:09] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[20:10] * tizack (~steve@wraith.wireless.rit.edu) has joined #ceph
[20:24] * markbby (~Adium@168.94.245.2) Quit (Remote host closed the connection)
[20:25] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[20:28] <gregaf1> sagewk: pulled out your tiering interface patches on top of master in wip-tier-interface and commented at https://github.com/ceph/ceph/pull/554
[20:28] <gregaf1> I want to talk about it a bit more :)
[20:29] <sagewk> k
[20:35] * houkouonchi-work (~linux@12.248.40.138) has joined #ceph
[20:35] <joao> sagewk, is wip-6047 to go into master? pull request was made against next
[20:35] <joao> close and reopen?
[20:35] * roald (~oftc-webi@87.209.150.214) Quit (Remote host closed the connection)
[20:36] <joao> s/reopen/open another/
[20:37] <sagewk> joao: next is ok
[20:37] <joao> okay
[20:37] <joao> will do after dinner then
[20:38] <joao> jsut got ready; brb
[20:40] <sagewk> alfredo|noms: zackc: Tamil: joshd: https://github.com/ceph/teuthology/pull/59
[20:43] * alfredo|noms is now known as alfredodeza
[20:49] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[20:50] * tizack (~steve@wraith.wireless.rit.edu) Quit (Quit: Leaving)
[20:51] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:52] * Cube (~Cube@12.248.40.138) has joined #ceph
[20:52] * tryggvil (~tryggvil@178.19.53.254) Quit (Quit: tryggvil)
[20:52] * devoid (~devoid@130.202.135.221) has joined #ceph
[20:53] * rturk is now known as rturk-away
[21:02] * markbby (~Adium@168.94.245.4) has joined #ceph
[21:10] * campr (~campr@53545693.cm-6-5b.dynamic.ziggo.nl) has joined #ceph
[21:11] <campr> Hey guys, does anyone know if there is any video of the following presentations?: http://ceph.com/presentations/20121102-ceph-day/
[21:11] * dmick1 (~dmick@cpe-76-87-42-76.socal.res.rr.com) has joined #ceph
[21:21] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[21:25] * dmsimard (~Adium@108.163.152.2) Quit (Quit: Leaving.)
[21:27] * allsystemsarego (~allsystem@5-12-37-127.residential.rdsnet.ro) Quit (Quit: Leaving)
[21:29] * zhyan__ (~zhyan@101.83.113.109) Quit (Ping timeout: 480 seconds)
[21:29] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[21:34] * dmick1 (~dmick@cpe-76-87-42-76.socal.res.rr.com) has left #ceph
[21:36] * jmlowe (~Adium@149.160.195.49) Quit (Quit: Leaving.)
[21:41] * rturk-away is now known as rturk
[21:51] * vata (~vata@2607:fad8:4:6:ce:36f0:efff:4474) Quit (Quit: Leaving.)
[21:53] * markbby (~Adium@168.94.245.4) Quit (Quit: Leaving.)
[21:53] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:55] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:56] * mjblw (~mbaysek@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[21:58] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[21:58] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[21:59] <gregaf1> afraid not, campr
[22:00] * imjustmatthew (~imjustmat@c-24-127-107-51.hsd1.va.comcast.net) Quit (Remote host closed the connection)
[22:00] * markbby (~Adium@168.94.245.2) has joined #ceph
[22:01] * markbby (~Adium@168.94.245.2) Quit ()
[22:03] * markbby (~Adium@168.94.245.2) has joined #ceph
[22:05] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[22:05] * ninkotech_ (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[22:09] <sagewk> glowell: ping
[22:09] <glowell> hello
[22:09] <sagewk> glowell: do you mind looking at the packaging changes in d08e05e463f1f7106a1f719d81b849435790a3b9 ?
[22:09] <glowell> ok
[22:09] <sagewk> it went into master a while ago, but a extra check would be nice before i backport it to dumpling
[22:12] * sel (~sel@212.62.233.233) Quit (Quit: Leaving)
[22:14] * campr (~campr@53545693.cm-6-5b.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[22:14] <glowell> A quick read through didn't pop-up any flags. I'll do a test install to double check that the files are getting installed all the right places.
[22:16] * sleinen (~Adium@2001:620:0:46:c0fe:371e:4fd8:beff) Quit (Ping timeout: 480 seconds)
[22:25] <gregaf1> from the mailing list: "I didn't set pgp_num. It came by default with 2 in my case."
[22:25] <sagewk> alfredodeza: zackc: repushed https://github.com/ceph/teuthology/pull/59
[22:25] <gregaf1> I'm not seeing how that could happen in the monitor; any ideas joao, sagewk, joshd?
[22:26] <joshd> he created a pool with 2 pgs originally maybe?
[22:27] <gregaf1> it had 16 by the time he copied the output to the list or that would be my guess, though I suppose it's possible :/
[22:27] <joshd> then he increased pg_num without changing pgp_num
[22:29] <gregaf1> woah, suite results now include timings on the passed tests! and links!
[22:29] <gregaf1> thanks alfredodeza? or zackc?
[22:30] <alfredodeza> zackc for coming up with the "we need to fix this now", I helped a bit with the formatting
[22:30] <alfredodeza> but for sure, more kudos for zackc
[22:30] * alfredodeza stands up, starts clapping for zackc
[22:31] <alfredodeza> well done
[22:31] <kraken> http://i.imgur.com/wSvsV.gif
[22:31] <alfredodeza> exactly
[22:32] <joao> gregaf1, that shouldn't happen; iirc that's one of the mandatory parameters isn't it?
[22:32] <gregaf1> pg_num is required but pgp_num isn't
[22:32] <gregaf1> I think joshd is probably right and he started the pool out with 2 PGs and didn't bump up pgp_num when doing the splits
[22:33] <gregaf1> just didn't think of that diagnosis myself :)
[22:35] * zackc glances at irc and sees applause
[22:36] <zackc> yay! i mean you're welcome!
[22:37] * vipr (~vipr@frederik.pw) Quit (Ping timeout: 480 seconds)
[22:37] * zackc was really sick of getting headaches from reading teuthworker emails
[22:38] <zackc> note that because teuthology's architecture is really special, some emails won't have the sexy new formatting yet
[22:38] * sagelap (~sage@2607:f298:a:607:ea03:9aff:febc:4c23) has joined #ceph
[22:41] * ShaunR (~ShaunR@staff.ndchost.com) has joined #ceph
[22:43] * ShaunR- (~ShaunR@staff.ndchost.com) Quit (Read error: Connection reset by peer)
[22:45] * rturk is now known as rturk-away
[22:46] * Cube (~Cube@12.248.40.138) Quit (Quit: Leaving.)
[22:46] * Cube (~Cube@12.248.40.138) has joined #ceph
[22:47] * aardvark (~Warren@2607:f298:a:607:ccbb:b6b0:2d7c:a034) Quit (Ping timeout: 480 seconds)
[22:47] * wusui (~Warren@2607:f298:a:607:ccbb:b6b0:2d7c:a034) has joined #ceph
[22:49] * jmlowe (~Adium@c-98-223-198-138.hsd1.in.comcast.net) has joined #ceph
[22:50] * ishkabob (~c7a82cc0@webuser.thegrebs.com) has joined #ceph
[22:50] <ishkabob> hey guys, can i set "osd max backfills" on a live system?
[22:52] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[22:54] <ishkabob> or any osd options for that matter?
[22:55] * mozg (~andrei@host86-185-78-26.range86-185.btcentralplus.com) has joined #ceph
[23:00] * vata (~vata@2607:fad8:4:6:2dcd:32ef:bfab:c53c) has joined #ceph
[23:04] * jantje (~jan@paranoid.nl) Quit (Read error: Connection reset by peer)
[23:06] * markbby (~Adium@168.94.245.1) has joined #ceph
[23:07] * markbby (~Adium@168.94.245.1) Quit ()
[23:07] <sagewk> zackc: btw, when cherry-picking to non-master always use -x
[23:10] * DarkAce-Z is now known as DarkAceZ
[23:10] * jantje (~jan@paranoid.nl) has joined #ceph
[23:13] <zackc> sagewk: ah, ok. oops.
[23:14] <aciancaglini> someone know I can get in touch with inktank now?
[23:14] <sagewk> zackc: for the future :)
[23:14] <aciancaglini> is very urgent
[23:15] <sagewk> aciancaglini: yeah msg me
[23:19] <sjustlaptop> ishkabob: that should be fine to set via injectargs
[23:21] <wschulze> aciancaglini: we can help you
[23:21] <wschulze> Where are you based?
[23:21] <wschulze> time zone wise?
[23:21] <aciancaglini> we are based in Italy
[23:21] <aciancaglini> TZ CET
[23:27] * LeaChim (~LeaChim@176.24.168.228) Quit (Read error: Operation timed out)
[23:43] * carif (~mcarifio@64.119.130.114) has joined #ceph
[23:43] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) Quit (Quit: ...)
[23:43] <mozg> sagewk, hello. are you online?
[23:44] <mozg> i am having a bit of an issue with my virtual vms under high io
[23:44] <mozg> and filed a bug report
[23:44] <mozg> i was wondering if there are any steps that I should take before running the benchmark tests which cause the kernel panic in my vms?
[23:44] <kraken> http://i.imgur.com/fH9e2.gif
[23:45] <mozg> i mean to help debugging the issue?
[23:45] * mschiff (~mschiff@port-50007.pppoe.wtnet.de) Quit (Remote host closed the connection)
[23:46] <mozg> i am also having issues with hang tasks
[23:46] <mozg> that's running the latest Dumpling release
[23:47] <sjusthm> gregaf1: uh
[23:47] <sjusthm> hmm
[23:48] <sjusthm> ids I would think
[23:48] <gregaf1> sorry, let me copy that over
[23:48] <gregaf1> "do we really want pg_pool_t to be holding strings of pool names or should it track integer IDs?"
[23:48] * Vjarjadian (~IceChat77@176.254.37.210) has joined #ceph
[23:48] <gregaf1> and now I'm checking adn we do have a pool rename command, so it has to be IDs
[23:48] <gregaf1> that answers that question
[23:48] <gregaf1> ;)
[23:49] <sagewk> gregaf1: oh.. yeah, int64_t
[23:49] <gregaf1> s/int64_t/uint64_t/
[23:50] <sjusthm> whatever type the pools already are, right?
[23:50] <gregaf1> the place I was looking was uint64_t, although this might be one of the ones we're schizophrenic about in different places
[23:50] * Cube (~Cube@12.248.40.138) Quit (Remote host closed the connection)
[23:50] <mozg> is there anyone who could help me to setup my ceph env for debugging before I start the benchmarking?
[23:51] * Cube (~Cube@12.248.40.138) has joined #ceph
[23:51] <mozg> i would like to be able to provide some meaningful info for debugging
[23:51] <mozg> when I'll have a kernel panic
[23:51] <kraken> http://i.imgur.com/fH9e2.gif
[23:51] <gregaf1> what interface are you using?
[23:51] <gregaf1> kraken: help
[23:51] * kraken whispers to gregaf1
[23:52] <gregaf1> kraken: panic disable
[23:52] <kraken> http://i.imgur.com/fH9e2.gif
[23:52] <gregaf1> dammit
[23:52] <joshd> mozg: it'd be good to rule out guest driver bugs by trying a different version of the kernel in the guest
[23:52] <gregaf1> alfredodeza, how do we turn that off?
[23:53] <mozg> joshd, I am already on the latest ubuntu kernel with 3.5 packport
[23:53] <joshd> mozg: hung tasks can still occur if your cluster is overloaded of course, but the lost interrupts seem fishy
[23:53] <mozg> i've now started the benchmark and I am already seeing a hang tasks on the client side
[23:53] <mozg> and a bunch of slow requests on the osd servers
[23:53] <mozg> so something odd is going on
[23:53] <mozg> for sure
[23:54] <joshd> mozg: that just sounds like your osds are overloaded
[23:55] <joshd> mozg: check the disk queues on the osds, but slow requests probably just means the benchmarks are trying to do more iops than your osds can handle in their current configuration
[23:58] <mozg> from what i can see one of the osd servers has a load of 16, the other load of around 8
[23:58] <mozg> and iowait between 10 and 20% at the moment
[23:59] <mozg> not sure if this is any help at all
[23:59] <kraken> ≖_≖

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.