#ceph IRC Log


IRC Log for 2013-05-21

Timestamps are in GMT/BST.

[0:00] <andrei> it has been running for ages now
[0:00] <dmick> this one what?
[0:00] <dmick> that's a monitor
[0:00] <andrei> ah, sorry, wrong line
[0:00] <andrei> /usr/bin/python /usr/sbin/ceph-create-keys --cluster=ceph -i arh-ibstorage1
[0:00] <andrei> this process seems to be stuck
[0:01] <dmick> ceph --admin-daemon=/var/run/ceph/ceph-mon.arh-ibstorage1.asok mon_status has what as a returncode?
[0:02] <gregaf> sagewk: there's still a difference in retry behavior between chooseleaf and doing a choose down the levels, right? (cc sjust)
[0:02] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[0:02] <gregaf> oh, I guess he left
[0:03] <andrei> dmick: here is what i have: http://fpaste.org/13296/13690873/
[0:03] <dmick> yeah, that's the output; the return code would have been in $?
[0:03] <dmick> but that shows the mons aren't in quorum
[0:03] * dcasier (~dcasier@ has joined #ceph
[0:04] <dmick> that's why create keys is stuck
[0:07] <andrei> okay, but what do I do to continue? why is it not in quorum? how do I check what it needs?
[0:07] <andrei> i am doing clean install
[0:10] <andrei> i've got 3 monitor servers
[0:13] <andrei> i think it's a fw issue, one of the monitors is not accessible
[0:13] <Tamil> andrei: only client.admin key is not created? is that what gatherkeys report?
[0:16] <andrei> nope, all keys apart from one is not gatheres
[0:16] <andrei> gathered
[0:17] <andrei> it can't find admin and bootstrap-* keys
[0:17] <Tamil> andrei: i hope your firewall is turned off "sudo service iptables off"
[0:17] <andrei> the fw issue is solved, i can now access port 6789 from any of the 3 mons
[0:17] <andrei> but the issue with keys is still there
[0:17] <andrei> or do I need to open any other ports?
[0:18] <Tamil> andrei: I dont think so
[0:18] <dmick> well
[0:19] <dmick> http://ceph.com/docs/master/rados/configuration/network-config-ref/?highlight=iptables#ip-tables
[0:20] <tchmnkyz> hey guys, will try again today. I am having problems starting up MDS services on 2 of the nodes that previously worked fine. The log file gives the following error: mds.-1.0 handle_mds_map mdsmap compatset
[0:20] <tchmnkyz> compat={},rocompat={},incompat={1=base v0.20,2=cient writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding} not writeable with daemon fealtures compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}, killing myself
[0:20] * dcasier (~dcasier@ Quit (Ping timeout: 480 seconds)
[0:22] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[0:24] <andrei> the firewall rules are in place
[0:24] <andrei> and i can telnet to port 6789 on all 3 hosts and from all of them
[0:24] <dmick> I'm just saying there's more ports used than 6789
[0:24] <dmick> maybe not for mons
[0:25] <dmick> but certainly for other daemons
[0:25] <andrei> true
[0:25] <andrei> thanks for the info
[0:25] <andrei> any other idea why it's not in quorum?
[0:29] <dmick> no. check mon logs?
[0:31] <andrei> 2013-05-20 23:30:55.680705 7f1dd927e700 1 mon.arh-cloud1@-1(probing) e0 adding peer to list of hints
[0:31] <andrei> a bunch of these messages
[0:31] <andrei> with different ips of the mon servers
[0:32] <andrei> i also see these coming up as well:
[0:32] <andrei> 2013-05-20 23:30:18.039193 7f77a26e3700 1 mon.arh-ibstorage1@-1(probing) e0 discarding message auth(proto 0 34 bytes epoch 0) and sending client elsewhere
[0:32] <andrei> 2013-05-20 23:30:18.039250 7f77a26e3700 1 mon.arh-ibstorage1@-1(probing) e0 discarding message mon_subscribe({monmap=0+,osdmap=13}) and sending client elsewhere
[0:33] <andrei> any idea?
[0:36] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[0:41] * MarkN (~nathan@ has left #ceph
[0:42] * BManojlovic (~steki@fo-d- Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:42] * Tamil (~tamil@ Quit (Quit: Leaving.)
[0:44] <BillK> does ceph handle sparse files - and can you sparsify in-situ (i.e. dd if=/dev/zero of=/a; rm /a)
[0:44] <BillK> or do you need to export/re-import?
[0:45] <loicd> sjust: I'm here if you have questions regarding https://github.com/dachary/ceph/commit/7267244bf0775442d445407e55b7fbb9935a404a
[0:46] <sjust> loicd: hey
[0:46] <sjust> reading your patch set, I've noticed some dead code I'm fixing in master
[0:46] <sjust> you'll need to rebase and remove the methods
[0:46] <loicd> ok
[0:47] <sjust> I'm leaving comments on the patch on github
[0:47] <loicd> I overlooked the 80 col limit, I'll fix this
[0:47] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:47] * loicd reading the comments as they get added ;-)
[0:47] <sjust> loicd: thanks! I have tiny code windows
[0:48] <sjust> loicd: you probably want to give me a bit though, I'm not quite done pruning dead code
[0:49] <loicd> Ok, I'll wait for it. I'm not blocked.
[0:49] <loicd> s/a bit though/a bit of time though/ ?
[0:50] <sjust> loicd: yeap
[0:50] <sjust> *yep
[0:57] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:59] * Tamil (~tamil@ has joined #ceph
[1:02] <loicd> sjust: I'm out for the night, thanks a lot for taking the time to review the patch :-)
[1:02] <sjust> loicd: certainly
[1:02] <sjust> see you tomorrow
[1:02] <loicd> have a nice day :-)
[1:03] <andrei> so, i guess ceph-deploy is broken somewhere as it can't install the initial cluster using 3 clean servers
[1:04] <dmick> andrei: I don't have time to diagnose what's wrong with your monitors, but we'd need to understand that before we knew it was broken
[1:10] <andrei> dmick: true, however, i've not nog a clue how to do that as I am new to ceph. I can say that earlier on today I did use the old guide with mkcephfs on the same hardware and that didn't give me any problems
[1:10] <andrei> i did clean the system afterwords
[1:10] <andrei> before running the ceph-deploy
[1:11] <andrei> so something odd is going on with it
[1:11] <andrei> at least with my simple setup
[1:11] * MK_FG (~MK_FG@00018720.user.oftc.net) Quit (Ping timeout: 480 seconds)
[1:14] * infernix (nix@5ED33947.cm-7-4a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[1:19] * tnt (~tnt@ Quit (Ping timeout: 480 seconds)
[1:23] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) has joined #ceph
[1:25] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:26] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:27] <andrei> how does ceph handle device letter changes?
[1:27] <andrei> i did notice a few times linux tend to rearrange drive letters
[1:27] <andrei> doens't happen all the time, but i've seen this occasionally
[1:29] <lurbs> You could always use the /dev/disk/by-uuid/ paths instead, if it handles it poorly.
[1:34] <dmick> andrei: not at all. ceph does what you've instructed it to do. If you're not initializing devices, it should complain if it doesn't have a proper filestore for the daemon in question
[1:35] <dmick> andrei: are your monitors still apparently not talking?
[1:36] <andrei> dmick: thanks
[1:36] <andrei> they were not talking properly
[1:36] <andrei> but i've removed the installation
[1:36] <dmick> ok
[1:36] <andrei> and went back to the 5 minutes quick guide
[1:37] <andrei> as I need to start testing things tomorrow with a working cluster
[1:37] <andrei> and it is almost 1am now (((
[1:37] * MK_FG (~MK_FG@00018720.user.oftc.net) has joined #ceph
[1:37] <andrei> need to wrap it up
[1:38] <dmick> tchmnkyz: still problems with mds?
[1:41] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[1:47] * themgt (~themgt@24-177-232-33.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[1:47] * themgt (~themgt@24-177-232-33.dhcp.gnvl.sc.charter.com) has joined #ceph
[1:48] <phantomcircuit> why does it take so long to delete and rbd volume?
[1:49] <phantomcircuit> oh right
[1:49] <phantomcircuit> it's deleting all the objects
[1:52] <dmick> or even just checking that they exist
[1:52] * mikedawson (~chatzilla@ has joined #ceph
[1:52] <lurbs> Takes ages even if the objects don't exist. Try creating a 16 TB volume and then deleting it. Or rather, don't. It'll take longer than you'd like.
[1:52] * MrNPP (~MrNPP@0001b097.user.oftc.net) has joined #ceph
[1:53] <MrNPP> I have the strangest problem, I upgraded to 0.61.2 and one of my 3 monitors won't come back online. When i try to start it I get failed to bind to an address error, the weird thing is, the address its trying to bind to is a different monitor's ip.
[1:54] <dmick> MrNPP: one would immediately suspect ceph.conf and/or /etc/hosts
[1:54] <MrNPP> i thought so too, but both are correct, and hasn't changed
[1:55] * esammy (~esamuels@host-2-102-69-49.as13285.net) Quit (Quit: esammy)
[1:55] <dmick> can you paste the mon startup log
[1:56] <MrNPP> 2013-05-20 16:55:52.264706 3452c363780 0 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid 43879
[1:56] <MrNPP> 2013-05-20 16:55:52.585535 3452c363780 -1 accepter.accepter.bind unable to bind to Cannot assign requested address
[1:56] <MrNPP> [mon.1] host = hyp01 mon addr =
[1:58] <MrNPP> nano /etchttps://gist.github.com/anonymous/bc7acd6f81dee3316bdd
[1:58] <MrNPP> er
[1:58] <MrNPP> https://gist.github.com/anonymous/bc7acd6f81dee3316bdd
[1:59] <MrNPP> i'm stuck at this point, makes no sense
[2:00] <dmick> it's a bit odd to have the short name on two different addresses, but I assume hosts file processing does something normal with that
[2:01] <MrNPP> well the lower one was just added
[2:01] <MrNPP> i thought maybe that was the reason
[2:01] <MrNPP> but it didn't make a difference
[2:01] <dmick> what happens if you take it out of 127?
[2:02] <MrNPP> doesn't make a difference
[2:02] <dmick> is it possible DNS or some other nameservice is returning different values?
[2:03] <MrNPP> its not in dns currently
[2:04] <dmick> have these hosts always been these addresses?
[2:04] <MrNPP> yeah
[2:04] <MrNPP> nothing changed
[2:04] <MrNPP> just upgraded
[2:04] <dmick> mon will try to do stuff with the monmap, and warn you if the IP-for-this-ID is different than the local IP
[2:05] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[2:05] <MrNPP> is there a way to reset the monitor files?
[2:05] <dmick> before we go there
[2:05] <dmick> try starting the mon from the cmdline so you see errors
[2:05] <MrNPP> i did
[2:05] <MrNPP> i'm running it from command line
[2:06] <dmick> and add --debug-mon=20?
[2:06] <MrNPP> /usr/bin/ceph-mon -i 1 --pid-file /var/run/ceph/mon.1.pid -c /etc/ceph/ceph.conf
[2:06] <MrNPP> 2013-05-20 17:06:21.039615 322e9ada780 0 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid 44136
[2:06] <MrNPP> 2013-05-20 17:06:21.040651 322e9ada780 10 needs_conversion
[2:06] <MrNPP> 2013-05-20 17:06:21.329225 322e9ada780 10 obtain_monmap
[2:06] <andrei> dmick: you were right, there is an issue with the key creation that is likely nothing to do with ceph-deploy
[2:06] <MrNPP> 2013-05-20 17:06:21.354277 322e9ada780 10 obtain_monmap detected aborted sync
[2:06] <MrNPP> 2013-05-20 17:06:21.354340 322e9ada780 10 obtain_monmap read backup monmap
[2:06] <MrNPP> 2013-05-20 17:06:21.354545 322e9ada780 -1 accepter.accepter.bind unable to bind to Cannot as
[2:07] <andrei> as i've now done the 5 minute guide and i am having same issues if I am using 3 monitors and not one shown in the guide
[2:07] * mtanski (~mtanski@ Quit (Ping timeout: 480 seconds)
[2:08] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[2:08] <dmick> MrNPP: let's see what that monmap actually says
[2:09] <dmick> monmaptool will dump it
[2:09] <dmick> it's in the mon data dir
[2:09] <dmick> uh
[2:09] <dmick> well it was.. doh
[2:09] <MrNPP> yeah
[2:09] <MrNPP> i can't seem to find it anymore
[2:10] <MrNPP> i tried to use the remove a dirty monitor, but the instructions are outdated
[2:13] <phantomcircuit> hmm guess i should make deleting disk instances asynchronous...
[2:15] <dmick> MrNPP: monitor data store has changed to leveldb; not sure what the "look at monmap" procedure is now; hunting
[2:15] <phantomcircuit> hmm
[2:16] <phantomcircuit> i wonder with libvirtd will it let one connection create a vm with a disk that has a pending delete operation
[2:16] <phantomcircuit> i dont think so but i better try it
[2:17] * arye (~arye@pool-96-248-99-133.cmdnnj.fios.verizon.net) has joined #ceph
[2:17] <MrNPP> dmick, thank you
[2:23] <joshd> phantomcircuit: rbd deletes can be faster in the next version https://github.com/ceph/ceph/commit/40956410169709c32a282d9b872cb5f618a48926
[2:24] <phantomcircuit> joshd, im still getting any emails over a 0.56.3 and 0.56.6 compatibility issue
[2:24] <phantomcircuit> version bumps scare me
[2:27] * themgt (~themgt@24-177-232-33.dhcp.gnvl.sc.charter.com) Quit (Quit: Pogoapp - http://www.pogoapp.com)
[2:29] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) has joined #ceph
[2:32] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:40] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[2:40] * Dieter_be (~Dieterbe@dieter2.plaetinck.be) has left #ceph
[2:48] * mikedawson_ (~chatzilla@ has joined #ceph
[2:52] * Tamil (~tamil@ Quit (Quit: Leaving.)
[2:53] <dmick> MrNPP: there isn't a good one.
[2:53] * mikedawson (~chatzilla@ Quit (Ping timeout: 480 seconds)
[2:53] * mikedawson_ is now known as mikedawson
[2:55] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[2:58] <MrNPP> hmmm
[2:58] <MrNPP> so i wonder how i can wipe out this monitor and reinitiallize it
[2:58] <via> has anyone experimented with ceph on VMs or with a vps provider? and if so did it suck
[3:02] <dmick> MrNPP, joao: http://tracker.ceph.com/issues/5125
[3:02] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) has joined #ceph
[3:03] <dmick> MrNPP: if the cluster was healthy before, I think it should be safe to blow away the monstore and start over on this monitor
[3:03] <dmick> I'm nervous that we don't know what went wrong
[3:03] <dmick> and "obtain_monmap detected aborted sync" makes me think it was trying to convert and failed
[3:04] <joao> hmm
[3:04] <joao> no
[3:04] <joao> that means that there was a sync in progress at some point
[3:04] <joao> and the monitor either crashed or was killed
[3:04] <joao> mid-sync
[3:05] <joao> some safeguards against that screwing up the monmap were introduced for cuttlefish I think
[3:05] <joao> anyway
[3:06] <joao> if you have a cluster with more than one monitor, it should be safe to blow away that monitor's store and create a fresh new one, it will sync from the other monitors
[3:06] <joao> if you have just the one, then that's a really bad idea
[3:08] <joao> this is assuming I am understanding the whole scope of the issue at hand
[3:09] <dmick> there are three, right MrNPP?
[3:09] <dmick> https://gist.github.com/anonymous/bc7acd6f81dee3316bdd
[3:09] <joao> btw, I should make this line more obvious:
[3:09] <joao> 8<MrNPP> 2013-05-20 17:06:21.040651 322e9ada780 10 needs_conversion
[3:09] <joao> it doesn't mean it needs conversion
[3:09] <joao> it's just checking if it does
[3:09] <dmick> i agree it should be more obvious :)
[3:10] <joao> ah, here are those safeguards I mentioned before:
[3:10] <joao> <MrNPP> 2013-05-20 17:06:21.354277 322e9ada780 10 obtain_monmap detected aborted sync
[3:10] <joao> oops
[3:10] <joao> wrong line
[3:10] <joao> <MrNPP> 2013-05-20 17:06:21.354340 322e9ada780 10 obtain_monmap read backup monmap
[3:10] <joao> this ^
[3:10] <MrNPP> dmic, yeah, three
[3:11] <MrNPP> however i don't think it was healthy
[3:12] <joao> MrNPP, from the looks of that 'unable to bind to ...' it would appear as if you had a monitor already running?
[3:12] <joao> either that or you have moved the monitor to a different location
[3:12] <joao> MrNPP, have you tried obtaining the monmap from the cluster?
[3:12] <joao> I am assuming you have the other 2 monitors up and running, and with a formed quorum?
[3:13] <MrNPP> currently they are all offline
[3:13] <MrNPP> but i can bring them up
[3:13] <joao> ah
[3:13] <MrNPP> scratch that, i have 2 up
[3:13] <MrNPP> but no quorum
[3:14] <joao> well, you should be able to obtain the monmap after they are up with 'ceph getmap' (or is it 'ceph mon getmap'?)
[3:14] <joao> oh, okay
[3:14] <MrNPP> they just hang
[3:14] <joao> you should then check why you don't have a quorum
[3:14] <MrNPP> not sure how to recover at this point
[3:14] <joao> MrNPP, are all the monitors running on the same version?
[3:14] <MrNPP> yeah
[3:15] <MrNPP> ceph-0.61.2
[3:15] <MrNPP> they have been up for the last week or so, but unable to connect to them
[3:15] <joao> then crank up debugging ('debug mon = 20', 'debug auth = 20' and 'debug ms = 1') and check the logs
[3:15] <MrNPP> ok
[3:16] <joao> that might be some auth issue, for instance
[3:16] <joao> if they come up it's unlikely they have monmap issues, unless they're all sending messages to the wrong places
[3:16] <joao> which should be obvious after you start 'mon_probe' messages flying around
[3:17] <joao> *start seeing
[3:18] <joao> I'd love to stay around and help you with that, but I might just end up asking you to remove your mon stores
[3:18] <joao> being sleepy makes me do all sorts of crazy things
[3:19] <dmick> joao: the reason we were looking at monmap was that the monitor that won't start, won't start because it wants to bind to the wrong address
[3:19] <joao> ah
[3:19] <dmick> so I was thinking somehow the mapping of id <-> IP had gotten borked
[3:19] <joao> was the monitor moved for some reason?
[3:19] <dmick> since the mon apparently tries to rely on the monmap over teh conf, although should warn
[3:19] <dmick> apparently not
[3:19] <MrNPP> i never moved the mon
[3:20] <MrNPP> 2013-05-20 18:19:50.813648 294326df700 1 -- mark_down 0x2942800a780 -- 0x2942800a510
[3:20] <MrNPP> 2013-05-20 18:19:50.813733 294326df700 1 -- --> -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x2942800aa30 con 0x29428008e60
[3:20] <joao> are you trying to fire up the correct mon?
[3:20] <MrNPP> 2013-05-20 18:19:53.813898 294326df700 1 -- mark_down 0x29428008e60 -- 0x294280088c0
[3:20] <MrNPP> 2013-05-20 18:19:53.814009 294326df700 1 -- --> -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x2942800b790 con 0x2942800a680
[3:20] <MrNPP> i can't seem to see the monitor
[3:20] <MrNPP> this is on a good monitor
[3:20] <MrNPP> i started it with debugging
[3:20] <joao> MrNPP, are you running without cephx?
[3:21] <MrNPP> no
[3:21] <MrNPP> ok, so i've shut down everything
[3:21] <MrNPP> i currently have one monitor on
[3:22] <MrNPP> and i'm getting the error above
[3:22] <joao> well, you could regenerate the monmap and just inject it
[3:23] <MrNPP> you mean on the unhealthy monitor?
[3:23] <MrNPP> since i know everything went to leveldb, how do i do that?
[3:23] <joao> well, if you are to inject it you must do it on all monitors I think
[3:23] * dmick listens closely too :)
[3:24] * The_Bishop__ (~bishop@f052101105.adsl.alicedsl.de) has joined #ceph
[3:24] <joao> MrNPP, 'ceph-mon -i <id> --inject-monmap </foo/monmap>'
[3:24] <MrNPP> there is no monmap
[3:24] <joao> shutdown the monitors first, inject and then start them up
[3:24] <dmick> you'd create it with monmaptool
[3:25] <joao> MrNPP, do you have the osds/mds's up?
[3:26] <MrNPP> no mds, and i've shut down all the osds
[3:26] <joao> okay
[3:26] <joao> so those are auth messages coming from the monitors
[3:26] <MrNPP> yeah
[3:26] <joao> can you confirm those are the correct ips as expected to be on the monmap?
[3:26] <MrNPP> yes, they are
[3:26] <joao> okay
[3:26] <MrNPP> 1, 4, and 252
[3:29] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[3:30] * The_Bishop_ (~bishop@f052098107.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[3:31] <MrNPP> so ok, so i generate the monmap, and i inject it to all 3 correct?
[3:33] <dmick> yeah; I'd probably inject it to one, start it, then continue like that
[3:34] <MrNPP> do i have to purge the old mon dir?
[3:34] <dmick> no
[3:34] <dmick> and it wouldn't hurt to doublecheck with monmaptool --print that it looks the way you expect
[3:35] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) has joined #ceph
[3:39] <MrNPP> arggg
[3:39] <MrNPP> 2013-05-20 18:38:58.032282 2670ea48780 0 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid 20747
[3:39] <MrNPP> 2013-05-20 18:38:58.319668 2670ea48780 0 mon.6 does not exist in monmap, will attempt to join an existing cluster
[3:39] <MrNPP> 2013-05-20 18:38:58.321084 2670ea48780 -1 no public_addr or public_network specified, and mon.6 not present in monmap or ceph.conf
[3:39] <MrNPP> 0: mon.1
[3:39] <MrNPP> 1: mon.4
[3:39] <MrNPP> 2: mon.6
[3:39] <MrNPP> thats from monmaptool print
[3:44] * mikedawson_ (~chatzilla@ has joined #ceph
[3:45] * rturk is now known as rturk-away
[3:51] * mikedawson (~chatzilla@ Quit (Ping timeout: 480 seconds)
[3:51] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) has joined #ceph
[3:51] * SvenPHX (~scarter@wsip-174-79-34-244.ph.ph.cox.net) has left #ceph
[3:53] <dmick> MrNPP: hum.
[3:53] <dmick> was that the first one you tried to start?
[3:56] <MrNPP> yeah
[3:57] <MrNPP> i can't seem to get this to work at all
[3:57] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[3:59] <dmick> and you started that one with --inject-monmap? This is confusing.
[4:01] <MrNPP> yeah
[4:01] <MrNPP> so how do i clear it out and start new monitors?
[4:02] * treaki_ (2cf4f0764e@p4FDF7C7E.dip0.t-ipconnect.de) has joined #ceph
[4:02] <dmick> so just to be clear: all the mons were stopped
[4:02] <dmick> and you started just mon.6 with --inject-monmap.
[4:03] <MrNPP> yeah
[4:04] <MrNPP> and it died right away
[4:04] <MrNPP> 2013-05-20 19:03:40.943375 3226f4d8780 -1 mon.6@-1(probing) e20 error: cluster_uuid file exists with value '0ef5bfbd-7d8a-4853-954c-bdc8d4228896', != our uuid 17246c2e-9076-438f-add1-f5898c19cc19
[4:04] <dmick> oh
[4:04] <MrNPP> tahts after i tried to start it again
[4:05] <MrNPP> using: /usr/bin/ceph-mon -i 6 --pid-file /var/run/ceph/mon.6.pid -c /etc/ceph/ceph.conf
[4:05] <MrNPP> at this point, i just want one monitor up and running
[4:06] <dmick> so you should have seen messages like
[4:06] * treaki (48d29226a7@p4FF4ABBD.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[4:06] <dmick> last committed monmap epoch is X, injected map will be Y
[4:07] <dmick> and maybe
[4:07] <dmick> changing monmap epoch from X to Y
[4:07] <dmick> yes? no?
[4:07] <MrNPP> you mean when i inject it?
[4:07] <dmick> yes
[4:07] <MrNPP> nothing
[4:07] <MrNPP> just back to shell
[4:10] <MrNPP> the log just has the version and the pid
[4:10] <MrNPP> but the process died
[4:10] <dmick> trying to debug
[4:11] * The_Bishop_ (~bishop@f052101139.adsl.alicedsl.de) has joined #ceph
[4:11] <MrNPP> i did
[4:11] <MrNPP> nothing
[4:12] <dmick> no I'm running ceph-mon under gdb to see wth it's doing
[4:12] <dmick> clearly I'm missing something
[4:14] <MrNPP> you and i both
[4:14] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Quit: Ex-Chat)
[4:14] <dmick> ah, it's daemonized by the time it gets to monmap injection, so the messages are lost
[4:14] <dmick> doh
[4:14] <dmick> ((written to /dev/null)
[4:15] <MrNPP> root at fs02 ➜ ceph ceph-mon -f -i 6 --inject-monmap /etc/ceph/monmap
[4:15] <MrNPP> last committed monmap epoch is 24, injected map will be 25
[4:15] <MrNPP> changing monmap epoch from 0 to 25
[4:15] <MrNPP> done.
[4:15] <dmick> where did you get those?
[4:15] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) Quit (Quit: mtanski)
[4:15] <MrNPP> ran in foreground
[4:15] <dmick> oh -f
[4:15] <dmick> gotcha
[4:15] * The_Bishop__ (~bishop@f052101105.adsl.alicedsl.de) Quit (Read error: Operation timed out)
[4:16] <dmick> ok, so, --inject-monmap just injects the map, and then you start the monitor as usual, supposedly
[4:16] <dmick> but you're saying it dies if you do
[4:16] <MrNPP> i get the uuid error
[4:17] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[4:17] <dmick> when did/do you get the error about mon.6 not in the map?
[4:18] <MrNPP> i dunno, it might have been because the when i created it i specified mon.6 instead of just 6
[4:18] <MrNPP> and it auto appends mon.
[4:18] <dmick> oh you mean you tried to start it with --id mon.6?
[4:18] <dmick> yeah that'd do it
[4:19] <MrNPP> no
[4:19] <MrNPP> i mean monmaptool
[4:19] <MrNPP> monmaptool --create --add mon.a --add mon.b \ --add mon.c --clobber monmap
[4:19] <MrNPP> those instructions are incorrect
[4:19] <MrNPP> specifying mon.b it makes it mon.mon.b
[4:19] <dmick> ok. so that error was a bad monmap we think?
[4:19] <MrNPP> i think
[4:20] <MrNPP> its hard to tell really
[4:20] <dmick> because my monmaptool --print with a known good map prints mon.a
[4:20] <MrNPP> yeah mine too
[4:20] <dmick> but ok, let's assume our problem is the fiid
[4:20] <dmick> *fsid
[4:20] <dmick> that's also in the monmap
[4:20] <MrNPP> monmaptool: monmap file monmap
[4:20] <MrNPP> epoch 0
[4:20] <MrNPP> fsid 17246c2e-9076-438f-add1-f5898c19cc19
[4:20] <MrNPP> last_changed 2013-05-20 19:03:08.369046
[4:20] <MrNPP> created 2013-05-20 19:03:08.369046
[4:20] <MrNPP> 0: mon.6
[4:20] <mikedawson_> dmick: starting a monitor... "node2: [11526]: (33) Numerical argument out of domain" do you know what that is?
[4:21] <dmick> mikedawson_: no, other than EDOM, but I don't know why
[4:21] <mikedawson_> EDOM?
[4:21] <MrNPP> mike, did you mess with the crushamp recently?
[4:21] <dmick> there are a few scattered in the source
[4:21] <mikedawson_> MrNPP: yes, I removed a failed OSD
[4:22] <MrNPP> i've seen that before when i've changed the crushmap, it caused all my monitors to crash
[4:22] <MrNPP> not sure why
[4:22] <dmick> mikedawson_: more context for that message?
[4:22] * NXCZ (~chatzilla@ip72-199-155-185.sd.sd.cox.net) has joined #ceph
[4:23] <dmick> MrNPP: if you haven't installed ceph-test, do so
[4:25] <dmick> ceph_test_store_tool <path-to-mon-store.db> get monitor cluster_uuid
[4:25] <dmick> for the mon you injected
[4:26] <dmick> and compare that to what monmaptool -print shows for fsid
[4:26] <MrNPP> i got it up
[4:26] <MrNPP> i changed the uuid to the one that was existing
[4:26] <NXCZ> I got an issue I do not know how to solve. I have installed ceph and had it running smoothly. I then upgraded to 0.56.6. It erred on ceph-mds. I fixed the error and now nothing will mount. all I get is ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory. Is my data gone?
[4:27] <dmick> NXCZ: hard to say. What does "It erred on ceph-mds. I fixed the error" mean, specifically?
[4:27] <dmick> MrNPP: in the --generate?
[4:27] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) Quit (Quit: Leaving.)
[4:28] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) Quit (Quit: leaving)
[4:28] <MrNPP> yeah, in the create
[4:28] <MrNPP> monmaptool --create --add 6 --fsid 0ef5bfbd-7d8a-4853-954c-bdc8d4228896 --clobber monmap
[4:28] <MrNPP> so now its recovering
[4:28] <dmick> ok
[4:28] <dmick> whem
[4:28] <NXCZ> it errored out with a dpkg error. I did a apt-get remove ceph-mds then did a install. Stopped the dpkg error but service start errors out.
[4:28] <dmick> *whew I mean
[4:28] <MrNPP> hopefully
[4:28] <MrNPP> a ton of stale
[4:28] <dmick> NXCZ: ok, so nothing to do with the code
[4:28] <dmick> fine
[4:29] <dmick> is /var/lib/ceph/osd/ceph-0 present, and is that where your data was configured?
[4:29] <NXCZ> Doubt it was the code. I shouldn't have done the upgrade :-)
[4:29] <NXCZ> yes it is present but empty
[4:29] <dmick> how was osd.0 configured? directory, partition, or device?
[4:30] <NXCZ> device
[4:30] <dmick> did you reboot? Is that device still present and accessible?
[4:31] <NXCZ> I did reboot, the device is present. havent tried to access it yet.
[4:34] <mikedawson_> dmick: http://pastebin.com/raw.php?i=p3jgnc1H
[4:36] <dmick> am I missing the Numerical argument out of domain?
[4:36] <dmick> or is this a different problem?
[4:36] <dmick> NXCZ: can you pastebin your ceph.conf?
[4:36] <NXCZ> yes
[4:37] <NXCZ> I mounted one of the drives and can read the conrtents
[4:37] <NXCZ> contents
[4:37] <mikedawson_> dmick: same... http://pastebin.com/raw.php?i=CyhTyvTU
[4:37] <dmick> so it's an issue of getting them mounted right
[4:38] <dmick> mikedawson_: ok, that might be a stale pidfile
[4:38] <dmick> pidfile_remove
[4:38] <dmick> if (a != getpid())
[4:38] <dmick> return -EDOM;
[4:39] <mikedawson_> dmick: I don't have a ceph mon pid, but there is an asok in there
[4:39] <dmick> I think the pidfile is somewhere like /var/run/ceph?..
[4:40] <NXCZ> dmick: http://pastebin.com/aUfVEvGN
[4:40] <dmick> yeah, /var/run/ceph/$name.pid I bet
[4:40] <mikedawson_> dmick: you are busy tonight -> http://pastebin.com/raw.php?i=SibKLCq5
[4:41] <dmick> NXCZ: and are you using upstart or sysvinit?
[4:42] <dmick> mikedawson_: hm
[4:42] <NXCZ> upstart
[4:43] <NXCZ> part of the upgrade error was that it could not wrtie /etc/init/ceph-mds.conf
[4:43] <NXCZ> sorry /etc/init/ceph-mds.conf
[4:44] <dmick> NXCZ: I think udev was supposed to have mounted those for you
[4:44] <dmick> with 95-ceph-osd.rules
[4:45] <dmick> are all the osd partitions the same (and, really, you have 7 on one host, all using different devices?)
[4:45] <NXCZ> ya lol
[4:46] <dmick> so none of them got mounted
[4:46] <NXCZ> it started as a testbed then moved to pre-production as more storage servers are put in
[4:46] <NXCZ> none of them
[4:46] <NXCZ> I have the rules file in /lib/udev/rules.d/95-ceph-osd.rules
[4:47] <mikedawson_> dmick: root@node2:/var/lib/ceph/mon/ceph-a# du --si
[4:47] <mikedawson_> 271k ./store.db
[4:47] <dmick> try udevadm trigger --subsystem-match=block --action=add
[4:47] <mikedawson_> That's not good
[4:48] <dmick> for a mon?
[4:48] <mikedawson_> yeah
[4:48] <dmick> oh yeah, you're Mr. HugeMonDir :)
[4:48] <dmick> hm
[4:49] <mikedawson_> yep, my others are closer to 1GB
[4:49] <mikedawson_> (and growing)
[4:49] <mikedawson_> dmick: remove, then re-add this mon?
[4:50] <dmick> I guess. I don't know what's going on with the EDOM
[4:50] <NXCZ> dmick: no go. wherecan I check to see why the rule isnt working?
[4:50] <dmick> add --debug, maybe?
[4:51] <dmick> there's also udevadm info
[4:51] <dmick> and tewt
[4:51] <dmick> *test
[4:51] <dmick> you can also try running what it runs
[4:52] <dmick> (namely /usr/sbin/ceph-disk-activate --mount /dev/$name)
[4:53] <NXCZ> WARNING:ceph-disk:No fsid defined in /etc/ceph/ceph.conf; using anyway 7f9362db1780 -1 read 56 bytes from /var/lib/ceph/tmp/mnt.jjb3NW/keyring added key for osd.0
[4:53] <dmick> what caused that?
[4:54] <NXCZ> sorry, the command you posted
[4:54] <NXCZ> they show up now
[4:54] <dmick> I posted two udevadm options and ceph-disk-activate, but I guess you mean the latter
[4:55] <NXCZ> at least in a df
[4:55] <dmick> ok, so, one wonders why your udev didn't work
[4:55] <NXCZ> yes sorry )
[4:55] <NXCZ> no idea, rebooted twice
[4:55] <dmick> that should get everything mounted on startup as each device is discovered
[4:56] <NXCZ> will have to look into that. Still have tons to learn
[4:57] <dmick> neet, udevadm's behavior doesn't match its manpage. wee.
[4:59] <dmick> mikedawson_: any luck?
[5:00] <mikedawson_> dmick: just killed mon.a, then re-added it. Same thing. Can you sanity check? http://pastebin.com/raw.php?i=WBni1Rib
[5:01] <dmick> gah.
[5:02] <dmick> I would now try starting ceph-mon manually with --debug-mon=30
[5:02] <dmick> (and all the other switches)
[5:02] <MrNPP> and -f
[5:02] <MrNPP> :)
[5:03] <dmick> vsm
[5:03] <dmick> yes, can't hurt
[5:03] <mikedawson_> dmick: here is the log (with debug mon = 20)
[5:03] <mikedawson_> http://pastebin.com/raw.php?i=SdCfK4aM
[5:03] <MrNPP> oh failed assert, gotta love that one
[5:03] <NXCZ> Still dont know why it is not running the udev. Think if I rename it to a lower number?
[5:04] <dmick> NXCZ: no
[5:04] <NXCZ> sorry rule that is
[5:04] <dmick> start a udevadm monitor while you execute the udevadm trigger above, maybe
[5:05] <mikedawson_> dmick: ceph-mon -i a --public-addr seems like it is going to work. Wonder why?
[5:05] <dmick> wouldn't hurt to make sure the partition guids are correct too
[5:07] <dmick> mikedawson_: {public,cluster}_{network,addr} set in ceph.conf?
[5:08] <NXCZ> Thanks dmick
[5:08] <dmick> NXCZ: no worries, but, realize that if you don't get to the bottom of this, a reboot's gonna leave you hanging again.
[5:08] <NXCZ> ya I know
[5:08] <dmick> k
[5:08] <NXCZ> having an issue creating volumes now too
[5:09] <mikedawson_> dmick: http://pastebin.com/igZcQze2
[5:10] <dmick> effing annoying that those last two routines aren't shown in the backtrace
[5:10] <dmick> I don't suppose you can get it to fail under gdb?
[5:12] <mikedawson_> dmick: ^ me
[5:12] <mikedawson_> ?
[5:12] <dmick> yes
[5:12] <dmick> I can't see what config value it's fetching and failing
[5:13] <dmick> (or setting, actually)
[5:13] <mikedawson_> dmick: now it's starting with 'service ceph start mon', so maybe I'm past the issue. No changes to ceph.conf
[5:13] * dmick shakes head
[5:15] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[5:15] <mikedawson_> dmick: ceph >= 0.58 monitors are my nemesis!
[5:23] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[5:32] * dcasier (~dcasier@ has joined #ceph
[5:45] <mikedawson_> Thanks dmick for your help (once again)! Mons are back to normal. OSDs are backfilling properly.
[5:46] * coyo (~unf@00017955.user.oftc.net) Quit (Ping timeout: 480 seconds)
[5:46] * dcasier (~dcasier@ Quit (Ping timeout: 480 seconds)
[6:01] * arye (~arye@pool-96-248-99-133.cmdnnj.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[6:06] <dmick> good. Not sure I helped much (sigh) but glad you got going
[6:13] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:30] * mikedawson_ (~chatzilla@ Quit (Read error: Connection reset by peer)
[6:31] * jtaguinerd (~jtaguiner@ has joined #ceph
[6:35] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[6:42] * Vjarjadian (~IceChat77@ Quit (Quit: If you can't laugh at yourself, make fun of other people.)
[6:43] * mikedawson (~chatzilla@ has joined #ceph
[6:56] * coyo (~unf@pool-71-164-242-68.dllstx.fios.verizon.net) has joined #ceph
[6:56] * mikedawson (~chatzilla@ Quit (Ping timeout: 480 seconds)
[7:13] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[7:14] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[7:28] * madkiss (~madkiss@2001:6f8:12c3:f00f:4df4:543e:9789:981e) Quit (Ping timeout: 480 seconds)
[7:32] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[7:36] * bergerx_ (~bekir@ has joined #ceph
[7:56] * esammy (~esamuels@host-2-102-69-49.as13285.net) has joined #ceph
[8:02] * tnt (~tnt@ has joined #ceph
[8:07] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[8:14] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:21] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[8:21] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[8:21] * ChanServ sets mode +o scuttlemonkey
[8:23] * iggy (~iggy@theiggy.com) Quit (Read error: Operation timed out)
[8:23] * iggy (~iggy@theiggy.com) has joined #ceph
[8:32] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[8:36] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[8:40] * bergerx_ (~bekir@ Quit (Ping timeout: 480 seconds)
[8:41] <tchmnkyz> dmick: yes i am still having the same issue with my mds
[8:41] <tchmnkyz> sorry for late responce been dealing with IO issus on my ceph cluster
[8:46] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[8:49] * loicd (~loic@ has joined #ceph
[8:50] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[8:52] * bergerx_ (~bekir@ has joined #ceph
[8:53] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) has joined #ceph
[9:05] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[9:06] * tnt (~tnt@ Quit (Ping timeout: 480 seconds)
[9:10] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[9:11] * Fetch__ (fetch@gimel.cepheid.org) has joined #ceph
[9:12] * Fetch (fetch@gimel.cepheid.org) Quit (Read error: Connection reset by peer)
[9:14] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) Quit (Ping timeout: 480 seconds)
[9:16] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[9:17] * eschnou (~eschnou@ has joined #ceph
[9:18] * mrjack (mrjack@office.smart-weblications.net) Quit (Ping timeout: 480 seconds)
[9:18] * mrjack (mrjack@office.smart-weblications.net) has joined #ceph
[9:20] * illuminatis (~illuminat@0001adba.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:20] * MK_FG (~MK_FG@00018720.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:20] * via (~via@smtp2.matthewvia.info) Quit (Ping timeout: 480 seconds)
[9:20] * illuminatis (~illuminat@0001adba.user.oftc.net) has joined #ceph
[9:20] * lightspeed (~lightspee@i01m-62-35-37-66.d4.club-internet.fr) Quit (Ping timeout: 480 seconds)
[9:22] * MK_FG (~MK_FG@00018720.user.oftc.net) has joined #ceph
[9:22] * lightspeed (~lightspee@i01m-62-35-37-66.d4.club-internet.fr) has joined #ceph
[9:23] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:25] * via (~via@smtp2.matthewvia.info) has joined #ceph
[9:26] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:30] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[9:31] * loicd (~loic@soleillescowork-4p-55-10.cnt.nerim.net) has joined #ceph
[9:33] * ScOut3R (~ScOut3R@ has joined #ceph
[9:36] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) Quit (Remote host closed the connection)
[9:46] * BManojlovic (~steki@ has joined #ceph
[9:47] * humbolt1 (~elias@80-121-52-87.adsl.highway.telekom.at) Quit (Ping timeout: 480 seconds)
[9:49] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) has left #ceph
[9:49] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) has joined #ceph
[9:53] * leseb (~Adium@ has joined #ceph
[9:58] * humbolt (~elias@213-33-1-180.adsl.highway.telekom.at) has joined #ceph
[10:08] * capri_wk (~capri@ Quit (Ping timeout: 480 seconds)
[10:08] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[10:09] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[10:09] * capri (~capri@ has joined #ceph
[10:15] * LeaChim (~LeaChim@ has joined #ceph
[10:15] * fghaas (~florian@91-119-68-174.dynamic.xdsl-line.inode.at) has joined #ceph
[10:23] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[10:25] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[10:25] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit ()
[10:27] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[10:30] * KindTwo (KindOne@h183.187.130.174.dynamic.ip.windstream.net) has joined #ceph
[10:32] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) has joined #ceph
[10:37] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[10:37] * KindTwo is now known as KindOne
[10:37] * tziOm (~bjornar@ has joined #ceph
[10:50] * nhm (~nhm@174-20-107-121.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[10:50] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[11:02] * andrei (~andrei@host86-155-31-94.range86-155.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[11:07] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) has joined #ceph
[11:12] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[11:20] * capri_on (~capri@ has joined #ceph
[11:21] * fghaas (~florian@91-119-68-174.dynamic.xdsl-line.inode.at) has left #ceph
[11:25] * capri (~capri@ Quit (Ping timeout: 480 seconds)
[11:28] * nhm (~nhm@65-128-142-169.mpls.qwest.net) has joined #ceph
[11:45] <Azrael> hey folks, i have 1 mon node and i am going to add 2 more
[11:45] <Azrael> and of course i'll be adding the 2 additional mon's in a serial fashion
[11:45] <Azrael> so at one point, i'll only have 2 mon nodes in my cluster
[11:46] <Azrael> when i do this, wont the cluster then shutdown, because there is no quorum for the mon's?
[11:46] <joao> the cluster won't shutdown
[11:46] <joao> it will just stop working until the second monitor comes up and forms a quorum with the already existing one
[11:47] <Azrael> ahh ok
[11:47] <Azrael> so what will stop working
[11:47] <Azrael> all operations?
[11:47] <Azrael> but the mon daemon itself will still continue to run etc, right?
[11:47] <Azrael> (waiting for that 3rd guy to join the party)
[11:48] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[11:48] * dcasier (~dcasier@LVelizy-156-44-40-164.w217-128.abo.wanadoo.fr) has joined #ceph
[11:51] <Azrael> what i remember happening last time i tried this was the origin mon + the new mon actually shutdown (or crashed) because there was no quorum
[11:51] <Azrael> and i couldn't start them back up.... because there was no quorum.... so i couldn't add the 3rd mon heh
[11:52] <joao> Azrael, you might have hit a bug
[11:52] <joao> probably a sync bug
[11:52] <joao> there's still one to be fixed
[11:53] <joao> if that happens, please set higher debug levels
[11:57] * sh_t (~sht@lu.privatevpn.com) Quit (Read error: Connection reset by peer)
[11:58] * sh_t (~sht@lu.privatevpn.com) has joined #ceph
[12:02] * uli (~uli@mail1.ksfh-bb.de) Quit (Quit: Verlassend)
[12:11] * janisg (~troll@ Quit (Remote host closed the connection)
[12:12] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[12:12] * ScOut3R (~ScOut3R@ has joined #ceph
[12:22] <Gugge-47527> Azrael: 2 mons can form a quorum fine
[12:23] <wogri_risc> yeah, I had a setup with 2 mons, too.
[12:23] <wogri_risc> no problem.
[12:23] <wogri_risc> but with bobtail.
[12:23] <Gugge-47527> only "problem" is that it is no safer than a 1 mon setup
[12:23] <wogri_risc> true. :)
[12:25] * capri_on (~capri@ Quit (Quit: Verlassend)
[12:26] * capri (~capri@ has joined #ceph
[12:33] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[12:33] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[12:36] * loicd (~loic@soleillescowork-4p-55-10.cnt.nerim.net) Quit (Quit: Leaving.)
[12:55] * diegows (~diegows@ has joined #ceph
[12:57] <Azrael> Gugge-47527: oh really, ok cool.
[13:01] <Gugge-47527> Azrael: if everything work as designed, your only problem is the period between adding mon2 and starting it
[13:07] * loicd (~loic@ has joined #ceph
[13:10] * janisg (~troll@ has joined #ceph
[13:10] <Esmil> Do anyone have access to the ceph-devel mailing list? I've tried sending from two different email adresses to ceph-devel@vger.kernel.org, but none of the mails seem to get there..
[13:11] <joao> Esmil, are you sending plain-text emails?
[13:11] <joao> vger filters any email containing html or the sorts
[13:11] <Esmil> joao: Yes, in the 2nd try I convinced my email clients to only send plain text
[13:13] <joao> Esmil, http://vger.kernel.org/majordomo-info.html
[13:13] <joao> You can test email delivery between you, and VGER by sending an empty test letter to: <autoanswer@vger.kernel.org>
[13:14] <joao> Esmil, fyi, "Message size exceeding 100 000 characters causes blocking."
[13:15] <joao> just in case you're attaching some huge log or something :)
[13:16] <Esmil> Right, that's probably it. It's just weird that you're warned about the html, but such messages are silently ignored.
[13:18] <joao> eh, I suppose they may assume that html messages are simply a mistake worth warning the user about :)
[13:18] <joao> besides, I learned the hard way that android's gmail client doesn't even send plain-text emails
[13:18] <joao> which is annoying
[13:19] <Esmil> Yeah, I thought my clients only sent plain text already, but it seems they send both html and plain text versions
[13:21] * john_barbee_ (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[13:22] * mikedawson_ (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[13:22] <Esmil> Alright, this time it got through. Thanks
[13:24] <Azrael> Esmil: pebkac :-P
[13:24] * Azrael and nyerup waive to Esmil
[13:25] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[13:25] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[13:25] * john_barbee_ is now known as john_barbee
[13:25] * mikedawson_ is now known as mikedawson
[13:49] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 21.0/20130511120803])
[13:54] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[13:58] <tnt> Mmm, the mons are generating much more IO than I would expect. ~ 75 iops / 4 MBytes/s
[13:58] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[13:58] <tnt> is that expected ?
[14:02] * andrei (~andrei@host217-46-236-49.in-addr.btopenworld.com) has joined #ceph
[14:02] <andrei> hello guys
[14:02] <andrei> i am new to ceph. created a new ceph cluster made of 2 servers
[14:03] <andrei> the initial testing is working okay
[14:03] <andrei> i've now tried to restart one of the storage servers
[14:03] <wogri_risc> and it stopped working
[14:03] <wogri_risc> because - there's a rule: you need an unequal number of mons
[14:03] <andrei> the ceph -s is showing heath-warn, which is expected
[14:03] <wogri_risc> 2 servers are just not enough. you should use three.
[14:03] <andrei> wogri_risc: sorry, no, it seems to work
[14:04] <andrei> as i only had 1 monitor
[14:04] <andrei> for now
[14:04] <wogri_risc> ok ;)
[14:04] <andrei> i will have 3
[14:04] <andrei> but for the time being i've got 2 osd servers, 1 mon and 1 mds
[14:04] <andrei> i've restarted the osd server
[14:04] <wogri_risc> sorry, so what's your problem then :)
[14:04] <andrei> so, mon and mds were running
[14:04] <andrei> what i wanted to ask is this
[14:04] <andrei> when the server rebooted it didn't automatically start ceph. I wanted to find out what is a proper procedure?
[14:05] <andrei> should I start ceph from the start scripts
[14:05] <wogri_risc> depends on your distribution
[14:05] <andrei> or do i need to do anything prior to that
[14:05] <andrei> i am on ubuntu 12.04 with the latest ceph from a repo
[14:05] <wogri_risc> in ubuntu or debian the init-scripts take care of the dependencies (networking up, and so on)
[14:05] <andrei> and also, do i need to manually mount all osds from fstab
[14:06] <andrei> or is this automatically done by ceph?
[14:06] <wogri_risc> no, it should do it for you, given you have the right configuration syntax
[14:06] <wogri_risc> in ceph.conf you can do sth like automatically mount before running the osd.
[14:06] <wogri_risc> or you use fstab for it.
[14:06] <andrei> so, i should simply do service ceph start
[14:06] <andrei> and all shold work, right?
[14:08] <wogri_risc> given you have the right stuff in the ceph.conf or fstab
[14:08] <wogri_risc> I defined the device per osd in the ceph-config-file:
[14:08] <wogri_risc> [osd.0]
[14:08] <wogri_risc> host = rd-c2
[14:08] <wogri_risc> devs = /dev/sdb
[14:08] <andrei> yeah, i've done the config as per the 5 minute quick start )))
[14:08] <wogri_risc> there you go. so a service ceph start should do the trick. hopefully.
[14:09] <wogri_risc> in debian you would type insserv ceph to make it run at boot-time. I don't know about ubuntu by heart, though.
[14:09] <andrei> seems to work like a charm )))
[14:09] <andrei> doing recovery now
[14:10] <wogri_risc> don't forget about the mon rule. you want to have a mon on each of your three servers. without the mon it's done.
[14:10] <wogri_risc> gone.
[14:10] <wogri_risc> away.
[14:10] <wogri_risc> i mean dead.
[14:10] <wogri_risc> :)
[14:10] <andrei> so, if I will now restart the server which mon and mds the clients's mountpoint will fail
[14:10] <andrei> right?
[14:11] <andrei> even if the 2nd osd server is up
[14:11] <wogri_risc> yes.
[14:11] <wogri_risc> it will sit there and wait.
[14:11] <wogri_risc> rbd images can't do IO.
[14:11] <wogri_risc> don't know what cephfs does.
[14:11] <wogri_risc> without the mon ceph doesn't know about the status of the cluster and therefore has no idea where to read and write data.
[14:11] <andrei> i can see a great deal of activity on the storage servers
[14:12] <wogri_risc> pg replication.
[14:12] <andrei> even though i've only added a single 4G file to the fs while one of the osd servers was rebooting
[14:12] <andrei> is this the right behaviour?
[14:12] <wogri_risc> seems so.
[14:12] <wogri_risc> it will copy the 4G.
[14:12] <wogri_risc> and everything else that would have changed (fs-attrs, etc)
[14:13] <andrei> i think it's doing more
[14:13] <wogri_risc> how would you know?
[14:13] <andrei> coz it's a poc and i've not got any clients apart from one
[14:13] <andrei> and i've only done 1 4g write
[14:13] <andrei> but it has been fixing things for about 5 mins already
[14:14] <andrei> i can see that both servers are writing about 250mb/s
[14:14] <wogri_risc> ceph health
[14:14] <andrei> HEALTH_WARN 138 pgs degraded; 65 pgs recovering; 529 pgs recovery_wait; 834 pgs stuck unclean; recovery 23698/91424 degraded (25.921%); recovering 56 o/s, 224MB/s
[14:15] <andrei> do you know if adding ssd disk will speed up the recovery?
[14:15] <wogri_risc> so 25% of your overall ceph storage if up to recovery.
[14:15] <andrei> or is it purely spinning disk activity?
[14:16] <wogri_risc> a journal on a ssd won't help you much here.
[14:16] <wogri_risc> but this should not be your main concern.
[14:16] <wogri_risc> recovery is a background task
[14:16] <wogri_risc> I can't tell you why it copies so much right now.
[14:16] <andrei> coz I am seeing very poor performance on the ceph mount from the client
[14:16] <wogri_risc> you can throttle recovery
[14:16] <andrei> nice!!!
[14:18] <andrei> ceph is awesome!
[14:21] <andrei> i can see that the recovery has finished
[14:21] <andrei> but I still have health_warn message
[14:22] <andrei> ceph health
[14:22] <andrei> HEALTH_WARN 138 pgs degraded; 240 pgs stuck unclean; recovery 1880/91424 degraded (2.056%)
[14:22] <andrei> should I worry about this?
[14:22] <tnt> probably
[14:22] <tnt> see why they're unclean.
[14:22] <andrei> tnt: sorry, new to ceph, don't know where to look for
[14:23] <tnt> pastebin the complete ceph -s
[14:23] * jamespage (~jamespage@culvain.gromper.net) Quit (Quit: Coyote finally caught me)
[14:23] * jamespage (~jamespage@culvain.gromper.net) has joined #ceph
[14:23] <andrei> http://fpaste.org/13392/91390081/
[14:23] <andrei> here you go
[14:25] <tnt> and it's not moving ?
[14:25] <andrei> nope
[14:25] <andrei> it's been like that for several minutes now
[14:26] <tnt> try ceph pg dump_stuck unclean
[14:27] <andrei> actually, it is changing
[14:27] <andrei> i've just checked
[14:27] <andrei> and even thought the degraded percentage hasn't changed
[14:27] <andrei> the pgmap info from the ceph-s is changing
[14:27] <tnt> ok, then just wait more :p
[14:28] <tnt> percentage might not come down until the very end. it actually can even rise if some write hit unclean PGs
[14:29] <andrei> i see
[14:29] <andrei> thanks
[14:29] * blue (~blue@irc.mmh.dk) has joined #ceph
[14:33] * brother (~brother@vps1.hacking.dk) Quit (Ping timeout: 480 seconds)
[14:36] * brother (foobaz@vps1.hacking.dk) has joined #ceph
[14:41] * eschnou (~eschnou@ Quit (Ping timeout: 482 seconds)
[14:43] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) has joined #ceph
[14:51] * med (~medberry@ec2-50-17-21-207.compute-1.amazonaws.com) Quit (Quit: Coyote finally caught me)
[15:01] <mikedawson> jamespage: Is there a plan to get a Qemu with joshd's async patch released for Raring soon (as a PPA)? 1.5.0 is out and 1.4.2 should be released Thursday? http://lists.nongnu.org/archive/html/qemu-stable/2013-05/msg00066.html
[15:03] <jamespage> mikedawson, not sure - lemme ask the guys who touch that stuff
[15:03] <mikedawson> jamespage: thanks. It seems to be really critical during heavy writes
[15:10] <tnt> mikedawson: why do you think the issue has been present since 0.48 ? Personally I'm just wondering why the monitor needs to write that much stuff on disk.
[15:14] <jamespage> mikedawson, hallyn is currently working on 1.5.0 for saucy
[15:15] <jamespage> he's queued that patch up for a 1.4.0 update he's going to work on next week
[15:19] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Ping timeout: 480 seconds)
[15:21] <mikedawson> tnt: It has been a problem since the monitor was rewritten http://ceph.com/dev-notes/cephs-new-monitor-changes/ (now that I think of it, I think those changes hit in 0.48, not 0.49), so the issue started in 0.49
[15:22] <mikedawson> jamespage: excellent! 1.4.0 with backports (including joshd's async patch linked above) would be perfect!
[15:22] <tnt> mikedawson: oh, you mean 0.58
[15:22] <tnt> 0.48 was argonaut.
[15:23] <tnt> 0.56 was bobtail
[15:23] <mikedawson> tnt: yes. sorry.
[15:23] <tnt> mikedawson: how many PGs do you have ?
[15:24] * dcasier (~dcasier@LVelizy-156-44-40-164.w217-128.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[15:25] <jamespage> mikedawson, any chance I can persuade you to submit a bug report for that for raring?
[15:25] <mikedawson> jamespage: sure. could you point me to the right spot?
[15:26] <mikedawson> tnt: 20672 PGs
[15:26] <jamespage> mikedawson, if you have a raring box with network connectivity 'ubuntu-bug qemu-kvm' would do the trick
[15:26] <mikedawson> jamespage: will do
[15:27] <jamespage> mikedawson, thanks!
[15:28] <tnt> mikedawson: I'm wondering if before the rewrite, the old code maybe only wrote incremental changes of the pg map to disk while the new one writes complete pgmap (which is fairly large, several megs).
[15:28] * jahkeup (~jahkeup@ has joined #ceph
[15:29] <mikedawson> tnt: not sure. joao is the man to ask.
[15:29] <jahkeup> hey all, I'm having an issue getting glance to talk to ceph. I've double checked all authentication via all users and clients involved and just can't seem to pinpoint the problem :/ help?
[15:33] <mikedawson> jahkeup: look at 'ceph auth list'. Do you have a key for Glance?
[15:34] <jahkeup> mikedawson: yes, I do. I have glance client as 'images'
[15:34] * aliguori (~anthony@ has joined #ceph
[15:34] <jahkeup> mikedawson: I can get rbd ls images --id images to work but glance won't connect
[15:34] <mikedawson> does glance have ownership of /etc/ceph/ceph.client.images.keyring? And do the keys match
[15:35] <jahkeup> mikedawson keys match, not chown'd but readable
[15:36] <mikedawson> jahkeup: glance:glance owns mine, not sure if that is the problem though
[15:36] <mikedawson> jahkeup: can you paste your client.images from ceph auth list?
[15:36] * Wolff_John (~jwolff@ has joined #ceph
[15:37] <jahkeup> mikedawson client.images
[15:37] <jahkeup> key: AQBgPpVRwIbjGxAAxfG1Ntq0kQyP0czsrkLgbw==
[15:37] <jahkeup> caps: [mon] allow r
[15:37] <jahkeup> caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=images
[15:38] <mikedawson> seems reasonable
[15:38] <mikedawson> mine is " caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool="
[15:39] <jahkeup> mikedawson the most aggravating part is that it works via the ceph client under glance user but not when I try to upload an image
[15:39] <mikedawson> jahkeup: now look at the glance-api.conf
[15:41] <jahkeup> mikedawson I followed the documentation to a 't' for connecting openstack and ceph
[15:41] <jahkeup> mikedawson what am I looking for? :)
[15:41] <mikedawson> cat glance-api.conf | grep rbd
[15:43] * Jahkeup_ (~jahkeup@ has joined #ceph
[15:43] <Jahkeup_> mikedawson sorry bout that connection dropped
[15:49] * jahkeup (~jahkeup@ Quit (Ping timeout: 480 seconds)
[15:50] * aliguori (~anthony@ Quit (Remote host closed the connection)
[15:52] <andrei> has anyone integrated ceph with cloudstack?
[15:54] <jamespage> sagewk, I commented on that commit you pointed me to yesterday
[15:54] <jamespage> sagewk, see also https://github.com/ceph/ceph/pull/304
[15:56] * portante (~user@ has joined #ceph
[15:56] <Jahkeup_> mikedawson help please?
[15:56] <john_barbee> Jahkeup_: are you trying to upload an image to glance using the --copy-from?
[15:57] <Jahkeup_> john_barbee using the glance client, no
[15:57] <Jahkeup_> john_barbee its from a raw image on disk
[15:57] * aliguori (~anthony@ has joined #ceph
[15:57] <mikedawson> Jahkeup_: can you paste cat glance-api.conf | grep rbd
[15:58] <Jahkeup_> mikedawson I found it..
[15:58] <Jahkeup_> default_store = rbd
[15:58] <Jahkeup_> # glance.store.rbd.Store,
[15:58] <Jahkeup_> rbd_store_user=images
[15:58] <Jahkeup_> rbd_store_pool=images
[15:58] <Jahkeup_> rbd_store_ceph_conf = /etc/ceph/ceph.conf
[15:58] <Jahkeup_> rbd_store_user = glance
[15:58] <Jahkeup_> rbd_store_pool = images
[15:58] <Jahkeup_> rbd_store_chunk_size = 8
[15:58] <Jahkeup_> mikedawson stupid glance has it listed twice for user..
[15:58] <Jahkeup_> mikedawson let me change that
[15:59] <tnt> Interestingly the leveldb site says that "Syncronous Writes" (i.e. the type used by ceph) perform pretty badly on ext4 vs ext3. I'm wondering if there is the same effect with xfs.
[16:00] <Jahkeup_> mikedawson glance is so much happier. I didn't think glance-api.conf had a section already for rbd. tyvm for that.
[16:01] <mikedawson> Jahkeup_: yw
[16:04] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[16:05] <mikedawson> tnt: my mon leveldb are backed by ext4 on SSD, whereas my OSD leveldb are backed by XFS on SATA.
[16:07] * coyo|2 (~unf@pool-71-164-242-68.dllstx.fios.verizon.net) has joined #ceph
[16:07] <tnt> mikedawson: in anycase I think writing the full pgmap all the time is a waste of IO since it doesn't change all that much between two iterations.
[16:09] * dcasier (~dcasier@ has joined #ceph
[16:10] <tnt> I'm wondering if enabling mon_leveldb_compression would help.
[16:11] <andrei> guys, would you recommend using xfs or ext4 for the osds?
[16:11] <mikedawson> andrei: xfs
[16:12] <jtang> is rbd-fuse recommended for production usage?
[16:12] <andrei> is there a particular reason for choosing xfs?
[16:12] <andrei> performance wise ext4 seems to be a bit faster
[16:13] <nhm> andrei: xfs is getting more testing and is in wider use
[16:13] <nhm> andrei: though ext4 seems to more or less work fine.
[16:14] * coyo (~unf@00017955.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:14] <jtang> im still reluctant to install a kernel that's outside of rhel/centos/sl's base repos
[16:15] <andrei> thanks!
[16:18] * vata (~vata@2607:fad8:4:6:398f:3791:ed80:4c53) has joined #ceph
[16:18] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) has joined #ceph
[16:21] <nhm> jtang: I've been doing a lot of performance testing lately. At 10GbE+ speeds, the kernel can have a huge performance impact.
[16:22] <nhm> jtang: it'd be interesting to know how well tuned the stock RHEL kernels are.
[16:22] <jtang> nhm: if at all i will just have gigabit on the machines that i want to deploy on
[16:23] <jtang> so if i can saturate a 1gbit link via the fuse client im happy
[16:23] <jtang> or even 2gbit (as i have 6 interfaces to play with)
[16:23] <jtang> 2 out, 2 for storage, 2 for internal networking
[16:24] <jtang> nhm: have you play with cgroups to pin the ceph-osd/mon/mds processes?
[16:24] <nhm> jtang: right now I'm maxing out at about 1.3GB/s per client node, but I've done 1.6GB/s in the past.
[16:24] <nhm> jtang: with RBD that is
[16:24] <jtang> ah okay
[16:24] <nhm> jtang: not yet, though it's on the list, especially for folks that want to put OSDs and VMs on the same nodes.
[16:24] * dcasier (~dcasier@ Quit (Ping timeout: 480 seconds)
[16:25] <jtang> well im going to be forced to run the mon/osd on the same machines that i will run vms on
[16:25] <jtang> i see its possible to limit the number of threads that the ceph daemons use
[16:25] <nhm> jtang: yeah, seems that is becoming more and more common
[16:27] <tnt> Yup, I'm running OSDs as VM on the same Xen Dom0 as the RBD vms. Works fine if you do things properly.
[16:27] <jtang> well i have enough compute and ram to do so
[16:27] <jtang> so i shuold be okay
[16:28] <jtang> im kinda just thinking maybe i should have ordered a third machine so i can be more flexible
[16:28] <jtang> but that eats into my longer term budget where i can buy ~200tb of disk next year
[16:30] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[16:31] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[16:31] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[16:32] <andrei> hi guys
[16:33] <andrei> i am having a bit of a problem with ceph performance
[16:33] <andrei> just wondering if this is a normal behaviour
[16:33] <andrei> i am reading a large file from 2 ceph servers
[16:33] <andrei> each ceph server has 8 osds
[16:34] <andrei> each osd can do seq. read at about 150mb/s
[16:34] <andrei> i am reading this large file from 2 clients using ceph-fuse mount
[16:34] <andrei> i use 4 dd processes on each client
[16:34] <andrei> so, 8 threads in total from 2 clients
[16:35] <andrei> i am only seeing about 200mb/s cumulative read speed
[16:36] <andrei> how do I go about in determining what is causing this?
[16:37] <tnt> How can I ask a daemon for its current running config ?
[16:41] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) has joined #ceph
[16:49] * oliver1 (~oliver@p4FECEC7E.dip0.t-ipconnect.de) has joined #ceph
[16:50] * Cube (~Cube@66-87-112-70.pools.spcsdns.net) has joined #ceph
[16:52] <mikedawson> tnt: try something like ceph --admin-daemon /var/run/ceph/ceph-mon.a.asok config show
[16:53] <tnt> mikedawson: that works. Any idea if you can do it remotely ? (i.e. like you can do injectargs from a remote ceph admin)
[16:54] <mikedawson> tnt: ssh hostname ceph --admin-daemon /var/run/ceph/ceph-mon.a.asok config show
[16:54] <tnt> yeah sure, but I meant like 'ceph tell mon.a xxx' or something :p
[16:55] * Volture (~Volture@office.meganet.ru) Quit (Remote host closed the connection)
[16:55] * dcasier (~dcasier@ has joined #ceph
[16:56] <tnt> meh, I wanted to try leveldb compression ... but the mon_leveldb_compression option is just ignored in the code.
[16:56] <nhm> andrei: what is the dd line you are using?
[16:57] <andrei> dd if=<file> of=/dev/null bs=1M count=10000 iflag=direct
[16:57] <nhm> andrei: how much does it change if you try bs=4M?
[16:58] <andrei> i've not tried that yet
[16:58] <nhm> andrei: also, you may see an improvement in sequential read speeds if you change the read_ahead_kb on the OSDs from 128 up to like 1-4M.
[16:58] <andrei> it should change a bit. is data devided into 4mb chunks ?
[16:59] <nhm> so 1024 to 4096
[16:59] <nhm> andrei: ideally yes
[16:59] <nhm> andrei: also, I don't actually know that much about performance with the fuse client.
[17:00] <nhm> andrei: something we probably need to test in a bit more detail, but right now our focus is much more on RBD and RGW.
[17:00] <andrei> nhm: do you know about ceph performance in small random reads?
[17:00] <nhm> andrei: there are some known cephfs performance issues in general.
[17:00] <andrei> like if I store database for instance?
[17:01] <tnt> you want to store a db on cephfs ?
[17:01] <tnt> why would you do that ...
[17:01] <nhm> andrei: it depends on a lot of factors ranging from the HW to how ceph and the underlying OS are tuned.
[17:01] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Quit: Leaving.)
[17:02] * portante (~user@ Quit (Read error: Connection reset by peer)
[17:02] <nhm> andrei: as TNT says though, if you want a database, cephfs may not be the right way to go. Infact for some things using the object store directly via librados can have very good performance.
[17:02] <andrei> tnt: no, not really, but i am going to use vm images
[17:02] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[17:02] <andrei> and some vms might have db running
[17:02] <andrei> which use large number of small bs reads/writes
[17:03] <nhm> andrei: you may want to test the QEMU/KVM interface.
[17:03] * bergerx_ (~bekir@ Quit (Quit: Leaving.)
[17:03] <nhm> andrei: make sure to enable rbd write back cache.
[17:03] <andrei> that is what I am testing right now
[17:03] <andrei> cloudstack + kvm + ceph
[17:03] <tnt> andrei: well, again ... why would you use cephfs for that vs RBD
[17:04] <andrei> sorry, by ceph i didn't mean cephfs. I've meant ceph in general
[17:06] <tnt> andrei: well, testing perf of cephfs is very different from testing perf of rbd.
[17:08] * fghaas (~florian@91-119-68-174.dynamic.xdsl-line.inode.at) has joined #ceph
[17:08] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:10] * PerlStalker (~PerlStalk@ has joined #ceph
[17:12] * fghaas (~florian@91-119-68-174.dynamic.xdsl-line.inode.at) Quit ()
[17:12] * dcasier (~dcasier@ Quit (Ping timeout: 480 seconds)
[17:21] <andrei> tnt: by the way, if i set dd to bs=4M the speed increases a lot!!!
[17:21] <andrei> i am having about 205mb/s per single thread with 4 concurrent threads ))
[17:23] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[17:23] <andrei> the trouble is i don't think it's coming from disks though
[17:23] <andrei> as I do not see any disk activity on the servers while i run dd
[17:23] <andrei> i think it's coming from cache
[17:29] * capri (~capri@ Quit (Quit: Verlassend)
[17:29] <nhm> andrei: you may want to run a sync and drop_caches on the OSD nodes before you run the read test.
[17:35] * eschnou (~eschnou@203.39-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[17:36] * fghaas (~florian@91-119-68-174.dynamic.xdsl-line.inode.at) has joined #ceph
[17:40] * alram (~alram@ has joined #ceph
[17:41] * fghaas (~florian@91-119-68-174.dynamic.xdsl-line.inode.at) Quit ()
[17:43] * portante (~user@ has joined #ceph
[17:45] <Azrael> filestore(/var/lib/ceph/osd/ceph-6) Error initializing leveldb: Corruption: missing start of fragmented record(2)
[17:45] <Azrael> have you guys seen that before with ceph osd's?
[17:45] <Azrael> or do you have any insight as how to begin debugging
[17:45] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:46] <tnt> I'd say the leveldb got corrupted somehow.
[17:47] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[17:48] <Azrael> which file(s) exactly are the leveldb
[17:48] <Azrael> in omap?
[17:49] <Azrael> or the journal?
[17:49] <tnt> omap yes
[17:50] <tnt> (no idea what info is stored in it though)
[17:51] * eschnou (~eschnou@203.39-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[17:57] <Azrael> its perhaps the 5th or 6th corruption case we've had in the past week
[17:57] <Azrael> so i want to start narrowing this down
[18:00] <tnt> are you using unsafe options ? (like enabled drive write cache without battery backup, or unsafe xfs/ext4 options) ?
[18:04] <Azrael> afaik nope
[18:05] <Azrael> xfs (rw,noatime,attr2,noquota)
[18:05] * Wolff_John (~jwolff@ Quit (Ping timeout: 480 seconds)
[18:06] <tnt> what's "attr2" btw ?
[18:07] <Azrael> The options enable/disable (default is enabled) an "opportunistic" improvement to be made in the way inline extended attributes are stored on-disk. When the new form is used for the first time
[18:07] <Azrael> (by setting or removing extended attributes) the on-disk superblock feature bit field will be updated to reflect this format being in use.
[18:07] <Azrael> its the default mount option i believe
[18:07] <Azrael> either that or its what 'ceph-disk activate' uses
[18:10] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Quit: Leaving)
[18:11] * leseb (~Adium@ Quit (Quit: Leaving.)
[18:11] * oliver1 (~oliver@p4FECEC7E.dip0.t-ipconnect.de) has left #ceph
[18:13] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[18:15] * jtaguinerd (~jtaguiner@ Quit (Quit: jtaguinerd)
[18:16] * mikedawson (~chatzilla@mobile-166-147-100-146.mycingular.net) has joined #ceph
[18:19] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[18:30] * tnt (~tnt@ has joined #ceph
[18:34] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[18:36] * fghaas (~florian@91-119-68-174.dynamic.xdsl-line.inode.at) has joined #ceph
[18:38] * DarkAceZ (~BillyMays@ Quit (Ping timeout: 480 seconds)
[18:41] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[18:42] * Tamil (~tamil@ has joined #ceph
[18:43] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[18:45] * KindTwo (~KindOne@ has joined #ceph
[18:52] * KindTwo is now known as KindOne
[18:59] * sjustlaptop (~sam@ has joined #ceph
[18:59] * mikedawson (~chatzilla@mobile-166-147-100-146.mycingular.net) Quit (Read error: No route to host)
[19:02] * loicd (~loic@magenta.dachary.org) has joined #ceph
[19:05] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) has joined #ceph
[19:06] * tziOm (~bjornar@ti0099a340-dhcp0870.bb.online.no) has joined #ceph
[19:08] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[19:14] * BillK (~BillK@124-169-186-145.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[19:22] * Volture (~quassel@office.meganet.ru) has joined #ceph
[19:22] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[19:25] * diegows (~diegows@ has joined #ceph
[19:25] * andrei (~andrei@host217-46-236-49.in-addr.btopenworld.com) Quit (Ping timeout: 480 seconds)
[19:26] * Wolff_John (~jwolff@ has joined #ceph
[19:29] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[19:36] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) has joined #ceph
[19:37] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[19:39] * SvenPHX (~scarter@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[19:39] * SvenPHX (~scarter@wsip-174-79-34-244.ph.ph.cox.net) has left #ceph
[19:41] <Volture> Hi all
[19:43] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[19:45] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[19:45] * fghaas (~florian@91-119-68-174.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[19:45] * tkensiski (~tkensiski@ has joined #ceph
[19:45] * tkensiski (~tkensiski@ has left #ceph
[19:46] * loicd (~loic@magenta.dachary.org) has joined #ceph
[19:53] * Volture (~quassel@office.meganet.ru) Quit (Remote host closed the connection)
[19:54] * Volture (~quassel@office.meganet.ru) has joined #ceph
[20:00] <sagewk> mikedawson: around?
[20:10] * Volture (~quassel@office.meganet.ru) Quit (Remote host closed the connection)
[20:11] * Volture (~quassel@office.meganet.ru) has joined #ceph
[20:11] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[20:12] <sagewk> wido: https://github.com/ceph/ceph/pull/303
[20:12] <sagewk> just need to verify the qa runs and we'll pull it in and backport to cuttlefish
[20:13] * rtek_ is now known as rtek
[20:14] * jtang1 (~jtang@ has joined #ceph
[20:15] * DarkAceZ (~BillyMays@ has joined #ceph
[20:15] * jtang1 (~jtang@ Quit (Read error: Connection reset by peer)
[20:15] * jtang1 (~jtang@ has joined #ceph
[20:16] <tnt> sagewk: wrt to the logs for the monitor write rate, do you need the quorum leader or will any mon do ? (the peons seems to have a bit more IO actually)
[20:16] <sagewk> g
[20:16] <sagewk> any mon will do... the txn stream is identical
[20:17] <sagewk> or should be
[20:17] <wido> sagewk: Ah, cool!
[20:18] * fghaas (~florian@91-119-68-174.dynamic.xdsl-line.inode.at) has joined #ceph
[20:19] * BManojlovic (~steki@fo-d- has joined #ceph
[20:21] * chutz (~chutz@rygel.linuxfreak.ca) Quit (Quit: brb)
[20:23] * eschnou (~eschnou@203.39-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:25] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[20:25] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:26] <MrNPP> so i let an osd fill up for some stupid reason and now it crashes, it was marked as out, long before it crashed, but the data wasn't migrated from it, how can i keep the osd up so it can be cleaned up
[20:33] <MrNPP> nm, found the answer
[20:44] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[20:45] * Tamil (~tamil@ Quit (Quit: Leaving.)
[20:46] * Cube (~Cube@66-87-112-70.pools.spcsdns.net) Quit (Quit: Leaving.)
[20:47] <MrNPP> I went to bring a vm online after the failure this last week, and it just got stuck, wouldn't boot, it wasn't until i removed an old downed (and marked out) osd, and now its booting normally
[20:48] * chutz (~chutz@rygel.linuxfreak.ca) has joined #ceph
[20:51] * Tamil (~tamil@ has joined #ceph
[21:09] * jtang2 (~jtang@ has joined #ceph
[21:09] * jtang1 (~jtang@ Quit (Read error: Connection reset by peer)
[21:10] * jtang1 (~jtang@ has joined #ceph
[21:10] * jtang2 (~jtang@ Quit (Read error: Connection reset by peer)
[21:11] <sagewk> mikedawson: ping!
[21:11] * coyo (~unf@pool-71-164-242-68.dllstx.fios.verizon.net) has joined #ceph
[21:13] * Tamil (~tamil@ Quit (Quit: Leaving.)
[21:15] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[21:16] <tnt> sagewk: good news about 4895 that you're eager to speak to mikedawson ? :)
[21:16] <sagewk> it doesn't look like paxos is trimming on his mon.. may not be leveldb at all
[21:17] <sagewk> but i need logs to diagnose :)
[21:17] * coyo|2 (~unf@pool-71-164-242-68.dllstx.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[21:20] * Tamil (~tamil@ has joined #ceph
[21:22] <tnt> sagewk: in case it happens to me again, what should I look for in the logs ? A non-incrementing 'first_commit' number ?
[21:23] <sagewk> i wanto look for paxos (not paxosservice) trim, and first_committed
[21:23] <sagewk> a log would be ideal
[21:25] * LeaChim (~LeaChim@ has joined #ceph
[21:32] <MrNPP> hmmm, randomly my ceph install gives me 2013-05-21 12:32:08.222196 7f20e10ec700 0 monclient: hunting for new mon
[21:34] <tziOm> How are things moving along with cephfs?
[21:34] <tziOm> Hi btw!
[21:55] <saaby> hi, I think I see osds timeout on I/O with: "OSD::op_tp thread 0x...' had timed out after 30" messages because of scrubbing happening on large (25GB'ish) objects
[21:57] <saaby> I have tried changing the osd filestore and scrub timeouts to 600 seconds, which looks to have avoided those "down" markings.
[21:57] <saaby> but... is that the correct way of handling scrubbing of large objects..?
[21:58] * frank9999 (~frank@kantoor.transip.nl) Quit (Read error: Operation timed out)
[21:59] <saaby> my point being 10 minutes is a long time for I/O timeout, and this would probably not be enough for even larger objects
[21:59] * frank9999 (~frank@kantoor.transip.nl) has joined #ceph
[21:59] * JohansGlock_ (~quassel@kantoor.transip.nl) has joined #ceph
[22:00] <tnt> saaby: I think you're supposed to avoid large objects ...
[22:01] <saaby> ok? I didn't know that - can you point me in the direction of some info on that?
[22:03] <janos> i pretty much only store large objects
[22:03] * JohansGlock (~quassel@kantoor.transip.nl) Quit (Ping timeout: 480 seconds)
[22:04] <tnt> saaby: well, it's mostly inferred from the way rgw and rbd store data in rados, they stripe it in chunks.
[22:04] * rturk-away is now known as rturk
[22:04] <dmick> tnt: that's more about spreading the load around, not really a limitation
[22:04] <dmick> just for higher throughput
[22:05] <saaby> I know that striping files and rbd devices across many objects is good for performance etc, but I haven't heard about large objects being a problem for rados itself
[22:05] <tnt> there are other downside to storing large objects AFAICT: like http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/6177 or the one you're seeing right now.
[22:06] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:06] <tnt> (defining large as more like > 10G or so)
[22:08] <saaby> right.. interesting
[22:08] <tnt> I'm definitely no expert on the matter but from the various things I read over the last year about ceph, that's always the impression I had : you're better off distributing the load !
[22:09] * Cube (~Cube@66-87-112-70.pools.spcsdns.net) has joined #ceph
[22:10] <saaby> janos: large objects? how large is large? and have you had any problems with e.g. scrubbing timing out?
[22:12] <janos> i haven't forced scrubs, but usually 2-14GB in size
[22:12] <janos> i'm on .56.6
[22:12] <cjh_> would it be possible to use the Ceph Classes to extend ceph so that a client connects to data that is physically closer instead of the same data that is further away?
[22:13] <saaby> ok, I'm on cuttlefish with scrubs running auto as default
[22:15] * hufman (~hufman@rrcs-67-52-43-146.west.biz.rr.com) has joined #ceph
[22:15] <hufman> hello!
[22:15] <hufman> guess who flubbed a ceph upgrade :D
[22:16] <hufman> apparently when i upgraded one of my nodes from argonaut to bobtail, it never restarted and upgraded its data
[22:16] <hufman> so today when i upgraded everything to cuttlefish, that node refuses to start because it skipped that upgrade
[22:17] <hufman> is there a way to upgrade its data without being in a quorum?
[22:17] <tnt> which node ? (mon / osd / ...)
[22:17] <tnt> ok, I guess mon ...
[22:17] <tnt> how many mon do you have and how many are up ?
[22:17] <hufman> indeed, mon
[22:18] <hufman> i nominally have 3
[22:18] <hufman> then i dropped to two in quorum, because of the skipped node
[22:18] * drokita1 (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[22:18] <hufman> then i tried removing that mon from the cluster, except i typed the wrong number
[22:18] <tnt> then, ... just wipe the data from the third, redo a mkfs ...
[22:18] <tnt> oh boy ..
[22:19] * lando` (~lando@login.lando.us) Quit (Remote host closed the connection)
[22:19] <hufman> so now i have one good node in the cluster, one node that skipped the upgrade, and one node that has good data but isn't in the cluster
[22:19] <hufman> soooo 1 working node out of 2 configured, should still work right?
[22:19] <tnt> no
[22:19] <hufman> but all of the ceph commands show a pipe fault
[22:19] <tnt> you need floor(N/2) + 1 up ...
[22:19] <tnt> so if N=2 you need both up.
[22:19] <tnt> (which is why having 2 mons is not a good idea)
[22:20] <hufman> agreed
[22:20] <hufman> wishful thinking heh
[22:20] <hufman> so then!
[22:20] <hufman> how do i rip the other node out of the cluster, if all management commands are stuck? :)
[22:21] <tnt> I solved that same issue back in argonaut ... but back then the mon where storing things as file.
[22:21] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Remote host closed the connection)
[22:21] <tnt> Now with level db, it's more obscure and the procedure here : http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ is not applicable.
[22:21] <hufman> also, what bad things will happen to my rbd things if i turn off the only working mon node?
[22:22] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[22:23] * Vjarjadian (~IceChat77@ has joined #ceph
[22:23] <tnt> if it's not in quorum, it's not doing anything anyway.
[22:23] <hufman> but strangely my rbd access continues to appear to work
[22:24] <sagewk> tnt, hufman: tha tprocess is mostly still applicabable.. just need a ceph-monstore-tool getmonmap command. i'll add one now.
[22:24] <hufman> does librbd lock onto one node when it opens a file?
[22:24] <hufman> ! i don't mean to bother you sage :x
[22:24] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[22:25] <tnt> there is a ceph-monstore-tool ?
[22:25] <hufman> what is ceph_mon_store_converter?
[22:25] <sagewk> part of ceph-tests i think
[22:26] <tnt> interesting, I hadn't noticed before
[22:27] <sagewk> but i'm adding the command now
[22:28] <sagewk> pushed to master, will take a bit to build a package for it. onc eyou have that, you can do
[22:28] <sagewk> ceph-monstore-tool getmonmap 0 --mon-store-path /var/lib/ceph/mon/whatever --out /tmp/surviving_map to replace steps 3-5
[22:29] <tnt> so, he can remove the bad mon and will end up with a monmap with 1 single mon.
[22:29] <hufman> "ceph mon remove 1" right?
[22:29] <hufman> that command also is stuck
[22:30] <tnt> hufman: no, that can't work, you need to do it manually.
[22:30] <hufman> oh, that command would be used offline, got it
[22:30] * lando` (~lando@login.lando.us) has joined #ceph
[22:30] <tnt> login to your only working mon and do the command sagewk pasted above.
[22:31] <hufman> i eagerly await the new package :)
[22:32] * rturk is now known as rturk-away
[22:32] <sagewk> actually any mon data dir will work ok
[22:32] <sagewk> to get the monmap
[22:33] <tnt> actually, would it work to take his old monmap from the "old format" and re-inject it in the new mon ?
[22:33] <tnt> that monmap had 3 monitors ...
[22:34] * mtanski (~mtanski@ has joined #ceph
[22:35] <tnt> since mon.a is OK (but has a monmap with only mon.a & mon.c), mon.b has been eroneously removed but is otherwise still good, mon.c is hosed/lost. Reinjecting the monmap from before cuttlefish upgrade that had the three mon, could allow to make a quorum.
[22:35] <tnt> or is the 'epoch' embedeed in the monmap and restoring an old one bad ?
[22:37] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[22:38] <dmick> sagewk: so you fixed 5125
[22:38] <sagewk> dmick: yeah, i'll close the bug
[22:38] <dmick> should get docs updated too
[22:41] <mtanski> I might be a bit slow… where on the bug page is the button for filling a bug
[22:41] <mtanski> I spent like 5 minutes looking for it
[22:41] <mtanski> I already registered
[22:42] <dmick> "New issue"; a tab at the top
[22:48] <saaby> so, do any of you have any good/best practices for max rados object sizes? We are not really concerned with performance of the single object, but rather having sane object sizes, so we don't hit and (scrub) timeouts.
[22:49] * lando` (~lando@login.lando.us) Quit (Quit: leaving)
[22:49] <saaby> looks like ~25GB object demands scrubbing timeouts of >300 secs, which looks like an uncomfortable long timeout
[22:51] * mjblw1 (~mbaysek@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[22:51] <saaby> same goes for osd_op_thread_timeout btw.
[22:51] * SvenPHX1 (~scarter@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[22:54] * rahmu (~rahmu@ip-147.net-81-220-131.standre.rev.numericable.fr) has joined #ceph
[22:54] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[22:55] <dmick> sjust: ^ see saaby's questions
[22:55] * mjblw (~mbaysek@wsip-174-79-34-244.ph.ph.cox.net) Quit (Ping timeout: 480 seconds)
[22:55] <sjust> saaby: 25gb is rather on the large side
[22:56] <sjust> we usually do chunks of around 4MB
[22:57] <saaby> ok, well that probably explains why we are struggling with i/o op timeouts then
[22:57] <sjust> saaby: I don't think you can do writes that large anyway
[22:58] <sjust> (by default)
[22:58] <saaby> I can see objects that large in our test environment already, so I think so
[22:58] <saaby> written with librados
[22:58] <sjust> well, you can do partial writes to any offset, so it might have happened from multiple writes
[22:58] <saaby> it's just 1:1 mappings of incoming files. most are small, but the odd one is very large
[22:59] <saaby> right, that makes sense
[22:59] <sjust> you don't want unbounded size rados objects
[22:59] <saaby> got it
[22:59] <saaby> so, what would a sane striping size be?
[22:59] <sjust> we don't have a hard limit because such a limit would depend on the details of the network, disk, etc
[23:00] <sjust> but 4-16MB seems like a large enough size to get good streaming performance but not large enough to cause trouble
[23:01] <sjust> if for no other reason than that a single rados object lives in a single PG
[23:01] <saaby> ok. and the scaling of the number of files shouldn't have any significant impact?
[23:01] <sjust> no, it's designed for that
[23:01] <saaby> ok
[23:01] <nhm> consider that Lustre (unless it's changed recently) has been more or less designed around 1MB stripes.
[23:01] <nhm> 4MB should do well.
[23:01] <saaby> yeah, thats right.. I remember that
[23:02] <saaby> ok
[23:02] <saaby> all right, thanks for your help!
[23:02] <sjust> sure
[23:05] <jtang> nhm: i thought lustre did 8mb stripes these days
[23:05] <jtang> gpfs do 4 or 8 i think
[23:06] <nhm> jtang: Might be in one of the new 2.X releases.
[23:06] <jtang> well blocksize in gpfs's case
[23:06] <jtang> the last time i talked to a gpfs engineer they wanted 16mb blocksizes
[23:06] <nhm> jtang: I exited the Lustre scene back just as 2.0 and 1.8.6 were coming out.
[23:06] <jtang> but they lacked a big enough system to test on
[23:07] * Wolff_John (~jwolff@ Quit (Quit: ChatZilla 0.9.90 [Firefox 20.0.1/20130409194949])
[23:07] <jtang> nhm: yea me too, but i still keep an eye on it, just in case
[23:07] <jtang> especially now that it seems they are getting closer to userland ost/oss's
[23:07] <jtang> given how zfs on linux panned out
[23:07] <jtang> though i dont do hpc these days
[23:07] * jtang frowns on that
[23:08] <nhm> jtang: daos will certainly be interesting
[23:09] <jtang> im kinda not sure of lustre these days, its definately getting better
[23:10] <jtang> feature wise i still feel gpfs is better than lustre
[23:10] <jtang> btw our bluegene died
[23:10] <jtang> :(
[23:10] <nhm> jtang: doh
[23:10] <nhm> out of warranty?
[23:10] <jtang> yea
[23:11] <janos> dang!
[23:11] <nhm> can you resurrect any of it?
[23:11] <jtang> out of warranty and we had no spares for the force10 switch
[23:11] <jtang> it'd cost too much to fix it
[23:11] <jtang> i was hoping to spend some effort at making ceph work on it too
[23:11] <saaby> sjust: could these (very) large objects be the source of the segfaults we reported earlier today too, you think?
[23:12] <jtang> well at leas to
[23:12] <nhm> jtang: no salvaging or hackery possible?
[23:12] <gregaf> saaby: he stepped out, but maybe — was that you on the mailing list with the 370MB allocation?
[23:12] <jtang> compile the daemons on ppc
[23:12] <jtang> nhm: no unfortunalely, its hard to come by force10 spaces for the 10gb equipment
[23:12] * Vjarjadian (~IceChat77@ Quit (Quit: If at first you don't succeed, skydiving is not for you)
[23:13] <jtang> we tried calling in favours from dell and hp and we had no luck
[23:13] <saaby> gregaf: yeah, that was us with the leveldb 370MB allocation
[23:13] <nhm> jtang: S4810?
[23:14] <jtang> nhm: i need to check, i wasnt responsible for the networking
[23:14] <jtang> nhm: eitherway its too late to salvage it
[23:14] <jtang> we had to scrap it
[23:14] <nhm> too bad
[23:14] <jtang> to make space for new kit
[23:14] <nhm> Ah, that's always that case though
[23:14] <janos> i'm sure the new kit will be fun ;)
[23:14] <jtang> it was either fix it or make space for new kit
[23:15] <nhm> Gotta get rid of the old & broken for the new shiney
[23:15] <jtang> janos: yea hopefully it will be a new cluster
[23:15] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[23:15] <jtang> im not in the discussions for new kit for hpc right now
[23:16] <jtang> so i have no idea what we'll get
[23:16] <jtang> we did salvage the ibm san though
[23:16] <mikedawson> sagewk, sage: I'm back. Do you want mon-before, mon-after, tdump and logs or just a log?
[23:16] <jtang> we're probably gonna put ceph on it
[23:16] <jtang> to supplement out messed up pod ceph setup
[23:17] * jtang still isnt happy with them backblaze pods
[23:17] <sagewk> any log over a period of growth. i want ot see why paxos states aren't trmming.
[23:17] <sagewk> mikedawson: ^
[23:17] <sagewk> thanks!
[23:18] <mikedawson> sagewk: will do. How much growth? Hours, GBs?
[23:18] <sagewk> probably ~10-20 minutes of log is enough
[23:18] <sagewk> debug mon = 20, debug paxos = 20, debug ms = 1
[23:19] <nhm> jtang: oh interesting, you guys did backblaze pods for ceph?
[23:19] * The_Bishop_ (~bishop@f052101139.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[23:20] <jtang> nhm: yea
[23:20] <jtang> pods == bad
[23:20] * eschnou (~eschnou@203.39-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:20] <jtang> i thought i mentioned it a few times on the channel
[23:21] <nhm> jtang: I probably just forgot. :)
[23:21] <nhm> jtang: what's causing the most trouble?
[23:21] <jtang> between the pod hardware being too top heavy in disk and the lack of syncfs in rhel it was pain
[23:22] <nhm> yeah, I believe that.
[23:22] <jtang> 45x3tb disk in a box with no redundancy and 10gb or higher bandwidth it was pain
[23:22] <jtang> btrfs isnt ready on rhel as well
[23:22] <jtang> even with a mainline kernel
[23:22] <jtang> there were numerous reasons
[23:23] <jtang> in hindsight half filled pods and lots of them would have been better
[23:23] <jtang> we learnt a lot about them
[23:23] <jtang> the side effect is, we have about 50x3tb disks sitting on a shelf right now
[23:24] <nhm> jtang: I like the 36 drive supermicro chassis, but I think it's only really viable since it has dual power supplies and you can put dual CPU in with lots of SAS controllers.
[23:24] <janos> jtang, hang i'm setting upa shipping label to my address for you
[23:24] <janos> :)
[23:24] <jtang> we ended up downsizing the pods to about ~25tb of disk each
[23:24] <nhm> jtang: and losing a motherboard would still really suck. It's probably best used for very large installations.
[23:25] <jtang> nhm: yea again in hindsight, i would get a back blaze pod unless i can get 10 of them
[23:25] <jtang> or something more than say 5
[23:25] <jtang> the rebuild process would be painful
[23:26] <jtang> wouldnt get that is
[23:27] <nhm> jtang: I'm excited that multiple vendors are sticking multiple nodes in 4U chassis and still getting high storage density.
[23:28] <jtang> nhm: yea, agreed
[23:28] <jtang> im excited too ;)
[23:28] <jtang> especially them supermicro fat-twins
[23:28] <jtang> the ones that are disk heavy
[23:28] <jtang> i saw them at sc2012
[23:28] <jtang> thye look cool!
[23:29] <nhm> Sanmina has 60 drives in 4U with 2 nodes, HP has the SL4540, and Dell is doing the C8000.
[23:29] <jtang> yes
[23:29] <jtang> !
[23:30] <jtang> mind you the dell dx-storage systems look good too
[23:30] <jtang> though they do seem to be in the same space as ceph's radosgw
[23:30] <nhm> ooh, those fattwins look really interesting.
[23:31] <janos> dang just looked those up
[23:31] <janos> i like
[23:31] * Vjarjadian (~IceChat77@ has joined #ceph
[23:31] <mikedawson> sagewk: I should have injected the logging, but I restarted mon.a with 'mon compact on start = true'. Now mons are compacted from 14GB to ~325MB. I'm pretty sure I can trigger it by re-adding an OSD. I'll try that tonight.
[23:31] <hufman> we have some sort of supermicro thing at work, it's great
[23:31] <jtang> if it werent for government procurement policies, i'd get a few of them to solve my storage/deployment problems
[23:31] <sagewk> k
[23:32] <jtang> hufman: yea, supermicro is great
[23:32] <jtang> they aren an underdog of the the hpc market
[23:32] <jtang> they make great kit, but they lack a name
[23:32] <hufman> i wish i could figure out their remote vnc protocol and make a standalone client for it
[23:32] <jtang> we have two clusters which are rebranded supermicros
[23:32] <jtang> and they are rock solid
[23:33] <jtang> cheap and cheerful
[23:35] <nhm> Man, I wish they could do the 4-node version of that fat twin with 6 3.5" bays and 4 2.5" bays.
[23:37] <nhm> though I shouldn't be complaining, that's a pretty impressive setup anyway.
[23:38] <sagewk> dmick: http://tracker.ceph.com/issues/5125
[23:38] <sagewk> dmick: https://github.com/ceph/ceph/commit/c0268e27497a4d8228ef54da9d4ca12f3ac1f1bf
[23:38] * vata (~vata@2607:fad8:4:6:398f:3791:ed80:4c53) Quit (Quit: Leaving.)
[23:40] <dmick> does this makes a best effort if the rest of the store is borked?
[23:44] <tnt> hufman: so did you fix your mon ?
[23:45] <hufman> i haven't seen a new deb come down yet
[23:45] <tnt> are deb built continuously ?
[23:46] <sagewk> depends on what leveldb does. but you can do this on the surviving mon.. so if it is also borked, you are screwed anyway.
[23:46] <sagewk> tnt: yes, http://ceph.com/gitbuilder.cgi, http://gitbuilder.ceph.com for output
[23:46] <sagewk> dmick: ^^
[23:47] <hufman> so, i have several virtual machines running off of rbd, how are they still running if the mons are all down?
[23:47] <dmick> ok
[23:48] <tnt> hufman: AFAICT, as long as OSDs don't go down and they don't get disconnected, or need to renew the auth token, it'll continue to work ... but it's bound to stop at some point.
[23:48] <hufman> how do the OSDs know to distribute their changes to their peers?
[23:49] * sagewk (~sage@2607:f298:a:607:1010:231c:3d5f:f266) Quit (Remote host closed the connection)
[23:49] <tnt> hufman: and the debs seems to have been built AFAICT from http://gitbuilder.sepia.ceph.com/gitbuilder-precise-deb-amd64/#origin/master
[23:49] <tnt> hufman: I guess they have the latest osdmap / pgmap ...
[23:49] <hufman> ah ha, i probably didn't see it because i'm on lts, just a sec
[23:50] * jtang1 (~jtang@ Quit (Quit: Leaving.)
[23:50] <hufman> oh, so they cache those files and talk that way? neat!
[23:51] * Tamil (~tamil@ Quit (Quit: Leaving.)
[23:51] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:53] * Jahkeup_ (~jahkeup@ Quit (Quit: My MacBook Pro has gone to sleep. ZZZzzz…)
[23:54] * sagewk (~sage@2607:f298:a:607:5835:d7f1:dcc4:ed57) has joined #ceph
[23:56] * BillK (~BillK@124-169-186-145.dyn.iinet.net.au) has joined #ceph
[23:57] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:57] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:59] * Tamil (~tamil@ has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.