#ceph IRC Log


IRC Log for 2012-07-17

Timestamps are in GMT/BST.

[0:04] <joshd> andrewbogott: you will need to use mkcephfs unless you want to learn how to do what it does manually
[0:04] <joshd> what error did mkcephfs give you?
[0:07] <andrewbogott> joshd: Won't I need to learn all that when it comes time to build a production cluster anyway?
[0:07] <dmick> well, there's the chef setup method as well
[0:07] <andrewbogott> Is puppet an option?
[0:08] <andrewbogott> (In any case, I will rewind and try mkcephfs; stay tuned.)
[0:08] <dmick> I don't know of any puppet config methods currently
[0:08] <dmick> nothing's preventing it; we've just done work around chef
[0:09] <dmick> http://ceph.com/docs is a more-current reference; not sure what exactly you meant by "the wiki" but there's an older set of stuff in a wiki that you should probably steer clear of
[0:10] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has left #ceph
[0:10] <andrewbogott> ceph.com/wiki is what I mean. But, ok, thanks for the warning.
[0:11] <andrewbogott> And, huh, mkcephfs is playing nice now.
[0:11] <andrewbogott> So, never mind, for now :) I predict I'll be back with more questions in ~10 minutes.
[0:11] <dmick> bring 'em on
[0:43] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:48] * LarsFronius (~LarsFroni@2a02:8108:3c0:24:69bf:5e36:57f3:f253) Quit (Quit: LarsFronius)
[0:54] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:11] * lightspeed (~lightspee@2001:8b0:16e:1:216:eaff:fe59:4a3c) has joined #ceph
[1:20] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Operation timed out)
[1:30] <andrewbogott> ok, now mkcephfs seems happy, but mounting just gets me a long pause and a failure. Any debugging suggestions?
[1:30] <joshd> can you run ceph -s successfully?
[1:31] <andrewbogott> that's the same as 'service ceph start' right?
[1:31] <dmick> no, that's "status"
[1:32] <andrewbogott> oh, ok. Trying...
[1:32] * dmick (~dmick@ has left #ceph
[1:33] <andrewbogott> Looks like a 'no'??? still waiting for ceph -s to return.
[1:35] <joshd> sounds like you can't connect to your ceph-mon daemons
[1:36] * lightspeed (~lightspee@2001:8b0:16e:1:216:eaff:fe59:4a3c) Quit (Ping timeout: 480 seconds)
[1:36] * lightspeed (~lightspee@ has joined #ceph
[1:38] <andrewbogott> Maybe so, but one of them is running locally??? or should be.
[1:44] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:45] * Tv_ (~tv@2607:f298:a:607:394a:5e1a:feb6:b166) Quit (Ping timeout: 480 seconds)
[1:46] * dmick (~dmick@2607:f298:a:607:adef:2e71:e864:1fc3) has joined #ceph
[1:56] * JJ1 (~JJ@ Quit (Quit: Leaving.)
[2:07] * dmick (~dmick@2607:f298:a:607:adef:2e71:e864:1fc3) has left #ceph
[2:10] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:11] * sagelap (~sage@53.sub-166-250-67.myvzw.com) has joined #ceph
[2:11] * dmick (~dmick@2607:f298:a:607:7578:af16:6927:7001) has joined #ceph
[2:19] * sagelap (~sage@53.sub-166-250-67.myvzw.com) Quit (Ping timeout: 480 seconds)
[2:26] * loicd (~loic@ Quit (Read error: Operation timed out)
[2:30] * sagelap (~sage@2600:1012:b007:c63c:8a53:2eff:fecb:3261) has joined #ceph
[2:43] * sagelap (~sage@2600:1012:b007:c63c:8a53:2eff:fecb:3261) Quit (Read error: Operation timed out)
[2:51] * JJ (~JJ@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[2:54] * andrewbogott (~andrewbog@c-76-113-214-220.hsd1.mn.comcast.net) Quit (Quit: andrewbogott)
[2:57] * sagelap (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[3:08] * lofejndif (~lsqavnbok@09GAAGRLM.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[3:12] * loicd (~loic@ has joined #ceph
[3:15] * andrewbogott (~andrewbog@50-93-251-66.fttp.usinternet.com) has joined #ceph
[3:15] * loicd (~loic@ Quit (Remote host closed the connection)
[3:18] * loicd (~loic@ has joined #ceph
[3:27] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[3:32] * loicd (~loic@ Quit (Quit: Leaving.)
[3:37] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[3:37] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[3:37] * ferai (~quassel@quassel.jefferai.org) Quit (Remote host closed the connection)
[3:37] * gregorg_taf (~Greg@ has joined #ceph
[3:37] * James_259 (~James259@ has joined #ceph
[3:37] * morse_ (~morse@supercomputing.univpm.it) has joined #ceph
[3:38] * jefferai (~quassel@quassel.jefferai.org) has joined #ceph
[3:38] * renzhi (~renzhi@ has joined #ceph
[3:39] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (Remote host closed the connection)
[3:39] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[3:42] * James259 (~James259@ Quit (Ping timeout: 480 seconds)
[3:44] * Cube (~Adium@ Quit (Ping timeout: 480 seconds)
[3:53] * loicd (~loic@ has joined #ceph
[3:53] * loicd (~loic@ Quit ()
[4:02] <renzhi> hijacker, on production cluster (still going through the testing), after restarting the server, I got stuck with stale pg, what is wrong?
[4:02] <renzhi> morning
[4:04] <renzhi> I do ceph -w to watch, and don't see any activity.
[4:06] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:11] * andrewbogott (~andrewbog@50-93-251-66.fttp.usinternet.com) Quit (Quit: andrewbogott)
[4:14] <renzhi> we tested restart many times with our test cluster, never had such an issue
[4:26] <renzhi> funny thing, it seems every pg is stuck stale
[4:36] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[4:59] * nIMBVS (~nIMBVS@ Quit (Ping timeout: 480 seconds)
[5:21] * andrewbogott (~andrewbog@50-93-251-66.fttp.usinternet.com) has joined #ceph
[5:22] * andrewbogott (~andrewbog@50-93-251-66.fttp.usinternet.com) Quit ()
[5:27] <renzhi> all osd are up, all nodes are connected (pingable, and no iptables rules, so the port should be all open), but there seems to be no connection activity, as all pgs are stale.
[5:27] <renzhi> anyone knows why?
[5:39] * JJ (~JJ@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[5:46] * renzhi (~renzhi@ Quit (Quit: Leaving)
[5:47] * renzhi (~renzhi@ has joined #ceph
[6:04] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[6:05] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[6:14] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[6:22] * JJ (~JJ@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[6:30] * loicd (~loic@ has joined #ceph
[6:38] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[6:57] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has left #ceph
[7:14] * JJ (~JJ@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[7:46] * dmick (~dmick@2607:f298:a:607:7578:af16:6927:7001) Quit (Quit: Leaving.)
[7:47] <renzhi> when I change the crushmap, how long does it take to take effect on the whole cluster?
[7:50] * deepsa_ (~deepsa@ has joined #ceph
[7:52] * deepsa (~deepsa@ Quit (Remote host closed the connection)
[7:52] * deepsa_ is now known as deepsa
[8:04] <ajm-> renzhi: if it has to move data around, it could be hours, or days
[8:05] <renzhi> ok, but I tested with an empty cluster, no data yet, I don't see any activity going
[8:43] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Remote host closed the connection)
[9:01] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:03] * BManojlovic (~steki@ has joined #ceph
[9:32] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[9:32] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Ping timeout: 480 seconds)
[9:33] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:33] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[9:34] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[9:38] * LarsFronius (~LarsFroni@2a02:8108:3c0:24:311a:64:aa0f:6111) has joined #ceph
[9:39] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[10:20] * nhm (~nh@65-128-158-48.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[10:27] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:54] * deepsa (~deepsa@ has joined #ceph
[11:06] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[11:06] * tnt is now known as tnt_
[11:08] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Read error: Operation timed out)
[11:25] <tnt_> Hi everyone. When creating a bucket with radosgw, I'd like to control in which ceph pool it will be. Currently I see that's it's a random and I'm wondering what would be the best way to carry the info.
[11:41] * deepsa_ (~deepsa@ has joined #ceph
[11:43] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[11:43] * deepsa_ is now known as deepsa
[11:51] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[12:19] * tnt_ (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[12:22] * tnt_ (~tnt@202.181-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[12:42] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) has joined #ceph
[13:06] * Dr_O (~owen@heppc049.ph.qmul.ac.uk) has joined #ceph
[13:10] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[13:15] * morse_ is now known as morse
[13:18] * deepsa (~deepsa@ has joined #ceph
[13:21] * Dr_O (~owen@heppc049.ph.qmul.ac.uk) Quit (Remote host closed the connection)
[14:08] * renzhi (~renzhi@ Quit (Quit: Leaving)
[14:19] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[14:27] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:29] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:02] * steki-BLAH (~steki@ has joined #ceph
[15:05] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[15:18] * deepsa (~deepsa@ Quit (Remote host closed the connection)
[15:20] * deepsa (~deepsa@ has joined #ceph
[15:29] * lofejndif (~lsqavnbok@83TAAHHB7.tor-irc.dnsbl.oftc.net) has joined #ceph
[15:41] * RupS (~rups@panoramix.m0z.net) Quit (Quit: leaving)
[15:49] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[15:50] * deepsa (~deepsa@ has joined #ceph
[16:06] * lofejndif (~lsqavnbok@83TAAHHB7.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[16:07] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Read error: Operation timed out)
[16:13] * sagelap (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:14] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[16:22] * LarsFronius_ (~LarsFroni@95-91-243-240-dynip.superkabel.de) has joined #ceph
[16:23] * sagelap (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[16:24] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[16:25] * LarsFronius (~LarsFroni@2a02:8108:3c0:24:311a:64:aa0f:6111) Quit (Ping timeout: 480 seconds)
[16:25] * LarsFronius_ is now known as LarsFronius
[16:26] * nhm (~nh@184-97-254-223.mpls.qwest.net) has joined #ceph
[16:44] * s[X] (~sX]@ has joined #ceph
[16:46] * JJ (~JJ@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[16:52] * ninkotech (~duplo@ Quit (Remote host closed the connection)
[16:59] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:03] * sagelap1 (~sage@142.sub-166-250-73.myvzw.com) has joined #ceph
[17:03] * sagelap (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Read error: Operation timed out)
[17:03] * s[X] (~sX]@ Quit (Read error: Connection reset by peer)
[17:04] <elder> Turns out I dislike this commit: vstart: use absolute path for keyring
[17:08] <nhm> whoa, Intel is buying whamcloud.
[17:08] <elder> That's interesting.
[17:08] <elder> How much?
[17:08] <nhm> don't know yet.
[17:08] <nhm> http://www.whamcloud.com/2012/07/whamcloud-intel-announcement/
[17:09] <nhm> So we've got redhat+gluster and intel+whamcloud
[17:09] <elder> And Intank+ceph
[17:10] <elder> Let those other titans clash.
[17:10] <elder> I should have said New Dream Network+Inktank
[17:12] <nhm> Wow, so for that fast forward grant, so far the winners have been: AMD, IBM, Nvidia, Whamcloud (Intel), Intel, and Intel.
[17:14] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[17:31] * steki-BLAH (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:33] * JJ (~JJ@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[17:33] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[17:47] * sagelap1 (~sage@142.sub-166-250-73.myvzw.com) Quit (Ping timeout: 480 seconds)
[17:48] * deepsa (~deepsa@ has joined #ceph
[17:49] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[17:57] * nymous (~darthampe@93-181-201-83.pppoe.yaroslavl.ru) has joined #ceph
[17:57] <nymous> heya
[17:57] <nymous> anyone non-sleeping here?
[17:58] <nymous> i have a question on cluster configuration
[17:58] <nymous> with xfs parts
[17:59] <nymous> i have 4 node cluster, each node have 2 disks formated with xfs. i've set all daemons on all nodes (mon, mds, 2osds)
[17:59] <nymous> didnt found any special config options for xfs
[17:59] <nymous> i've set 3 replicas per pool
[18:00] <nymous> i've mounted fs on node by themselves
[18:00] <nymous> and here comes the bad thing
[18:00] <nymous> if i reboot a node, i have whole cluster hangup for up to 20 minutes
[18:01] <nymous> i thought it should take place, because i have enough replicas etc
[18:01] <nymous> any suggestions?
[18:01] <nymous> *shoudn't
[18:04] <nymous> i'm on centos 6.3, using ceph 0.48
[18:05] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[18:14] * Tv_ (~tv@2607:f298:a:607:b435:f9f6:cf25:1ca2) has joined #ceph
[18:30] * Cube (~Adium@ has joined #ceph
[18:30] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[18:40] * gregaf (~Adium@2607:f298:a:607:b98c:614a:1d58:34a1) Quit (Quit: Leaving.)
[18:41] * gregaf (~Adium@2607:f298:a:607:e987:dc8f:9b8a:10bd) has joined #ceph
[18:41] <gregaf> nymous: what's your ceph.conf look like? have you set it up with the host information correctly?
[18:41] <gregaf> what's the output of ceph -w during that time period?
[18:43] <nymous> http://pastebin.com/hfAZwHXX
[18:43] <nymous> my ceph.conf
[18:44] * aliguori (~anthony@ has joined #ceph
[18:44] <nymous> . /srv/ceph/data{0,1} is where my xfs volumes are mounted
[18:44] <nymous> not sure about ceph -w output
[18:44] <nymous> it becomes unresponsive
[18:45] <nymous> ceph -s say health warn, degraded (~24%)
[18:46] <nymous> actually i have frequent freezes of ceph command
[18:46] <gregaf> that looks right, what's the other output? ceph -s should have 5+ lines
[18:47] <nymous> can i paste it here?
[18:47] <gregaf> sure, or in patebin
[18:47] <nymous> health HEALTH_WARN mds 2 is laggy
[18:47] <nymous> monmap e2: 4 mons at {0=,1=,2=,3=}, election epoch 52, quorum 0,1,2,3 0,1,2,3
[18:47] <nymous> osdmap e121: 8 osds: 8 up, 8 in
[18:47] <nymous> pgmap v7510: 6160 pgs: 6160 active+clean; 17827 MB data, 58118 MB used, 13840 GB / 13897 GB avail
[18:47] <nymous> mdsmap e187: 1/1/1 up {0=2=up:active(laggy or crashed)}
[18:47] <nymous> my current output
[18:48] <gregaf> okay, so there are a number of things I see here, not sure which is the cause of your freeze
[18:49] <gregaf> 1) the way you're naming your OSDs is dangerous, although it seems to be working ??? internally it stores a lot of stuff in vectors and arrays, so it's not required but you're better off with sequential naming
[18:49] <gregaf> and "osd.00" might be confusing things; do just osd.0
[18:50] <nymous> now it says health HEALTH_OK
[18:50] <gregaf> 2) having 4 monitors is a bad idea; you want odd numbers because the monitors require a strict majority in order to make any progress (so with 4 monitors you can withstand one failure, but not two; which is the same as with 3 monitors)
[18:50] <gregaf> 3) it looks like the MDS has crashed
[18:50] <gregaf> you should see if you actually have any MDS processes still running
[18:51] <nymous> yes, it's running, turned to HEALTH_OK now
[18:51] <gregaf> my guess is you've found some bug and killed them all off but you have some monitoring software turning them back on?
[18:51] <nymous> no
[18:51] <gregaf> anyway, I have to run and focus on 17 other things ?????I'm sure that Tv_ can help you
[18:52] <nymous> i had to restart a node... i did it sequentally... and i had whole cluster hang
[18:54] <nymous> now i've rebooted everything in parallel and it became OK
[18:55] <Tv_> nymous: is there an ongoing problem still remaining?
[18:55] <nymous> with hangs of the cluster? yes
[18:56] <nymous> sometimes it just delays any ops by few seconds, sometimes it won't respond at all
[18:56] <nymous> node reboot made whole cluster unresponsive
[18:56] <Tv_> nymous: and you are talking about filesystem operations on a mounted cephfs, right?
[18:56] <nymous> yes, filesystem operations, and even ceph commands like ceph -s or ceph health
[18:57] <Tv_> nymous: well ceph -s not being quick means it has problems talking to one of the monitors
[18:57] <nymous> can it be related with even number of mons?
[18:58] <Tv_> nymous: which comes down, is the mon process running, what is it logging, what's the health of the server overall (not in swap hell etc)
[18:58] <Tv_> nymous: the even number just means that if you lose 2 mons, you're out of service
[18:58] <nymous> they aren't going down at all
[19:02] <nymous> ok, tell me, do i need to edit something in my conf?
[19:02] <nymous> for xfs
[19:03] <nymous> gregaf says i need to renames OSDs
[19:03] <nymous> *rename
[19:03] <elder> gregaf, or someone... Can you confirm for me that CEPH_OSD_OP_WATCH will never return ERANGE?
[19:04] <nymous> Tv_: do you need any logs?
[19:04] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[19:05] * sileht (~sileht@sileht.net) Quit (Ping timeout: 480 seconds)
[19:06] <elder> That is, it looks to me like it will never supply a result of -ERANGE to the client, but I'm looking for confirmation.
[19:06] <gregaf> elder: ask sjust, I'm already in too many places at once
[19:06] <Tv_> nymous: are your mon servers running out of cpu, disk bandwidth, are they swapping, ....
[19:06] <elder> sjust, what do you say?
[19:06] <sjust> here
[19:07] <elder> See above
[19:07] <nymous> Tv_: no, they run on same hosts and currently they aren't busy
[19:07] <sjust> yeah, was reading, now checking code, one moment
[19:07] <nymous> # cat /proc/loadavg
[19:07] <nymous> 0.38 0.51 0.88 1/653 5120
[19:08] * sileht (~sileht@sileht.net) has joined #ceph
[19:08] <Tv_> nymous: what do you mean "run on the same hosts"?
[19:08] <nymous> Tv_: all daemons are running on all hosts
[19:08] <Tv_> nymous: oh yes but not all on a single host, good
[19:09] <Tv_> nymous: so then the question is does something suspicious show up at a mon log at the time "ceph -s" is slow
[19:09] <nymous> http://pastebin.com/hfAZwHXX my ceph.conf
[19:10] <sjust> elder: doesn't look like it can fail
[19:10] <elder> That's what I thought.
[19:10] <elder> OK, thanks for checking for me.
[19:10] <sjust> sure
[19:11] <nymous> Tv_: as i'm a newbie to ceph, i can't tell if there is something suspicious or not
[19:11] <nymous> mon.0@0(leader).log v10984 check_sub sub mdsmap not log type
[19:12] <nymous> is it suspicious? most messages are like this
[19:12] <nymous> sometimes i got unsynchronized clock warning, but all hosts are running ntpd, so...
[19:13] <Tv_> nymous: running ntp doesn't mean clocks are synchronized
[19:13] <Tv_> ntp saying clocks are synced means clocks are synced
[19:13] * cking (~king@ has joined #ceph
[19:13] <Tv_> that message should be harmless, as far as i know
[19:14] * chutzpah (~chutz@ has joined #ceph
[19:15] <cking> hi, are there any recommended CEPH test suite(s)?
[19:16] * Ryan_Lane (~Adium@ has joined #ceph
[19:17] <nymous> could osd names be the cause?
[19:17] <nymous> i could reboot a node right now and look for the logs
[19:17] <Tv_> nymous: i'd expect that to be more of a binary works-or-not thing
[19:17] <Tv_> nymous: you're wasting some RAM by having gaps in the numbers, but since you're in the low tens, that's not a real issue
[19:18] <Tv_> nymous: the [osd.00] section might not be found in the config; not sure about that one
[19:18] <nymous> daemon starts...
[19:19] <nymous> at least
[19:19] <nymous> maybe i should specify something xfs related?
[19:19] * BManojlovic (~steki@ has joined #ceph
[19:19] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[19:34] * loicd (~loic@z2-8.pokersource.info) has joined #ceph
[19:36] * dmick (~dmick@2607:f298:a:607:7578:af16:6927:7001) has joined #ceph
[19:38] <dmick> anyone else using Thunderbird with Gmail, and if so, are you having problems with it?
[19:38] <dmick> (I know, off topic; will go off list if anyone has any issue)
[19:40] <joshd> I used to, and always had problems with it
[19:41] <joao> dmick, the only problem I have is that, occasionally, I have to turn it off and on again; otherwise it won't get new mails
[19:41] <elder> dmick, I am using that setup.
[19:41] <elder> I occasionally have to re-try sending mail.
[19:41] <joshd> I forward gmail to a different imap server and have thunderbird read that
[19:41] <elder> What kind of trouble are you seeing?
[19:51] <yehudasa> widodh: around?
[19:51] <joshd> elder: the watch ERANGE check was never implemented, but the idea that it's guarding should be
[19:52] * tnt__ (~tnt@ has joined #ceph
[19:53] <joshd> elder: it can be done with a multi-op transaction of CEPH_OSD_OP_ASSERT_VER and CEPH_OSD_OP_WATCH
[19:54] * tnt_ (~tnt@202.181-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[19:54] <joshd> elder: that protects against a lost update to the header between reading the header and watching it
[19:56] <nymous> Tv_: i'm rebooting a node and having cluster hang
[19:56] <nymous> what should look?
[19:56] <nymous> *should i
[19:57] <Tv_> nymous: what does ceph -s say?
[19:57] <nymous> it doesn't respond
[19:57] <Tv_> nymous: are the mon processes alive?
[19:58] <nymous> other 3 - yes
[19:58] <nymous> ok, it did respond after a while
[19:58] <Tv_> nymous: probably hit the 1 that's down, timed out and moved on to a live one
[19:58] <nymous> health HEALTH_WARN 4571 pgs degraded; 4391 pgs stuck unclean; recovery 3722/14016 degraded (26.555%)
[19:58] <nymous> monmap e2: 4 mons at {0=x.x.x.141:6789/0,1=x.x.x.148:6789/0,2=x.x.x.139:6789/0,3=x.x.x.3:6789/0}, election epoch 62, quorum 0,1,2,3 0,1,2,3
[19:58] <nymous> osdmap e128: 8 osds: 8 up, 8 in
[19:58] <nymous> pgmap v7626: 6160 pgs: 1589 active+clean, 4217 active+degraded, 354 active+replay+degraded; 17828 MB data, 58121 MB used, 13840 GB / 13897 GB avail; 3722/14016 degraded (26.555%)
[19:58] <nymous> mdsmap e193: 1/1/1 up {0=2=up:active}, 2 up:standby
[19:58] <Tv_> nymous: that says 4 mons; what did you mean with "other 3"?
[19:59] <nymous> i did reboot 1 node
[19:59] <nymous> it was unavailable for a while
[19:59] <nymous> now it says health HEALTH_WARN 786 pgs stuck unclean; 1 mons down, quorum 0,2,3 0,2,
[19:59] <Tv_> nymous: anything trying to talk to that mon will need to timeout and connect to a different mon; that takes i recall 30 seconds with the default settings
[20:00] <nymous> i tried to ls my mounted cephfs and ls just stuck
[20:00] <nymous> ls is still stuck
[20:01] <nymous> it's rather strange, i did rebooted 3rd mon, and it says it's in quorum
[20:01] <nymous> ls still stuck :(
[20:03] <Tv_> nymous: the default timeouts are in the tens of seconds range; stop worrying about hangs less than that
[20:03] <Tv_> nymous: what's happening with ceph -s?
[20:04] <nymous> i'm worring that if one node will stop working at all, i will get none responding cluster forever
[20:04] <nymous> i wish to have minimum to no delay on node fail
[20:04] <Tv_> and i wish for a pony
[20:04] <Tv_> nymous: the timeouts are adjustable; the risk of low timeouts is flapping
[20:05] <Tv_> nymous: what's happening with ceph -s?
[20:05] <nymous> now it says OK
[20:05] <Tv_> well, that's the amount of time it takes for an osd with your hw and data set to come up properly
[20:06] <Tv_> if it hadn't recovered, the clients would have shifted to using the replicas
[20:07] <Tv_> seems like everything is working, it's just not tuned for rapid recovery; which is certainly true
[20:07] <nymous> oh, i found one of my mds has crashed
[20:07] <Tv_> we're more worried about not upsetting a 500-node cluster than fast recovery of 4-node clusters, at this point; we should be able to tune it later
[20:08] <nymous> how can i tune it for fast recovery in my case?
[20:08] <Tv_> what do you mean "one of"? the ceph -s output only talks about one mds in the first place
[20:08] <nymous> ok, i can live with 30 seconds hang, but not 20 minute
[20:08] <Tv_> i see your ceph.conf talks about 4
[20:08] <Tv_> that doesn't seem to have ever worked right
[20:09] <Tv_> and running 4 mdses in active mode is probably not stable yet
[20:09] <nymous> ceph -s says it's OK, but i can't see mds proccess on node
[20:09] <Tv_> http://pastebin.com/hfAZwHXX
[20:09] <Tv_> err
[20:10] <Tv_> nymous: mdsmap e193: 1/1/1 up {0=2=up:active}, 2 up:standby
[20:10] <Tv_> you only had one mds at that point
[20:10] <nymous> hmmm
[20:11] <nymous> i have mds 1-3 running and 0 crashed
[20:11] <nymous> mds.0
[20:12] <Tv_> i wish that output was documented
[20:12] <nymous> i have set to have 3 replicas per pool...
[20:14] <nymous> i think i should rebuild my conf and cluster... make osd sequental... everything more?
[20:14] <nymous> *anything
[20:14] <Tv_> i think that says you have #0 = mds.2 is active, and you have two other mdses standby
[20:14] <Tv_> so they're not actually all active, which is good
[20:14] <Tv_> but i frankly can't remember how to read that output, and i may be misinterpreting the source right now
[20:15] <Tv_> nymous: you do realize the distributed file system is not stable yet, right?
[20:15] <nymous> i wish ceph had some best practices config in docs >_<
[20:15] <tnt__> Tv_: how unstable is it ?
[20:16] <Tv_> tnt__: it passes our nightly tests, but we aren't aggressively expanding tests for the filesystem part yet
[20:16] <nymous> Tv_: i wish to try rbd with openstack... filesystem could be good too
[20:16] <Tv_> nymous: so use rbd! you don't need mdses for that
[20:17] <nymous> i want to expand my cluster to 15 nodes... i have gigabit ethernet between node... i know, weird config, but it's all i have for now
[20:17] <Tv_> tnt__: http://ceph.com/docs/master/faq/#is-ceph-production-quality
[20:18] <tnt__> Tv_: thanks.
[20:19] <nymous> Tv_: filesystem would be nice to have. i've used to use ibm gpfs for my tasks, but for this one i do not have such budget
[20:21] <nymous> Tv_: so your suggestions? odd number of mons, odd number of mdses? which delays can i adjust without messing it up?
[20:22] <Tv_> nymous: odd number of mons. number of mdses is not as relevant but make sure just one is active (like you had). there are no easy answers for the tuning.
[20:27] <nymous> Tv_: should i set devs entry in osd sections?
[20:36] * tnt__ (~tnt@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * cking (~king@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * aliguori (~anthony@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * deepsa (~deepsa@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * lxo (~aoliva@lxo.user.oftc.net) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * lightspeed (~lightspee@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * nolan (~nolan@2001:470:1:41:20c:29ff:fe9a:60be) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * jamespage (~jamespage@tobermory.gromper.net) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * Qu310 (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * gregorg_taf (~Greg@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * fc (~fc@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * MK_FG (~MK_FG@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * markl (~mark@tpsit.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * themgt (~themgt@24-181-215-214.dhcp.hckr.nc.charter.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * sjust (~sam@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * ivan` (~ivan`@li125-242.members.linode.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * iggy (~iggy@theiggy.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * ajm- (~ajm@adam.gs) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * __jt__ (~james@jamestaylor.org) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * eternaleye (~eternaley@tchaikovsky.exherbo.org) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * Meths (rift@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * eightyeight (~atoponce@pinyin.ae7.st) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * todin (tuxadero@kudu.in-berlin.de) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * Azrael (~azrael@terra.negativeblue.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * thafreak (~thafreak@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * Solver (~robert@atlas.opentrend.net) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * sdouglas (~sdouglas@c-24-6-44-231.hsd1.ca.comcast.net) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * darkfader (~floh@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * acaos (~zac@209-99-103-42.fwd.datafoundry.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * jantje (~jan@paranoid.nl) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * chutzpah (~chutz@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * gregaf (~Adium@2607:f298:a:607:e987:dc8f:9b8a:10bd) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * Tv_ (~tv@2607:f298:a:607:b435:f9f6:cf25:1ca2) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * rosco (~r.nap@ Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * chuanyu (chuanyu@linux3.cs.nctu.edu.tw) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * yehudasa (~yehudasa@2607:f298:a:607:18ee:8529:6607:79ec) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * mkampe (~markk@2607:f298:a:607:222:19ff:fe31:b5d3) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * sagewk (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * cattelan (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * cephalobot (~ceph@ps94005.dreamhost.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * rturk (~rturk@ps94005.dreamhost.com) Quit (reticulum.oftc.net synthon.oftc.net)
[20:36] * Qu310 (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[20:36] * nolan (~nolan@2001:470:1:41:20c:29ff:fe9a:60be) has joined #ceph
[20:36] * jamespage (~jamespage@tobermory.gromper.net) has joined #ceph
[20:36] * lightspeed (~lightspee@ has joined #ceph
[20:36] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[20:36] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:36] * deepsa (~deepsa@ has joined #ceph
[20:36] * aliguori (~anthony@ has joined #ceph
[20:36] * cking (~king@ has joined #ceph
[20:36] * tnt__ (~tnt@ has joined #ceph
[20:36] * ajm- (~ajm@adam.gs) has joined #ceph
[20:36] * eternaleye (~eternaley@tchaikovsky.exherbo.org) has joined #ceph
[20:36] * __jt__ (~james@jamestaylor.org) has joined #ceph
[20:36] * iggy (~iggy@theiggy.com) has joined #ceph
[20:36] * ivan` (~ivan`@li125-242.members.linode.com) has joined #ceph
[20:36] * sjust (~sam@ has joined #ceph
[20:36] * themgt (~themgt@24-181-215-214.dhcp.hckr.nc.charter.com) has joined #ceph
[20:36] * markl (~mark@tpsit.com) has joined #ceph
[20:36] * MK_FG (~MK_FG@ has joined #ceph
[20:36] * fc (~fc@ has joined #ceph
[20:36] * gregorg_taf (~Greg@ has joined #ceph
[20:36] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[20:36] * Solver (~robert@atlas.opentrend.net) has joined #ceph
[20:36] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[20:36] * sdouglas (~sdouglas@c-24-6-44-231.hsd1.ca.comcast.net) has joined #ceph
[20:36] * Meths (rift@ has joined #ceph
[20:36] * eightyeight (~atoponce@pinyin.ae7.st) has joined #ceph
[20:36] * todin (tuxadero@kudu.in-berlin.de) has joined #ceph
[20:36] * darkfader (~floh@ has joined #ceph
[20:36] * acaos (~zac@209-99-103-42.fwd.datafoundry.com) has joined #ceph
[20:36] * Azrael (~azrael@terra.negativeblue.com) has joined #ceph
[20:36] * jantje (~jan@paranoid.nl) has joined #ceph
[20:36] * thafreak (~thafreak@ has joined #ceph
[20:36] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) has joined #ceph
[20:37] * chutzpah (~chutz@ has joined #ceph
[20:37] * gregaf (~Adium@2607:f298:a:607:e987:dc8f:9b8a:10bd) has joined #ceph
[20:37] * Tv_ (~tv@2607:f298:a:607:b435:f9f6:cf25:1ca2) has joined #ceph
[20:37] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[20:37] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[20:37] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[20:37] * rosco (~r.nap@ has joined #ceph
[20:37] * chuanyu (chuanyu@linux3.cs.nctu.edu.tw) has joined #ceph
[20:37] * yehudasa (~yehudasa@2607:f298:a:607:18ee:8529:6607:79ec) has joined #ceph
[20:37] * mkampe (~markk@2607:f298:a:607:222:19ff:fe31:b5d3) has joined #ceph
[20:37] * sagewk (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) has joined #ceph
[20:37] * cattelan (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[20:37] * cephalobot (~ceph@ps94005.dreamhost.com) has joined #ceph
[20:37] * rturk (~rturk@ps94005.dreamhost.com) has joined #ceph
[20:37] <elder> Whew, that was cathartic.
[20:38] <sagewk> :)
[20:41] <dmick> hm. precise-x86)64 gitbuilder out of space or something? no pkgs there.
[20:42] <dmick> ah no, master failed
[20:42] <dmick> rsync failure
[20:46] <nymous> Tv_: if you have access to sources, please add ssh port selection for creation scripts >_<
[20:46] * yehudasa (~yehudasa@2607:f298:a:607:18ee:8529:6607:79ec) Quit (Ping timeout: 480 seconds)
[20:48] <gregaf> dmick: master failed to build and so we no longer have packages from the last working commit?
[20:49] <dmick> yeah, Sage just fixed it
[20:49] <dmick> rsync/fail-marking confusion
[20:49] <gregaf> bah humbug
[20:49] <gregaf> that is inconvenient when I'm working on installation and need to use the master packages ;)
[20:49] <dmick> they'll be there shortl
[20:49] <dmick> y
[20:49] <gregaf> thanks :)
[20:55] * yehudasa (~yehudasa@2607:f298:a:607:c475:173a:4a0b:a6) has joined #ceph
[20:57] * loicd (~loic@z2-8.pokersource.info) Quit (Ping timeout: 480 seconds)
[21:03] * andrew (~andrewbog@h66-173-127-171.mntimn.dedicated.static.tds.net) has joined #ceph
[21:03] * andrew is now known as andrewbogott
[21:06] <nymous> is metadata pool somehow critical to the cluster?
[21:07] * andrewbogott is back and just as stumped as yesterday.
[21:08] * cking (~king@ Quit (Ping timeout: 480 seconds)
[21:08] <andrewbogott> So, mkcephfs seems happy, but immediately after I can't run ceph -s; it hangs.
[21:08] <andrewbogott> What should I look for? ps -ef | grep ceph makes it look like nothing is actually running.
[21:10] <dmick> mkcephfs initializes the daemons etc.
[21:11] <dmick> you still need to start the service
[21:11] <dmick> service ceph start -a
[21:11] <andrewbogott> ok, so, 'service ceph -a start'
[21:11] <dmick> yah, maybe I reversed those
[21:12] <andrewbogott> hm??? I can ssh w/out a password to the other hosts. But ceph cannot. Does that mean it's using some user other than 'root'?
[21:12] <dmick> are you running 'service' as root?
[21:12] <andrewbogott> yep.
[21:13] <dmick> no, it doesn't change uid that I'm aware of
[21:13] <andrewbogott> And it prompts me for root@hostname despite the fact that I just now ssh'd with no trouble.
[21:13] <andrewbogott> Anyway, I'll just do 'service ceph start' on each host. Same thing, right?
[21:13] <nhm> andrewbogott: same user on the client side?
[21:14] <dmick> you really want that passwordless root ssh to work right
[21:14] <andrewbogott> nhm: For various reasons, I /only/ have root login (no user login) on all three servers.
[21:14] <andrewbogott> So, everything is root root root
[21:15] <dmick> so the ~root/.ssh/id*pub key from the originating machine is present in all the target machines' ~root/.ssh/authorized_keys files, right?
[21:16] <andrewbogott> behold, the transcript: http://pastebin.com/FT2unqJi
[21:16] <dmick> (sorry to belabor, I've screwed this up so many times myself)
[21:16] <andrewbogott> As you see, I can ssh directly without a prompt. But ceph prompts me anyway.
[21:16] <nhm> andrewbogott: crazy. I don't think that should screw anything up though....
[21:16] <andrewbogott> dmick: All of my keys are managed by puppet.
[21:17] * LarsFronius (~LarsFroni@95-91-243-240-dynip.superkabel.de) Quit (Read error: Connection reset by peer)
[21:17] <dmick> and "ssh virt1002" works without asking for a password too?
[21:17] <dmick> (i.e. no root@)
[21:17] <andrewbogott> dmick: Yeps.
[21:18] * LarsFronius (~LarsFroni@95-91-243-240-dynip.superkabel.de) has joined #ceph
[21:18] <dmick> well, that's impossible :)
[21:18] <andrewbogott> I agree.
[21:18] <andrewbogott> Although sometimes ssh is clever and refuses logins due to subtle things like .ssh permissions and such.
[21:18] <andrewbogott> Can I get a verbose log from what service ceph -a start is doing?
[21:19] <dmick> it's the same as /etc/init.d/ceph, which you can run with -x anyway
[21:19] <dmick> just looking at what that is
[21:20] <dmick> and yes you can give it -v
[21:20] <andrewbogott> I was thinking more of getting verbose output from ssh.
[21:20] <dmick> which might help
[21:20] <Tv_> nymous: metadata pool is used to store the cephfs metadata, by the mdses
[21:20] <nymous> ok
[21:21] <nymous> i've recreated the cluster, 3 mons, sequental osd names, set max mds to 3
[21:22] <andrewbogott> grrr, it reports running a particular command (ssh virt1002 "cd / ; ulimit -c unlimited ; mkdir -p /var/run/ceph") which I can run from the command line, no problems, no password prompt
[21:22] <dmick> could you have two different ssh commands in different paths?
[21:22] <dmick> or an alias in your interactive rootshell?
[21:24] <andrewbogott> dmick: A colleague suggests that my success in the shell is do to agent forwarding, and that the ceph script is breaking that somehow. Stay tuned...
[21:25] <dmick> ah
[21:25] <Ryan_Lane> it's annoying to have to manage a cluster by running a shell script
[21:26] <Ryan_Lane> can this be done via puppet?
[21:26] <dmick> Ryan_Lane: there's chef
[21:26] <Ryan_Lane> we use puppet...
[21:26] <Ryan_Lane> does the script do anything other than install config files?
[21:26] <dmick> init.d ceph? yes
[21:26] <Ryan_Lane> I'd be way happier implementing this in puppet, than using the shell script
[21:26] <Tv_> Ryan_Lane: if you're building from scratch with puppet, you probably don't want to imitate mkcephfs
[21:27] <Tv_> Ryan_Lane: look at ceph-cookbooks.git instead, it's way way more modern in its approach
[21:27] <andrewbogott> dmick: Ryan_Lane is the aforementioned colleage
[21:27] <nymous> it might be self-suggestion, but looks like it works faster now
[21:27] <nymous> i mean reporting ceph -s etc
[21:28] <Ryan_Lane> are the basics that we need to install the keychain, the config file, and then reload ceph when that occurs?
[21:28] <nhm> Ryan_Lane: Do you hang out in #openstack? I think we've spoken before about some of the work you did at Mediawiki...
[21:28] <Tv_> Ryan_Lane: actual chef is a few hundred lines (and decreasing!), you should be able to redo that with puppet fairly easily; same thing is also being done with Juju charms
[21:28] <Ryan_Lane> nhm: yep. I run Wikimedia Labs
[21:28] <Ryan_Lane> and am an openstack contributor
[21:28] <Ryan_Lane> we're looking at switching from gluster to ceph
[21:28] <nhm> Ryan_Lane: Neat, that was back when I was at the Minnesota Supercomputing Institute. I work for Inktank now. :)
[21:29] <Ryan_Lane> assuming ceph testing shows it's scalable :)
[21:29] <Ryan_Lane> nhm: ah. nice to talk to you again
[21:29] <andrewbogott> Hrm. I was really just trying to get a cursory 3-node install up in order to poke around and do some performance testing. Is it really appropriate to puppetize before I even know if I want ceph and/or what I want from it?
[21:29] <Ryan_Lane> andrewbogott: hm. true.
[21:29] <nhm> Ryan_Lane: Performance testing is my department. Feel free to email me and I'll try to lend a hand if you have problems. mark.nelson@inktank.com
[21:29] <andrewbogott> Ryan_Lane: I can just copy keys around between the hosts, if that won't keep you up at night.
[21:30] <Ryan_Lane> andrewbogott: we manage the root authorized keys globally, unfortunately, so it's not easy to change it
[21:30] <andrewbogott> Ryan_Lane: I can just turn puppet /off/ on these hosts.
[21:30] <Ryan_Lane> andrewbogott: if you disable puppet, install the keys, then run mkcephfs, it'll work
[21:30] <andrewbogott> They're totally disposable.
[21:30] * Ryan_Lane nods
[21:30] <Ryan_Lane> works for me
[21:30] <Ryan_Lane> nhm: cool. will do
[21:31] <nhm> Ryan_Lane: Also, I don't know if we have the resources to do it any time soon, but it wouldn't hrut to put a feature request in for puppet testing/support
[21:31] <Ryan_Lane> nhm: gluster has basically been killing us
[21:31] <andrewbogott> OK. The angel on my shoulder gave me a dirty look when I considered setting up passwordless keys between production servers. I will ignore the angel for now.
[21:31] <Ryan_Lane> so, I'm hoping ceph will not
[21:32] <nhm> Ryan_Lane: What kind of problems have you been having with gluster?
[21:32] <Ryan_Lane> performance issues
[21:32] <Ryan_Lane> data corruption
[21:32] <Ryan_Lane> split brains
[21:32] <Ryan_Lane> basically every bad thing you can think of
[21:33] <Ryan_Lane> and now they've made a backwards compatible change in the new version, requiring complete downtime for my cluster, which is the last straw
[21:33] <Ryan_Lane> *backwards incompatible
[21:33] <nhm> Ryan_Lane: We ran gluster at MSI briefly on an 8000 core cluster. We had some data corruption issues too. It just couldn't handle that many clients.
[21:33] <Ryan_Lane> I have a 5 node cluster and it can't handle it
[21:33] <nhm> Ryan_Lane: performance was pretty good, though the testing we had done at the time was limited.
[21:33] <Ryan_Lane> well, 5 nodes and 180 instances
[21:34] <Ryan_Lane> I can't replace a brick without the entire cluster crashing
[21:34] <nhm> Ryan_Lane: For ceph, how many drives/node and what kind of interconnect?
[21:34] <Ryan_Lane> well, we've split our IO into two clusters
[21:34] <Ryan_Lane> one for instance storage, and another for project data access
[21:35] <Ryan_Lane> the instance storage is on the compute nodes. we'll have 7, with SAS raid 10
[21:35] <nhm> 7 nodes?
[21:36] <Ryan_Lane> the project storage is 4 nodes, each with 24 SATA drives in a raid 10
[21:36] <Ryan_Lane> hm. wait. no. not a raid 10. two raid 6
[21:36] <Ryan_Lane> (I'm still recovering from travel ;) )
[21:36] <nhm> Ryan_Lane: totally understand, I just got back from LA last week and am still grouchy in the mornings.
[21:37] <nhm> Ryan_Lane: what kind of controllers?
[21:37] <Ryan_Lane> the amount of storage on the 7 nodes should be rather small
[21:37] <Ryan_Lane> LSI, using software raid
[21:38] <Ryan_Lane> which is less than ideal, but our hardware was donated. I can't be too picky :)
[21:39] <Fruit> hmm, I wonder how ceph would do with zfs as a backend
[21:39] <andrewbogott> dmick: Do I want each host to have passphraseless access to every other host? Or just access from the one I'm typing shell commands on to the others?
[21:40] <nhm> Ok. Sadly we're having some trouble with the controllers on our internal testing nodes (Dell H700s, which are basically 2960s) which is causing no end of aggrevation. We don't have a good sense yet if for large numbers of disks you are better off with a couple of raids and fewer OSDs are going with an IT firmware and doing 24 OSDs per node. Hopefully I'll have some new test equipment soon and I can help answer that question. ;)
[21:40] <Ryan_Lane> heh. well, for the instance storage, it's 8 drives in a raid 10
[21:41] <Ryan_Lane> for the project storage it's 24 in two raid 6
[21:41] <dmick> andrewbogott: it's only necessary from the controller to the others I believe
[21:41] <Ryan_Lane> we're probably going to need to rely mostly on cephfs, as well
[21:42] <dmick> which one is the controller is arbitrary
[21:42] <nhm> Ryan_Lane: Ok. It's possible that you might need to do some tweaking to get good performance out of it. There are a lot of queues/thread pools/etc that we haven't quite figured out optimal settings for yet in high performance environments. Having said that, one of the guys out at Sandia is getting like 2GB/s per node, so it's entirely possible. ;)
[21:42] <andrewbogott> dmick: Cool
[21:42] <Ryan_Lane> ah, that's good performance indeed
[21:43] <Ryan_Lane> nhm: we're very volunteer based, if you're interested in helping us out :)
[21:43] <nhm> Ryan_Lane: he's done a lot of tweaking though, so it may take a bit of playing to make things work right.
[21:44] <dmick> andrewbogott: there's an "ssh" subcommand to init.d/ceph that will ssh to each in turn to test
[21:44] <Ryan_Lane> if we decide to go with ceph, we'll definitely be contributing back, at minimum for puppet modules
[21:45] <nymous> can i set interface for daemons to communicate?
[21:45] <iggy> nymous: yes
[21:45] <nymous> how?
[21:45] <nhm> Ryan_Lane: I'll certainly be around to answer questions. If it looks like it's going to take a ton of work we can cross that bridge when we get to it. :)
[21:45] <iggy> well, you can set the net it uses
[21:45] <Ryan_Lane> heh
[21:46] <nymous> how?
[21:46] <Ryan_Lane> nhm: most likely answering questions will be all we need. we're usually pretty self-sufficient.
[21:47] <andrewbogott> Hey, now ceph -s does something!
[21:47] <dmick> andrewbogott: yay.
[21:48] <andrewbogott> It says it's unable to authenticate :)
[21:48] <nhm> Ryan_Lane: One thing you should know is that ceph writes data out to both the journal and the data disks. If you put them on the same drive you can only expect half the throughput at best, and possibly increased seeks.
[21:48] <Ryan_Lane> ah
[21:49] <Ryan_Lane> that likely changes how we should create our raidsets
[21:49] <nymous> iggy: it should be cluster_addr and cluster_network, but are there any usage examples?
[21:50] <iggy> not that I know of... I was looking in the docs and couldn't even find that (although I knew they existed)
[21:50] <iggy> might try searching the list archives
[21:51] <Ryan_Lane> we can probably split the disks into two raid sets, and have the journal write to one, and have the data write to the other
[21:52] <nhm> Ryan_Lane: Yeah, one of the things that may make sense is to put journals for a couple of OSDs on an SSD (depending on the OSD throughput you expect). If you use say a 128-256GB SSD and only use a <40GB portion of it for the journal, the wear levelling will hopefully be enough to deal with the lowish MTTF.
[21:53] <Ryan_Lane> yeah, we can likely replace two disks per node for SSDs, and stick the rest into the raid 10
[21:53] <nhm> Ryan_Lane: A lot of the testing I was doing earlier was actually with data and journals in seperate raid sets. It works fine, assuming your controls don't misbehave (glares at H700s)
[21:54] <nymous> iggy: ok, found on archives
[21:54] <nhm> s/controls/controllers
[21:54] <Ryan_Lane> heh
[21:54] <Ryan_Lane> well, we're using software raid, so the controllers shouldn't be an issue there
[21:55] <nhm> Yeah. Are you guys on lsi2008ish series chips?
[21:56] <Ryan_Lane> I'd need to check. we're using donated cisco UCS systems
[21:57] <nhm> Ryan_Lane: ok. If the raid testing isn't working out, definitely try just straight up jbod too.
[21:58] <Ryan_Lane> ceph can handle the disks itself?
[21:58] <Ryan_Lane> or are you recommending btrfs?
[21:58] <andrewbogott> I'm using ext3 right now, although now I can't find the doc that recommended that.
[21:59] <Ryan_Lane> andrewbogott: probably want to use xfs
[21:59] <andrewbogott> Um??? I think I had xfs previously and mkcephfs told me to switch. But perhaps I misunderstood.
[22:00] <andrewbogott> Ryan_Lane: e.g. http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/5048
[22:01] <andrewbogott> So I guess mkcephfs /does/ warn, but it's wrong. Crap.
[22:02] <nhm> Ryan_Lane: You can assign an OSD to each disk. BTRFS and XFS are both options with tradeoffs. I'm seeing higher initial performance with BTRFS, but worse degradation vs XFS. We just started testing Ext4, but don't have a really good feel for it yet. You'll want to make sure that you use "filestore xattr use omap = true" if you go the ext4 route.
[22:02] <nhm> and mount with user_xattrs
[22:02] <nhm> er user_xattr
[22:05] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:08] * Ryan_Lane nods
[22:08] <Ryan_Lane> I think I'm a little paranoid about btrfs to use it right now
[22:10] <nhm> Ryan_Lane: Understandable. XFS has been pretty stable for us, but you lose some things. One of which is that data has to be written to the journal before requests get acked. For btrfs, it's the journal or data disk, whichever completes first.
[22:11] <Ryan_Lane> we'd need to use a very new kernel for btrfs, though
[22:11] <nhm> Ryan_Lane: Yeah, every kernel there are still fixes being put in place.
[22:11] <nhm> Ryan_Lane: What kernel do you have now?
[22:11] <Ryan_Lane> there's a lot of worries with going with btrfs. I'll stick with a stable filesystem for now :)
[22:11] <Ryan_Lane> we're using stock ubuntu kernel
[22:12] <Ryan_Lane> with precise
[22:12] <Ryan_Lane> and the storage is being dual-purposed with virtualization
[22:12] <nhm> Ryan_Lane: Yeah, I'd stick with xfs with that kernel.
[22:12] <Ryan_Lane> which means every kernel upgrade requires vm migration, or vm downtime
[22:12] <nhm> ugh
[22:13] <Ryan_Lane> the other storage cluster, is thankfully not coupled :)
[22:14] <nhm> Ryan_Lane: well, best of luck with the testing. Definitely let us know how it goes!
[22:14] <Ryan_Lane> thanks. will do :)
[22:15] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:17] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[22:17] * nymous (~darthampe@93-181-201-83.pppoe.yaroslavl.ru) Quit (Quit: RAGING AXE! RAGING AXE!)
[22:18] * sagelap (~sage@ has joined #ceph
[22:26] * JJ (~JJ@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[22:30] * cking (~king@74-95-45-185-Oregon.hfc.comcastbusiness.net) has joined #ceph
[22:39] * danieagle (~Daniel@ has joined #ceph
[23:00] * deepsa_ (~deepsa@ has joined #ceph
[23:00] * andrewbogott_ (~andrewbog@h66-173-127-171.mntimn.dedicated.static.tds.net) has joined #ceph
[23:00] * andrewbogott (~andrewbog@h66-173-127-171.mntimn.dedicated.static.tds.net) Quit (Read error: Connection reset by peer)
[23:00] * andrewbogott_ is now known as andrewbogott
[23:01] * andrewbogott_ (~andrewbog@h66-173-127-171.mntimn.dedicated.static.tds.net) has joined #ceph
[23:01] * andrewbogott (~andrewbog@h66-173-127-171.mntimn.dedicated.static.tds.net) Quit (Read error: Connection reset by peer)
[23:01] * andrewbogott_ is now known as andrewbogott
[23:01] * lofejndif (~lsqavnbok@28IAAF0XI.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:01] * andrewbogott (~andrewbog@h66-173-127-171.mntimn.dedicated.static.tds.net) Quit ()
[23:02] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[23:02] * deepsa_ is now known as deepsa
[23:02] * JJ (~JJ@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[23:07] * loicd (~loic@ has joined #ceph
[23:12] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[23:12] * gregorg_taf (~Greg@ has joined #ceph
[23:23] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[23:36] * Dr_O (~owen@host-78-149-118-190.as13285.net) has joined #ceph
[23:50] * aliguori (~anthony@ Quit (Quit: Ex-Chat)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.