#ceph IRC Log


IRC Log for 2012-08-09

Timestamps are in GMT/BST.

[0:00] <Leseb> darkfader: thank you for pointing the orphan mode to me, I'll setup it up though :)
[0:00] <darkfader> if you want very fast syncup you can also try
[0:01] <darkfader> server iburst burst minpoll 2 maxpoll 4
[0:01] <darkfader> there's some more sick options
[0:01] <darkfader> i love tinker huffpuff
[0:01] <darkfader> (yes, that exists)(
[0:01] * maelfius (~Adium@ has joined #ceph
[0:01] * mpw (~mpw@chippewa-nat.cray.com) Quit (Quit: Nettalk6 - www.ntalk.de)
[0:02] <Leseb> it's been a while that I didn't dive into those parameters but thank you :)
[0:02] <darkfader> rather dont ;p
[0:03] <Leseb> haha k
[0:06] * s[X] (~sX]@ppp59-167-154-113.static.internode.on.net) has joined #ceph
[0:09] <nhm> sjust1: any luck with the small IO research?
[0:10] <sjust1> nhm: it appears to be slow apply thread, as always, still exploring
[0:14] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:22] <Leseb> darkfader: can I ask one more question about cep's mon?
[0:26] * Cube (~Adium@ Quit (Quit: Leaving.)
[0:26] <Leseb> It's recommended to set the mon server with odd number. 3 nodes MON seems to be good for most of the cluster. But what happen if I loose one mon, I still do have 2, but this could implie wrong decision and bad quorum. So what's the best solution at the end? how harmful is the 2 MON situation?
[0:39] <dmick> as I understand it, which is not very well, you can continue to run with 2 monitors, but if one of those remaining monitors fail, the cluster will stop being usable
[0:39] <dmick> the point of having 3 is to be able to tolerate one failure
[0:44] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[0:44] <Leseb> I knew that with 2 mon the cluster works but I follow what you said 4 MON are good too right?
[0:44] <Leseb> and the ceph wiki says Ceph-MON daemons must be deployed in odd-number sets.
[0:44] <joshd> you need a majority of the monitors to be up for them to function
[0:45] <joshd> so if you have an even number, like 4, you can still only tolerate 1 failure (since 2/4 is not a majority)
[0:45] <Leseb> ok I got that one
[0:46] <Leseb> it brings now the 'real' question
[0:46] <Leseb> how harmful is the 2 MON situation?
[0:47] <Leseb> like a countdown before something really happen?
[0:47] <joshd> you mean running only two?
[0:47] <dmick> it's fine until another monitor goes down
[0:47] <Leseb> johnl: yes
[0:47] <joshd> yeah, what dmick said
[0:48] <dmick> which was the point of having three to start with
[0:51] <Leseb> ok but where is the majority with 2 nodes? or maybe I don't understand it??? :/
[0:51] <gregaf> if the two nodes together agree on something, they are a majority of the nodes in the system
[0:52] <dmick> the cluster knows that it originally had three, even when one has failed
[0:52] <Leseb> and if one is agree and the other one disagree?
[0:52] <gregaf> they're not adversarial; what actually happens is one of them (the leader) says "we're doing this" and then the rest agree (or, if something has gone wrong, they try to elect a new leader)
[0:53] <maelfius> i think the point of the question was with 2/3 online, you potentially could have a case of election conflict. Node 1 says node 1 is new master, node 2 says node 2 is (if i am interpreting Leseb's question)
[0:54] <maelfius> in the case that the master is the one that failed.
[0:55] <Leseb> ok but if the master fails, and node1 says: "I want to be the new master ok?" and node2 says: "I want to be the new master as well"
[0:55] <Leseb> is that possible?
[0:56] <Leseb> paxos takes care of that consensus?
[0:56] * steki-BLAH (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:56] <gregaf> they have tie-breaking algorithms and will agree on who gets to be leader, assuming there aren't other communication issues
[0:57] <Leseb> ok so conflictual scenarios between the 2 surviving nodes will never happen? (in theory?)
[0:57] <gregaf> correct
[0:58] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[0:59] <Leseb> perfect, thank for having clarified this :D
[1:00] <maelfius> gregaf: just to clarify then, with a large cluster does it make sense to deploy more than say 3 mon nodes? or it's as effective to run 3 (barring spreading over something akin to PDUs, racks, etc for risk of all mons going down at once) as it would be 5, or 7
[1:01] <gregaf> maelfius: well, with 5 nodes you can survive two monitors failing, with 7 you can survive 3, etc
[1:01] <gregaf> so (if you plan it right) you get more redundancy and availability
[1:01] <maelfius> gregaf: good to know. that makes sense. I just wanted to make sure i understood correctly :)
[1:02] <gregaf> but larger paxos clusters write slower, and right now nobody's cluster is large enough to need the extra read bandwidth, so the answer is just 3 monitors
[1:03] <maelfius> gregaf: cool. on another topic, is there an easy way to create an OSDMAP (similar to how I can edit a crushmap text file and import). Or is it more of a use the ???createsimple option then modify with ceph osd and ceph mon commands?
[1:04] <gregaf> no, you can't do manual osdmap creation like that right now
[1:05] <maelfius> gregaf: ok, so create a simple osdmap with enough pg_bits and then modify via cmd utils is the "accepted" way.
[1:05] <gregaf> yeah
[1:06] <maelfius> gregaf: thanks for the help! I really appreciate it.
[1:07] <gregaf> np; thanks for looking at Ceph! ;)
[1:08] <Leseb> gregaf: may I ask you a question about the cep's journal?
[1:08] * jefferai (~quassel@quassel.jefferai.org) has joined #ceph
[1:08] <gregaf> what's up?
[1:09] <Leseb> I didn't really get if the journal act as a 'kind of buffer cache' request and also/or act as a common journal like every fs
[1:09] <Leseb> (recovery purpose)
[1:09] <gregaf> the OSD journal serves two purposes
[1:10] <gregaf> 1) it's entirely sequential writes, so it can burst faster than the regular disk can (even if they're the same disk)
[1:10] <gregaf> 2) if the OSD crashes, then on restart it can make sure to get back into a consistent state by replaying the journal
[1:11] <Leseb> perfect clarification
[1:11] <gregaf> the second is the motivating factor, but the potential extra speedup is nice too and something we're working on exploiting more
[1:11] <Leseb> so I wonder if I loose the journal
[1:12] <Leseb> and if no writes are performed and if every last commit are sync/drop cached into each device
[1:12] <Leseb> does the OSD is down? or simply in a READ-ONLU state?
[1:12] <Leseb> *only
[1:13] <gregaf> if you lose the journal then the OSD is going to fail to write to it and crash
[1:13] <Leseb> but if you don't perform any write
[1:14] <Leseb> you only keep requesting data
[1:14] <Leseb> is that possible?
[1:14] <gregaf> hmm, not sure ??? sjust?
[1:14] <sjust1> no, the osd frequently needs to persist internal data
[1:14] <Leseb> anyway I will test it soon
[1:14] <gregaf> not that it's particularly helpful though ??? who wants a storage device you can't write to?
[1:14] <gregaf> if you lose the journal you can issue manual instructions to tell the OSD to create a new one and come back up, but of course there's the potential for data loss
[1:15] <Leseb> it's a different question
[1:15] <Leseb> you still provide a service
[1:15] <Leseb> a read-only one, by a service
[1:15] <sjust1> Leseb: it's a reasonable use case, just not one that we support
[1:16] <Leseb> so basically the default behavior to crash the OSD
[1:17] <Leseb> so in this case, why does the wiki say that you can run an OSD without a journal?
[1:18] <sjust1> it'll work with btrfs, just not xfs
[1:18] <sjust1> and it can't while running switch from one to the other
[1:19] <Leseb> did you parallel journal write or it's different?
[1:19] <Leseb> *+mean
[1:19] <sjust1> I mean, you can't disable the journal while the osd is still running
[1:20] <sjust1> btrfs has an async snapshot option which lets us handle transactions without a journal, but at a very large cost in commit latency
[1:20] <Leseb> and if I do, that will affect all the OSD running on the storage node too right?
[1:20] <sjust1> it's not a real option anyway, the commit latency cost is huge
[1:21] <Leseb> (I meant changing the option for disabling the journal)
[1:21] <sjust1> you can configure individual osds for different journal configurations
[1:21] <sjust1> but I meant that running without a journal isn't a good idea
[1:22] <Leseb> I can imagine
[1:24] <Leseb> sjust1: last question please
[1:24] <sjust1> Leseb: sure
[1:24] <Leseb> I'm currently checking my MON connection
[1:24] <Leseb> and one seems to be connected to itself
[1:24] <sjust1> I don't quite understand
[1:25] <Leseb> tcp 0 0 ESTABLISHED 10589/ceph-mon
[1:25] <Leseb> actually it appears on 2 MON
[1:26] <sjust1> hmm, gregaf: is that normal?
[1:26] <gregaf> that port number isn't anything of ours
[1:27] <Tv_> the 36454? it's just a dynamic port
[1:27] <Leseb> http://pastebin.com/KT9PN8g7
[1:27] <gregaf> doesn't mean it's not normal though; it might be a syslog daemon, or a client maybe?
[1:27] <Tv_> that could be a "ceph" command line client
[1:28] <Leseb> no ceph command are running
[1:28] <dmick> lsof could find you the process on the other end, right?
[1:28] <Tv_> Leseb: sudo netstat -ntp|grep 36454
[1:29] <Leseb> tcp 0 0 ESTABLISHED 9736/ceph-osd
[1:29] <Leseb> tcp 0 0 ESTABLISHED 10589/ceph-mon
[1:29] <dmick> ...or that :)
[1:29] <Tv_> Leseb: so that's an osd talking to a mon
[1:30] <gregaf> why do you have a ceph-mon running on port 41502 on the same node as anothe rmon?
[1:30] <Leseb> ok, so it's normal to have different connection behavior depending on MON
[1:30] <Leseb> I have no idea
[1:30] <Leseb> It's a fresh cluster
[1:30] <Leseb> it does nothing at the moment
[1:31] <Leseb> no rbd map
[1:31] <dmick> gregaf: that's the peer, right? .12 is local, .6 is remote
[1:31] <gregaf> sorry, I'm confused
[1:33] <Leseb> so?
[1:35] <gregaf> just ignore me :)
[1:35] <Tv_> Leseb: so what's the actual problem?
[1:35] <Leseb> there is no problem
[1:35] <Leseb> I was just wondering why the connection is not identical on each server
[1:35] <Tv_> Leseb: ok so.. "does this look sane?" "it looks believable for a normal system"
[1:36] <Leseb> for me each mon should be connected to all the mon
[1:36] <Tv_> Leseb: because once A connects to B, B can already talk to A
[1:36] <gregaf> the connection probably just died because of no traffic going over it
[1:38] <gregaf> actually, no, they are fully connected
[1:42] * Tv_ (~tv@2607:f298:a:607:d976:71b0:669f:be18) Quit (Quit: Tv_)
[1:42] <Leseb> ok the other connection are probably from OSD then
[1:43] <Leseb> dmick: gregaf :Tv_ many many thanks guys for all your precious clarifications :D
[1:43] <Leseb> truly appreciated
[1:44] <gregaf> welcome!
[1:45] <dmick> yep, no problem Leseb
[1:46] * EmilienM (~EmilienM@arc68-4-88-173-120-14.fbx.proxad.net) has left #ceph
[2:02] * Kioob (~kioob@luuna.daevel.fr) Quit (Ping timeout: 480 seconds)
[2:10] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[2:27] * tnt (~tnt@45.124-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[2:49] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[2:53] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[3:10] * maelfius (~Adium@ Quit (Quit: Leaving.)
[3:47] <elder> Is ceph tracker down, or do I have something misconfigured.
[3:47] <dmick> seems like it has issues
[3:49] <elder> OK, thanks.
[3:50] * chutzpah (~chutz@ Quit (Quit: Leaving)
[3:57] * adjohn (~adjohn@ Quit (Quit: adjohn)
[4:55] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[4:58] * maelfius (~Adium@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[5:13] * deepsa (~deepsa@ has joined #ceph
[6:21] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Connection reset by peer)
[6:34] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[6:56] * dmick (~dmick@2607:f298:a:607:4cd9:fe1c:42bd:84be) Quit (Quit: Leaving.)
[7:36] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[7:43] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[8:11] * tnt (~tnt@45.124-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:19] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[9:09] * EmilienM (~EmilienM@arc68-4-88-173-120-14.fbx.proxad.net) has joined #ceph
[9:18] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:24] * Leseb (~Leseb@ has joined #ceph
[9:25] * tnt (~tnt@45.124-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:34] * pmjdebruijn (~pmjdebrui@overlord.pcode.nl) has joined #ceph
[9:34] <pmjdebruijn> hi guys... wasn't there an initiative to maintain the ceph bits in the 3.4 kernel?
[9:39] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:40] * s[X] (~sX]@ppp59-167-154-113.static.internode.on.net) Quit (Remote host closed the connection)
[9:57] * BManojlovic (~steki@ has joined #ceph
[10:05] * fiddyspence (~fiddyspen@94-192-234-112.zone6.bethere.co.uk) has joined #ceph
[10:18] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[10:31] * johnl (~johnl@2a02:1348:14c:1720:203a:8363:b2a4:caec) Quit (Read error: Operation timed out)
[10:35] * johnl (~johnl@2a02:1348:14c:1720:1531:7033:ff8a:c766) has joined #ceph
[11:42] * Deuns (~kvirc@169-0-190-109.dsl.ovh.net) has joined #ceph
[11:43] <Deuns> hello all
[12:03] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[12:11] <alexxy[home]> hi all!
[12:11] <alexxy[home]> is there any chanses to have ceph work over infiniband network
[12:11] <alexxy[home]> ?
[12:16] * alexxy[home] (~alexxy@2001:470:1f14:106::2) Quit (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
[12:16] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[12:59] <tnt> isn't there ip over infiniband ?
[13:00] <liiwi> yes there is
[13:00] <tnt> so, that should just work right ?
[13:01] <liiwi> have to check if it conflicts with anything
[13:02] <liiwi> and if the other end can handle it
[14:17] <NaioN> tnt: we have it working
[14:17] <NaioN> alexxy: yes works fine
[14:17] <NaioN> but with IPoIB
[14:17] <NaioN> so no RDMA :)
[14:39] <tnt> NaioN: and did you compare perf of ceph between IPoIB and classic ethernet gigabit ?
[14:39] <tnt> My feeling so far is that ceph itself is the bottleneck rather than the network. But hopefully people are working on fixing that :p
[14:54] <tnt> Is there a way to prevent 'recovery' ? I'm going to shut down an osd while moving some data and I don't want the data to be moved around / re-replicated ...
[14:55] <pmjdebruijn> tnt: with GigE it's fairly easy to hit that limit, it's only ~120mb/sec
[14:55] <pmjdebruijn> well or 110mb/sec
[14:56] <pmjdebruijn> tnt: even if ceph is currently the bottleneck, it makes sense to account for future scaling to have more bandwidth than GigE
[14:56] <pmjdebruijn> IB offers a fairly cheap alternative to 10GigE
[14:58] <pmjdebruijn> NaioN: what did we push over ipoib using synthetic (non-ceph) benchmarks? 6gb/sec
[15:01] <tnt> pmjdebruijn: oh yes, I'm not saying otherwise, I was just wondeing if you had compared the performance impact on ceph. Right now I'm just using dual-gbit.
[15:03] <pmjdebruijn> oh right
[15:03] <pmjdebruijn> I'm not sure, we extensively tested that
[15:31] <tnt> Is there a way to set settings like 'First A query (valid response)
[15:31] <tnt> =============
[15:31] <tnt> argh, wrong cut&paste buffer
[15:32] <tnt> settings like "mon osd down out interval" at run time ?
[16:04] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[16:04] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[16:12] * EmilienM (~EmilienM@arc68-4-88-173-120-14.fbx.proxad.net) has left #ceph
[16:14] * fiddyspence (~fiddyspen@94-192-234-112.zone6.bethere.co.uk) Quit (Quit: Leaving.)
[16:20] <nhm> good morning #ceph
[16:20] <nhm> tnt: yeah, we are indeed working hard on that. :)
[16:21] <tnt> nhm: I have faith :)
[16:21] * jluis is now known as joao
[16:22] <tnt> Btw, would you know if it's possible to change "mon osd down out interval" from the command line using the 'ceph' utility ?
[16:22] <nhm> tnt: hrm, not that I know of, but someone else might.
[16:23] <joao> I believe we are able to inject options onto a live system
[16:23] <nhm> joao: don't you need to use ceph to do that though?
[16:23] <joao> not sure if all the options are eligible though
[16:23] <joao> yes
[16:23] <joao> but that's what he's talking about, I think
[16:23] <nhm> oh, I misread what he said. I thought he said without the ceph utility.
[16:24] <tnt> yes. I'd like something like ceph set mon osd down out interval = 180 or something.
[16:24] <joao> let me take a quick look on the monitor class, given that it's pretty much in front of me ;)
[16:26] <joao> there is an option on the ceph tool that should be 'injectargs' and takes a second argument, which I'm assuming would be something similar to var_name=value
[16:26] <joao> but then again, are you sure you want to mess with that?
[16:26] <joao> I have no idea what the side effects may be
[16:26] <tnt> It's a test cluster, there is no important stuff on it.
[16:26] <nhm> tnt: it'd be something vaguely like "ceph mon tell '(mon name)' injectargs <command> <value>
[16:26] <nhm> "
[16:27] <tnt> But it may be very useful in the future because I definitely see scenarion when I know I have to take down a few osd for 30 min or so and I don't want to trigger recovery ...
[16:28] <joao> tnt, yeah, that makes sense
[16:28] <joao> nhm, I think you are right
[16:29] <nhm> also, you could try: "ceph injectargs \* '--mon_osd_down_out_interval=180'"
[16:29] <nhm> actually, that probably won't work anymore.
[16:31] <nhm> probably "ceph mon tell \* injectargs '--mon_osd_down_out_interval=180'"
[16:31] <tnt> it didn't complain ... but how can I check the current value ?
[16:31] <joao> tnt, I would assume it would pop up on the logs
[16:32] <joao> I don't think there's a command to obtain the current value for an option
[16:32] <nhm> joao: sage wrote something recently to do that afaik, but I can't remember what it's called.
[16:33] <joao> oh, must have missed it on the commits list then
[16:33] <joao> I've been working on an older version of the code, so I'm not that aware of recent changes
[16:34] <joao> btw, nhm, have you read this yet? http://www.wired.com/threatlevel/2012/08/tv-amazon-assault-rifle/
[16:34] <nhm> joao: I think it was a month or two ago.
[16:34] <joao> I found it amazingly amusing :p
[16:34] <joao> let me check then
[16:37] <nhm> joao: wow, crazy
[16:38] <joao> btw, I don't find anything of the sorts, to obtain the value of options
[16:38] <joao> but then again, don't have the time to go on a hunt right now ;)
[16:39] <nhm> Yeah, it's also possible it never made it out of a wip or something.
[16:39] <nhm> I think I maybe used it once a while ago.
[16:40] <tnt> nhm: didn't work. After 300 sec (default value) the 'down' went to 'out'.
[16:41] <nhm> tnt: hrm, which command did you use?
[16:41] <tnt> ceph mon tell \* injectargs '--mon_osd_down_out_interval=180
[16:41] <joao> I have a feeling that the osd needs to be aware of which options to watch out for
[16:41] <tnt> the other one ( ceph injectargs \* '--mon_osd_down_out_interval=180' ) makes an error in the logs
[16:42] <nhm> yeah
[16:42] <nhm> did you have the closing ' on the end?
[16:42] <tnt> yes, it's just a cut&paste error when I put it in IRC
[16:44] <nhm> ok. Not really sure why it didn't work.
[16:45] <joao> well, looks like (using the FileStore as a reference since we use injectargs for debug purposes), the OSDMonitor would need to be on the watch out for configuration changes
[16:46] <joao> oh wait
[16:46] <joao> actually it doesn't
[16:46] <joao> or shouldn't
[16:53] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[16:59] <nhm> for those of you that like NASA stuff: http://www.youtube.com/watch?v=r7UfMq-b0Uo
[17:16] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:28] * mpw (~mpw@chippewa-nat.cray.com) has joined #ceph
[17:28] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[17:44] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:46] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[17:51] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:53] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[17:53] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:57] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[17:57] * aliguori_ (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[17:58] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[18:00] * tnt (~tnt@45.124-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:01] * Deuns (~kvirc@169-0-190-109.dsl.ovh.net) Quit (Quit: KVIrc 4.0.0 Insomnia http://www.kvirc.net/)
[18:05] <Leseb> hi guys
[18:05] <nhm> hello!
[18:06] <Leseb> there is 2 journaling methods
[18:06] <Leseb> according to the wiki: writeahead mode and parallel mode
[18:07] <Leseb> parallel mode seems to be only possible with btrfs
[18:07] <Leseb> so I assume that writeahead mode is the default one, so why this file shows a false boolean for both? https://github.com/ceph/ceph/blob/master/src/common/config_opts.h
[18:07] <nhm> Leseb: that is correct
[18:08] <nhm> Leseb: hrm, maybe just a bug. Want to open a ticket?
[18:09] <Leseb> can't be a bug because the journal is functional a way or another ^^
[18:09] <gregaf> that file is the defaults, and the journaling mode is set when the OSD figures out what kind of filesystem it's running on
[18:10] <Leseb> ok so everything accept btfrs gets writeahead method?
[18:10] * maelfius (~Adium@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[18:10] <gregaf> yep
[18:11] <Leseb> thanks for clarification
[18:15] <nhm> gregaf: it is a bit confusing that both of them are false and then something gets set automagically behind the scenes.
[18:15] <Leseb> nhm: +1
[18:15] <gregaf> tnt: I think the correct format is "ceph injectargs '???mon-osd-down-out-interval=180'"
[18:16] <gregaf> but you'll want to make sure to send it to the leader
[18:16] * Tv_ (~tv@ has joined #ceph
[18:17] <Leseb> how long does the data stay in the journal? before getting written to the OSD?
[18:17] <nhm> Leseb: hence my bug comment. :)
[18:17] <gregaf> nhm: Leseb: better that than both of them being true ;)
[18:18] <Leseb> :)
[18:18] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[18:19] <Leseb> any ideas about the journal question?
[18:19] <gregaf> oh, depends on what's happening in the system at the time
[18:20] <gregaf> the OSD doesn't blank out the journal or anything, it just overwrites it, so however long it takes to get through the rest of the journal space and come back to that point
[18:21] <gregaf> if you mean how long does it take to get into the main data store (in writeahead mode), then taht varies but is limited by the filestore_max_sync_interval
[18:25] <Leseb> gregaf: that was the question perfect :D
[18:31] * EmilienM (~EmilienM@ede67-1-81-56-23-241.fbx.proxad.net) has joined #ceph
[18:36] * fiddyspence (~fiddyspen@94-192-234-112.zone6.bethere.co.uk) has joined #ceph
[18:38] * Cube (~Adium@ has joined #ceph
[18:40] <Tv_> FYI *: teuthology the schedule/queue/worker part is broken, looking at it
[18:40] * EmilienM (~EmilienM@ede67-1-81-56-23-241.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[18:41] * BManojlovic (~steki@ has joined #ceph
[18:42] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[18:47] * EmilienM (~EmilienM@ has joined #ceph
[18:50] * adjohn (~adjohn@ has joined #ceph
[18:57] <Tv_> should be better now, let me know if you see trouble
[19:02] * Leseb (~Leseb@ Quit (Quit: Leseb)
[19:03] * glowell2 (~Adium@ Quit (Quit: Leaving.)
[19:07] * maelfius (~Adium@ has joined #ceph
[19:07] * lofejndif (~lsqavnbok@9YYAAILJ3.tor-irc.dnsbl.oftc.net) has joined #ceph
[19:15] * glowell (~Adium@2607:f298:a:607:7982:fdfd:2e6b:ee39) has joined #ceph
[19:20] * chutzpah (~chutz@ has joined #ceph
[19:21] * xander (527c72a1@ircip3.mibbit.com) has joined #ceph
[19:23] <xander> hi all
[19:23] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[19:24] * glowell (~Adium@2607:f298:a:607:7982:fdfd:2e6b:ee39) Quit (Read error: Connection reset by peer)
[19:25] <nhm> sjust1: could you repeat what you were talking about in the meeting? I couldn't understand you well, but it sounded like it was very interesting.
[19:27] * fiddyspence1 (~fiddyspen@94-192-234-112.zone6.bethere.co.uk) has joined #ceph
[19:28] <xander> has anyone already encountered a systematic client kernel crash (on ceph_d_prune) using rsync ?
[19:28] <xander> (ceph 0.49 on 3.2.0-27-generic ubuntu kernel)
[19:29] * glowell (~Adium@2607:f298:a:607:7982:fdfd:2e6b:ee39) has joined #ceph
[19:29] <sjust1> nhm: basically, we reply with safe/complete when the commit happens, but before the apply happens
[19:29] <sjust1> this is correct, but it means the client believes the op to be complete before it actually is
[19:30] <sjust1> the end result is that the op is still in the apply queue which allows the filestore queues to grow beyond what the client considers the max number of outstanding ops
[19:30] <sjust1> shouldn't affect throughput, but it does affect latency
[19:30] <sjust1> mostly, it just adds noise to measurements
[19:31] <gregaf> elder: does that crash from xander sound familiar?
[19:31] <gregaf> it's nothing that I know about, but there were a lot of fixes going on at and after that point
[19:32] <nhm> sjust1: commit is journal and apply is data disk?
[19:32] <sjust1> commit means won't be lost, apply means readable
[19:32] <sjust1> in practice, commit is journal, apply is the filesystem operation
[19:33] * aliguori_ (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Quit: Ex-Chat)
[19:33] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:33] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[19:34] <nhm> sjust1: yeah. IE basically the client is (correctly) informed when the data is written to something on the server and no longer needs to be involved, so it pushes out more data. The journal is keeping up but the filestore is struggling and so latencies grow long?
[19:34] <sjust1> yeah
[19:34] <sjust1> again, shouldn't really affect throughput
[19:34] * fiddyspence (~fiddyspen@94-192-234-112.zone6.bethere.co.uk) Quit (Ping timeout: 480 seconds)
[19:35] * lofejndif (~lsqavnbok@9YYAAILJ3.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[19:35] <nhm> that also lets us overlap writes to the journal while other writes to the filesystem are happening without exceeding our maximum in flight ops.
[19:35] <sjust1> yeah, but that's what the max in-flight-ops concept is for in the first place
[19:38] <elder> xander, gregaf it doesn't sound familiar to me.
[19:38] <xander> ok
[19:38] <nhm> fair enough. Are you going to hack it up to only ack on apply?
[19:38] * glowell1 (~Adium@ has joined #ceph
[19:39] <gregaf> sjust1: what's your plan for this hack, btw? IIRC you'll need to muck around in both the OSD and the librados stuff
[19:39] <sjust1> ...just going to delay the commit message until apply time
[19:39] <sjust1> done
[19:40] <nhm> nice
[19:40] <gregaf> sjust1: oh, so *really* hacky then, okay
[19:41] <sjust1> yeah... solving this for real would probably mean adding an additional "we're all done here" reply to rados
[19:41] <sjust1> or allowing the apply to follow the commit
[19:41] * glowell (~Adium@2607:f298:a:607:7982:fdfd:2e6b:ee39) Quit (Ping timeout: 480 seconds)
[19:42] <gregaf> yeah, I thought you were allowing the apply and commit to come in any order
[19:42] <sjust1> that sounds like actual work
[19:43] <sjust1> and would foul up pretty much all current librados users
[19:43] <sjust1> I just want simpler measurements on my hacky test branch :)
[19:43] <xander> alright, i'm gonna fill a bug report about this
[19:44] <nhm> gregaf: shush, it's awesome. ;)
[19:50] * sjust1 (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) Quit (Quit: Leaving.)
[19:54] * sjust (~sam@ has joined #ceph
[19:59] * lofejndif (~lsqavnbok@19NAABMG4.tor-irc.dnsbl.oftc.net) has joined #ceph
[20:02] * danieagle (~Daniel@ has joined #ceph
[20:09] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[20:10] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:10] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[20:12] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:14] * xander (527c72a1@ircip3.mibbit.com) has left #ceph
[20:15] * Leseb_ (~Leseb@ has joined #ceph
[20:16] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[20:21] * dmick (~dmick@ has joined #ceph
[20:23] * Leseb_ (~Leseb@ Quit (Ping timeout: 480 seconds)
[20:26] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:36] <yehudasa> gregaf: can you take a look at wip-2504?
[20:51] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[21:17] <alexxy> NaioN: any plans for rdma?
[21:18] <alexxy> NaioN: if it will be here ceph can be good replacement for lustre
[21:20] * cattelan (~cattelan@2001:4978:267:0:21c:c0ff:febf:814b) has joined #ceph
[21:21] <nhm> alexxy: It's something people ask us for on a semi-regular basis. Not sure where (or if) it is on the roadmap yet.
[21:22] <alexxy> nhm: we planning large hpc cluster here
[21:22] <alexxy> and i still not decide about home fs
[21:22] <nhm> alexxy: I used to work for a supercomputing center. How big are you talking?
[21:22] <alexxy> (large meanse ~5k nodes and about 2-10Pb for home)
[21:23] <nhm> alexxy: IB for home too?
[21:23] <alexxy> in general hw requirements for distributed fs are same
[21:23] <alexxy> nhm: ib for all
[21:23] <darkfader> ib for all is a good motto hehe
[21:24] <alexxy> it will have management 1G interfaces
[21:24] <nhm> When do you plan to deploy?
[21:24] <dmick> ib uber alles
[21:24] <alexxy> but all trafic should go for ib
[21:24] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[21:24] <nhm> alexy: yeah, that was what we did too.
[21:24] <alexxy> nhm: 13Q4 or 14Q1
[21:25] <nhm> alexxy: hrm, that's probably right around when we will be trying to get cephfs production stable.
[21:25] <alexxy> i sill didnt chose dfs
[21:25] <alexxy> candidates are lustre ceph and fgfs (fraughofer fs)
[21:26] <alexxy> ceph has a big advantage that it works with vanilla kernels
[21:26] <nhm> alexxy: If you want to give cephfs a shot, just be aware that it may still be actively getting bug fixes. :)
[21:26] <alexxy> fgfs seems too
[21:26] <alexxy> but it doest opensourece
[21:26] <nhm> alexxy: lustre is terrible for /home honestly.
[21:27] <alexxy> nhm: i know that lustre is terrible
[21:27] <nhm> alexxy: it's fast for scratch space, but I'd never want to run /home on it. Expect a decent size deployment to be going down at least once a month.
[21:27] <alexxy> but its like a de facto standart
[21:27] <nhm> alexxy: For scratch certainly. Not sure how many sites are using it for home these days.
[21:28] <alexxy> nhm: also it doesnt have data redundancy
[21:28] <alexxy> cant rebalance osd
[21:28] <alexxy> and so on
[21:28] <nhm> alexxy: Another option might be panasas. Not sure if they have their IB stuff out yet.
[21:28] <alexxy> nhm: i'd say about 80% of top500
[21:28] <alexxy> panasas is closed source
[21:29] <nhm> alexxy: I'd say that's true for scratch partitions.
[21:29] <nhm> alexxy: I doubt it's true for /home
[21:29] <alexxy> some can use gpfs
[21:29] <darkfader> <random mailing list smartass>why can't you use rsync</random mailing list smartass>
[21:29] <alexxy> but its buggy with parallel i/o
[21:30] <nhm> alexxy: what site are you at?
[21:30] <darkfader> alexxy: what's the downside of the fraunhofer fs? I remember looking at it
[21:30] <gregaf> rsync works great for swift! </digs>
[21:30] <darkfader> (thinking it can't be worse than lustre)
[21:30] <dmick> snark level 8 and climbing, captain
[21:30] <alexxy> darkfader: i didnt tryed it
[21:30] <alexxy> so i dont have any opinions about this fs
[21:31] <nhm> darkfader: I never actually tried it. Someone gave a presentation on it at LUG2011 and was saying it was actually faster than Lustre for their application.
[21:32] <tnt> nd I get the following output...maybe someone can spot something...
[21:33] <tnt> damnit, sorry for the noise, damn touch pad ...
[21:33] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[21:33] <nhm> alexxy: fyi I managed about 900TB of lustre storage and I kept it pretty stable, but apparently once I left it was crashing weekly due to small IO traffic overloading the MDS. Be warned if you have users that will cause a lot of metadata traffic.
[21:35] <alexxy> nhm: well if ceph will use ibverbs for data traffic i think i will chose it over lustre
[21:35] <alexxy> if it will be stable enough
[21:35] <nhm> alexxy: yeah, you should talk to our business folks. :)
[21:37] <alexxy> he he
[21:38] <nhm> alexxy: the more people (paying customers!) who want infiniband, the more likely it is to happen sooner.
[21:38] <alexxy> well i will think about this
[21:38] <alexxy> =)
[21:39] <alexxy> also may be about support
[21:40] <alexxy> nhm: btw does ceph sutable for /home and /scr on cluster
[21:42] <nhm> alexxy: ceph is composed of multiple parts: An object storage system, a block device layer, an S3 object storage layer, and a posix layer (kernel client and fuse, with clustered MDS)
[21:43] <alexxy> i mean posix fs with kernel client
[21:43] <nhm> alexxy: of those, we've primarily been focusing on the object storage system and the S3 and block device layers. The posix filesystem layer hasn't been getting as much attention as that's not what our current customers are using.
[21:43] <alexxy> rbd and s3 looks like more sutable for virtualization
[21:43] <gregaf> our customers are only not using it because it's not done, though!
[21:44] <gregaf> the posix part is harder and built on top of RADOS, so we've been focusing on RADOS and the cheaper extensions to it
[21:44] <alexxy> and we get closed circle =)
[21:44] <nhm> alexxy: I think we plan to start working more heavily on the filesystem layer later this year, but I don't know for sure.
[21:45] <nhm> alexxy: Plans change quickly though if there is the right kind of interest, which is why should talk to our business guys. ;)
[21:46] <nhm> alexxy: Are you putting the system out for bid?
[21:47] <alexxy> no currently
[21:47] <alexxy> we're at planning stage right now
[21:47] <alexxy> at least until the end of this year
[21:47] <nhm> Ok. I was just curious since a lot of vendors want to sell you the whole package.
[21:49] <nhm> Any idea what OS you'll use?
[21:49] <nhm> (or are planning on using)?
[21:49] <alexxy> well they usualy want to sell everything even if i dont need it
[21:49] <alexxy> we going to use our gentoo build
[21:49] <nhm> yeah, that's definitely true. :)
[21:49] <alexxy> that we use on other clusters here
[21:51] <nhm> Interesting! We were always SLES/CentOS
[21:51] <nhm> Too many old/broken applications the scientists used would only work on (sometimes old) versions of centos.
[21:52] <alexxy> well most of apps works here like a charm
[21:52] <alexxy> like CAD/CAE
[21:52] <alexxy> QM
[21:52] <alexxy> MD
[21:52] <alexxy> and so on
[21:52] <alexxy> even old ones
[21:54] <nhm> alexxy: ANSYS was always a pain in the butt for us on alternate platforms, though it wasn't the worst.
[21:54] <nhm> alexxy: I kept trying to get people to move to OpenFOAM.
[21:56] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[21:57] <nhm> alexxy: we also had a bunch of programs some of the biologists wanted that used old versions of FSL.
[21:57] <jefferai> I'm curious, as anyone here looked at zfs as a backend for ceph rather than btrfs, on Linux?
[21:57] <nhm> jefferai: I've wanted to, but haven't had time yet.
[21:57] <dmick> last I heard zfs doesn't exist on linux?
[21:58] <jefferai> dmick: you heard wrong
[21:58] <jefferai> zfs doesn't exist in the Linux kernel
[21:58] <jefferai> that's all
[21:58] <nhm> dmick: the LLNL guys created a kernel client and there's been a fuse one around.
[21:58] * Fruit uses zfs on linux
[21:58] <jefferai> dmick: they simply can't fold it into the kernel sources
[21:58] <jefferai> because of license incompatibilities
[21:58] <jefferai> but they decided to just make it a module, which gets around that
[21:58] <jefferai> I've been using it for about a week, and color me pleased
[21:59] <dmick> ok
[21:59] <jefferai> and seems far more stable than btrfs, even now
[21:59] <jefferai> plus more features
[21:59] <darkfader> it works quite ok now, last year i still got some crashes and none more since
[21:59] <nhm> jefferai: yeah, I think it has a lot of potential if the patent issues can be worked outl.
[21:59] <jefferai> patent issues?
[21:59] <darkfader> billion times more stable-ish than btrfs :)
[21:59] <jefferai> thought it was just licensing issues
[21:59] <Fruit> jefferai: there's a #zfsonlinux channel btw
[21:59] <jefferai> Fruit: I figured, though wasn't sure which network
[22:00] <Fruit> oh freenode
[22:00] <darkfader> there used to be patent issues because zfs was really a sick copy of netapps cow, but i think that's settled already
[22:00] <jefferai> ah, okay
[22:00] <darkfader> (with sun gone... lol)
[22:00] <dmick> "copy of netapps cow"...lol
[22:00] <jefferai> so my hardware is on order and I've been looking forward to ceph but not btrfs
[22:00] <darkfader> dmick: moooo :)
[22:00] <dmick> that was a nice try on netapp's part
[22:01] <Fruit> wafl and all that
[22:01] <jefferai> is it theoretically possible to use zfs as a backing FS rather than btrfs?
[22:01] <darkfader> dmick: sun hated netapp because the netapp guys left sun to do their thing. and so they just had to try to get back at them (imho)
[22:01] <jefferai> AFAIK it supports all the needed xattrs and so on
[22:01] <Fruit> wafl must be the dumbest name for an fs that I ever heard
[22:02] <dmick> there was bad blood btw sun and netapp, but the suit was nonsense
[22:02] <dmick> cow is obvious and prevalent
[22:02] <nhm> jefferai: I think the zfs linux port does support all the xattrs needed.
[22:02] <jefferai> plus it supports snapshotting and the like
[22:02] <jefferai> "
[22:02] <jefferai> Tip
[22:02] <jefferai> We recommend configuring Ceph to use the XFS file system in the near term, and btrfs in the long term once it is stable enough for production.
[22:02] <jefferai> "
[22:02] <jefferai> it seems like zfs is already much more stable than btrfs
[22:02] <dmick> anyway: jefferai: I don't know of any technical reason why it couldn't work. ceph has some btrfs awareness, and so it would need to be tweaked to recognize zfs if similar optimizations exist (which I suspect, but don't know all the details)
[22:02] <jefferai> that's what got me thinking about this question
[22:03] <joshd> I'd guess it wouldn't take much work to use zfs snapshots instead of btrfs snapshots
[22:03] <dmick> basically no one's looked very closely that I've heard of. but then again I wasn't aware zfs was that mature on Linux
[22:03] <Fruit> jefferai: there's some remaining bugs wrt memory management
[22:03] <jefferai> dmick: Fruit clearly knows more than I do
[22:03] <jefferai> but I have looked around and have seen people talking about having been using it for two years with great success
[22:03] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:03] <nhm> jefferai: yeah, I think the primary reason we haven't done a lot of testing with it is simply that it's not included in the kernel.
[22:03] <dmick> that's great news. more the merrier.
[22:04] <jefferai> plus I sure
[22:04] * Fruit even has his rootfs on zfs now
[22:04] <jefferai> nhm: dmick: I'm potentially quite interested in helping/working on getting that working
[22:04] <dmick> let us know how we can help!
[22:04] <jefferai> hah, that's my question to you guys
[22:04] <nhm> jefferai: Well, first thing would be to just point ceph at zfs and see what happens. :_)
[22:04] <jefferai> first I actually have to get Ceph up and running
[22:04] <nhm> :)
[22:04] <dmick> download the ceph source, set up zfs, run it...jinx
[22:05] <jefferai> I've been waiting on my hardware, which has taken > 2 months to get through our purchasing department
[22:05] <jefferai> but I've been thinking that maybe just for testing I could get it running on some VMs
[22:05] <dmick> you can play on a desktop, even, no VMs required
[22:05] <dmick> see vstart.sh
[22:05] <dmick> not "real", but lets you get familiar
[22:05] <jefferai> hm
[22:05] <nhm> jefferai: yeah, I actually just got a SC847a chassis in for testing at inktank with a bunch of different SAS/SATA/RAID controllers.
[22:05] <dmick> and certainly valid for testing/compatibility
[22:05] <dmick> even safer in a vm, of course
[22:05] <jefferai> dmick: OK, will look, good to know
[22:06] <jefferai> nhm: cool -- I bought through Silicon Mechanics (which is a Supermicro shop)
[22:06] <jefferai> but I convinced them to let me buy the hard drives and network cards all on my own
[22:06] <dmick> currently just about all my development testing is on a vstarted local cluster
[22:06] <nhm> jefferai: Having ZIL and L2ARC on SSDs is especially interesting.
[22:06] <jefferai> nhm: yep, will be exploring that
[22:07] <alexxy> nhm: well personaly i usualy use gromacs and gamess-us for my projects and they run just fine
[22:07] <alexxy> same should be for most biologyst sw
[22:07] <alexxy> nhm: also ansys works here
[22:07] <alexxy> as long as fluent
[22:08] <dmick> jefferai: we have some SM machines here
[22:08] <alexxy> nhm: btw i can try zfs for ceph backend
[22:08] <nhm> alexxy: we eventually got it working, but only after installing on centos and then manually copying over to SLES.
[22:08] <alexxy> nhm: well we simply edited some startup files on gentoo
[22:08] <jefferai> alexxy: the more the merrier...I can probably work on it tomorrow as I have to leave today and take the wife out for a celebration (new job!)
[22:09] * jefferai is on Gentoo too
[22:09] <alexxy> and now it works with openmpi installed here
[22:09] * alexxy gentoo dev =D
[22:09] <jefferai> hah, thought I remembered your nick
[22:09] <jefferai> not a gentoo dev, but work with some of them as downstreams
[22:09] <alexxy> ahh =)
[22:09] <jefferai> mostly...johu?
[22:09] <jefferai> I think
[22:10] <nhm> jefferai: congrats. :)
[22:10] <jefferai> unless I'm mixing up which guy in our channel is the gentoo dev
[22:10] <jefferai> nhm: ah, her new job, not mine
[22:10] <jefferai> she went through 7 years of hell as a PhD student
[22:10] <darkfader> nhm: if you had zil/l2arc, and also ceph journal, would you turn off zil?
[22:10] <nhm> jefferai: well, congrats to her then. :)
[22:10] <jefferai> finally out!
[22:10] <darkfader> because that's one of the questions i'm still not getting past
[22:11] <alexxy> darkfader: turning off zil isnt good idea
[22:11] <alexxy> =)
[22:11] <alexxy> it may corrupt zfs
[22:11] <jefferai> right, but isn't that the point, that ceph takes care of that?
[22:11] <jefferai> that actually brings me to a question, which is, my understanding is that each drive in a host is a different OSD
[22:11] <darkfader> you see what i'm getting at lol
[22:11] <jefferai> because you don't need to e.g. RAID them since Ceph is handling the redundancy
[22:11] <alexxy> it may renedr fs unmountable
[22:12] <darkfader> either you need io bw for two journal devices, or you get rid of one
[22:12] <jefferai> so each drive has its own btrfs filesystem (right?)
[22:12] <jefferai> so if your fs becomes unmountable, then you wipe the fs and let ceph repliate the data back?
[22:12] <darkfader> jefferai: btrfs can span multiple drives, but if you want one osd per drive, then yes
[22:12] <jefferai> *replicte
[22:12] <nhm> darkfader: Honestly I'm not sure. I'd have to think about it more.
[22:12] <darkfader> nhm: np :)
[22:12] <darkfader> i'll keep poking at it
[22:13] <jefferai> darkfader: not sure if I'd want one OSD per drive, I just know that the people in here advised me in the past to not use e.g. RAID
[22:13] <jefferai> because that's the point of ceph redundancy
[22:13] <jefferai> so to me that sounds like you'd then have to use an FS per device
[22:14] <alexxy> well btrfs will have raid5 functionality
[22:14] <nhm> jefferai: In some cases it may make sense to do raid if you have tons of drives per node.
[22:14] <jefferai> it doesn't have to
[22:14] <jefferai> nhm: just for performance reasons?
[22:14] <nhm> yeah, 3.7 right?
[22:14] <alexxy> nhm: yep
[22:14] <nhm> jefferai: And memory usage during recovery
[22:14] <alexxy> nhm: also seems like osd nodes here will have 24 or 32 4T drives per node
[22:15] <alexxy> so there defenetely should be some kind of raid
[22:15] <jefferai> so...two-disk stripes?
[22:15] <jefferai> well, hang on
[22:15] <nhm> alexxy: I just got a SC847a chassis in to test with. Don't have it filled with drives yet though.
[22:15] <alexxy> he he
[22:15] <alexxy> i mean this supermicro chassis
[22:15] <alexxy> =)
[22:16] <jefferai> if you lose a disk and it's not raided, then you only need to replace that disk and have ceph replicate it back
[22:16] <alexxy> i run this one for media archive
[22:16] <nhm> alexxy: oh, is that the one you guys will likely be using?
[22:16] <jefferai> so I get the speed argument
[22:16] <alexxy> with 24 disks and zfs
[22:16] <alexxy> on gentoo
[22:16] <jefferai> but then wouldn't you want to e.g. have two-disk stripes, and risk having to replicate two disks' worth of data back?
[22:16] <jefferai> (since you'll have multiple copies in the cluster)
[22:16] <jefferai> IOW go for raw speed
[22:16] <jefferai> and let the cluster be the redundancy
[22:16] <alexxy> also for zfs eache vdev has only one i/o thread
[22:17] <alexxy> even if it consist of multiple backend drives
[22:17] <alexxy> nhm: most likely
[22:17] <alexxy> price/performance/capacity factor is good for this storage nodes
[22:17] <alexxy> ~10k$ for 40T node
[22:18] <jefferai> alexxy: I'm doing better than that :-)
[22:18] <alexxy> if you use 2T drives
[22:18] <jefferai> ah
[22:18] <nhm> alexxy: what controllers are you using?
[22:18] <alexxy> even if its sas2 ones
[22:18] <alexxy> i tryed different ones
[22:18] <alexxy> from lsi
[22:18] <alexxy> and adaptec
[22:19] <alexxy> lsi seems works fine
[22:19] <jefferai> good, mine are lsi :-)
[22:19] <alexxy> but it depends on what you need
[22:19] <jefferai> I was a fan of 3ware in the past, and LSI snapped them up
[22:19] <alexxy> lsi gives about ~1.8G rw on 24 drives
[22:20] <nhm> We are going to be testing lsi2008, lsi2308, some marvell based highpoint card, Areca 1680 raid, and we've already got Dell H700s (ie lsi 9260s) in house.
[22:20] <jefferai> alexxy: so if each vdev has one I/O thread it would suggest that you're best off having no zfs raid, and simply creating a zfs pool/vdev per drive?
[22:20] <alexxy> jefferai: it depends
[22:21] <alexxy> you can create mirrors or raidz1 vdevs
[22:21] <jefferai> (looks like I'll have lsi 9211 controllers)
[22:22] <nhm> jefferai: those are 2008 based. I have a feeling those are going to be good for ceph, though haven't tested them yet (give me 2 weeks).
[22:22] <alexxy> nhm: jefferai: https://paste.lugons.org/show/8EfGVxcEasZxIejwRxfN/
[22:22] <alexxy> its from live system
[22:22] <nhm> Probably better than the 9260s in our dells.
[22:23] <alexxy> if you interested
[22:23] <jefferai> alexxy: I see
[22:23] <jefferai> but you're doing that for redundancy, right?
[22:23] <alexxy> yep
[22:23] <jefferai> you're not doing that with a ceph cluster?
[22:23] <alexxy> its media storage
[22:23] <jefferai> sure
[22:23] <alexxy> for iptv
[22:24] <alexxy> for ceph poll should be simplier
[22:24] <alexxy> *pool
[22:24] <jefferai> alexxy: so that goes back to my question -- if you're letting ceph do redundancy, should you bother with (zfs) RAID (previously in this channel I was told no)
[22:25] <jefferai> I guess the benefit is that if a disk does go down you can still participate in the Ceph cluster while you swap it out
[22:25] <jefferai> without having to rebuild
[22:25] <alexxy> i think no
[22:25] <alexxy> you can just use linear pool
[22:25] <jefferai> hm
[22:25] <jefferai> how do you set up a linear pool?
[22:25] <jefferai> isn't that just one pool per drive?
[22:26] <alexxy> no its pool from all drives
[22:26] <jefferai> ah
[22:26] <nhm> jefferai: I think the answer depends on how many drives you have in one node, whether you want more reliable nodes or more redundancy across multiple nodes, etc.
[22:26] <jefferai> nhm: right, sure
[22:26] <alexxy> but for better erformance its better to use drive per vdev
[22:26] <jefferai> you can minimize the chance of having to have ceph do the replication
[22:26] <jefferai> by having local raid too
[22:26] <darkfader> i think it also matters how constant bandwidth you need to offer
[22:26] <jefferai> the other benefit of the local raid is that zfs is good at detecting disk errors
[22:26] <jefferai> constant bandwidth, probably not a lot...more bursty
[22:26] <darkfader> if variance is ok, a rebuild from ceph is great
[22:27] <jefferai> gotcha
[22:27] <jefferai> alexxy: okay -- I guess I wasn't clear, I thought vdevs = pools
[22:27] <jefferai> because you create a pool by specifying a vdev and then the drives that belong to it
[22:27] <alexxy> well if you have more then 16 drives its better to use some kind of raids
[22:27] <jefferai> alexxy: for speed only?
[22:27] <jefferai> or?
[22:27] <nhm> alexxy: zfs is a little different too, vs say xfs ontop of a hardware raid.
[22:27] <alexxy> for better multithreading
[22:27] <jefferai> hm
[22:28] <alexxy> jefferai: no =) vdevs are different this then pools
[22:28] <jefferai> ok, I'll have to look at that again
[22:28] <jefferai> I'm still new to zfs
[22:28] <alexxy> pool may contain multiple vdevs
[22:28] <jefferai> okay
[22:28] <jefferai> oh, I see
[22:28] * jefferai is looking at the docs, misunderstood the output they were showing
[22:28] <jefferai> okay
[22:28] <nhm> jefferai: so far we've been targetting smaller configurations (ie 2U boxes with 12 drives each). I think in that case hardware raid is probably not very helpful. For bigger nodes...
[22:28] <alexxy> nhm: yep =) xfs has better multithreaing features then zfs
[22:29] <jefferai> nhm: I'll have 2U boxes
[22:29] <jefferai> 12 drives 3.5, some with 20 drives 2.5
[22:30] <nhm> alexxy: amazingly we hit a bug on our Dell H700 controllers were performance tanked as soon as two processes were concurrently writing to the disk.
[22:30] <nhm> alexxy: that was with XFS.
[22:30] <alexxy> heh
[22:30] <jefferai> alexxy: stupid question, but why would having raid help for >16 drives in a multithreading aspect, since you said more vdevs = good because of more i/o threads
[22:30] <nhm> alexxy: dropped from 800MB/s on a 7 drive raid-0 to 95MB/s when going from 1 writer to 2 writers.
[22:31] <alexxy> nhm: omg
[22:31] <jefferai> nhm: yeah, that's the other end of the specrtrum -- I could raid-0 a bunch of drives toether
[22:31] <jefferai> and if I lose one, I have two redundant copies via ceph, so can afford to wait while it rebuilds over 10GbE
[22:31] <jefferai> even across 5 drives, say
[22:31] <alexxy> jefferai: its for redundancy
[22:31] <jefferai> ah, okay
[22:31] <alexxy> you may not have too many osd nodes
[22:31] <jefferai> oh, there's a limit?
[22:31] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:32] <alexxy> jefferai: no =) there no limit
[22:32] <Tv_> RAM and CPU
[22:32] <jefferai> ah
[22:32] <jefferai> fair enough
[22:32] <nhm> jefferai: It might be good for big configurations. I don't know that we know yet what the per-osd throughput limits are.
[22:32] <jefferai> I see
[22:32] <alexxy> for example if you have 20 osd nodes
[22:32] <alexxy> with 24 drives each
[22:32] <jefferai> that'd be a lot...
[22:32] <alexxy> and one node goes offline if you have single disk failure
[22:32] <alexxy> that will be bad
[22:33] <jefferai> so if I have 12 osd nodes, each with 10 non-system drives, and I put those into two 5-disk RAID0s...
[22:33] <jefferai> and I have two redundant copies via Ceph
[22:33] <alexxy> but in case of some kind of raid you may tolerate single disk failures
[22:33] <jefferai> sure, I know
[22:33] <jefferai> but if Ceph is up to the replication job, I'd rather add some speed
[22:34] <nhm> jefferai: fewer OSDs would mean less memory requirements. Maybe less CPU. I'm not sure if you could get the per-osd throughput high enough for it to be as fast.
[22:34] <jefferai> after all that means that even if you do lose a drive, you can then fill the raid-0 back up with replication much faster
[22:34] <jefferai> ah
[22:35] <jefferai> so it's mostly a comfort-level thing, then
[22:35] <jefferai> I could e.g. turn those 10 drives into 3 3-disk raidzs with a hot spare
[22:35] <jefferai> and thus 3 OSDs per node
[22:35] <jefferai> or I could turn them into three raid0s, or two raid0s
[22:35] <nhm> jefferai: yep, that might be viable. It would be interesting to test anyway!
[22:35] <jefferai> which might increase total throughput but increase the amount of repliction that would have to happen, and thus possibility of other drives failing during that time
[22:35] <jefferai> sure
[22:36] <jefferai> I'll definitely look into that shell script, I'd love to get a feeling for working with ceph while waiting on the hardware
[22:37] <jefferai> thanks for all the info, if one of you plays around with ceph on zfs before I report back about it, please be sure to ping me
[22:37] <jefferai> :-)
[22:37] <jefferai> would want to know what you found
[22:37] <nhm> Will do! If you test it, please send a note to the mailing list. I bet other people would be interested too!
[22:37] <jefferai> sure, I'll have to get on it
[22:38] <jefferai> take care
[22:38] <alexxy> nhm: btw will ceph perform good if it will use same node for osd and client?
[22:39] <nhm> alexxy: I don't think that's a good idea with the kernel client right now due to some potential deadlock situation. With fuse you can do it.
[22:39] <alexxy> hmmm
[22:39] <jefferai> by client you just mean cephfs, right?
[22:39] <jefferai> not the rados gateway, or rbd...?
[22:39] <nhm> alexxy: for performance, I don't think it will improve throughput in most cases. We're still working on maxing out our 10G links.
[22:40] <nhm> jefferai: yeah, just the kernel client.
[22:40] <jefferai> cool
[22:40] * jefferai goes
[22:40] <alexxy> heh i'll going to try it for some workloads on small test cluster
[22:44] <alexxy> nhm: well. I dont have 10G hw here
[22:44] <alexxy> only different ib hw
[22:44] <alexxy> SDR DDR QDR
[22:47] <elder> nhm are you going to set up teuthology@home, or are you just going to have some dedicated local hardware at your disposal?
[22:52] <nhm> elder: for the controller testing, just local hardware. We could see about trying to make it available over the network, though I'll be mucking around with it a lot.
[22:53] <nhm> elder: and century link is sucking it up as a service provider.
[22:53] <nhm> alexxy: yeah, so you are even less likely to hit network limitations. ;)
[22:54] <NaioN> alexxy: had the same problem... also wanted to have raid on the nodes so I don't have to reboot a node for a single disk failure
[22:54] <NaioN> but we didn't want to use hardware raid and we didn't get it stable enough with software raid
[22:55] <NaioN> so at the moment we still run a osd per disk with 24 disks per node
[22:56] * gregaf (~Adium@2607:f298:a:607:4cc5:a15:c810:6c20) Quit (Read error: Operation timed out)
[22:56] <nhm> NaioN: Do you guys have any memory consumption issues?
[22:56] <NaioN> we have 24G per node
[22:57] <NaioN> no memory issues as far as we noticed
[22:58] <dmick> NaioN: when you say "reboot", you mean "restart ceph-osd", right?
[22:58] <nhm> That's good to hear. What kind of throughput do you get per node?
[22:58] <NaioN> dmick: no
[22:58] <NaioN> the problem is the mounted fs on a failed disk
[22:58] <dmick> force-unmount doesn't work?
[22:59] <NaioN> you have no way under linux to umount a fs from a failed disk
[22:59] <NaioN> dmick: no
[22:59] <dmick> sadface
[22:59] <NaioN> yeps
[22:59] <darkfader> heh.
[23:00] <NaioN> you also can't kill the processes that are waiting on IO
[23:00] * MK_FG (~MK_FG@ Quit (Quit: o//)
[23:00] <darkfader> umount -l can take care of the procs sometimes
[23:00] <darkfader> as opposed to kill
[23:00] * gregaf (~Adium@ has joined #ceph
[23:00] <NaioN> sometimes indeed :)
[23:01] <NaioN> darkfader: -f you mean
[23:01] <dmick> I assume removable drives handle it better?
[23:01] <darkfader> do i? lazy umount
[23:01] <darkfader> i dont remember
[23:01] <dmick> there are both
[23:01] <dmick> -l is lazy
[23:01] <dmick> -f is force
[23:01] <darkfader> i dont have to do nfs crap any more so never needs it any more
[23:01] <darkfader> -l works for stuck stuff by tearing it down
[23:02] <darkfader> my old team in india used -l after 2 times fuser -ck
[23:02] <darkfader> result: first fuser got torn down, umount of mountpoint, then fuser -k on /
[23:02] <darkfader> then one less guy on team ;)
[23:02] <NaioN> haven't tried that one, but it still doesn't clean all the references
[23:02] <darkfader> yes it's just a "best effort" thing
[23:02] <NaioN> hehe
[23:02] <gregaf> -l doesn't force it to write out all dirty cache; I don't remember what -f does precisely, but they each do something different (I think -f is data-safe if you don't have any current writers; -l is not)
[23:03] <darkfader> from carrier grade linux, the idea was "if the disk is broken we'll never sort this out anyway
[23:03] <darkfader> so lets just wipe it out"
[23:03] * fiddyspence1 (~fiddyspen@94-192-234-112.zone6.bethere.co.uk) Quit (Quit: Leaving.)
[23:03] <darkfader> meh, i want my hpux back
[23:03] <darkfader> lose whole san and when it comes back, give it 2 mins and everything is fine again
[23:04] * MK_FG (~MK_FG@ has joined #ceph
[23:04] <NaioN> well you don't have to umount for that
[23:04] <dmick> I would have thought -f is "screw it, burn it all to the ground"
[23:04] <dmick> but I have no actual info, that's just assumption from the nature of the problem and the flag name
[23:04] <dmick> this sounds like something worth digging into
[23:05] <darkfader> dmick: -f only works if the FS is OK
[23:05] <darkfader> iirc
[23:05] <dmick> then why do you even need -f?
[23:05] <NaioN> dmick: yes I would think either but in real live it almost never works :)
[23:05] <darkfader> idk. it never worked when i needed it
[23:05] <NaioN> the man page refers to nfs
[23:05] <Tv_> -f doesn't work well, but it's supposed to make all open files -EIO
[23:05] <darkfader> NaioN: maybe it kills processes that are in generally having stuff open, but not have a stuck write
[23:05] <NaioN> Force unmount (in case of an unreachable NFS system). (Requires kernel 2.1.116 or later.)
[23:06] <Tv_> where as -l keeps the files open and unmounts when last one is closed
[23:06] <NaioN> darkfader: could be, but that's not really what you want
[23:06] <darkfader> NaioN: hehe no :)
[23:06] <dmick> sounds like umount needs a -F
[23:06] <NaioN> the change of an open write is big
[23:06] <dmick> as in "-F it"
[23:06] <NaioN> dmick: yeah :)
[23:06] <Tv_> dmick: -f is that, it just doesn't work well
[23:07] <darkfader> for nfs... just don't use hard mounts w/o intr if you don't need/understand it
[23:07] <NaioN> just unmount and kill all processes depending on that fs and clean all references
[23:07] <darkfader> that did it for $oldjob
[23:07] <darkfader> lwn.net has a few good articles on -l
[23:07] <Tv_> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=fs/bad_inode.c;h=b1342ffb3cf6e595125252ac82d5bd54ebbe9544;hb=HEAD
[23:07] <NaioN> darkfader: yes we also used soft mounts
[23:07] <Tv_> ^ that's a fs where everything files
[23:07] <Tv_> umount -f is supposed to switch you into that
[23:08] <gregaf> oh, so -l just clears it out of the hierarchy, and open file descriptors do???something?
[23:08] <darkfader> gregaf: they get parked and the processes are wiped out
[23:08] <Tv_> gregaf: -l just prevents new chdirs/opens from entering the fs
[23:08] <darkfader> i think
[23:09] <Tv_> it's disconnected from the hierarchy, but remains existing
[23:09] <NaioN> Tv_: but you need to do that for every inode?
[23:09] <Tv_> s/everything files/everything fails/
[23:09] <gregaf> I think this has been improperly described to me in the past
[23:09] <Tv_> NaioN: the bad thing? yeah
[23:10] <Tv_> NaioN: well not inode.. fd/dentry
[23:10] <NaioN> so on a fs with a lot of open inodes it can take a while before every inodes gets bad
[23:10] <NaioN> Tv_: yeah fd...
[23:10] <Tv_> in-mem not on-disk, but yeah
[23:10] <Tv_> NaioN: oh the umount might sit there for a while, that'd be fine
[23:10] <Tv_> NaioN: if the darn thing worked in the first place
[23:10] <NaioN> with ceph you have a lot of open fd's?
[23:11] <elder> nhm I meant, are you going to set it up as a local teuthology target. I have a couple of machines and have wondered about doing something like that in my own "lab."
[23:11] <Tv_> http://www.selenic.com/pipermail/kernel-mentors/2005-July/000330.html
[23:11] <elder> I guess it's Tv_ I should be talking to.
[23:11] <Tv_> elder: specify explicit targets: in a yaml and avoid locking through some means i am not the expert of
[23:11] <Tv_> elder: joshd knows more about the locking ;)
[23:12] <darkfader> i can't find the original articles
[23:12] <elder> I'll come talk to you about it today or tomorrow, Tv_
[23:12] <darkfader> 2005 sounds good since that was when i played with carrier grade linux, they had it on the specs list
[23:12] <Tv_> elder: i'm gone tomorrow
[23:12] <elder> OK, then not tomorrow.
[23:13] <joshd> elder: check_locks: false in your config iirc
[23:13] <joshd> that's easier than setting up the locking infrastructure
[23:13] <Tv_> ah that was the thing
[23:13] <elder> I think I'm looking for more than this, but I'll talk to you in a little while.
[23:13] <Tv_> i always forget and can't find it fast enough
[23:19] * MK_FG (~MK_FG@ Quit (Quit: o//)
[23:23] * MK_FG (~MK_FG@ has joined #ceph
[23:41] * cpglsn (~ac@host96-174-dynamic.8-87-r.retail.telecomitalia.it) has joined #ceph
[23:54] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.