#ceph IRC Log


IRC Log for 2010-08-04

Timestamps are in GMT/BST.

[0:17] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[0:28] <MarkN> I have access to some machines with 128Gb ram, i was wondering if I could use tmpfs to create on RAM journals, MDSes etc, has anyone tried this at all ?
[0:29] <sagewk> you could make a (fast!) ram journal, but it wouldn't actually do any good without being backed by nvram
[0:30] <sagewk> 128gb will let the mds keep gobs and gobs of metadata in ram, though, making it super fast. that is ideal.
[0:32] <MarkN> yeah that was my thinking RE MDS, however if it is simply a scratch space if the journal went down it would be OK since data will not be guaranteed if a osd goes down.
[0:36] <gregaf> the mds journal isn't scratch space
[0:37] <gregaf> it's a required record of batched changes that are only periodically written to disk in other places
[0:37] <gregaf> this lets the MDS employ streaming writes to a small number of OSDs instead of random writes to any OSD storing metadata, which is a lot slower
[0:39] <MarkN> no, what I am saying is if a journal is in ram and it (the machine holding the journal) dies then the data in the osd is corrupt, is this correct ?
[0:40] <sagewk> you're talking about the osd journal?
[0:40] <MarkN> yes
[0:41] <sagewk> if the osd loses it's journal, it basically warps back in time to an older commit state, which violates the "this write is safe" promise it made.
[0:41] <sagewk> that's actually probably ok if it only happens to one replica, but problematic if all replicas similarly fail
[0:42] <sagewk> (which is typically what happens when the whole circuit/building/city/whatever loses power)
[0:49] <MarkN> essentially what I am trying to do is create a small fast file system, with no guarantees to the users that the data is OK. Obviously I would not set up our main data store cluster like this!
[0:51] <MarkN> the users just want to read and write to a filesystem as fast as possible, their applications will take care that the data is OK so I don't really care if state of this particular file system gets corrupted because after a restart it will get cleaned anyway.
[0:52] <sagewk> i see.. and you want to pool lots of these machines into one big temp fs?
[0:52] <MarkN> anyway, this is not my main focus jsut having some thoughts on how to create a really fast (non consistent after power loss) file system. Since the developers are always asking for more speed.
[0:52] <MarkN> yeah
[0:53] <sagewk> yeah, sure, that'd work. could turn off osd replication even (set pg size to 1x). it'll take down the fs if any of your N nodes fails, but if that's ok, it'll be twice as big and faster
[1:01] <MarkN> yeah sounds good, as i said it is more the idea stage at the moment, will keep it in mind. thanks
[2:36] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (reticulum.oftc.net charon.oftc.net)
[2:36] * darkfader (~floh@host-82-135-62-109.customer.m-online.net) Quit (reticulum.oftc.net charon.oftc.net)
[2:36] * DLange (~DLange@dlange.user.oftc.net) Quit (reticulum.oftc.net charon.oftc.net)
[2:36] * f4m8 (~drehmomen@lug-owl.de) Quit (reticulum.oftc.net charon.oftc.net)
[2:37] * darkfader (~floh@host-82-135-62-109.customer.m-online.net) has joined #ceph
[2:37] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[2:37] * f4m8 (~drehmomen@lug-owl.de) has joined #ceph
[2:37] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[2:48] * fzylogic (~fzylogic@dsl081-243-128.sfo1.dsl.speakeasy.net) Quit (Quit: DreamHost Web Hosting http://www.dreamhost.com)
[3:01] * akhurana (~ak2@c-98-232-30-233.hsd1.wa.comcast.net) has joined #ceph
[3:05] * akhurana (~ak2@c-98-232-30-233.hsd1.wa.comcast.net) Quit ()
[3:10] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[3:58] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[4:34] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[7:58] * allsystemsarego (~allsystem@ has joined #ceph
[8:06] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[9:20] * Osso (osso@AMontsouris-755-1-10-232.w90-46.abo.wanadoo.fr) Quit (Quit: Osso)
[10:00] * sassyn (~sassyn@ Quit (Quit: I love my HydraIRC -> http://www.hydrairc.com <-)
[10:18] <jantje> Hmm, I was trying the same as MarkN , I like to have my journal on a memory fs /dev/shm , just for testing (It could point out that having a SSD for having the journal on makes sense)
[10:25] <Anticimex> will ceph have any use for the btrfs direct I/O in 2.6.35?
[11:47] <wido> jantje: about the SSD's, for now i can say they make sense
[11:50] <jantje> wido: I have no idea what improvements to expect
[11:51] <jantje> would the write speed equal that of an SSD ?
[11:56] <jantje> 10.08.04_11:56:49.058179 7f448cce7710 -- mark_down -- 0x7f4488001c70
[11:56] <jantje> 10.08.04_11:56:49.058214 7f448cce7710 -- --> mon0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x7f4488001680
[11:56] <jantje> 10.08.04_11:56:49.058281 7f448fbe0710 -- >> pipe(0x7f4488001860 sd=-1 pgs=0 cs=0 l=0).fault first fault
[11:57] <jantje> crap, sorry
[11:57] <jantje> i have past buffers
[11:57] <jantje> *hate
[11:57] <jantje> (no I dont need diagnostics!:P)
[12:26] <jantje> Hmm
[12:26] <jantje> 10.08.04_12:26:09.138215 7ff431041710 log [WRN] : lease_expire from mon0 was sent from future time 10.08.04_12:26:25.699573, clocks not synchronized
[12:27] <jantje> all ceph servers are running ntpd
[12:33] <jantje> I'm wondering if I can get old Centos 4 servers connected to ceph
[12:34] <jantje> would that be ceph-fuse?
[13:16] <jantje> I verified with ntpdate that clocks are in sync
[14:31] <jantje> Hmm, my box just rebooted while running bonnie :)
[15:46] * f4m8 is now known as f4m8_
[17:39] * Osso (osso@AMontsouris-755-1-10-232.w90-46.abo.wanadoo.fr) has joined #ceph
[18:52] <sagewk> wido: ok, the heartbeat thing is fixed now in testing branch
[19:26] <wido> sagewk: you mean 330?
[19:26] <sagewk> yeah
[19:27] * Osso (osso@AMontsouris-755-1-10-232.w90-46.abo.wanadoo.fr) Quit (Quit: Osso)
[19:27] <wido> my cluster is btw pretty damaged
[19:27] * Osso (osso@AMontsouris-755-1-10-232.w90-46.abo.wanadoo.fr) has joined #ceph
[19:27] <wido> ever since i restarted everything yesterday it has been broken badly
[19:27] <wido> i've lost 30GB of data, had 660GB in use, it's down to 630 right now
[19:29] <darkfader> wow ouch
[19:30] <wido> ofcourse it wasn't valuable data :-)
[19:30] <darkfader> jantje: i also got the "messages from future" thing and the vm's im testing with aren't really drifting. there is a ceph.conf setting that will allow a drift of 1 sec
[19:31] <darkfader> wido: hehe sure, still sorry thing to happen. i'm surprise i still got all my test data intact as my testing was very "intuitive"
[19:31] <gregaf> I think it's mon_lease_wiggle_room
[19:31] <darkfader> yeah
[19:31] <wido> sagewk: it's #326 and #327 killing most of my OSD's it seems
[19:31] <gregaf> you can make that larger if you don't think there are going to be problems
[19:32] <darkfader> gregaf: are you not getting these errors?
[19:32] <gregaf> with the leases from future?
[19:32] <darkfader> yes
[19:32] <gregaf> I don't think so
[19:33] <darkfader> i wonder why
[19:33] <darkfader> i'll re-check my dom0 clocks
[19:33] <darkfader> maybe they just need a driftfile to run super-stable
[19:33] <gregaf> well, don't run real installs very often, mostly just on my dev machine to test changes
[19:33] <gregaf> *I don't run
[19:33] <darkfader> sec said the wiggle room should stay under 2 seconds anyway
[19:33] <darkfader> i see
[19:34] <darkfader> <- always gets furious about devs who just test on their desktop pc *hahaha*
[19:34] <gregaf> yeah, wiggle room needs to be less than the lease time or you might get issues with timeouts not behaving properly
[19:34] <gregaf> well, we have real installs of it in our datacenter, I just don't have to handle them often
[19:35] <darkfader> this is fresh software so perfectly fine for other people helping you with the testing :>
[19:35] <darkfader> no need to justify to me hehe
[19:35] <darkfader> and time i checked the clocks
[19:36] <gregaf> wido: was it you asking about multiple NICs for the OSDs?
[19:38] <wido> no, was jantje
[19:38] <wido> binding each OSD to a seperate NIC
[19:38] <wido> but i see you have a osd_msgr branch? seperate data path for OSD sync traffic?
[19:38] <gregaf> yeah, just merged it into unstable
[19:39] <gregaf> it lets you set a "cluster address" for talking to other OSDs and a "public address" for talking to clients, monitors, and MDSes
[19:39] <gregaf> a few people have asked about it over the last several months, like they had gigabit connections for clients but had 10GigE or something that they could use between the OSDs
[19:39] <wido> IPv6 support? ;)
[19:40] <gregaf> heh, I didn't touch that but it'll be at the same level as the rest of the system
[19:40] <wido> but yes, i really see a benefit in it. There are a lot of situations where you could want this
[19:40] <wido> even for security
[19:41] <gregaf> I just thought it might have been you asking so wanted to point it out, but I guess I'll have to search my logs
[19:42] <wido> no, but i saw the plans for it, liked it, so did our network admin
[19:50] <monrad-65532> 10g also gets you some lower latency
[19:52] <wido> monrad-65532: depends on your NIC and connection, since there are some poor NIC's (even from Intel!) which have a 800us processtime per packet
[19:52] <wido> so you really want jumbopackets on them
[19:54] <monrad-65532> also if you have 10g all the way you can get cutthrough switching
[19:54] <monrad-65532> but sure you have to have good NICs
[19:55] <wido> yes, that's true. Don't know how the NIC's are these days btw, but about a year ago i had a lot of kernel panics with the ixgbe driver from Intel
[19:55] <wido> and there are some cool switches on the market, which run Linux and you can modify them to whatever you want
[20:38] <gregaf> wido: I can't be sure but I suspect that crash you just reported was caused by my map changes; I pushed something that should fix it
[20:39] <gregaf> (I just can't be sure since I don't have access, if that doesn't work we'll look into it more)
[20:41] <darkfader> i also wondered if i should have a dedicated network for osd's or mds-osd
[20:42] <darkfader> they'll end up in some extra vlan/ib partition anyway
[20:45] <wido> gregaf: sagewk already has access, but if you want your key loaded into the cluster, no problem
[20:45] <wido> it that makes debugging easier, your welcome
[20:45] <wido> i'll try the fix for now
[20:46] <gregaf> maybe I'll get it to you later — i think I have 2 or 3 floating around that I should consolidate before I start handing it out or I'll get myself confused ;)
[20:47] <wido> np, that's why i create issues, don't want to be bugging you on IRC about it
[21:00] <wido> btw, could i get the possibility to close or update my own issues? Sometimes i make a stupid typo in them, but then there is no edit button
[21:31] <wido> gregaf: could there be that there are a lot of crashes since the merge? For example, i have a crash on "PG::recover_master_log" right now, another one on "PG::read_log"
[21:50] <jantje> darkfader: i'm running on real hardware, but I used the mon lease wiggle room option, and now it looks OK
[21:52] <jantje> gregaf / wido : I'm now bounding 4 gbit nics, iperf gives me 2.6 gbps nominal througput (20 clients simulated) , I think I'm hitting my CPU limit. (quad core 2.6Ghz)
[21:53] <wido> 2.6gbps is pretty good
[21:53] <jantje> thats *without* the disks ofcourse :)
[21:54] <jantje> with a 'single' client I get 2gbps
[21:54] <jantje> so I'm good.
[21:56] <wido> gregaf: i'm going afk, if you want to take a look at my cluster, your welcome. For now i'll leave it like it is. I'm afk, ttyl!
[21:57] <darkfader> jantje: do you understnad the window handling in iperf?
[21:57] <darkfader> it often does confusing things to me
[21:57] <darkfader> TCP window size: 256 KByte (WARNING: requested 8.00 MByte)
[21:58] <wido> sage has access btw, he can log on. Afk now ;)
[21:58] <darkfader> have fun
[21:59] <jantje> darkfader: no, but I'm sure it has it's reasons :)
[21:59] <darkfader> hehe
[22:44] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Ping timeout: 480 seconds)
[22:51] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[23:01] <jantje> wido: I might get more out of it with jumbo frames, but at first sight this isn't supported on my 82573E/L
[23:06] <darkfader> jantje: i'd bet it is
[23:07] <darkfader> what happens if you just turn the mtu up?
[23:22] <jantje> i cant put it higher than 1500
[23:23] <jantje> so its not supported :P
[23:23] <jantje> But this document suggests it's possible: http://download.intel.com/design/network/specupdt/82573.pdf
[23:24] <jantje> Hm, ok
[23:24] <jantje> I can set mtu for eth2 and eth3, but not for 0 and 1
[23:25] <jantje> (make that eth0 only)
[23:26] <jantje> so 82573E can't , but 82573L can
[23:26] <jantje> I dont understand why they ship different chips on the same motherboard
[23:30] <jantje> I only can increate my mtu of my bonding when all slave nics support it, so I'm now off trying just eth1 2 3 with 9k mtu
[23:35] <darkfader> one should be able make the mainboard vendor pay an extra card for the wasted time
[23:39] <jantje> they never claimed it supports jumbo frames :-)
[23:43] <jantje> oh crap, one of the boxes just went down when changing mtu
[23:44] <jantje> oh well, time to sleep I guess. Nite

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.