#ceph IRC Log

Index

IRC Log for 2010-11-17

Timestamps are in GMT/BST.

[0:04] <gregaf> jantje: sorry for the delay, was on a break, let me check
[0:06] <jantje> (add it to the twiki)
[0:07] <jantje> if it isn't there yet :)
[0:07] <gregaf> hmm, oddly enough we don't have a config option for it right now!
[0:07] <jantje> (also ;mds bal frag = true)
[0:07] <jantje> ah, ok
[0:08] <gregaf> I'll add it to the tracker
[0:08] <jantje> great
[0:10] <jantje> I'm currently building our source tree on a ceph filesystem, hopefully 3 servers with total 6 osd's will give identical results to a 'local' build with just a single sata disk
[0:12] <johnl> ceph-client.git updated 117 seconds ago - yum, fresh meat!
[0:13] <jantje> but I think it's not as performant as I hoped to, cpu is more than 50% idle, meaning storage can't keep up I think :(
[0:13] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[0:13] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[0:13] <gregaf> I dunno, I often see idle CPU time on my local builds too
[0:14] <jantje> we have higly parallelized build stuff, so on a local build there is 0 to 5% idle
[0:14] <jantje> and sometimes a bit of iowait
[0:14] <jantje> I can see a few MB/sec going to the ceph cluster
[0:14] <jantje> probably random writes or something, don't know how to describe the workload
[0:15] <jantje> and I'm even using a memory device to host my journal ;p
[0:16] <jantje> anyway, if you guys have any hints, it would be appreciated
[0:17] <jantje> (I have a default pool size, and the journal on /dev/shm , 500MB for each OSD , and filestore journal parallel = 1
[0:17] <gregaf> well, we haven't spent a lot of time figuring out the best settings for specific workloads
[0:18] <gregaf> if you time the compile on local disk versus Ceph and it's way different it might be something we want to look at
[0:18] <jantje> i'm currently timing
[0:18] <gregaf> this is your scratch space, without any replication?
[0:20] <jantje> On a part where the identical hardware with a sata disk takes 27min to compile an image, the same thing is already taking more than 38, i'll let you know the results
[0:20] <jantje> yes
[0:20] <jantje> for every test I start with a clean mkcephfs
[0:20] <jantje> (and remove journals, etc)
[0:21] <jantje> however I might have seen a strange thing, just a second
[0:22] <jantje> I'll clean it all up again tomorrow just to be sure
[0:24] <jantje> Ok, that log is gone, but when I did a new mkcephfs, it automaticly created the journal
[0:24] <jantje> and when starting the first time, it appreared if it was replaying the journal
[0:24] <jantje> and the mon reported more than 100MB storage used
[0:30] <jantje> let me know if you need to me to test an universal workload, like a linux kernel
[0:30] <jantje> i'll do that tomorrow
[0:31] <jantje> what are you guys using to synchronize clocks?
[0:31] <jantje> I have ntpdate running, but I sometime still get those messages
[0:31] <sagewk> ntpd
[0:31] <jantje> (and ntpdate reports succesul sync)
[0:31] <jantje> I have s/ntpdate/openntpd/
[0:32] <jantje> hm
[0:32] <jantje> ok
[0:32] <sagewk> the thing with compilations is they're pretty metadata intensive, and many of those ops go to the mds. so there will always be some latency penalty there
[0:32] <sagewk> the ntpd daemon will correct slow clock drift over time
[0:32] <jantje> ok, would it help to have 3 active MDSs ?
[0:33] <sagewk> probably not in this case.. it's not that the mds is slow (probably, it's not eating 100% cpu is it?), but that thereround trip time
[0:33] <sagewk> is slowing things down.
[0:34] <jantje> I understand, no it's not eating all cpu, just 20-30%
[0:35] <jantje> the monitor
[0:35] <jantje> hmm
[0:35] <jantje> anyway, i might have some messy old configuration
[0:35] <jantje> I'll clean up tomorrow and let you know the result
[0:40] <sagewk> ok cool.
[0:41] <sagewk> jantje: about #549.. can you see if you can reproduce the problem with this? http://fpaste.org/wCFg/
[0:41] <sagewk> bonnie++ cleans up after it errors out, making it hard to catch the problem when it happens
[0:44] <jantje> ok
[0:48] <jantje> the script runs just fine
[0:48] <jantje> no diffs
[0:48] <jantje> bonnie still fails..
[0:48] <jantje> lets see if I can get more out of it
[1:05] <jantje> when mounting the cephfs 'locally' on any of the servers
[1:05] <jantje> bonnie works!
[1:05] * darkfader (~floh@host-82-135-62-109.customer.m-online.net) Quit (Ping timeout: 480 seconds)
[1:05] <jantje> the one is debian, the failing one is centos
[1:06] <jantje> both running 2.6.37-rc1+
[1:06] <jantje> I really have no clue sage
[1:07] <sagewk> same version of bonnie++?
[1:08] <sagewk> can you capture an strace log for me?
[1:09] <jantje> working on that
[1:10] <jantje> installing strace .. on a diskless client, just takes tooo long :)
[1:14] <jantje> getdents(3, 0x89d1304, 32768) = -1 EOVERFLOW (Value too large for defined data type)
[1:14] <jantje> (just a quick look)
[1:15] <jantje> file is 7,5MB, i'll upload it
[1:16] <sagewk> ah interesting
[1:16] <sagewk> thanks
[1:18] <jantje> http://jan.sin.khk.be/bonnie.strace
[1:18] <jantje> should be there
[1:19] * allsystemsarego (~allsystem@188.26.32.123) Quit (Quit: Leaving)
[1:29] <jantje> i'm not sure if i'm still using that kernel where I removed that commit
[1:30] <sagewk> yeah, doesn't look related. i'll look through the strace..
[1:30] <jantje> I'm going to bed now
[1:30] <sagewk> :) sweet dreams
[1:31] <jantje> let me know if you want me to try things, i'll read the backlog
[1:31] <jantje> nite!
[1:32] * Jiaju (~jjzhang@222.126.194.154) Quit (Ping timeout: 480 seconds)
[1:38] * Jiaju (~jjzhang@222.126.194.154) has joined #ceph
[1:40] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[2:03] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[2:03] * michael-ndn (~michael-n@12.248.40.138) Quit (Read error: Connection reset by peer)
[2:08] * alexxy[home] (~alexxy@79.173.81.171) Quit (Ping timeout: 480 seconds)
[2:14] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[2:14] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[3:18] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[3:18] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[3:32] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:55] * greglap (~Adium@166.205.137.35) has joined #ceph
[3:56] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[4:07] * sjust (~sam@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[4:59] * greglap (~Adium@166.205.137.35) Quit (Read error: Connection reset by peer)
[5:43] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[5:55] * Jiaju (~jjzhang@222.126.194.154) Quit (Read error: Connection timed out)
[5:56] * Jiaju (~jjzhang@222.126.194.154) has joined #ceph
[6:35] * sage1 (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[6:35] * sage (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Read error: No route to host)
[7:52] * Jiaju (~jjzhang@222.126.194.154) Quit (Ping timeout: 480 seconds)
[7:54] * Jiaju (~jjzhang@222.126.194.154) has joined #ceph
[7:57] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:14] * Jiaju (~jjzhang@222.126.194.154) Quit (Ping timeout: 480 seconds)
[8:21] * Jiaju (~jjzhang@222.126.194.154) has joined #ceph
[8:52] * gregorg (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[8:52] * gregorg (~Greg@78.155.152.6) has joined #ceph
[9:06] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:49] <jantje> i didn't verify, but the file journal should be deleted when doing a mkcephfs, I think I saw it replaying yesterday after creating a new cephfs
[9:59] * allsystemsarego (~allsystem@188.26.33.21) has joined #ceph
[10:16] * Yoric (~David@213.144.210.93) has joined #ceph
[11:37] <jantje> sagewk: I'm getting other weird stuff on that machine ...
[11:37] <jantje> 26214 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfccedd8) = -1 ENOTTY (Inappropriate ioctl for device)
[11:37] <jantje> 26214 _llseek(1, 0, [0], SEEK_CUR) = 0
[11:37] <jantje> 26214 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfccedd8) = -1 EINVAL (Invalid argument)
[11:37] <jantje> 26214 _llseek(2, 0, 0xbfccee00, SEEK_CUR) = -1 ESPIPE (Illegal seek)
[11:41] <jantje> I've read that it could be something with 32/64bit
[11:42] <jantje> ceph servers are 64bit
[11:42] <jantje> client is 32
[11:44] * tnt (~tnt@mojito.smartwebsearching.be) has joined #ceph
[12:11] <jantje> nog belgen :P
[14:03] * shdb (~shdb@217-162-231-62.dclient.hispeed.ch) has joined #ceph
[14:04] * shdb (~shdb@217-162-231-62.dclient.hispeed.ch) has left #ceph
[14:04] * shdb (~shdb@217-162-231-62.dclient.hispeed.ch) has joined #ceph
[14:04] * shdb (~shdb@217-162-231-62.dclient.hispeed.ch) has left #ceph
[14:05] * shdb (~shdb@217-162-231-62.dclient.hispeed.ch) has joined #ceph
[15:23] * fred_ (~fred@80-219-183-100.dclient.hispeed.ch) has joined #ceph
[15:23] <fred_> hi
[16:29] <wido> hi
[16:31] <wido> sagewk: Your btrfs test doesn't seem to work
[16:31] <wido> still seeing: "WARNING: at /build/buildd/linux-lts-backport-natty-2.6.37/fs/btrfs/inode.c:2143 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()"
[16:31] <wido>
[16:31] <wido> saw that message at 12:00 here today
[16:47] <wido> sagewk / yehudasa: you might be interested in: http://lists.wpkg.org/pipermail/sheepdog/2010-November/000761.html
[16:52] <sagewk> jantje: what process were you stracing?
[17:25] * fred_ (~fred@80-219-183-100.dclient.hispeed.ch) Quit (Quit: Leaving)
[17:27] <jantje> sagewk: cc1 , it's on top of the strace log
[17:27] <jantje> or the ioctl stuff?
[17:28] <sagewk> the ioctl /pipe errors above..
[17:28] <jantje> I did send you mail
[17:29] <sagewk> oh right, looking now
[17:30] <sagewk> oh, i see the cc error...
[17:30] <sagewk> 11034 stat64("..", {st_mode=S_IFDIR|0755, st_size=2177111503, ...}) = 0
[17:30] <sagewk> it's stating the directory and getting a large directory file size (because of the recursive stats)
[17:30] <sagewk> try mounting with 'norstat'
[17:31] <sagewk> er, 'norbytes' that is
[17:32] <jantje> it's crappy that i didn't pay attention to which process it was
[17:32] <jantje> i dumped it with strace -ff
[17:33] <jantje> (but removed it ...)
[17:36] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[18:04] * greglap (~Adium@166.205.139.89) has joined #ceph
[18:13] * Yoric_ (~David@213.144.210.93) has joined #ceph
[18:13] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[18:13] * Yoric_ is now known as Yoric
[18:24] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[18:57] * sjust (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:58] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:17] * greglap (~Adium@166.205.139.89) Quit (Ping timeout: 480 seconds)
[20:25] <wido> When trying to start todays unstable, all my OSD's go down with: http://pastebin.com/ZnSQmWWv
[20:25] <wido> new issue? Or know issue?
[20:25] <wido> Otherwise I'll create a new issue in the tracker
[20:26] <sagewk> new issue
[20:26] <sagewk> hmm
[20:27] <wido> They go down instantly
[20:27] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:27] <wido> Btw, are you busy today? I've been able to get some budget to buy new hardware for a cluster, would like to talk about the possible h/w configs
[20:28] <sagewk> sure
[20:28] <wido> Shall I create a new issue first for this crash?
[20:28] <sagewk> yeah
[20:28] <sagewk> mention a node or two that's crashing
[20:34] <wido> sagewk: yes, node07 seems to be having a issue to
[20:35] <wido> oh, I get it :-)
[20:35] <wido> yes, i'll do
[21:11] <failboat> o hai
[21:12] <failboat> how's going?
[21:26] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:36] * monrad-51468 (~mmk@domitian.tdx.dk) Quit (Ping timeout: 480 seconds)
[21:36] <gregaf> failboat: not bad, anything we can do for you?
[21:38] * monrad-51468 (~mmk@domitian.tdx.dk) has joined #ceph
[21:38] <wido> sagewk: you've got some time right now?
[21:38] <gregaf> wido: he's on a phone call right now, we'll let you know when he gets back :)
[21:39] <wido> ok, great :-)
[21:44] * Jiaju (~jjzhang@222.126.194.154) Quit (Ping timeout: 480 seconds)
[21:46] <sagewk> wido: back
[21:48] * monrad-51468 (~mmk@domitian.tdx.dk) Quit (Ping timeout: 480 seconds)
[21:49] * Jiaju (~jjzhang@222.126.194.154) has joined #ceph
[21:49] <wido> sagewk: ok, great
[21:49] <wido> got some time about the hardware?
[21:50] <sagewk> yeah
[21:51] <wido> Great, for 2011 I've been able to get a budget to take things a step further with testing
[21:51] <wido> my current setup lacks RAM, disks and doesn't perform that well
[21:51] <wido> I know a monitor doesn't need that much power, a MDS requires a lot of RAM and a fast CPU, but about the OSD
[21:52] <wido> to reduce power usage I've been looking into the Atom based solutions of SuperMicro: http://www.supermicro.com/products/motherboard/ATOM/ICH9/X7SPA.cfm?typ=H&IPMI=Y
[21:52] <wido> The total of 4GB RAM is not that much, but I'm thinking about attaching 4 2TB disks for the OSD's and a SSD for journaling and extra SWAP
[21:53] <wido> do you think that would be sufficient? 8TB storage with a SSD vs 4GB of RAM and a Dual-Core Atom
[21:53] <wido> On my current systems I see 1.8% memory usage on a 4GB machine, so it should fit, but sometimes OSD's tend to use a lot of memory, I assume during recovery operations
[21:53] <sagewk> that should work ok, i would think. 4gb ram is on the low side tho
[21:54] <wido> Yes, my point, 4GB is low indeed, but it seems to be a Atom limitation
[21:54] <sagewk> are those cpus 32 or 64 bit?
[21:54] <wido> 64 bit
[21:54] <wido> I
[21:54] <wido> I've got some of them running, they are pretty fast and work fine
[21:55] <sagewk> how many are you getting?
[21:55] <wido> Power usage is somewhere around 20 ~ 30W
[21:55] <wido> In terms of money or machines?
[21:55] <sagewk> yeah esp with an ssh for swap i suspect it'll be fine.
[21:55] <sagewk> machines
[21:55] <wido> I hope somewhere around 30 ~ 40
[21:55] <sagewk> s/ssh/ssd/
[21:56] <wido> The SSD is 80GB, so about 4GB of journaling and the rest as SWAP, so about 70GB
[21:56] <wido> Due to the low memory, would you recommend 4 OSD's on the machine, or use btrfs striping, I personally lean towards the btrfs method, I know it poses extra risks, but reduces the number of processes
[21:56] <wido> thus saves memory
[21:57] <wido> One disk crash would mean you loose 8TB of data instead of 2TB, that's the risk
[21:58] * Jiaju (~jjzhang@222.126.194.154) Quit (Ping timeout: 480 seconds)
[21:59] <sagewk> wido: i would try it both ways, actually. if the cpu can take it, 4 cosd's would be nicer, and i don't think there's that much per-instance overhead you'd need to worry about (that wouldn't be incurred anyway with 1 osd managing 4x the data)
[22:00] <wido> sagewk: Yes, but was just brainstorming about it
[22:00] <gregaf> with 4 OSDs on one SSD journal you might run into issues with your journal write speed handicapping your total performance
[22:01] <sagewk> i'd be very interested in seeing what the cpu load is as well (with both configurations).
[22:01] <gregaf> a 2-TB disk gets a lot more than 25% the performance of most 80GB SSDs :)
[22:02] <wido> sagewk: Yes, me too. But I'm not so worried about the Atom, more about the RAM
[22:03] <wido> gregaf: Yes, but I could always add a second SSD. Most mainboards these days have 6 S-ATA ports
[22:03] <wido> so 4 disks and 2 SSD's should be possible
[22:03] <gregaf> I don't think the RAM ought to be a problem if you've got an SSD for swap space
[22:03] <gregaf> you should only be running up near that limit during recovery anyway, right?
[22:04] <wido> Yes, I hope so
[22:04] <wido> Under normal situations an OSD uses < 100MB
[22:05] <wido> I've done some calculations today, 44 of those machines in one rack would use about 3kW of energy and give 325.6TB of storage
[22:05] <wido> I'm thinking that 2TB disks might be a bit overkill, but we'll see
[22:06] <wido> But with a descent cluster i'm able to do much better testing and see how everything performs
[22:06] <wido> Then I could even apply for "Quality Assurance and Performance Testing" ;)
[22:07] <sagewk> most definitely :)
[22:07] * monrad-51468 (~mmk@domitian.tdx.dk) has joined #ceph
[22:07] <sagewk> have to run.. did i answer your questions?
[22:07] <wido> sagewk: Yes, thanks
[22:08] <sagewk> np
[22:08] <wido> It's what I thought
[22:08] <wido> But time will tell if it works as I hope
[22:10] <wido> gregaf: About the OSD's saturating the SSD, do you know with which blocksize Ceph writes it's journal?
[22:11] <gregaf> wido: nope, sorry
[22:12] <gregaf> I presume it's configurable, or could be, if it turns out to make a difference
[22:12] <gregaf> why?
[22:12] <wido> Well, SSD's perform well different with blocksizes
[22:15] <gregaf> ah, of course
[22:16] <wido> gregaf: for example, it depends on the SSD you have, how well it performs with a 2k blocksize
[22:17] <gregaf> well for the file-based journal it uses whatever the blocksize reported by a stat on the file
[22:17] <gregaf> so I'm not sure if the size is exposed as an option right now but it certainly can be
[22:18] <wido> gregaf: I mean how large the blocks are it writes, fwrite(, size);
[22:19] <wido> But that is something to look up in the code
[22:19] <wido> But a good SSD gives you about 250MB/sec write
[22:20] <gregaf> depending on which ones you use, yeah
[22:20] <wido> So with a Gbit link to on OSD, the SSD outperforms the network link
[22:20] <gregaf> Intel's x25-m doesn't though, if you're looking at them
[22:20] <wido> No, but the new X25-M, on 25nm should give you about 170MB/sec, still outperforming a Gbit link
[22:21] <gregaf> heh, I don't know anything about their unreleased models :)
[22:21] <wido> I most cases even outperforming a 2Gbit bonded link
[22:21] <gregaf> anyway, if you're using Gb ethernet it's probably not a big deal, I just wasn't sure what network interconnect you were looking at
[22:21] <gregaf> we're starting to hear from more people with 10Gb or IP over IB
[22:22] <wido> Looking at 2x 1Gbit with bonding, but with good loadbalancing it will give you about 1.5Gbit
[22:22] <wido> Not going to use 10Gb on a Atom ;)
[22:23] <wido> So most SSD's will outperform the network link, when using 10Gb it's a differnt story indeed :-) But then you've got the budget for 10Gb, you'll also buy more expensive nodes
[22:24] <gregaf> I just wanted to let you know it was an issue — I'm used to thinking about the x25-m I have in my desktop which only does ~70MB
[22:25] <gregaf> and depending on how you want to handle your replication/CRUSH map you might have replicas on the same machine when you're running multiple disks/daemons on one machine, which could make sense as long as you have at least one copy on a different machine (for reducing network traffic)
[22:26] <wido> Yes, I was thinking about storing every replica on another OSD
[22:27] <wido> But it would still require a lot of testing, that's why I want to buy a new setup
[22:28] <gregaf> I don't want to try and derail you, we just got a bug report from somebody who was having issues partly because their journal couldn't keep up with the rest of their filesystem, so I'm a bit paranoid about it ;)
[22:28] <wido> Yes, I'm subscribed to that issue ;)
[22:28] <wido> Also to the issue with the 208 nodes
[22:31] <gregaf> yeah, I just emailed that guy since I don't think he's watching the tracker any more
[22:32] <gregaf> I have a suspicion that he didn't apply the rados patch, and that any other benchmarks he ran were also accessing the exact same files at the same time, but we really need to make sure!
[22:34] <wido> gregaf: that's a shame, since 208 nodes is a big cluster
[22:34] <wido> I thought the biggest was somewhere around 50 nodes
[22:35] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[22:35] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[22:35] <gregaf> my thoughts exactly ;)
[22:51] <wido> gregaf: about that issue with the journal, you mean #531 ?
[22:52] <gregaf> uh, yep!
[22:53] <wido> You said "their journal couldn't keep up with the rest of their filesystem"
[22:54] <wido> so I thought it was a H/W issue, but it's a software from what I make of it?
[22:54] <gregaf> it's a combination of several issues
[22:54] <gregaf> we haven't identified all of them yet
[22:56] <wido> ok, but the best thing is to seperate the I/O path of the journal, from that of the data dir
[22:56] <gregaf> their average write speed seems to be the speed of their journal (80MB/s) divided by their replication level
[22:57] <gregaf> and something strange is happening when their journal fills up (which is presumably happening because their general storage is lots faster, and it's sending out the appropriate ack for that)
[22:57] <wido> Where did you pick up the journal does 80MB/sec? Can't find it in the issue report
[22:57] <gregaf> they're using consumer MLC SSDs
[22:58] <wido> aha, ok, and use HW RAID for the OSD storage
[22:58] <gregaf> we had some off-list discussions :)
[22:58] <gregaf> yeah, very large HW RAID
[22:58] <wido> your journal should always be faster then the OSD storage I assume
[22:58] <wido> to have any performance benefit
[22:59] <gregaf> it depends on the setup you're using, and your usage
[22:59] <gregaf> as always ;)
[22:59] <gregaf> SSD journals, even if slower than the storage, can still provide a big net benefit due to their super-low latency
[22:59] <wido> depends on how you define "fast"
[23:00] <wido> ;)
[23:00] <wido> But I get the point, a journal simply has to accept the write as fast as possible
[23:00] <gregaf> and if you're using parallel journaling it shouldn't even cause any issues (if the journal falls far enough behind it can toss out sufficiently-secure writes), although it's not as well-tested as write-ahead is
[23:01] <wido> paralell, you mean the journal on a SSD and data on a HDD?
[23:01] <gregaf> nope, it's an algorithmic strategy
[23:02] <gregaf> the default until very recently was write-ahead, where everything goes into the journal before it's written to the storage
[23:02] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[23:02] <gregaf> it's still the default on most FSes, but if you're using btrfs the default is now parallel, where it writes to the journal and the storage at the same time
[23:02] <wido> Ah, but I was talking about the OSD's journal, that is something differnt
[23:03] <gregaf> nope, that's the journal I mean too :)
[23:03] <wido> btrfs would just do it's journaling on the HDD
[23:03] <wido> while the OSD does his journaling on the SSD
[23:04] <wido> But, I'm going afk, enough for today
[23:04] <gregaf> for data safety in the case of a power failure, the OSD always needs to have some known-consistent state for its store, and then a journal of all changes since that known-consistent state
[23:04] <wido> ok, one last thing
[23:04] <gregaf> generally, to maintain that invariant the OSD journal needs to be written to before it writes to storage
[23:05] <wido> ok, get it
[23:05] <wido> thanks!
[23:05] <gregaf> but with btrfs it can get a known-consistent state using the built-in snapshots, so it can take a snapshot, then dispatch a write to the store and the journal at the same time, and if the power failures it can roll back to that snapshot and then read the journal :)
[23:05] <gregaf> (these are btrfs snapshots, of course)
[23:05] <gregaf> np :)
[23:05] <wido> yes, pretty nice though
[23:05] <wido> ttyl!
[23:10] * ajnelson (~ajnelson@soenat3.cse.ucsc.edu) has joined #ceph
[23:26] * MarkN (~nathan@59.167.240.178) Quit (Remote host closed the connection)
[23:27] * MarkN (~nathan@59.167.240.178) has joined #ceph
[23:28] * allsystemsarego (~allsystem@188.26.33.21) Quit (Quit: Leaving)
[23:48] <ajnelson> gregaf: Howdy. Do you have a minute to talk about Hadoop patch logistics? (This is semi-related to Ceph.)
[23:48] <gregaf> ajnelson: sure!
[23:48] <gregaf> I haven't done anything with Hadoop besides talk to you guys in about a year though, fyi :)
[23:48] <ajnelson> Ah, ok. This was more a question on submitting a Hadoop patch;
[23:49] <ajnelson> I think its first round is going to ceph-devel.
[23:49] <ajnelson> The hangup was my summer job and thesis work ate my time before I could finish running the test suites,
[23:49] <gregaf> ah
[23:49] <gregaf> I don't recall everything that's in your patch?
[23:49] <ajnelson> and the dev cluster entered some kind of unavailable period. Re-imaging over summer break, etc.
[23:50] <ajnelson> Patch has a locality adaptation for Ceph as an underlying file system for Hadoop;
[23:50] <ajnelson> Ceph acts as a raw file system,
[23:50] <gregaf> it's mostly Java extending the Hadoop FileSystem interface, right?
[23:50] <ajnelson> but Hadoop uses the block location ioctl.
[23:50] <ajnelson> The Java is the raw file system interface, which is higher in the class hierarchy than HDFS.
[23:50] <gregaf> with some hooks to use various ioctls on the Ceph mount?
[23:51] <ajnelson> Aye, it's a set of hooks, with some JNI that make it Linux-dependent.
[23:51] <gregaf> yeah
[23:51] <ajnelson> Maybe it'd work in OS X, I haven't tried yet...
[23:52] <gregaf> I'm just not sure how much feedback you'll get from the ceph-devel list, but it's perfectly fine to post it :)
[23:52] <gregaf> I'm not sure cfuse even builds in OS X any more, and that doesn't support the ioctls AFAIK, no point worrying about that
[23:52] <ajnelson> Right.
[23:52] <ajnelson> My bigger concern is, if it's going to be submitted to Hadoop, how much of that test suite does it have to pass.
[23:53] <ajnelson> I've tried running `ant test-core`,
[23:53] <ajnelson> and tha
[23:53] <ajnelson> t falls on its face for even the 0.20.2 build I had.
[23:53] <ajnelson> *0.20.2 built fresh from the git tag.
[23:53] <gregaf> well, I never got the userspace patch accepted, so I'm certainly no expert on their process
[23:53] <ajnelson> With my patch, I don't seem to fail the tests in any spectacular way...but I can't get them to run to completion.
[23:53] <gregaf> that's Ceph .20.2?
[23:54] <ajnelson> No, Hadop 0.20.2
[23:54] <ajnelson> *Hadoop
[23:54] <gregaf> oh, didn't know they were using git now
[23:54] <ajnelson> I ran with the last release, 0.23
[23:54] <ajnelson> They have git on Apache's site and a github page too, for some reason.
[23:54] <gregaf> I think when I submitted my FS they just wanted some tests for the code I was submitting
[23:55] <gregaf> though obviously it'd be good to figure out why the unit tests are failing
[23:55] <ajnelson> Well,
[23:55] <ajnelson> I'm not sure the unit tests failing are related at all to my code, because they failed on the release as well.
[23:55] <gregaf> ah
[23:55] <gregaf> well you should tell them that, then
[23:55] <gregaf> they did have a number of tests that were pretty easy to get to fail in various ways when I was looking at it
[23:55] <ajnelson> Also, the tests on my patch are failing just because my local Ceph service (just running everything on one box) doesn't last. Something dies.
[23:56] <gregaf> in particular their HDFS and LocalFileSystem unit tests actually had different expectations
[23:56] <gregaf> if Ceph is failing, that's something we need to look into
[23:57] <ajnelson> Ok.
[23:57] <gregaf> any particular failure scenarios you were seeing?
[23:58] <ajnelson> Unsure. I just ran it once today, and that it failed at all (like it did before summer) made me think I should just get you on the horn.
[23:58] <ajnelson> So, I'm not sure how to classify this failure.
[23:58] <ajnelson> It just died in one of the Hadoop unit tests - copying files.
[23:59] <gregaf> which daemon?
[23:59] <ajnelson> Unsure. I have the logs handy. What would be easiest to grep for?
[23:59] <ajnelson> (16MB of logs.)
[23:59] <ajnelson> Oh
[23:59] <gregaf> well if there was a daemon crash it should say at the tail end of that daemon's log
[23:59] <gregaf> and maybe even a core dump if the size limits were set high enough
[23:59] <ajnelson> And I was using the userspace client with cfuse instead of the kernel client.

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.