#ceph IRC Log


IRC Log for 2011-04-20

Timestamps are in GMT/BST.

[0:00] <djlee> i checked the controllers speed too, epsecially when formatting 12disks at once, e.g., i get about 1000MB/s max for highspec,
[0:00] <djlee> but about 250MB/s max in lowspec,
[0:01] <Tv> gregaf: ffsb is part of our autotest suite
[0:02] <Tv> gregaf: as far as i know it succeeds fine
[0:02] <Tv> gregaf: we have not yet stored any metrics from it
[0:02] <Tv> because, well, let's fix the hang bugs first ;)
[0:08] <gregaf> djlee: looks like your disk controllers are much worse for the atoms, and that would cause problems in random read scenarios
[0:08] <djlee> gregaf: how though? given that each osd is pulling out mere 10mb/s
[0:09] <gregaf> djlee: well it depends a lot on the specifics of your test
[0:09] <gregaf> it could be that they're really providing 30MB/s but it's not talking to them all at the same time
[0:10] <gregaf> or it could just be that you really are doing small random reads, and it's actually not hard to get a hard drive down to .5MB/s with small random reads...
[0:10] <djlee> yeah exactly, thats why i have the realistic ffsb
[0:10] <djlee> haha
[0:11] <djlee> ok, i think i should first get it right with both small-file-size, and big-file size for ffsb
[0:11] <gregaf> djlee: you could try running a scaled-down version of it on each type of box and see how it changes
[0:11] <gregaf> (without Ceph at all, I mean)
[0:12] <gregaf> it may be that Ceph is just scaling up the performance of the benchmark on its boxes
[0:14] * cmccabe1 (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[0:15] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[0:49] <pombreda> howdy :)
[0:50] <pombreda> sage, gregaf: any update on the playground status?
[0:59] <sagewk> pombreda: sigh...
[0:59] <sagewk> "neglected?"
[1:00] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[1:10] <gregaf> Tv: this looks to have succeeded?
[1:11] <gregaf> http://autotest.ceph.newdream.net/afe/#tab_id=view_job&object_id=515
[1:11] <Tv> gregaf: yup
[1:11] <Tv> gregaf: now the question is, did the box hang..
[1:11] <gregaf> how does it hang if the test succeeded?
[1:11] <Tv> it did not
[1:12] <Tv> gregaf: apparently the test didn't sync, so it hung *after* the test when syncing for a reboot
[1:12] <Tv> which makes me think all tests should probably sync.. ;)
[1:12] <gregaf> ah, okay
[1:12] <gregaf> still, apparently it didn't have a problem with lchown
[1:13] <Tv> gregaf: it might be racy
[1:14] <Tv> gregaf: but we can hope..
[1:14] <Tv> or maybe we did read the test wrong in the first place; that'd be a good thing
[1:14] <Tv> i stared at it so long my eyes crossed
[1:15] <gregaf> heh, I didn't look quite closely enough
[1:15] <gregaf> it also might be that it's racy but it's a general race rather than a specific one
[1:15] <gregaf> and so we'll get it in the course of making the kclient happy with a multi-MDS system
[1:17] <Tv> yeah
[1:38] <sagewk> i wouldn't be suprised if its a multimds thing.. it can be confusing getting inode updates from multiple sources. the version checks are subtle.
[1:38] <sagewk> there have been similar bugs in the past
[1:40] <gregaf> sagewk: yeah, but I've never been able to reproduce it using UML and a multi-MDS vstart, and this latest test on autotest didn't reproduce either
[1:40] <gregaf> I'm just guessing that it's a generic multi-mds issue rather than an lchown-specific one
[1:40] <sagewk> it might be very timing sensitive :(
[1:41] <sagewk> yeah
[1:41] <sagewk> i'm guessing it's any inode update when the client has caps for that inode from multiple mdss
[1:51] <Tv> so autotest hates the idea of me gzipping the log files :(
[1:54] <jjchen> Two questions regarding to Rados placement group: 1. Can Rados support placement group split right now? 2. Can we increase the number of placement group for a pool? From the paper, it seems we can double the placement group if needed. Thanks in advance for your help.
[1:56] <gregaf> jjchen: 1) not sure what you mean by this
[1:56] <gregaf> 2) yes, you can increase the number of placement groups
[1:56] <gregaf> it does not need to be by a factor of two
[1:57] <gregaf> http://ceph.newdream.net/wiki/Changing_the_number_of_PGs contains a VERY short introduction
[1:57] <gregaf> note that you can't shrink the number of PGs, though
[1:59] <jjchen> Thanks a lot. For 1), I mean, if there are too many objects mapped into a PG, is there a way to split that PG into 2 and move objects in the original Pg into the two new PGs?
[2:01] <gregaf> oh, there's not a way to do it directly; no
[2:01] <gregaf> you just have to up the number of PGs
[2:02] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:10] <jjchen> So if some PG becomes too hot, we can increase the number of PG in the pool. Hopefully, Rados will remap some objects into the new PGs to even out the load. In this case, I assume Rados has to move some objects from one PG to another. Is this done in an atomic fashion? While the object is being moved, can users still access the object? Thanks.
[2:18] * greglap (~Adium@ has joined #ceph
[2:21] <cmccabe1> greglap: I'm almost ready to submit the config observer stuff
[2:21] <cmccabe1> greglap: just testing a few last things here
[2:22] <cmccabe1> greglap: it should be fairly self-explanatory. If you attach an observer right after calling common_init, you'll get notifications whenever injectargs happens
[2:23] <greglap> cmccabe1: awesome
[2:23] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:24] <greglap> jjchen: the data is remapped into the new PGs in a fairly deterministic fashion — new PGs are created by splitting up existing ones into pieces based on how they hash
[2:25] <greglap> so the new PGs are initially located on the OSDs which already hold the data, then are migrated to the correct OSDs which they actually map to
[2:26] <greglap> I don't recall the specifics off the top of my head, but users can still access the object during most of that process — it will be inaccessible for a short period but it should nominally only be a few seconds, during which the access will just hang until it becomes readable
[2:39] <jjchen> Thanks for the explanation. That makes sense. One more question: when a node goes down that holds the primary of a PG. If a client happens to accesses the PG at this moment, what happens? Will it get an error back or rerouted to one of the secondary copies? I assume it will get an error back and has to retry later until the reconfig finishes and by then client will get an updated map that contains the new primary info. If this is the
[2:47] <cmccabe1> are ProfLoggers always enabled?
[2:47] <cmccabe1> I'm searching through the code, but not seeing anything like a disable flag
[2:47] <cmccabe1> oh, never mind
[2:49] * joshd1 (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:54] <greglap> jjchen: the request hangs until the object is accessible somewhere
[2:55] <greglap> if an OSD goes down then it shouldn't take more than 30 seconds for it to get flagged as down, and then a new OSDMap will get pushed out with one of the replicas as the primary
[3:04] * cmccabe1 (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:15] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[3:41] * lxo (~aoliva@ Quit (Ping timeout: 480 seconds)
[4:18] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[4:29] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[5:19] * lxo (~aoliva@ has joined #ceph
[5:46] * lxo (~aoliva@ Quit (Quit: later)
[7:03] * djlee (~dlee064@des152.esc.auckland.ac.nz) Quit (Quit: Leaving.)
[7:59] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:14] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[8:25] * andret (~andre@pcandre.nine.ch) has joined #ceph
[9:20] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:50] * allsystemsarego (~allsystem@ has joined #ceph
[9:58] * Yoric (~David@87-231-38-145.rev.numericable.fr) has joined #ceph
[11:15] * Yulya (~Yulya@ip-95-220-130-133.bb.netbynet.ru) has joined #ceph
[11:16] <Yulya> hello guys
[11:16] <Yulya> how can i define default stripe_size and stripe_count?
[11:18] <Yulya> should i do it during mkcephfs or in ceph.conf or with cephfs tool after mount?
[11:24] * Jiaju (~jjzhang@ has joined #ceph
[11:26] * macana (~ml.macana@ has joined #ceph
[12:37] * sakib (~sakib@ has joined #ceph
[12:40] * sakib (~sakib@ Quit ()
[13:07] * MarkN (~nathan@ has joined #ceph
[13:35] * jantje_ (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[13:38] * jantje (~jan@paranoid.nl) has joined #ceph
[16:37] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[17:17] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:24] * st-3107 (foobar@ has joined #ceph
[17:26] * st-3107 (foobar@ Quit ()
[17:33] * st-3307 (foobar@ has joined #ceph
[17:34] * st-3307 (foobar@ Quit ()
[17:34] * st-3326 (foobar@ has joined #ceph
[17:34] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:34] * st-3326 (foobar@ Quit ()
[17:49] * Yulya (~Yulya@ip-95-220-130-133.bb.netbynet.ru) Quit (Quit: leaving)
[17:50] * Yulya (~Yulya@ip-95-220-130-133.bb.netbynet.ru) has joined #ceph
[18:00] * greglap (~Adium@ has joined #ceph
[18:07] * lxo (~aoliva@ has joined #ceph
[18:49] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[18:54] * cmccabe (~cmccabe@ has joined #ceph
[18:57] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:08] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:44] <lxo> uh oh. it looks like my cluster lost a whole subtree while I was away. although it presumably had completed replicating everything to the 3 osds before I left, two of the osds experienced btrfs instability and went down
[19:45] <lxo> some of the subtrees could be recovered by as much as accessing them by name, even though they didn't show up in ls, but one was really gone, and the space taken up by files in it was not recovered
[19:46] <lxo> what kind of info would help diagnose/fix this? must I restart from scratch to get the space back, or can something else do it?
[19:46] <gregaf> Ixo: you lost a whole subtree of the fs?
[19:47] <lxo> yup
[19:47] <gregaf> and you had 3 OSDs?
[19:47] <gregaf> 2x or 3x replication?
[19:47] <lxo> 3 osds, 3x replication
[19:48] <lxo> I created the filesystem with one of the osds down, then brought it up. when I left it was just about to complete syncing it up
[19:48] <gregaf> was the metadata also updated to 3x?
[19:48] <lxo> yup. all 4 pools were 3x
[19:48] <gregaf> and when you say 2 of them were brought down by btrfs issues, you mean transient ones, right?
[19:49] <gregaf> ie the on-disk filesystem is still okay?
[19:49] <lxo> yeah, restarting the machines fixed it
[19:49] <lxo> actually, only one of them was a btrfs issue, I recall now. the other was a disconnected network cable. the machine remained up, but disconnected from the two others
[19:49] <gregaf> heh
[19:50] <gregaf> that really shouldn't be able to lose data then....*ponders*
[19:50] <lxo> prolly shouldn't keep the servers in the living room ;-)
[19:50] <gregaf> what's ceph -s output?
[19:52] <lxo> http://pastebin.com/tRZv041A
[19:53] <gregaf> so the cluster thinks itself perfectly healthy...
[19:54] <gregaf> how much data does the ceph fs think it's holding?
[19:55] <gregaf> ls -lha on the root dir should give it to you
[19:56] <cmccabe> should Client::unmount really be able to fail?
[19:56] <cmccabe> I think having close/unmount/shutdown functions that can fail is usually a bad idea. What is the user supposed to *do* if unmount fails?
[19:57] <gregaf> cmccabe: doesn't look like it ever returns anything except 0 right now anyway
[19:57] <cmccabe> gregaf: yeah
[19:58] <cmccabe> I'm just wondering if libceph's unmount should return an error code or not
[19:58] <lxo> yep. the only unusual messages I saw was that it reported were of this form: 2011-04-20 13:58:40.755504 7f7f083dc700 log [ERR] : dir 10000000615.10000000615
[19:58] <lxo> object missing on disk; some files may be lost
[19:58] <gregaf> Ixo: okay, that makes sense in terms of the symptoms but I don't have the slightest idea how it could have gotten lost...
[19:59] <lxo> 922GB
[19:59] <gregaf> so we're looking at a few GB lost
[19:59] <sagewk> oh, i forgot to mention, i'll be out tomorrow!
[20:00] <gregaf> probably what happened is that the data is all still there but as you can see we lost some of the directories, and those contain all the metadata for every inode they contain
[20:00] <gregaf> we will eventually have fsck tools to repair stuff like that but we don't right now...
[20:03] <lxo> gregaf, nah, I think none of the data was actually lost, and now that I'm recreating the tree, I see some reports of inconsistent rstats between parent and child, too
[20:03] <gregaf> recreating the tree?
[20:04] <lxo> yeah, rsyncing again from the source
[20:04] <gregaf> ah, okay
[20:04] <gregaf> rstat issues isn't terribly surprising in this case
[20:05] <gregaf> all I can come up with is that maybe there actually was data loss due to the btrfs issue
[20:05] <gregaf> and then when that node rebooted it was the primary for some PGs and our recovery code has some bugs that prevented it from noticing the replicas had the lost objects
[20:06] <gregaf> what you could do is go look at the OSD stores and see if the lost directory objects exist in any of them
[20:06] <lxo> yeah, just did that. I see one of the inodes that are reported as missing
[20:06] <gregaf> if they do you could shut down the cluster, move them by hand onto any OSDs that are missing them, and then restart
[20:07] <gregaf> it should all be good at that point
[20:07] <gregaf> although running the rsync on top of some missing data might have confused things, not sure
[20:07] <gregaf> do you have any logging from the period when things got broken?
[20:07] <lxo> oddly, it is present on all 3 osds
[20:07] <gregaf> has your rsync finished recreating everything?
[20:08] <lxo> no. the actual rsync hasn't even started, it's still creating directories
[20:08] <gregaf> the object names are deterministic based on inode, although I don't remember if the inode selection is deterministic or not
[20:08] * Meths_ is now known as Meths
[20:08] <lxo> it being a shell script that goes sort of find -type d | xargs mkdir, to speed up the subsequent rsync
[20:09] <gregaf> heh, nice
[20:09] <lxo> inodes are sort of sequential AFAICT
[20:09] <gregaf> yeah they definitely mostly are, so I'm wondering if maybe you've just recreated them and that's why they exist now
[20:09] <gregaf> (without any of the old data in them at this point)
[20:09] <lxo> no, I haven't created files, and the dir inode I'm looking at definitely contains files
[20:10] <gregaf> hmm
[20:10] <gregaf> well, you could try just shutting down the whole cluster and starting it back up again and see if it detects stuff properly
[20:10] <gregaf> we'd love to see whatever logs you have
[20:11] <lxo> that's an interesting idea. maybe I should bring down the one osd that failed, restart the others and see how that goes
[20:11] <gregaf> not sure we can do anything with the logs right away but we'll want to look at them at some point certainly to see if we can figure out what broke
[20:11] <lxo> I have all the logs, but just basic (default IIRC) logging
[20:12] <lxo> mds, osd, mon, the whole shebang?
[20:12] <gregaf> it'll still give us a little bit of info
[20:12] <gregaf> yeah
[20:12] <gregaf> at least in terms of where the dramatic errors are originating
[20:12] <lxo> woah, it looks like logrotate doesn't see through symlinks
[20:13] <lxo> 3 GB of logs for mon.0 alone
[20:14] <lxo> I'll pack it up somewhere
[20:16] <lxo> anything else I should save? say, one of the mon trees?
[20:17] <gregaf> hmmm
[20:18] <gregaf> if it's not too large, yeah
[20:22] <lxo> hmm. very odd. this one directory _head has been unchanged for 9 days on the two servers I used to create the whole tree, and for 7 days on the third server. it couldn't possibly have disappeared today
[20:23] <gregaf> which _head?
[20:24] <lxo> the dir inode, current/1.77_head/1000000638a.00000000_head
[20:24] <gregaf> that was one of the lost ones?
[20:24] <lxo> yep
[20:24] <gregaf> huh
[20:24] <lxo> reported as missing. not sure it was actually lost. checking...
[20:25] <gregaf> what's the timeline on when you brought up the third server, when you left, when you noticed issues?
[20:27] <lxo> first thing I did when I got back was to restart the servers and let all pgs go active+clean. then I restarted mdses. then I started looking at the fs and saw stuff was missing
[20:28] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (Quit: Ex-Chat)
[20:28] <wido> sjust: did you make any progress yet?
[20:29] <gregaf> lxo: I meant like you started up the third OSD 8 days ago and left 6 days ago and got back yesterday and saw the problems and think the btrfs issue happened 3 days ago
[20:29] <lxo> now, imagine this: this one directory that still has files listed in that inode appears on a mounted filesystems with only its subdirectories (none of its non-directory children) *but* the files in the subdirectories are still there!
[20:29] <gregaf> I'm trying to figure out when in the sequence that __head object was updated, and when it was declared missing, etc
[20:30] <lxo> it was declared missing today, when i started inspecting the filesystem, after the clean up
[20:31] <lxo> from the logs, it looks like the btrfs issue happened right after I left, on the 14th
[20:31] <gregaf> okay, 6 days ago, the day after the directory was last touched on-disk
[20:31] <lxo> it was last touched on the 13th
[20:32] <gregaf> and it's got all the subdirs but not the leaf children?
[20:32] <lxo> 'zactly
[20:32] <lxo> now, it might be that its parent is what went missing, no?
[20:32] <lxo> temporarily, at least
[20:33] <gregaf> ermm, not sure what you're suggesting but I don't think so
[20:35] <lxo> yeah, the parent was also reported as missing
[20:35] <lxo> and the root (inode 1), too!
[20:35] <lxo> holy cow, this is weird
[20:36] <lxo> it feels like the mds had totally lost touch with reality
[20:36] <sagewk> are you sure this isn't related to the fact that you did a mkdir on all the directories?
[20:37] <sagewk> it would be very strange for the mds to forget all files in a directory but remember all subdirs
[20:38] <lxo> I am, because I didn't recreate any of the files that are present in the subdirs
[20:39] <lxo> and, in fact, the inode is still the same, so this one directory wasn't actually re-created
[20:39] <gregaf> it's lunchtime for me, I'll be back later
[20:42] * st-4157 (foobar@ has joined #ceph
[20:44] * st-4157 (foobar@ Quit ()
[20:45] <lxo> gregaf, enjoy, thanks
[20:47] <lxo> it looks like this one mds1 had been stuck in replay for 2 days! it only started making progress when I brought osds 0 and 2 back up, even though it had all it needed locally
[20:47] <lxo> and just as I brought osds back up, it complained
[20:52] * Yoric (~David@87-231-38-145.rev.numericable.fr) Quit (Quit: Yoric)
[20:57] <lxo> interesting. the parent directory of the disappeared and now-recreated-empty subtree was modified today on the one osd that survived while the two others were down, but the others haven't had it changed since the 13th
[20:59] <lxo> nevertheless, the contents are exactly the same
[21:02] * Ifur (~osm@pb-d-128-141-48-242.cern.ch) has joined #ceph
[21:03] <Ifur> Hello, what is this keys/ceph-type.h that is needed to compile but missing in the source?
[21:10] <Ifur> should it be possible to compile the kernel client on 2.6.34, if not what is the oldest version allowed -- and which method should work :)
[21:16] <gregaf> Ifur: what did you try and build with?
[21:16] <gregaf> I believe the backports branch of ceph-client-standalone is still good, although I'm not sure if it's fully up-to-date at the moment
[21:17] <gregaf> other sources are only expected to build against the latest kernel
[21:17] <sjust> wido: we ended up deciding to fix it as part of a larger overhaul which we are currently working on
[21:17] <Ifur> gregaf: tried ceph-client-standalone for the most part, tried several things, even reverted some changes, but not I'm completely stuck on keys/ceph-type.h file is nowhere to be found, and browsing the sources it lookes like it was added 12 days ago...
[21:18] <gregaf> Ifur: did you try the backports branch of ceph-client-standalone?
[21:18] <Ifur> now I'm compiling ceph-client as is, more or less, to see if I can just insmod the modules from that kernel... :S
[21:18] <Ifur> gregaf: yes
[21:18] <gregaf> hmm
[21:19] <Ifur> gregaf: master and backports actually.
[21:19] <gregaf> the keys stuff was changed very recently and it might have leaked by mistake
[21:19] <Ifur> even upgraded kernel to 2.6.34 to see if it would help.
[21:19] <gregaf> I don't work on the kernel client too much though….sagewk, do you know?
[21:21] <Ifur> gregaf: it does look like it was leaked by mistake from my point of view, tried finding the files missing, but nowhere to be find. its not only the header missing, but there is code in ceph_common.c using get_secret which seem to depend on it.
[21:21] <Ifur> rather not butcher the code and pray for the best, i have enough dirty hacks to deal with as is :P
[21:22] <gregaf> yeah, I'll ask Sage when he's free but the guy who made those changes is out sick today and the other kernel devs are busy atm
[21:23] <Ifur> If I get this test-setup working, I can evaluate and hopefully deploy this soon in the production environment, using fraunhofer FS for the moment, which is rather stable, but doesnt handle alot of files and absolutely no HA features.
[21:28] <Ifur> but btw, the benchmarks i did comparing ceph with fhgfs, showed favorable results for ceph on small I/O.
[21:28] <Ifur> even by using cfuse client... versus fhgfs kernel client.
[21:28] <gregaf> Ifur: drat, it seems the backports stuff is broken right now
[21:29] <gregaf> there were a lot of changes made to the kernel client over the last few weeks and the backported version had fallen pretty far behind so the patches got moved into but we haven't done the work of actually backporting it yet :(
[21:30] <Ifur> is it safe to comment out the get_secrets part of the code?
[21:31] <gregaf> I think it'll take a bit more than that but sagewk could comment better on it
[21:33] <Ifur> will he be around soon?
[21:33] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:33] <gregaf> Ifur: okay, there's an old-backport branch that he just pushed into the repository
[21:33] <gregaf> it doesn't have newer bugfixes but it should build cleanly
[21:35] <gregaf> our team's pretty small so backporting fixes and changes onto older kernel client interfaces is non-trivial for us in terms of time, we'll need to address it at some point but haven't done so yet
[21:36] <Ifur> *nod* i suffer the same issues at work :/
[21:36] <Ifur> whats the old backport branch?
[21:37] <gregaf> he just pushed it, I think it's called "old-backport"
[21:38] <gregaf> *checks* yep
[21:38] <gregaf> it is old though, you might be better off with the fuse client
[21:40] <Ifur> the fuse client crashes a lot when using more then one client accessing a folder.
[21:41] <Ifur> at least thats what I assume is happeining (due to missing locking?)
[21:41] <gregaf> oh hmm
[21:42] <gregaf> that's one scenario that isn't well-tested, could you file a bug report about it?
[21:42] <gregaf> is it actually crashing, or just hanging?
[21:42] <Ifur> gregaf: hanging, but doesn't recover.
[21:43] <gregaf> yeah, that's not too surprising
[21:43] <gregaf> it's something that's supposed to work though, so if you file a bug report we can prioritize it and probably get on it in the next couple weeks :)
[21:44] <Ifur> its getting late here now, can file the bug report tomorrow, would like to find out if this works (there is also an issue of using a kernel patch called bigphysarea due to a FPGA without a buffer).
[21:44] <Ifur> gregaf: sounds excellent!
[21:44] <Ifur> and yeah, small I/O performance was very nice! :)
[21:44] <gregaf> all right, cya and thanks!
[21:45] <Ifur> backwards read on ceph, I got 6MB/s over GbE, FhGFS gives me 10MB/s using IBoIP (mellanox connectx 40Gbps/s).
[21:47] <gregaf> wow, that is not what I would have expected
[21:47] <gregaf> I'm not too familiar with FhGFS though
[21:53] <Ifur> gregaf: its very simple, featureless performance filesystem, not even possible to add failover on the hardware level, as it attaches the hostname of the respective MDS in the metadta....
[21:53] <Ifur> only edge it has over ceph, is RDMA
[21:53] <Ifur> anyone working on that btw? :P
[21:54] <gregaf> we've had people ask about it, but no
[21:54] <gregaf> I think it doesn't do too badly with IPoIB
[21:54] <gregaf> and it's technically feasible but not something we've done yet
[21:55] <Ifur> within the next month, I've probably done testing on IPoIB with ceph.
[21:55] <gregaf> I know we've had people run it but I can't recall who atm
[21:56] <Ifur> problem with IPoIB, is slightly higher latency, and less throughput. Higest transfer speed I've gotten with IPoIB is ~1,1GB/s with RDMA / native infiniband, the theoretical maximum is 3,2GB/s
[21:57] <Ifur> forgot a period after 1,1GB/s there...
[21:58] <gregaf> wow, I didn't realize it was such a hit in throughput
[21:58] <Ifur> infiniband doesnt do "packets" in the same way
[21:58] <Ifur> on the physical layer, 40Gbs IB is equivalent to 10GbE
[21:59] <Ifur> (also in price, actually 10GbE is mor epricey)
[21:59] <Ifur> the nice thin with IPoIB, is that you can get 64kb 'jumbo frames'.
[22:00] <Ifur> MTU:65520 (on ib0)
[22:25] <wido> sjust: Ah, ok. What would you recommend me to do? Just wipe the cluster?
[22:25] <darkfader> Ifur: are you using SDP yet?
[22:26] <sjust> wido: sorry for the delay, unfortunately, it's likely to happen again
[22:26] <sjust> wido: one option would be to remove the offending assert
[22:26] <Ifur> darkfader: No, an issues of having too many toys atm, and don't change it when it works...
[22:26] <sjust> assert(is_replay() && is_active() && !is_crashed());
[22:26] <wido> I'm not sure anymore how I ran into the situation
[22:27] <wido> ok, but that could lead into other crashes, we'll see
[22:27] <wido> I'll give that a try
[22:27] <darkfader> Ifur: i understand that (and i haven't made SDP really work or be understood) but I think it does a lot less performance-crippling
[22:27] <wido> sjust: Is there any more data you want to analyze on the cluster?
[22:27] <Ifur> oh uh, ceph supports SDP?
[22:28] <sjust> wido: removing that assert should simply rehide bugs that have been around a while
[22:28] <sjust> wido: nope, thanks though
[22:28] <darkfader> Ifur: any tcp application can be tunneled with sdp
[22:28] <wido> Ok, I'll remove the assert then and see what happends
[22:28] <darkfader> Ifur: and i think if you're doing real benchmarks it might be worth trying
[22:28] <Ifur> darkfader: might be worth a test then, but first i need to focus a bit on getting things stable.
[22:28] <wido> sjust: are #990, #991, #992 and #996 related to the same bug?
[22:28] <darkfader> :)))
[22:29] <Ifur> I'm extatically happy I got allowed to toss out OpenAFS... :)
[22:29] <darkfader> Ifur: welcome to the new millennium
[22:30] <cmccabe> *cmccabe still has a soft spot for AFS from his college days
[22:40] * Juul (~Juul@slim.dhcp.lbl.gov) has joined #ceph
[22:41] <Ifur> cmccabe: hrm... try mainting a decade old installation with custom modication hacked and glued together to work on ubuntu with a buggy kerberose implementation, you'd get rid of that nostalgia quickly! :)
[22:42] <cmccabe> lfur: heh, sorry to hear that
[22:43] <cmccabe> lfur: I had a friend who hacked on the OpenAFS sources a bit; he said the ifdefs were getting out of control
[22:44] * Juul (~Juul@slim.dhcp.lbl.gov) Quit ()
[22:45] <sjust> wido: not sure about 992, 990 is the same bug, 991 might be
[22:45] <Ifur> cmccabe: openafs is out of control, tape-drive optimized :D
[22:46] <cmccabe> lfur: does openafs actually have any hierarchical file storage / tape drive reading code?
[22:47] <cmccabe> lfur: I know it had some read-only snapshot mechanism that was kind of backup-y, but I don't know if it has anything like Tivoli integrated
[22:50] <Ifur> cmccabe: cern has something they call castor integrated somehow with it.
[22:50] <Ifur> but i was more thinking that its from the tape-drive era.
[22:51] <Ifur> but yeas, security wise and backup (as long as you use AFS only/native) backup its fine.
[22:51] <cmccabe> lfur: well, the tape driver era hasn't completely ended for some organizations :)
[22:51] <Ifur> but you have to struggle to get above 10MB/s no matter what your hardware is.
[22:51] <cmccabe> lfur: although I think most people generally admit that D2D is more cost-effective
[22:52] <cmccabe> lfur: I predict that someday, SSDs will be the new hard drives, and hard drives will be the new tape
[22:52] <gregaf> less durable though, right?
[22:52] <cmccabe> lfur: perhaps the 10MB/s comes from the limitations of the ethernet interfaces?
[22:52] <Ifur> cmccabe: yeah, it also respects mores law a bit more. who cares about long-term, other then keeping your data up to date.
[22:52] <cmccabe> lfur: well, actually, that would be ~100MB/s
[22:53] <Ifur> cmccabe: openafs is comparable to all other filesystem below 10MB/s :P
[22:53] <cmccabe> gregaf: both tape and hard drives degrade over time. I haven't really seen much information about what degrades faster
[22:53] <gregaf> I thought that in storage tapes were expected to last tens of decades and drives were expected to last about a decade
[22:53] <Ifur> there is a project here by a guy to get AFS into this milennium, by using GPFS/Lustre etc, as object-storage/backend for OpenAFS....
[22:53] <gregaf> could be wrong, though
[22:54] <Ifur> sounds like even more fun, openafs ontop of Lustre, mhmmm!
[22:54] <cmccabe> gregaf: I would guess tape lasts longer because it doesn't have capacitors to leak and circuits to... do that thing where they stop working
[22:54] <Ifur> can google openafs+OSD
[22:54] <Ifur> gregaf: consider if you have 100 2TB disks, and how much will a disk be in a decade?
[22:55] <Ifur> as long as you dont buy big, every decade, and constantly upgrade, your fine!
[22:55] <cmccabe> gregaf: NBTI makes circuits stop working?
[22:55] <gregaf> I think it's more a concern for certain institutions that are expected to archive their data and don't want to touch it ever again once they put it in a storage closet
[22:55] <cmccabe> gregaf: I dunno.
[22:55] <Ifur> and with the latest trend of not being 31337 unless you have data-center replication/redundancy, it gets even more absurd with tapes.
[22:56] <cmccabe> gregaf: yeah, I have heard tales of decades-old tape drives being dusted off
[22:56] <gregaf> cmccabe: oh, I meant drives were less durable than tape, dunno about SSDs
[22:57] <cmccabe> lfur: based on my limited experience, it really does seem like automatic replication is what people would like to move towards
[22:57] <cmccabe> lfur: similar to how Amazon S3 works
[22:57] <Ifur> "archiving" and tapes are a bit too much of a paper/library kind of thinking.
[22:57] <cmccabe> lfur: the question is whether something like S3 can ever be cost-competitive with a tape storage facility
[22:58] <cmccabe> lfur: long-term data storage is kind of a vodoo field because you can't really know if your long-term strategy worked until a decade or two goes past
[22:58] <Ifur> considering that 6 9's arent enough anymore, and people wanting continent redundancy because of natural disasters, I dont see how tapes can ever be competetive to disks unless a new generation of cheaper tapes come along.
[22:58] <cmccabe> lfur: and by that point, the people who made the decisions may have moved on to bigger and better things
[22:59] <Ifur> tapes with its intended use, is a bit too much a case of "putting all your eggs in one basket"
[23:00] <Ifur> and even so, someone has to check and maintain it anyways, no gurantees if they get forgotten.
[23:00] <gregaf> on the other hand you can have tape backups in multiple locations that don't require x amount of power per terabyte of storage
[23:00] <cmccabe> gregaf: true
[23:00] <Ifur> gregaf: who says the disks need to be powered on at all locations at the same time?
[23:01] <gregaf> ah, I suppose, but they need to be hooked up somehow
[23:01] <Ifur> a single atom node, with ~60 disks powered off.
[23:01] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[23:01] <Ifur> then power on once per yeah
[23:01] <cmccabe> lfur: there was a company out there that was doing that.
[23:02] <iggy> powered but spun down drives
[23:02] <cmccabe> lfur: spinning down and powering off the disks when they weren't in use. They had like a WebDAV interface to the data, and I think an FTP interface too
[23:02] <Ifur> here they have about a petabyte per year storage requirement, and they use tapes!
[23:02] <Ifur> *sigh*
[23:05] <Ifur> but yeah, I honestly thing that the industry is starting to think more ahead now in general, that moore's law is dependable, and that your computing/storage infrastructure should respect this.
[23:07] <Ifur> alot of neat ideas in storage completely moving away from tradional disk architecture (more or less like putting btrfs on the disk firmware, and having the disks interface with other disks for replication etc)
[23:08] <cmccabe> lfur: as far as I know, object-based storage is still kind of an exotic idea
[23:08] <cmccabe> lfur: the mainstream industry still only cares about SATA-II and SATA-6gbps and SAS. Maybe a little bit of iSCSI or FC for the enterprise guys
[23:09] <cmccabe> lfur: in general, those are very poor interfaces, but getting everyone to agree on better ones is hard
[23:12] <Ifur> cmccabe: yeah :)
[23:20] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:30] <darkfader> Ifur: err
[23:30] <darkfader> have you ever actually seen a larger backup setup?
[23:30] <darkfader> like 10k+ tapes?
[23:31] <darkfader> people will have disks as cache for that but that's it
[23:31] <darkfader> tape's a lot cheaper at that scale
[23:32] <darkfader> and once you need to put the media *offline* in the tape term you aren't having fun with disks
[23:32] <darkfader> not just turned off, but driven to a hidden vault and such
[23:33] <darkfader> (the disk caches exist for small random stuff restores and because almost no server can feed at the speed a tape needs - thats one thing that will change a lot when you can use ceph :)
[23:38] <Ifur> darkfader: anything gets absurd at scale. but no, havent seen a 10k+ tape storage setup. But even with tapes, at some point you'd be better off with a completely custom setup, even if it means custom tape/setup, which I assume the largest setups are.
[23:39] <Ifur> but with high performant tapes, it do gets silly again, and those are almost exclusively for banking type tape backup, where you get tapes doing 2GB/s
[23:40] <Ifur> the backup soltution that IBM provides for some of the biggest banks in the world includes "backup data center in a trailer" where they come running out with backup tapes to get things up again.
[23:40] <darkfader> hehe would be my dream job in some way
[23:40] <Ifur> wouldn't be surprised if that moves over to using lightpeak type connection and SSD in the distant future
[23:41] <darkfader> not day-to-day stuff but run in and bring everything up as fast as possible
[23:41] <Ifur> as long as IBM gurantees that it works, and you dont trip :P
[23:41] <darkfader> hehehe
[23:41] <Ifur> would suck to be resposnbile for millions of dollars lost per second :)
[23:42] <Ifur> have a friend who is really into p/z systems from IBM, that IS silly.
[23:43] <Ifur> when you configure entire servers in a RAID setup, where even the motherboard is hotswappable.
[23:43] <darkfader> do like.
[23:44] <Ifur> Reg/ECC becomes a bit pathetic in comparison to those system that not even guarantees no-errors on ALL bits, but even have a third cpu, comparing the results of two others...
[23:44] <darkfader> is that the sysplex stuff?
[23:44] <Ifur> mhm
[23:44] <Ifur> p system og z systems, cant remember
[23:44] <darkfader> probably zseries
[23:44] <darkfader> though I never saw pseries in the last 5 years or so
[23:44] <darkfader> but those features sound serious, so probably zseries
[23:47] <Ifur> the guy teaches z system, and he goes on and on about it. showed me the admin interfaces the other day, looks like the easiest sysadmin job in the world
[23:47] <darkfader> hehe
[23:47] <Ifur> showed me how to configure a VM, with guaranteed cpu clock cycles, and did live growing and srhinking of root filesystem.
[23:48] <Ifur> take LVM on linux to day, and add a decade of development.
[23:48] <darkfader> a friend of mine almost got fires (~2002) when he ran seti on a zseries linux vm
[23:48] <darkfader> it ate up the scheduler
[23:49] <darkfader> Ifur: linux lvm is a very sad ripoff
[23:49] <darkfader> i think the sistina guys didnt really understand lvm
[23:49] <darkfader> oh well
[23:50] <darkfader> but the zseries costs more than a few nice houses
[23:50] <darkfader> so lets stick to lvm :)
[23:50] <Ifur> mhm, want me to go weep in a corner over the fact that banking/economy has all the bleeding edge...
[23:51] <darkfader> haha
[23:51] <darkfader> no
[23:51] <Ifur> hm?
[23:51] <darkfader> just think of how many finance sysadmins wait months to get a new server through their "processes"
[23:51] <darkfader> and it will stop the weeping
[23:52] <Ifur> i have to wait months! (paper mill, first, get bids from tenders, delivery time, etc etc...)
[23:53] <darkfader> erm. hmm :)
[23:53] <Ifur> alot of rules and regulation in europe (open competition translates to you suffering so others can get a fair chance).
[23:54] <Ifur> over a certain sum, i think a million USD, and you have to to a open call for bids in the entire european union (you do send invitations, but anyone that will can join the competition)
[23:55] <Ifur> lucky you get to set/make the so called 'objective critera for evaluation'.
[23:56] * MarkN (~nathan@ has left #ceph
[23:57] <darkfader> so you have to buy from the guy who matches the criteria best, even if you got a bad feeling?
[23:57] <darkfader> plus wait for them to send in
[23:57] <darkfader> i see
[23:57] <Ifur> its generally a good process, just doesn't work well for IT.
[23:59] <darkfader> back to ceph... :> what OS are you using for the ipoib tests?

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.