#ceph IRC Log


IRC Log for 2011-02-15

Timestamps are in GMT/BST.

[0:19] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:20] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[0:37] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[0:51] <Tv> heh, the way the thesis paper describes recovering open files after an MDS crash sounds very much like what we talked about recently with NFSv4 "client knows the pathname"
[0:55] <bchrisman> would be nice… but looks like linux nfsv4 clients don't implement that stuff…
[0:56] <bchrisman> maybe they will in the future.
[1:10] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[1:10] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[1:11] <gregaf> johnl: did you have any OSD logging on when your MDS first crashed?
[1:22] <johnl> gregaf: no
[1:22] <gregaf> bummer
[1:22] <johnl> :/
[1:22] <gregaf> it looks like one block in the journal got deleted (or else somehow skipped, but I don't think that can happen)
[1:23] <gregaf> but the on-disk header didn't get updated
[1:23] <gregaf> that shouldn't be able to happen though
[1:23] <johnl> hrm
[1:23] <gregaf> so maybe there's a mismatch in how they commit and get replayed
[1:23] <gregaf> I'll need to check
[1:23] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[1:23] <johnl> ok. need anything else off tat cluster?
[1:24] <gregaf> nope
[1:24] <gregaf> gonna kill it?
[1:25] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[1:28] <johnl> wouldn't mind
[1:28] <johnl> sleep first tho
[1:29] <johnl> nnight!
[1:29] <gregaf> night!
[1:29] <gregaf> oh wow, somehow expire < trim
[1:29] <gregaf> it should always be larger
[1:30] <johnl> ?
[1:32] <johnl> sorry knacked. need sleep.
[1:35] <gregaf> in the on-disk journal header — no reason you'd know what it was :)
[2:12] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:29] * bless_ (~bless@ has joined #ceph
[2:29] <bless_> hi
[2:30] <bless_> any one is reading ceph code now?
[2:31] <cmccabe> bless_: yes, why?
[2:33] <bless_> have you know its architecture?
[2:35] <cmccabe> bless_: somewhat
[2:39] <bless_> what is wrong with your ceph?
[2:39] <gregaf> ?
[2:40] <cmccabe> bless_: I'm confused
[2:44] <bless_> did not you meet any error with ceph?
[2:44] <gregaf> there are many possible errors to run into
[2:44] <gregaf> some of them are user error and some of them are programming issues
[2:45] <gregaf> but it certainly isn't plug-and-play so if you have an issue you're going to need to give us more than that
[2:47] <bless_> i am reading code ,so if you have any error with ceph,i am happy to deal with bugs
[2:48] <bless_> now i want to find some people who is analysis code, we can talk about code
[2:48] <bless_> thanks
[2:48] <gregaf> ah
[2:49] <bchrisman> heh
[2:50] <gregaf> well if you'd like a higher-level overview there are some papers on the ceph website that discuss the algorithms in play
[2:50] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:50] <gregaf> there are 7 or so of us here at Dreamhost working on it who hang out in the channel, and there are some guys from Tcloud working on it who do their own thing
[2:50] <gregaf> plus we're picking up a few others like bchrisman who just left, from other companies
[2:52] <bless_> code architecture is first to me,
[2:56] * cmccabe (~cmccabe@ has left #ceph
[3:00] <bless_> how can i find whether monitor is ok?
[3:00] <gregaf> have you looked through the wiki at all?
[3:01] <bless_> yes!!!
[3:02] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[3:03] <bless_> monitor is random, so i do not know which is good,which is bad! like ceph isd dump -o -, i can find all osd's status
[3:11] <gregaf> ceph -s or ceph -w will tell you the basic status of things
[3:12] <gregaf> including the status of the monitors
[3:12] <bless_> i know ceph -s/-w, but if i can find out which monitor is good,which is bad?
[3:13] <gregaf> I believe that after a timeout period (once the system knows a monitor is bad) it will tell you that it's bad
[3:14] <gregaf> but unlike OSDs and MDSes the system can't boot out dead monitors, so you may need to set up your own way of checking them
[3:14] <gregaf> anyway, I have to go now, sorry!
[3:14] <bless_> mon e2: 3 mons at {0=,1=,2=} .this msg just tell me how many mons,but no each other's status?
[3:19] <bless_> how to find out which monitor is leader?
[3:24] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[3:25] <gregaf> hmm, yeah, I guess it doesn't report mon status right now
[3:25] <gregaf> which monitor is the leader isn't important and is not exposed
[3:35] <bless_> thanks
[4:02] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[4:03] * bless_ (~bless@ has left #ceph
[4:04] * bless_ (~bless@ has joined #ceph
[4:04] <bless_> some one familar with ceph.ko?
[6:11] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) has joined #ceph
[6:27] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) has joined #ceph
[6:33] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[7:10] * DJLee (82d8d198@ircip3.mibbit.com) has joined #ceph
[7:12] <DJLee> anyone tools that stresses the mds node? like dir creation?
[7:22] <prometheanfire> general FS stress testing tools would work (I don't know of any, but will be testing eventually)
[7:25] <DJLee> well i was doing those iop tools, and see osd nodes getting burned, but mds hardly do any work.
[7:25] <DJLee> maybe less than 10kb/s
[7:25] <prometheanfire> I'd say that's normal, just metadata, but you are looking for something to test the mds?
[7:26] <DJLee> yeah, a bit more towards the mds, heh;
[7:28] <prometheanfire> I'd hack a script if I were you :D
[7:28] <DJLee> which script?
[7:29] <prometheanfire> create folders aa-zz^10
[7:29] <DJLee> lol, yeah
[7:29] <DJLee> btw, mds don't seem to use any storages, at least by the config, so where's the data stored if any?
[7:30] <DJLee> im guess its at the cache, but then how/where does the 2x, 3x replic stored for the metadata..? ;;
[7:31] <prometheanfire> I think on the ods, but I forget
[7:31] <prometheanfire> no, that can't be right
[7:31] <prometheanfire> I think it is stored seperately per each mds instance
[7:48] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:22] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[8:27] * andret (~andre@pcandre.nine.ch) has joined #ceph
[8:46] * allsystemsarego (~allsystem@ has joined #ceph
[8:48] <bless_> STATE_STARTING,STATE_LEADER,STATE_PEON what is the meaning of the state?
[8:59] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:07] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) Quit (Remote host closed the connection)
[9:07] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) has joined #ceph
[9:08] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) Quit ()
[9:08] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) has joined #ceph
[9:39] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[9:45] * ao (~ao@ has joined #ceph
[10:15] <bless_> how to get mdsmap?
[10:22] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) Quit (Remote host closed the connection)
[10:27] * Yoric (~David@ has joined #ceph
[10:40] * bless_ (~bless@ Quit (Ping timeout: 480 seconds)
[12:56] * bbigras__ (quasselcor@ Quit (Remote host closed the connection)
[12:58] * bbigras (quasselcor@bas11-montreal02-1128535712.dsl.bell.ca) has joined #ceph
[12:59] * bbigras is now known as Guest1424
[15:44] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[15:44] <greglap> DJLee: prometheanfire:the MDS stores all its data on the OSDs
[15:45] <greglap> there's a "metadata" pool in RADOS that holds an object per directory as well as each MDS journal
[15:46] <greglap> and just plain old inode creation should be capable of burning up the MDS as long as it's fast enough
[15:47] <greglap> bless_: STATE_[STARTING|LEADER|PEON] are the monitor states, they get STATE_STARTING during startup (and maybe elections?) and then either LEADER or PEON depending on the election results
[15:47] <greglap> oh, bless is gone, n/m that then
[15:50] <prometheanfire> greglap: if you know any tool I'd be grateful, I hope to test eventually...
[16:22] * ao (~ao@ Quit (Quit: Leaving)
[16:35] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[16:47] * greglap (~Adium@ has joined #ceph
[16:48] <greglap> prometheanfire: I usually just use shell scripts
[16:53] <prometheanfire> ok
[17:35] * greglap (~Adium@ Quit (Quit: Leaving.)
[17:40] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[17:57] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:07] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:43] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:58] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:06] * Yoric (~David@ Quit (Quit: Yoric)
[19:15] * Jiaju (~jjzhang@ Quit (Ping timeout: 480 seconds)
[19:20] * Jiaju (~jjzhang@ has joined #ceph
[19:24] <gregaf> Tv: sjust: joshd: let's push our meeting today back to 11:00 since Colin isn't in yet
[19:24] <sjust> ok
[19:24] <Tv> gregaf: ok
[19:24] <joshd> ok
[19:27] * cmccabe (~cmccabe@ has joined #ceph
[19:27] <gregaf> oh, there he is...
[19:38] * tjikkun (~tjikkun@82-168-5-225.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[19:47] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[19:54] <Tv> the autotest stuff: http://ceph.newdream.net/git/?p=ceph-autotests.git;a=blob;f=README.rst;h=1ca87465ca4ca12f63d3668e8cc2a10af85129f5;hb=HEAD
[19:54] <Tv> clone that repo
[19:54] <cmccabe> k
[19:54] <cmccabe> um, it's http?
[19:54] <cmccabe> should be ssh for us right
[19:55] <Tv> it's hosted just like ceph.git
[19:55] <cmccabe> cmccabe@metropolis:~/src$ git clone ssh://ceph.newdream.net/git/autotests.git
[19:55] <cmccabe> Cloning into autotests...
[19:55] <cmccabe> fatal: '/git/autotests.git' does not appear to be a git repository
[19:55] <cmccabe> fatal: The remote end hung up unexpectedly
[19:55] <Tv> read ;)
[19:55] <Tv> what's the project name?
[19:56] <cmccabe> oh, probably ceph-autotests
[19:57] <Tv> time to put in some cloneurls into gitweb, i guess..
[19:58] <Tv> there
[19:59] <cmccabe> looks like this requires libxml2
[19:59] <Tv> you don't need to run anything
[19:59] <cmccabe> I ran bootstrap
[19:59] <Tv> you don't need to
[20:00] <Tv> read the readme
[20:00] <cmccabe> what's the function of bootstrap
[20:00] <cmccabe> it's not mentioned in the README as far as I can see
[20:00] <Tv> yeah, so you don't care about it
[20:01] <Tv> if this is too confusing, i'll have to split the repo into two, so you don't see the parts you're not supposed to care about
[20:01] <cmccabe> just out of curiosity, what is its function
[20:03] <Tv> teuthology.web
[20:03] <cmccabe> which is a web server doing the front end stuff then
[20:03] <Tv> no, serving the test tarballs
[20:04] <cmccabe> I'm looking at the diagram now
[20:06] <cmccabe> how do I get a login on http://autotest.ceph.newdream.net/
[20:07] <cmccabe> never mind, I think you sent one yesterday.
[20:08] <cmccabe> well, if it's all right with you, I'm going to try starting a test
[20:09] <Tv> i'll be surprised if it breaks anything
[20:09] <gregaf> the whole point of a lot of it is it doesn't need to be all right with anybody
[20:09] <gregaf> right?
[20:10] <cmccabe> yeah.
[20:10] <Tv> yes to locking not asking humans, if that's what you mean
[20:10] <Tv> and the scheduler will just automatically lock hosts for the tests etc
[20:10] <gregaf> yeah
[20:13] <cmccabe> I started an fsx test
[20:13] <cmccabe> it seems like all the standard autotest tests are available and it seemed easy
[20:14] <cmccabe> I guess for ceph tests, we'd be using "edit control file"?
[20:14] <Tv> yeah, at least for now
[20:15] <cmccabe> this might be described somewhere in the README, but just in case it isn't....
[20:15] <cmccabe> how do we get things on http://ceph.newdream.net:8116/tarball/
[20:16] <cmccabe> so that's teuthology.web, right
[20:16] <Tv> yeah, doublechecking the readme
[20:16] <gregaf> the great thing about READMEs is that you can read them before you take up time asking questions
[20:16] <Tv> like 69
[20:16] <Tv> yeah, it's so awesome
[20:16] <Tv> s/like/line/
[20:17] <gregaf> and by consulting the README first and then asking if you still don't know, you help the author improve their README!
[20:17] <cmccabe> it says "They are served via ``teuthology.web``, from the files in the ``tests/`` directory of the source tree."
[20:17] <cmccabe> whose source tree?
[20:17] <cmccabe> mine, yours, santa claus's?
[20:18] <Tv> cmccabe: if you're reading a readme in a source tree, what does "the source tree" refer to?
[20:18] <cmccabe> are you implying I should run my own instance of teuthology.web
[20:18] <Tv> no
[20:18] <Tv> read the next paragraph, dammit
[20:19] <Tv> look, i'm actually interested in knowing what the readme leaves ambiguous
[20:19] <Tv> but you're asking sentence by sentence
[20:19] <Tv> read the damn thing, if necessary read it again, then give me a list of things that are unclear
[20:19] <cmccabe> so, I think I get this now.
[20:19] <Tv> oh and don't be afraid of looking at the source tree -- you're a programmer, after all
[20:20] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[20:20] <cmccabe> because the git repo is hosted on ceph.newdream.net, and because we can push to the git repo, pushing to the git repo is the way to make tests available to autotest
[20:20] <cmccabe> It's in general a well-written README, and thanks for putting it together
[20:21] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) has joined #ceph
[20:21] <cmccabe> however the concept of pushing to a repo in order to test is not obvious
[20:21] <cmccabe> I feel you should introduce that concept a little earlier in the file
[20:22] <Tv> you can run your own teuthology.web if you want
[20:22] <Tv> or use any url to .tar.bz2
[20:22] <cmccabe> but overall, creating a branch on ceph.newdream.net is probably easiest
[20:23] <cmccabe> I do appreciate the flexibility, it's great
[20:27] <cmccabe> so, first test seems to have gone pretty well
[20:28] <cmccabe> but that's just standard fsx
[20:28] <cmccabe> I gave my email address to the form, but I didn't get any mail. Do we only get mail on failure?
[20:28] <Tv> never tried that feautre
[20:29] <Tv> there's nothing installed on the box to send them
[20:30] <cmccabe> well, we could always install msmtp or something
[20:30] <Tv> yeah need to test whether mail.hq likes it or not
[20:30] <Tv> 554 5.7.1 <tommi.virtanen@dreamhost.com>: Relay access denied
[20:31] <Tv> no can do right now
[20:31] <cmccabe> so I think most tests would like to return error status if a daemon dies
[20:31] <cmccabe> can we somehow add that to teuthology
[20:31] <Tv> that and many other things
[20:32] <Tv> i'm already checking whether cfuse died too early
[20:32] <cmccabe> indeed. the unattainable is... unknown!
[20:32] <Tv> but not sure how to generalize things, quite yet -- need to have more dumb & simple tests, to see the patterns emerge
[20:32] <cmccabe> sorry... obscure reference
[20:33] <cmccabe> I feel like it would be best if the status monitoring lived at the second tier of autotest, the cluster management level
[20:33] <cmccabe> rather than in the part that runs on individual machines.
[20:33] <cmccabe> I forget, do those layers have well-agreed-on names?
[20:33] <Tv> don't get too far ahead yourself -- it doesn't even have any multi-machine tests yet
[20:34] <Tv> right now, a simple health check after an otherwise successful run will tell you whether any daemons died
[20:34] <Tv> and that's plenty good enough
[20:35] <cmccabe> I'm pretty familiar with the show health code, I did write it after all
[20:35] <cmccabe> the problem is that there is kind of a tradeoff between performance and accuracy
[20:36] <cmccabe> like if all OSDs die at once, no more PGStatsUpdates will be sent to the monitors
[20:36] <cmccabe> then after 10 minutes, the monitors will time out the OSDs for failure to update status
[20:36] <cmccabe> but 10 minutes is a rather long time
[20:36] <Tv> bah, we have NO TESTS right now\
[20:36] <cmccabe> anyway, these are issues we can work through later.
[20:36] <Tv> 10 minutes is perfectly fine
[20:37] <cmccabe> my original tests didn't check for daemon death explicitly either
[20:37] <cmccabe> actually, some of them killed daemons deliberately
[20:39] <cmccabe> Anyway, looks good. I'll try some other stuff in a bit... my network connection has finally been repaired and I can finish what I was doing earlier.
[21:04] <Tv> ok this is probably a dumb question, but i'm hoping to reach the point where i see why it's dumb...
[21:04] <Tv> why does ceph use PGs?
[21:04] <Tv> why not go straight from original object id to osds that store it?
[21:05] <cmccabe> tv: crudely put, placement groups are a kind of sharding
[21:06] <cmccabe> tv: the PG that an object is in determines where it will be placed
[21:06] <Tv> yeah but it's object_id ---deterministic--> pg --deterministic--> osds
[21:06] <Tv> why not object_id ---deterministic--> osds
[21:07] <cmccabe> tv: I think the answer might lie in the workings of crush
[21:08] <Tv> i've been reading that, hence the question ;)
[21:10] <cmccabe> tv: partly PGs are a management thing
[21:10] <cmccabe> tv: like the OSD which is primary for a PG is responsible for managing replication for the objects in that PG
[21:11] <cmccabe> tv: I guess you could imagine alternate ways to manage that of course
[21:11] <Tv> ahh so it's a way of assigning special roles wrt that subset of the data
[21:11] <Tv> still not seeing the big picture.. i guess i'll pester the mailing list
[21:11] <cmccabe> tv: that's definitely one thing that it is. Each PG has a set of OSDs that it maps to, and one "primary" osd
[21:12] <cmccabe> the primary OSD does all the writes, handles scrub and recovery
[21:12] <Tv> but that doesn't *have* to be by pg, it could be by object, etc
[21:13] <Tv> i can see how it's used, not why
[21:13] <cmccabe> yeah, I'm curious what sage or greg would have to say about this
[21:15] <cmccabe> a system without PGs would struggle to keep up with a changing cluster topology I think
[21:15] <cmccabe> in our system, when an OSD goes down, the OSDMap is updated and sent out to everyone
[21:16] <cmccabe> then clients know that their requests for objects in PG XYZ must be sent to set {o1,o2,o3} rather than {o4,o2,o3}.
[21:17] <Tv> but afaik osdmap is just a "program" executed with CRUSH
[21:17] <Tv> i don't see how the internal structure of it changes its use
[21:17] <cmccabe> actually I guess the relevant concept here is PGMap
[21:17] <cmccabe> the PGMap is altered by changes in cluster topology
[21:20] <gregaf> I'm not so great with CRUSH, actually — it's just a magic box to me
[21:20] <gregaf> but part of it is to scalably handle the data
[21:21] <gregaf> when the primary for a PG changes; it's notified by the previous OSDs that held that PG
[21:21] <gregaf> PGs limit the amount of data each OSD needs to process on every new map generation
[21:21] <Tv> i think i have a good grasp of CRUSH (on a whiteboard level; not the code)
[21:21] <gregaf> whereas if it was per-object placement then they'd need to handle updates per-object
[21:21] <Tv> but the "limit the amount of data" might be it
[21:22] <cmccabe> gregaf: map = PGMap?
[21:22] <gregaf> OSDMap
[21:22] <sjust> PGs also provide a fundamental unit of recovery of logging
[21:22] <sjust> otherwise you would need to maintain a log for and manage recovery on an object by object basis
[21:23] <sjust> *fundamental unit of recovery and logging
[21:23] <Tv> sjust: that just needs some way of grouping, it doesn't seem to require specific placement
[21:23] <cmccabe> technically you still have to move around the same amount of data when the OSDMap changes.
[21:23] <sjust> the objects in the recovery/logging unit would always need to be on the same node
[21:23] <cmccabe> I'm not sure about metadata though
[21:24] <Tv> sjust: hmm ok i haven't read up yet on osd data migration or recovery
[21:24] <Tv> hopefully that'll be illuminating
[21:24] <sjust> Tv: it's true that it's a method of grouping, but the grouping needs to remain the same through cluster changes
[21:24] <gregaf> I think that's basically it
[21:24] <gregaf> coordination is expensive and PGs provide a way to place bounds on the required coordination
[21:25] <gregaf> it doesn't matter much in our small test clusters, but in a large cluster of a few thousand nodes then an object->OSD mapping would basically require every OSD to be a peer to every other OSD
[21:25] <gregaf> which would get expensive
[21:25] <gregaf> whereas with object->PG->OSD then the number of peers is bounded by the number of PGs/OSD
[21:25] <gregaf> (that's PGs per OSD)
[21:25] <Tv> what got my interested was how big the first 140 pages of the thesis are about "declustering" things; and then just saying CURSH(rule_nrep, pgid) -> (osd1, osd2, osd3) -- that clumps up a bunch of things
[21:25] <gregaf> which according to old experiments ideally stands at ~100
[21:26] <cmccabe> gregaf: so the PG->OSD mapping is encoded in the PGMap
[21:26] <cmccabe> gregaf: which is generated by crush originally, but which is explicitly created and passed around later
[21:26] <gregaf> errm, not sure
[21:26] <gregaf> but I think it's another CRUSH mapping from PG to OSD
[21:26] <Tv> cmccabe: PG->OSD is done with CRUSH, and thus sounds like osdmap
[21:26] <sjust> cmccabe: I think the osd's recompute the placement when necessary
[21:26] <gregaf> the PGMap handles overrides
[21:26] <Tv> object_id->pg is just a modulo
[21:26] <gregaf> and some kind of history
[21:27] <Tv> i have no clue about overrides ;)
[21:27] <Tv> the thesis doesn't speak much about migration
[21:27] <gregaf> I think it's discussed briefly when it talks about failure recovery
[21:27] <Tv> yeah not quite there yet i guess
[21:29] <Tv> "Placement groups provide a means of controlling the level of replication declustering. That is, instead of an OSD sharing all of its replicas with one or more devices (mirroring), or sharing each object with different device(s) (complete declustering), the number of replication peers is related to the number of PGs μ it stores—typically on the order of 100 in the current system."
[21:29] <Tv> seems that the "limit overhead" explanation is the right one
[21:31] <cmccabe> tv: to complete the explanation, you have to show that limiting the number of replication peers is better than not limiting
[21:31] <Tv> yeah, no details yet, reading..
[21:31] <cmccabe> tv: perhaps if the number of replication peers was O(num_osds), doing the equivalent of a PGLog merge would not scale.
[21:32] <cmccabe> tv: also I guess on some level, having 100 sockets open instead of 1000 is probably more resource-efficient.
[21:33] <Tv> sockets are surprisingly cheap when idle
[21:33] <cmccabe> tv: there is a fixed overhead per packet you send (effectively, there no point in sending less than 1500 bytes)
[21:33] <gregaf> cmccabe: think about johnl's problems and your experiments with PG scaling issues...
[21:34] <cmccabe> gregaf: it is true that having many PGs in the current system doesn't scale too well
[21:34] <cmccabe> gregaf: however we have identified some non-algorithmic reasons for that (like the hashmaps you saw)
[21:34] <gregaf> yeah
[21:35] <gregaf> but my point is that if we weren't using PGs then we would see those issues proportional to the number of OSDs
[21:35] <gregaf> rather than the number of PGs
[21:35] <cmccabe> one thing that I'd like to figure out better is pools
[21:35] <Tv> well, on the other hand the whole PG mechanism wouldn't exist ;)
[21:35] <cmccabe> I didn't get a very good sense of how they worked from the thesis... I think it's been revised
[21:35] <gregaf> pools didn't exist when he wrote that :D
[21:36] <cmccabe> heh
[21:39] <gregaf> and there's your email from Sage :)
[21:39] <Tv> yeah
[21:40] <Tv> the double failure thing smells like a weak argument
[21:40] <gregaf> ?
[21:40] <Tv> multiple failure without pgs: high probability of losing small amounts of data
[21:40] <Tv> multiple failure with pgs: low probability of losing large amounts of data
[21:41] <Tv> that's my first reading of it
[21:41] <Tv> a*b is roughly the same for both, one is regrettable other is a disaster
[21:41] <gregaf> ah
[21:42] <gregaf> well, as he says the math works out as a wash at 2x and a win at higher replication levels ;)
[21:42] <Tv> defending against multiple failures is a losing game anyway; just up your replication level until you're happy
[21:42] <gregaf> the other thing is that, really, once you lose data it's pretty much a disaster
[21:42] <gregaf> I'm no netop guy but I'd think it'd be cheaper to handle the occasional long restore-from-backup than the constant small restore-from-backup?
[21:43] <cmccabe> there was a google paper recently where they said, roughly, that failures tended to be clustered in time and space
[21:43] <Tv> having been in the webby space a lot, that space considers restore from backup almost as bad as total loss
[21:43] <cmccabe> I'm not sure how general their results were. They seemed to be based on some pretty heavy assumptions related to the google environment
[21:44] <Tv> well clustered in time & space makes sense for datacenter setups, cooling etc
[21:45] <cmccabe> tv: they developed some extremely elaborate statistical model and found that it matched their actual results fairly well
[21:46] <cmccabe> tv: I read the paper expecting a discussion of systems software, but it was almost like... a statistics paper.
[21:47] <cmccabe> tv: similar to calculating insurance rates in a population of 25-35 year old males, or whatever I guess
[21:47] <Tv> hehe, large scale storage = insurance against data loss
[21:48] <cmccabe> http://research.google.com/pubs/pub36737.html
[21:48] <Tv> i recall seeing an insurance statistics paper that said societies where pedestrian have lots of small accidents with cars are safer
[21:48] <gregaf> Tv: partly — a lot of Ceph's basic research was originally done as research into archival storage
[21:49] <Tv> their explanation was that the drivers are traumatized for a while, and thus drive slower, and don't actually kill people as much
[21:50] <cmccabe> tv: there was a movement to take out traffic lights and stop signs from certain pedestrian crossings
[21:50] <cmccabe> tv: on the theory that without those things, drives paid more attention to the road
[21:50] <Tv> a lot of roads in .fi are intentionally painted narrower than they have to be
[21:50] <Tv> to make you scared
[21:51] <cmccabe> tv: narrow lanes are so annoying
[21:51] <cmccabe> tv: or do you just mean one lane roads?
[21:51] <Tv> one lane
[21:51] <Tv> per direction, that is
[21:52] <cmccabe> in pittsburgh there are a lot of roads with no lane markings. They have space about 1.5 normal lane widths
[21:53] <cmccabe> it really leads to a teeth-grinding experience when someone tries to creep up in your blind spot on a road like that
[21:53] <Tv> there's a .fi joke about that
[21:53] <cmccabe> the other fun thing was on-ramps with stop signs at the end before the merge.
[21:53] <cmccabe> there oughta be a law....
[21:53] <Tv> driving in 1-foot deep snow, and seeing only 3 deep tire tracks in the snow
[21:54] <Tv> so you have 2*0.666.. lanes for a road that's normally 1 lane per direction
[21:54] <Tv> makes you kinda more alert
[21:55] <cmccabe> yeah, that sounds unpleasant
[21:55] <cmccabe> especially if you don't know which tire tracks lie on the actual pavement
[23:19] * greglap (~Adium@ has joined #ceph
[23:43] * greglap (~Adium@ Quit (Quit: Leaving.)
[23:44] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[23:56] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.