#ceph IRC Log


IRC Log for 2011-04-28

Timestamps are in GMT/BST.

[0:09] * MarkN (~nathan@ has left #ceph
[0:26] * joshd (~jdurgin@ Quit (Ping timeout: 480 seconds)
[0:40] <yehudasa> Tv: added a code to FileStore.cc that uses MD5 (for now), but for some reason ceph::crypto::MD5 crashes in the MD5 constructor
[0:40] <Tv> yehudasa: funky
[0:40] <yehudasa> PK11_CreateDigestContext() returns NULL
[0:40] <Tv> yehudasa: can you write a standalone app with the same crash?
[0:40] <yehudasa> the ceph::crypto::init() is being called in common_init()
[0:41] <yehudasa> Tv: I can try
[0:45] <Tv> getting null out of that does seem like init was not done
[1:01] <Tv> 650MB log from mds after gzip -9 :(
[1:01] <Tv> we need to visit what's a realistic log level
[1:01] <Tv> even copying this data around takes too long
[1:04] <gregaf> any discussion about that will include adjusting our current logging levels
[1:04] <gregaf> we really only use 1,5,10,20
[1:04] <Tv> well the tests are explicitly kinda verbose
[1:04] <gregaf> and a lot of them are in areas of code that are now pretty safe
[1:05] <Tv> bumping them down from 20 and then you can bump them up when you want to seems probably better
[1:05] <Tv> i just don't have that sort of knowledge of the code yet
[1:06] <gregaf> yeah
[1:06] <gregaf> recompiling is a bitch, though...
[1:07] <Tv> oh i mean the "debug foo = 20" in ceph.conf
[1:07] <Tv> no recompiles
[1:07] <gregaf> yeah, that can help too
[1:08] <gregaf> it would be nice though if we had some way of turning down less important outputs as time goes by though
[1:08] <gregaf> or tuning the outputs a little more precisely
[1:09] <Tv> so this is a 9GB mon.0 log from a test that took ~20min and succeeded 100%
[1:09] <Tv> that's.. just.. not optimal
[1:11] <yehudasa> Tv: doesn't reproduce easily. My guess is that it only happens if the call is from one of our static libraries
[1:11] <yehudasa> Tv: so the init is being called in the main, and there might be multiple instances of libnss or something like that
[1:12] <Tv> yehudasa: multiple instances of libnss sounds.. wrong
[1:13] <yehudasa> Tv: sounds wrong as 'this shouldn't happen, let's fix that!' or sounds wrong as 'wtf are you talking about'?
[1:13] <Tv> yehudasa: as in how does that even happen to a C library
[1:13] <yehudasa> Tv: hmm.. autotools?
[1:13] <Tv> the symbols would collide..
[1:13] <Tv> no clue
[1:13] <yehudasa> yeah, we've seen such issues
[1:14] <Tv> yehudasa: push your branch & let me know how to reproduce.. i can try and poke it
[1:14] <yehudasa> ok, will just push whatever I have
[1:15] <trollface> okay, sjust, gregaf, I'm an idiot
[1:15] <trollface> /thread
[1:16] * trollface sadface
[1:16] <gregaf> heh, what was it?
[1:16] <trollface> I was trying to do fancy stuff with crushmap
[1:16] <trollface> essentially I had root{ host {device} host {device} } etc
[1:17] <trollface> with selection rules select device from root
[1:17] <trollface> for some reason this didn't want to work - dunno why
[1:17] <trollface> I added all devices to root directly, it took a while but now it's all clean
[1:18] <trollface> the bonus question would be wtf is wrong with crushmap
[1:18] <trollface> ah, oh, I started those experiments trying to go to osd per disk, not per host
[1:19] <gregaf> hmm, I'm not a crushmap expert but I wonder if the rules were somehow not quite satisfiable so it just gave up on some of them, hrm
[1:19] <trollface> created the config like root{ rack { host {disk disk disk disk} host {disk} host {disk} } } and then rule to select host
[1:19] <gregaf> maybe sagewk would know
[1:19] <gregaf> oh, the machines were unbalanced and then you were selecting hosts before disks?
[1:20] <trollface> it didn't work without any explanation, I poked it with various utensils, then decided to revert as there were concurrent problems with the cluster (somebody decided to hijack a rack)
[1:20] <trollface> gregaf: so I started with 3-device cluster with replication 3
[1:20] <sagewk> trollface: did you have a two choose step rule, or were you using chooseleaf?
[1:21] <trollface> I tried to do step take root
[1:21] <trollface> step choose firstn 0 type host
[1:21] <trollface> wrong?
[1:21] <trollface> (and step emit of course)
[1:23] <sagewk> if you take that route you also need 'step choose firstn 1 type disk' to choose 1 disk (or is it device?) out of that host
[1:23] <sagewk> or, you chooseleaf.. that'll give you better results in the end.
[1:24] <sagewk> see http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH
[1:25] <trollface> aha
[1:25] <trollface> I didn't pay attention to choose*leaf*
[1:25] <trollface> artificial intelligence stands no chance against natural stupidity
[1:25] <trollface> :(
[1:26] <trollface> thanks
[1:26] <trollface> will retry this thing tomorrow then.
[1:26] <trollface> by the way, did anyone do any testing of btrfs osd-per-machine vs. osd-per-disk config ?
[1:27] <trollface> performance wise, I guess
[1:28] <gregaf> not a lot ??? most people are trying to do an OSD-per-disk so far, I guess for the faster recovery in case of error
[1:28] <gregaf> it seems that you need about a full core per OSD to handle large-scale startup/recovery, though
[1:28] <yehudasa> Tv: pushed the lfn branch, doesn't work if configured --with-nss --without-cryptopp
[1:29] <yehudasa> a test case would be: ./rados -c ceph.conf -p data put bla12345678901234567890 bla
[1:29] <yehudasa> that'll crash the osd, but it might be that it'll crash even before
[1:56] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[1:56] <Tv> yehudasa: is the crash client or server side?
[1:56] <Tv> oh in osd i guess
[1:56] <Tv> as it's lfn
[1:56] <yehudasa> Tv: server side
[1:56] <yehudasa> right
[1:58] <Tv> make[3]: *** No rule to make target `osd/ClassHandler.cc', needed by `ClassHandler.o'. Stop.
[1:59] <Tv> :(
[1:59] <yehudasa> Tv: make distclean helps
[2:00] <Tv> bleh make :
[2:00] <Tv> (
[2:02] <Tv> git clean -fXd is my friend
[2:03] <Tv> and ccache too..
[2:05] <Tv> yehudasa: you wouldn't happen to get this in your logs
[2:06] <Tv> 2011-04-27 17:05:27.962073 7f617e047700 cannot convert AES key for NSS: -8023
[2:06] <yehudasa> Tv: have no idea.. my logs are long gone
[2:08] <Tv> that sounds related
[2:22] <Tv> the usages being all out of whack wrt reality is starting to seriously annoy me
[2:24] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:55] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:59] * djlee (~dlee064@des152.esc.auckland.ac.nz) has joined #ceph
[3:16] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[3:18] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[3:39] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[4:07] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[4:15] <djlee> greg are you here?
[4:52] * djlee1 (~dlee064@des152.esc.auckland.ac.nz) has joined #ceph
[4:57] * djlee (~dlee064@des152.esc.auckland.ac.nz) Quit (Ping timeout: 480 seconds)
[5:08] * djlee (~dlee064@des152.esc.auckland.ac.nz) has joined #ceph
[5:13] * djlee1 (~dlee064@des152.esc.auckland.ac.nz) Quit (Ping timeout: 480 seconds)
[5:25] * djlee1 (~dlee064@des152.esc.auckland.ac.nz) has joined #ceph
[5:30] * djlee (~dlee064@des152.esc.auckland.ac.nz) Quit (Read error: Operation timed out)
[5:39] * djlee (~dlee064@des152.esc.auckland.ac.nz) has joined #ceph
[5:44] * djlee1 (~dlee064@des152.esc.auckland.ac.nz) Quit (Ping timeout: 480 seconds)
[5:44] * djlee (~dlee064@des152.esc.auckland.ac.nz) has left #ceph
[8:03] * cephnewbie (~cephnewbi@173-24-225-53.client.mchsi.com) has joined #ceph
[8:13] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[8:24] * cephnewbie (~cephnewbi@173-24-225-53.client.mchsi.com) Quit (Quit: Leaving)
[8:26] * joshd (~jdurgin@ has joined #ceph
[8:30] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[8:34] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:48] * allsystemsarego (~allsystem@ has joined #ceph
[8:53] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[8:59] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[9:03] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[9:07] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:18] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[9:18] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit ()
[9:24] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) has joined #ceph
[9:49] * joshd (~jdurgin@ Quit (Quit: Leaving.)
[9:58] * Yoric (~David@87-231-38-145.rev.numericable.fr) has joined #ceph
[10:00] * cephnewbie2 (~cephnewbi@173-24-225-53.client.mchsi.com) has joined #ceph
[10:44] * greghome (~greghome@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[10:44] * DanielFriesen (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[10:47] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Read error: Operation timed out)
[11:06] * greghome (~greghome@cpe-76-170-84-245.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[12:00] * greghome (~greghome@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[12:16] * greghome (~greghome@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: ~ Trillian Astra - www.trillian.im ~)
[12:46] * Guest3338 (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) Quit (Remote host closed the connection)
[12:48] * bbigras (quasselcor@bas11-montreal02-1128536388.dsl.bell.ca) has joined #ceph
[12:49] * bbigras is now known as Guest3488
[15:55] * allsystemsarego_ (~allsystem@ has joined #ceph
[15:59] * allsystemsarego (~allsystem@ Quit (Read error: Connection reset by peer)
[16:02] * allsystemsarego_ (~allsystem@ Quit (Quit: Leaving)
[17:36] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:44] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:52] * greglap (~Adium@ has joined #ceph
[17:54] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:13] * andret (~andre@pcandre.nine.ch) Quit (Ping timeout: 480 seconds)
[18:17] * andret (~andre@pcandre.nine.ch) has joined #ceph
[18:20] * Yoric (~David@87-231-38-145.rev.numericable.fr) Quit (Quit: Yoric)
[18:42] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:46] * Yulya_ (~Yulya@ip-95-220-190-12.bb.netbynet.ru) Quit (Quit: leaving)
[18:48] * Yulya (~Yulya@ip-95-220-190-12.bb.netbynet.ru) has joined #ceph
[18:52] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[18:58] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:14] * verwilst (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[19:17] * trollface (~stingray@stingr.net) Quit (Ping timeout: 480 seconds)
[19:21] <sagewk> bchrisman: pushed libceph patch adding 'struct' prefix
[19:22] <bchrisman> sagewk: thx
[19:31] <sagewk> skype!
[19:32] <sagewk> bchrisman: what's your skype id?
[19:32] <sagewk> (in a different room today)
[19:34] <bchrisman> scaleca
[19:52] * tjikkun_ (~tjikkun@195-240-187-63.ip.telfort.nl) has joined #ceph
[19:53] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Read error: No route to host)
[19:57] <sagewk> bchrisman: pushed
[20:00] <Tv> a magic recompile made it work :(
[20:02] <Tv> and now it fails again
[20:04] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[20:14] <Tv> it is almost as if it worked on secret keys that happen to be even
[20:20] <Tv> ok it is threading
[20:21] <Tv> "./cmon -i a -c ceph.conf" fails, "./cmon -i a -c ceph.conf -f" works
[20:21] <Tv> yehudasa: quick workaround: -f
[20:22] <yehudasa> Tv: -f makes it not call daemonize()
[20:22] <Tv> yeah just reading the defails
[20:22] <Tv> details
[20:38] <Tv> it seems i cannot fix nss without adding something like common_init_daemonized(), that must be called after msgr->start()
[20:49] <jjchen> I have some questions about Rados client read/write/remove APIs. Our application needs to support storing large objects (a few GB each in size). The semantics is that while a reader is reading, the content should not change. A write always creates a new copy and replace the old copy when existing readers finish. I am thinking how to use Rados existing APIs to implement this semantics. My first question is that how we can tell whether
[20:49] <Tv> jjchen: you got cut off after "whether"
[20:50] <jjchen> Resend the second part: My first question is that how we can tell whether there are pending readers on an object in Rados? In addition, since each object can be large, users may want to call read multiple times to get back the entire object instead of a single read. In between two consecutive reads, is there a way to prevent the replacement or remove? Certainly we can always add locking at application code to implement this. But I am
[20:51] <Tv> jjchen: for the multiple reads, snapshots might be what you want
[20:51] <Tv> again, "But I am"
[20:51] <Tv> irc has silent maximum line lenght limits..
[20:51] <jjchen> resend the 3rd part: But I am wondering if Rados can handle this at its layer. If not, are there any thoughts for Rados to support such semantics in the future? Thanks a lot.
[20:52] <Tv> snapshots might be what you're looking for
[20:52] <Tv> that means you can still end up with two writers on the same object at the same time, i think
[20:54] <jjchen> In our case, I guess each write will need to create a new snapshot. Application handles assign the version number. So whoever has the largest version number wins.
[20:54] <Tv> jjchen: i'm not fully familiar with rados, but my understanding is the snapshots are read-only, so you could take a snap when you begin to read, consume data at your pace, destroy the snapshot when you're done
[20:54] <jjchen> My question is more about the removing the old versions since we only want to keep one public version. When we try to remove the old snapshot, how can we know whether some reader are reading it?
[20:56] <Tv> jjchen: well each reader could use their own snapshot
[20:56] <Tv> but i really don't know what the performance tradeoffs are
[20:57] <jjchen> Is snapshot a metadata op only or creating a real data copy?
[20:57] <Tv> if nobody writes while the snapshot exists, no data needs to be copied
[21:00] <Tv> so tools/common.cc does not seem to use common_init at all
[21:00] <Tv> ah the tools themselves call it
[21:04] <jjchen> So let's say all readers read the current snapshot. Writers create a new copy. When writer finishes, we would want to switch to the new snapshot for new readers. We can delete the old snapshot after some time. Is this doable using Rados snapshot?
[21:05] <Tv> new reader takes a new snapshot
[21:05] <Tv> i have no idea whether that's prohibitively slow or not
[21:06] <Tv> but you asked for how to get consistency across multiple reads; that's the only thing i see in librados that can provide that
[21:06] <Tv> perhaps a better way is that you construct your object names carefully
[21:07] <jjchen> we can create snapshot per object, not for entire pool, right?
[21:09] <Tv> looks like it's the whole pool
[21:09] <Tv> but this is really just me reading librados.h and making educated guesses
[21:11] <jjchen> OK, thanks Tv. I hope somebody familiar with snapshot on the list can make some comments too. Well I am going to study more about rados snapshots.
[21:13] <wido> I tried debugging my memory hungry mon today, but the mon doesn't seem to use tcmalloc. The memory profiling page on the Wiki says the same
[21:13] <wido> Debugging such issues is new terrain for me, so some advise would be nice :)
[21:13] <Tv> wido: it should use tcmalloc unless you built with ./configure --without-tcmalloc
[21:14] <wido> Tv: http://ceph.newdream.net/wiki/Memory_Profiling
[21:14] <wido> That says different. My OSD and MDS indeed use tcmalloc, but the mon doesn't
[21:14] <wido> ldd /usr/bin/cmon|cods|cmds shows the same
[21:15] <wido> cmon is not linked against libtcmalloc
[21:15] <Tv> wido: ah you're probably right
[21:15] <Tv> i just expected them all to be same
[21:33] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[21:53] <sagewk> wido: http://pastebin.com/gaiYANCf
[21:55] <gregaf> jjchen: Tv: I'm not sure how good a fit RADOS snapshots are for something like you're discussing
[21:55] <Tv> gregaf: yup, he might be better off naming his objects smartly
[21:56] <gregaf> unfortunately the use of our non-pool snapshots needs to be coordinated at the application level
[21:56] <gregaf> it's probably harder than locking, it's just that it's significantly more powerful
[22:03] * joshd (~jdurgin@ has joined #ceph
[22:05] <wido> sagewk: tnx! any particular reason why this isn't in the default Makefile?
[22:05] <sagewk> not really. i'm adding it..
[22:06] <Tv> anyone know what's the idea behind src/tools/ and why many of the things that seem like "tools" are not in it / using its init function?
[22:09] <gregaf> looks like Colin made the tools dir in late October
[22:09] <gregaf> I didn't know there was an init function?
[22:09] <gregaf> so my guess is incomplete aggregation of "tools" into a new subdir, followed by other people not using it
[22:10] <Tv> the tools/* call both common_init and a ceph_tool_common_init
[22:11] <Tv> aargh cfuse and its special daemonization
[22:12] <Tv> i can't find a nice place to plug this initialization into
[22:12] <Tv> it must happen after daemonization
[22:12] <Tv> i was going to do it in common_init on behalf of non-daemonizing things, but that breaks cfuse
[22:13] <gregaf> well cfuse loves you and thanks you for noticing ;)
[22:13] <gregaf> is it bad if the nss initialization happens twice?
[22:13] <Tv> yeah, second time does nothing
[22:13] <Tv> even if you forked in between
[22:13] <Tv> it *must* happen with the final pid, and you must never fork again
[22:13] <gregaf> wow, that is super lame
[22:13] <Tv> that's what my testing says
[22:14] <Tv> nss is not just lame, it is the yardstick of code smell
[22:15] <Tv> the SI unit of bad code is a millimozilla
[22:15] <gregaf> this is the library we're using because it's certified by the feds, right?
[22:15] <Tv> oh yeah
[22:15] <gregaf> heh
[22:15] <gregaf> I feel like I'm in an 80s action movie
[22:15] <gregaf> damn feds
[22:15] <Tv> notice how i made autoconf prefer crypto++ whenever possible
[22:17] <gregaf> wait, how do you tell in common_init if something's going to fork?
[22:17] <gregaf> isn't that decision made when it starts up the messenger?
[22:17] <gregaf> ???no, n/m
[22:17] <gregaf> separate options
[22:17] <Tv> gregaf: well it gets a "I'm not a daemon" thing
[22:17] <Tv> daemons can init crypto themselves, after the point where they either fork or not
[22:18] <Tv> but now cfuse messed up that plan
[22:18] <gregaf> yeah
[22:18] <Tv> though reading the code seems to reveal more cfuse daemonize funnies
[22:18] <Tv> checking that theory..
[22:19] <gregaf> yeah, cfuse is pretty ridiculous
[22:19] <gregaf> it needs to fork but it can't do so until it knows if *all* the separate component startups are going to work
[22:21] <Tv> yup, common_init sets daemonize=false for cfuse, always
[22:21] <Tv> it's daemonization support has been broken for a while now ;)
[22:22] <Tv> since 7fe7a816 on 2011-03-09
[22:24] <gregaf> well cfuse still works?
[22:24] <Tv> it just never daemonizes
[22:24] <gregaf> actually I think the daemonization hacks got some major surgery from sage after that commit
[22:25] * joshd (~jdurgin@ Quit (Quit: Leaving.)
[22:30] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[22:39] <Tv> so here's the big suck
[22:39] <Tv> if you init libceph, you can't fork & keep using it
[22:39] <Tv> re-initing won't help
[22:47] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) has joined #ceph
[23:22] <Tv> pushed that horror to wip-nss-vs-fork
[23:25] * verwilst (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[23:37] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) Quit (Quit: Leaving.)
[23:40] * benpol (~benp@garage.reed.edu) has joined #ceph
[23:46] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) has joined #ceph
[23:50] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[23:54] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.