#ceph IRC Log

Index

IRC Log for 2011-01-06

Timestamps are in GMT/BST.

[0:30] <bchrisman> any performance hit possibilities having a single osd with multiple disks? Will osd multithread I/O if/when it needs to?
[0:31] <cmccabe> bchrisman: the osd daemon is multithreaded
[0:31] <bchrisman> ok.. thx
[0:31] <cmccabe> bchrisman: it really depends on how you're using the multiple disks, and where the bottlenecks are (if any)
[0:31] <cmccabe> bchrisman: for example, you could easily create a system where the bottleneck was the raid controller
[0:32] <cmccabe> bchrisman: if you're using RAID to combine those disks into 1 device
[0:32] <bchrisman> cmccabe: yeah.. I was planning on just handing OSDs the raw disks and letting them sort it out.
[0:33] <bchrisman> cmccabe: unless there's a performance advantage to running md underneath.
[0:33] <bchrisman> cmccabe: no hardware raid right now.
[0:33] <cmccabe> bchrisman: I've mostly worked with single-disk OSD setups, I'm not sure what we recommend just yet
[0:34] <bchrisman> cmccabe: okay.. if I go multidev, I'll let you guys know if I hit anything.
[0:34] <cmccabe> bchrisman: yeah
[0:35] <bchrisman> cmccabe: osd journal is metadata only, not full data logging… correct?
[0:35] <cmccabe> bchrisman: yes
[0:35] <bchrisman> cmccabe: cool
[0:36] <bchrisman> cmccabe: thx
[0:38] <gregaf> cmccabe: bchrisman: osd journal is full data logging
[0:38] <gregaf> every write that goes to the OSD gets written to the journal
[0:39] <cmccabe> gregaf: oh, my bad. You're right...
[0:40] <bchrisman> so journal and data disk speeds/bandwidths are probably best when approximately equal…
[0:40] <gregaf> the biggest potential disadvantages of running a raid or whatever are 1) if you lose the whole RAID you've got a lot more data to recover across the network, and 2) if your RAID is big enough the journal might be your rate limiter instead of your disk random I/O
[0:40] <gregaf> bchrisman: yeah
[0:41] <gregaf> it's less of an issue if you're using btrfs because when running btrfs it's possible for data to get into the filesystem without passing through the journal (if it gets too slow)
[0:41] <gregaf> but if you're using any other underlying fs it needs to pass everything through the journal to make sure its data store stays in a consistent state after a crash
[0:42] <bchrisman> ahh.. will be using btrfs… hmm
[0:43] <bchrisman> will try some options..
[0:43] <gregaf> how big and bad is your hardware?
[0:43] <gregaf> if your network's slower anyway then it's not going to matter
[0:44] <bchrisman> have access 3 node cluster, 4 2TB drives each.. SATA… gigabit private and public
[0:44] <cmccabe> bchrisman: gigabit maxes out at 125 MB/s theoretical (more like 90 MB/s actual)
[0:44] <gregaf> yeah, you probably don't need to worry about it then
[0:44] <bchrisman> yeah
[0:45] <cmccabe> bchrisman: 3.5" hard drives usually get around 125 MB/s
[0:45] <gregaf> cmccabe is being a little pessimistic about throughput, I've seen people get 115MB/s under the right workloads
[0:46] <bchrisman> gigabit liimted woudl be great performance..
[0:46] <gregaf> but if you're not running badass RAID cards with 10+ disks your streaming-write journal is unlikely to limit your random-write data store
[0:46] <gregaf> it depends a lot on the workload
[0:46] <cmccabe> gregaf: the 90 MB/s was from some tests I ran over TCP. Sometimes UDP or other protocols can do better. Also depends on your NIC, etc.
[0:47] <bchrisman> if I hang 3 disks on a single osd and configure one disk for the journal.. will the osd be writing sequentially to the log (or do the three devices journal separately anyways)?
[0:47] <gregaf> streaming writes from one client are pretty easy to get that on, but as you add complexity it gets less likely to saturate that way
[0:48] <gregaf> OSDs just have one journal
[0:48] <bchrisman> good.. thx
[0:48] <gregaf> to put 3 disks on it you need to somehow present a single volume for the OSD, either via RAID or by putting them together under btrfs
[0:49] <gregaf> I'd try some benchmarks; I'm not sure if a 3-disk store + 1-disk journal would actually be any faster than a 4-disk store&journal
[0:49] <bchrisman> yeah…
[0:50] <bchrisman> wait… 3 disks must be presented as one, though osds can handle multiple disks… because…? (integer multiple of 2 required?)
[0:51] <gregaf> the OSD just isn't equipped to handle multiple filesystems for its store
[0:51] <gregaf> the complexity in data management and safety would go up a great deal
[0:51] <bchrisman> ahh.. so osd-per-disk is the way to go there?
[0:52] <gregaf> well btrfs can present a single filesystem that spans multiple disks, and since it's our favored filesystem anyway there seems little point to adding the complexity into the OSD itself
[0:53] <bchrisman> yeah.. if btrfs is handling I/O balancing already.. definitely.
[0:53] <gregaf> OSD-per-disk versus spanned disks isn't something we've tested yet, but I suspect that in a scenario with low numbers of disks like this that spanning is a better choice
[0:53] <gregaf> since it leaves more memory for disk caching, etc
[0:54] <bchrisman> larger numbers would be constrained by poor failure modes only then I guess?
[0:54] <Tv|work> gregaf: how would disk caching be different? isn't the active set the same, either way..
[0:54] <gregaf> yeah, plus the ever-increasing amount of data you need to push over the network if you have disk issues
[0:55] <gregaf> well the cosd takes up a few hundred megs on its own, which reduces the amount available to the kernel for caching
[0:55] <gregaf> so less of the total data set can live in-memory
[0:55] <Tv|work> gregaf: ah so there's significant per-osd memory overhead in ceph's userspace component
[0:56] <gregaf> well, we like to think it's not too significant, but depending on the machine you're running yes
[0:56] <gregaf> ;)
[0:58] * Tv|work rereads cosd(8) and shivers at the thought of journal on flash memory..
[0:58] <Tv|work> but that may be just because i worked in the embedded space before the modern SSDs came to market
[0:58] <gregaf> haha, enterprise SLC drives can take that kind of load just fine
[0:58] <cmccabe> tv: I think wido experimented a little bit with journals on SSD
[0:59] <gregaf> good consumer MLC drives can take it for a few years depending on how constant the writes are
[0:59] <cmccabe> tv: I'm not sure what the conclusion was, if any
[0:59] <Tv|work> let's just say that we killed a few cf drives just by manual qa runs
[0:59] <gregaf> we've talked about this a few times
[0:59] <cmccabe> gregaf: yeah.
[0:59] <Tv|work> but yeah, modern ssds are supposed to be way better
[0:59] <gregaf> I think wido is running a couple Intel x25-m drives somewhere and they haven't died yet
[1:00] <cmccabe> gregaf: I was more wondering about performance
[1:00] <gregaf> theoretical calculations I've seen online say that model would last 6 months of constant full-speed writing
[1:00] <gregaf> but the OSDs don't have constant writes
[1:00] <cmccabe> gregaf: I was pretty sure it wasn't going to die in the short term
[1:00] <gregaf> and if you switch to an SLC drive you get a significant multipler (20x or something)
[1:01] <gregaf> I dunno how long the new Sandforce drives are good for, though
[1:02] <cmccabe> gregaf: yeah, I read some analyses that basically concluded that SLC was more suited to applications where the data changes a lot than MLC
[1:02] <Tv|work> the other thing is, my other laptop has one of those intel ssds
[1:02] <cmccabe> gregaf: this was a while ago and perhaps the technology for MLC has gotten better
[1:02] <Tv|work> it's not all that fast for streaming writes
[1:02] <Tv|work> journals are streaming writes
[1:03] <gregaf> 70 MB/s
[1:03] <cmccabe> gregaf: however, at the time, SLC was much more suitable for use in a streaming / caching / files rapidly changing scenario
[1:03] <ijuz_> flash gets worse and not better due to the smaller structures
[1:03] <gregaf> newer SSDs have actually gotten much better, some of them actually use the sata-6gbps bandwidth
[1:03] <gregaf> SLC is just better than MLC, period, but way more expensive
[1:04] <gregaf> it's inherent in MLC being the exact same thing, except storing 2 (or 4 even?) bits per cell rather than 1
[1:05] <cmccabe> I think one of the big reasons sata-6gbps was even rolled out is because of flash
[1:06] <cmccabe> since spinning disks kind of seemed to max out at 15k/20k RPM (if I remember correctly), they didn't really need the extra bandwidth
[1:07] <cmccabe> and raid controllers can just hang off the PCIe bus so they need sata-6gbps either
[1:08] <cmccabe> * didn't need sata-6gbps
[2:20] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[2:24] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:34] * bbigras (quasselcor@bas11-montreal02-1128535712.dsl.bell.ca) has joined #ceph
[2:34] * bbigras is now known as Guest3437
[2:36] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[2:38] * Guest2806 (quasselcor@bas11-montreal02-1128535712.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[2:49] * greglap (~Adium@166.205.139.224) has joined #ceph
[2:52] * alexxy (~alexxy@79.173.81.171) Quit (Remote host closed the connection)
[2:58] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[3:11] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:17] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:36] * greglap (~Adium@166.205.139.224) Quit (Read error: Connection reset by peer)
[3:54] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[4:29] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[4:40] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[4:41] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (Quit: leaving)
[5:02] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[5:34] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[6:02] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (synthon.oftc.net oxygen.oftc.net)
[6:02] * MarkN (~nathan@59.167.240.178) Quit (synthon.oftc.net oxygen.oftc.net)
[6:02] * cmccabe (~cmccabe@208.80.64.200) Quit (synthon.oftc.net oxygen.oftc.net)
[6:02] * pruby (~tim@leibniz.catalyst.net.nz) Quit (synthon.oftc.net oxygen.oftc.net)
[6:03] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[6:03] * MarkN (~nathan@59.167.240.178) has joined #ceph
[6:03] * cmccabe (~cmccabe@208.80.64.200) has joined #ceph
[6:03] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[6:46] * ijuz_ (~ijuz@p4FFF6B07.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[6:55] * bchrisman1 (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[6:55] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[6:55] * ijuz_ (~ijuz@p4FFF585D.dip.t-dialin.net) has joined #ceph
[7:32] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[7:59] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:54] * allsystemsarego (~allsystem@188.27.165.135) has joined #ceph
[9:07] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:30] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[9:31] * DJLee (82d8d198@ircip3.mibbit.com) has joined #ceph
[9:46] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (Quit: leaving)
[10:01] * shdb (~shdb@217-162-231-62.dclient.hispeed.ch) Quit (Quit: leaving)
[10:23] * Yoric (~David@213.144.210.93) has joined #ceph
[10:44] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[10:44] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[11:35] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[11:49] * Yoric_ (~David@213.144.210.93) has joined #ceph
[11:49] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[11:49] * Yoric_ is now known as Yoric
[12:50] * shdb (~shdb@217-162-231-62.dclient.hispeed.ch) has joined #ceph
[13:42] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[13:44] <wido> greglap: Yes, I'm running about 30 X25-M's in some heavy database servers, they are still working fine after more then a year
[13:45] * wido (~wido@fubar.widodh.nl) Quit (Quit: Changing server)
[13:47] * wido (~wido@fubar.widodh.nl) has joined #ceph
[13:48] * wido (~wido@fubar.widodh.nl) Quit ()
[13:54] <stingray> http://stingr.net/g/personal_appeal.jpg
[14:03] <jantje> mkcephfs requires '-k /path/to/admin/keyring'. default location is /etc/ceph/keyring.bin.
[14:04] <jantje> I don't use keys/authentication
[14:09] <jantje> 2011-01-06 14:08:13.018808 7fa065cf5710 journal op_submit_finish 9377 expected 9376, OUT OF ORDER
[16:07] <jantje> sagewk: you here?
[16:29] <sage> jantje: what version was that? can you send a full osd log leading up to that?
[16:38] <jantje> On IRC you told gave ma a link to a patch for the 64bit server <=> 32bit client issue : cc1: ..: Value too large for defined data type
[16:38] <jantje> I thought that one was fixed
[16:39] <jantje> so I installed 2.6.37 on client and server
[16:39] <jantje> but still have that issue
[16:39] <jantje> can't remember what exactly that was, do you remember?
[16:39] <jantje> because I need to get that working .. hehe :-)
[16:41] <jantje> for the out of order issue, not sure if I can reproduce that, the osd log is quite empty, it's with ceph version 0.24 (commit:180a4176035521940390f4ce24ee3eb7aa290632)
[16:45] <jantje> (more details on the 32vs64bit issue is in the email I sent you on 17/11/2010 (subject: X86 vs x64)
[16:46] <jantje> I'm sure you came up with a patch, but just can't remember it
[16:55] * wido (~wido@fubar.widodh.nl) has joined #ceph
[16:59] <jantje> hi wido
[17:06] <wido> hi jantje
[17:08] <wido> still going on with your tests? :)
[17:08] <jantje> I bumbped to a 64bit server <=> 32bit client issue
[17:08] <jantje> I assumed it got fixed in 2.6.37
[17:09] <jantje> but I was wrong, I think :-)
[17:09] <jantje> and I lost the patch, so I'm feeling dumb
[17:09] * gregphone (~gregphone@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[17:09] * gregphone (~gregphone@cpe-76-90-239-202.socal.res.rr.com) has left #ceph
[17:09] <jantje> and now I'm on hold :-)
[17:16] <wido> jantje: Why are you still running 32-bit?
[17:16] <wido> Any particular reason?
[17:17] <jantje> our corporate compiler software is running on 32bit
[17:17] <jantje> and some other nasty things
[17:17] <jantje> I don't get to switch to x64 :-)
[17:18] * jantje got to go
[17:20] <wido> jantje: bye!
[17:31] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[17:39] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[17:51] * greglap (~Adium@166.205.138.131) has joined #ceph
[17:54] * bchrisman1 (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:24] <sagewk> jantje: still there?
[18:25] <jantje> yes, i'm at home now
[18:25] <sagewk> the commit is 3105c19c450ac7c18ab28c19d364b588767261b3
[18:26] <wido> jantje: to get back on the topic, to bad you're stuck with 32b
[18:27] <sagewk> ..and it was merged in 2.6.37-rc3. :/ you're sure you're running that code?
[18:28] <jantje> i'm going to rebuild my client kernel
[18:28] <sagewk> ok. as for the other crash you saw, that's fixed in the testing branch (soon to be v0.24.1). commit:259c509a8941bf7cdad8bd4ede0ccd73ca8a83d3
[18:29] <jantje> oh, cool
[18:33] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:36] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[18:38] <jantje> hm, i did a git reset --hard on my kernel tree, it should point back to what I pulled from git , right?
[18:38] <jantje> sometimes I really love cvs ...
[18:38] <sagewk> yeah..
[18:39] <sagewk> (about git --reset! :)
[18:39] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:39] * greglap (~Adium@166.205.138.131) Quit (Quit: Leaving.)
[18:42] <jantje> HEAD is now at 3c0eee3 Linux 2.6.37
[18:42] <jantje> ok, that's what I had
[18:46] <sagewk> maybe try someting like http://fpaste.org/5adA/ to see wtf is going on?
[18:46] <sagewk> the second value should _always_ fall within 32 bits
[18:49] <jantje> i'll put it in
[19:01] <Tv|work> hehe $monkeyring
[19:01] <sagewk> :)
[19:01] <bchrisman> heh.. yeah.. that's amusing :)
[19:02] <jantje> http://fpaste.org/M22D/ btrfs
[19:03] <jantje> I wonder why I always have bad luck :-)
[19:03] <jantje> dinner time
[19:04] <bchrisman> Is there a used for a secondary gigabit link between nodes in a ceph cluster? Right now, I only see a single ip/host entry for osd's.. and the client (on public) will need to connect there… but when some sort of change (disk failure/rebalance/whatever) occurs, can those interactions be pushed over a secondary interface (private intra-cluster network)?
[19:04] <sagewk> jantje: bad news for josef :( what was the workload?
[19:04] <Tv|work> bchrisman: i'd just bond the ethernets together..
[19:06] <sagewk> bchrisman: you can also use a second interface for the osd<->osd communication (replication, recovery, peering)
[19:06] <gregaf> bchrisman: the OSDs have separate cluster and client connections, which can be placed on different interfaces if you like
[19:07] <bchrisman> there's a config parameter in there for that then? I'll look through the wiki and see if I can find it.
[19:07] <Tv|work> the one valid argument for the separation i can see is security; sniffability etc
[19:07] <gregaf> yeah, there's a separate param
[19:07] <Tv|work> anything else, you're just making statements like "i won't need 2Gbps towards the client"
[19:08] <gregaf> we got some requests for it, and due to the way the messaging system works having separate connections allows for some simpler code too
[19:08] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:09] <gregaf> I think people wanted it partly for security, and partly because they could afford to set up an IB or 10G network on the OSDs but not for all the clients ;)
[19:13] * cmccabe1 (~cmccabe@adsl-76-200-188-5.dsl.pltn13.sbcglobal.net) has joined #ceph
[19:17] <bchrisman> cluster addr seems to be the one.. would be set in ceph.conf inside each osd definition as 'cluster addr (hostname)'?
[19:17] <gregaf> yeah
[19:18] <bchrisman> cool
[19:24] <jantje> sagewk: iozone
[19:35] <cmccabe1> johnl: you there?
[19:35] <cmccabe1> gregf: I remember you guys were discussing how to run pprof, what was the conclusion for that
[19:35] <jantje> i'll just send those traces to the btrfs list
[19:37] <gregaf> cmccabe1: it's just a perl script
[19:39] <cmccabe1> k
[19:40] <cmccabe1> I'm trying to figure out what the input is
[19:40] <cmccabe1> I'm pretty sure it's these .heap files, but none of the examples on the web site show .heap files as an argument?
[19:42] <gregaf> input the heap files and the executable
[19:42] <gregaf> http://google-perftools.googlecode.com/svn/trunk/doc/heapprofile.html
[19:43] <cmccabe1> oh, I feel stupid now. I was looking at this: http://goog-perftools.sourceforge.net/doc/cpu_profiler.html
[19:43] <cmccabe1> I guess the pprof script has multiple uses
[19:57] <Tv|work> whee i'm getting good at making ceph crash ;)
[19:58] <Tv|work> http://pastebin.com/raw.php?i=PqXAnphP
[19:58] <Tv|work> mon/MonMap.h: In function 'const entity_addr_t& MonMap::get_addr(const std::string&)':
[19:58] <Tv|work> mon/MonMap.h:118: FAILED assert(mon_addr.count(n))
[19:59] <cmccabe1> sounds like you failed to give an address to the monitor?
[19:59] <cmccabe1> tv: of course, we should catch this and print a nice message, rather than an assert
[19:59] <Tv|work> sure, i dug that from the assert already
[19:59] <Tv|work> just saying, lots of low-hanging fruit
[20:00] <Tv|work> also, adding one makes it crash another way, hold on ;)
[20:00] <gregaf> eh, we should make it not crash but all it's going to do is print out an error and quit anyway :p
[20:01] <Tv|work> sure, this is more about discoverability
[20:01] <Tv|work> the other one looked a bit more difficult, trying to reproduce cleanly..
[20:01] <cmccabe1> tv: I always try to fix things like these as I find them
[20:01] <cmccabe1> tv: I think I had a few patches a while back to improve error messages
[20:02] <Tv|work> cmccabe1: i'm basically trying to use my "new guy" fresh goggles to see all that needs improving to make ceph approachable
[20:02] <Tv|work> like, intentionally making all the little mistakes, just to see what happens
[20:03] <cmccabe1> tv: check out the backtrace and see who is assuming that the monitor has an address
[20:03] <Tv|work> i got this on another run, still trying to reproduce cleanly:
[20:03] <cmccabe1> tv: if everyone assumes that, perhaps the error message needs to be in the common parsing somewhere
[20:03] <cmccabe1> tv: if it's only cmon, I would put it there.
[20:03] <Tv|work> mon/MonmapMonitor.cc: In function 'virtual void MonmapMonitor::encode_pending(ceph::bufferlist&)':
[20:03] <Tv|work> mon/MonmapMonitor.cc:99: FAILED assert(mon->monmap->epoch + 1 == pending_map.epoch || pending_map.epoch == 1)
[20:03] <cmccabe1> tv: perhaps greg or yehuda has better advice, but that's where i would start.
[20:04] <gregaf> Tv|work: that's an odd one...
[20:04] <cmccabe1> tv: that is one I haven't seen, perhaps try asking the others about teh full backtrace?
[20:05] <Tv|work> it seems to be timing related
[20:06] <Tv|work> i'm just gathering up these nice little bug reports, for now
[20:06] <cmccabe1> tv: can you put the full backtrace into a pastebin? We can at least take a glance before you pass it along to sage
[20:06] <gregaf> well the assert is saying that the incoming map isn't of the expected version, and there are a few potential problem areas there, but hopefully it's a race that got recently introduced or something
[20:07] <cmccabe1> tv: check out the core file. See if the version makes sense or if you're getting an uninitialized variable or something
[20:08] <cmccabe1> tv: I had that problem in 6cdfa3045521d6c8f09218ba6361d9953e109aa3
[20:08] <Tv|work> cmccabe1: i expect all these to be easy; i'm just noting down how to reproduce, for now
[20:08] <gregaf> oh, wait, that's the leader, not a follower
[20:09] <gregaf> that's scarier
[20:09] <gregaf> hopefully just a problem with the store or something thoug
[20:09] <gregaf> *though
[20:09] <Tv|work> http://pastebin.com/raw.php?i=4V3PAPGR
[20:09] <Tv|work> that one i haven't been able to work around, yet
[20:12] <gregaf> I bet it's a store problem, but I don't play around with the config parts of startup much so I'm not sure
[20:13] <gregaf> you guys ready to meet?
[20:13] <cmccabe1> ready
[20:14] <Tv|work> ready
[20:38] <cmccabe1> I'm seeing lines like "_get_pool 2187 0 -> 1" in johnl's log
[20:38] <cmccabe1> is 2187 a reasonable number of pools?
[20:39] * breed (~breed@nat-dip6.cfw-a-gci.corp.yahoo.com) has joined #ceph
[20:39] <- *breed* anybody out there?
[20:40] <breed> anybody out there?
[20:40] <cmccabe1> breed: I'm here. I think it's lunchtime for some
[20:40] <breed> are there any prebuilt binaries for redhat 5.1 with statically linked libraries?
[20:41] <cmccabe1> I don't think so
[20:41] <breed> i'm running into a problem getting libcrypto++
[20:41] <cmccabe1> we have a spec file now and RPMs can be built
[20:41] <breed> im on a redhat 5.1 machine for which i do not have root and trying to get this thing compiled
[20:42] <breed> do you know where to pull the libcrypto++ from? i can't seem to find a project page
[20:42] <cmccabe1> not sure, yehuda did the libcrypto++ integration
[20:43] <cmccabe1> probably here? http://www.cryptopp.com/
[20:43] <gregaf> cmccabe1: I think johnl said he had 2600 PGs; it was something in that neighborhood
[20:44] <cmccabe1> also when you run configure, you can supply CRYPTOPP_CFLAGS and CRYPTOPP_LIBS
[20:44] <breed> ah right
[20:44] <breed> thatx!
[20:44] <breed> thanx
[21:02] <jantje> sagewk: [ 7608.483762] readdir ino 10000004e71 returning ino 4f71
[21:02] <jantje> works just fine ...
[21:03] <jantje> i'm compiling
[21:03] <jantje> i'm curious if i'll hit it again
[21:10] <sagewk> me too! :)
[21:11] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has left #ceph
[21:15] <bchrisman> libcrypto++ for redhat is in either EPEL or rpmforge.
[21:16] * cmccabe2 (~cmccabe@adsl-76-202-117-29.dsl.pltn13.sbcglobal.net) has joined #ceph
[21:17] <bchrisman> (precompiled)
[21:21] * cmccabe1 (~cmccabe@adsl-76-200-188-5.dsl.pltn13.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[21:21] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:27] <jantje> sagewk: seems to work, weird! thanks anyway :-)
[21:27] <jantje> no I seem to keep on hitting some btrfs bug
[21:29] <jantje> sagewk: Ok, I think I get it again
[21:29] <jantje> the printk looks OK
[21:43] <jantje> [ 9955.814226] readdir ino 100000002ea returning ino 3ea
[21:43] <jantje> [ 9955.819383] readdir ino 10000001037 returning ino 1137
[21:43] <jantje> [ 9955.824626] readdir ino 10000000cb6 returning ino db6
[21:43] <jantje> [ 9955.829784] readdir ino 1000000161b returning ino 171b
[21:43] <jantje> [ 9955.835027] readdir ino 10000000ffb returning ino efb
[21:43] <jantje> [ 9955.840185] readdir ino 1000000031e returning ino 21e
[21:43] <jantje> that looks different !
[21:46] <jantje> sagewk: i've sent you a link with strace in private
[21:49] <jantje> st_size=2402519022
[21:52] <jantje> that would be >2GB
[21:57] <jantje> ok, thaqt stat is correct, nevermind
[22:00] <Tv|work> so office peeps, what's the story with lunch today? did i miss the train?
[22:01] <gregaf> oh, I think you might have, the guys around you all brought food
[22:01] <gregaf> sorry :(
[22:02] <Tv|work> nah that's fine
[22:02] <Tv|work> i'm a habitual lunch-skipper anyway :(
[22:02] <gregaf> you won't miss it tomorrow, there'll be too much noise ;)
[22:02] <Tv|work> so the catered days are tue & fri?
[22:02] <sjust> yup
[22:02] * allsystemsarego (~allsystem@188.27.165.135) Quit (Quit: Leaving)
[22:02] <Tv|work> need to remember that
[22:29] <jantje> sagewk: let me know if you have any other idea's to try :-)
[22:30] <sagewk> jantje: what is the current issue?
[22:35] <sagewk> still the readdir thing? or the 2gb thing?
[22:44] <wido> hi
[22:45] <wido> Just a small thing I'm working on, to show what you could do with phprados: http://zooi.widodh.nl/ceph/radosview/screenshots/
[22:45] <wido> I can't seem to crash the current RC with my tests atm except for #666
[22:46] <cmccabe2> test 666 or bug 666?
[22:46] <wido> bug #666
[22:47] <cmccabe2> I thought we fixed bug #666 recently
[22:47] <wido> oh, yes :-)
[22:47] <wido> I have seemed to missed that notification
[22:47] <wido> I'll give it a try right away
[22:47] <cmccabe2> oh never mind, I see the latest updates on the page
[22:47] <cmccabe2> your information is more current than mine :)
[22:48] <cmccabe2> since it was reopened 44 minutes ago :)
[22:48] <wido> Yeah, I just read the update
[22:48] <wido> But i'll try the unstable anyway, see what it does in my situation
[22:50] <wido> Btw, now I'm working on RADOSView (see the link above) I'd like to give a webinterface a try which outputs the same info as 'ceph -s' (or almost the same)
[22:50] <wido> I checked libceph, that doesn't seem to provide all that info, does it?
[22:50] <wido> nor librados of libceph give me full access to that info, do they?
[22:52] <cmccabe2> that's an interesting project
[22:53] <jantje> sagewk: readdir
[22:53] <jantje> i get stuff like
[22:53] <jantje> 21:43 < jantje> [ 9955.840185] readdir ino 1000000031e returning ino 21e
[22:55] <cmccabe2> wido: yeah, I'm not really aware of any interface that exports that information now
[22:55] <wido> cmccabe2: It would be nice to have, since you could build a SNMP mib on top of it
[22:55] <wido> or plugins for systems like Nagios or Zabbix
[22:56] <cmccabe2> wido: although maybe we should consider adding one. Those stats don't seem likely to change very much in the future since the fundamental things like PGs, UP/DOWN, IN/OUT concepts aren't changing
[22:56] <wido> right now you have to do a lot of bash tricks over the 'ceph' tool
[22:56] <cmccabe2> wido: we do have the ./ceph health interface
[22:56] <cmccabe2> wido: which was designed to give a one-word status of the system... I think HEALTH_OK, HEALTH_WARN, or HEALTH_... um... error?
[22:57] <wido> Yes, I know
[22:57] <cmccabe2> HEALTH_ERR
[22:57] <wido> mon0 -> 'HEALTH_OK' (0)
[22:57] <cmccabe2> wido: but that's not a very exciting display
[22:57] <sagewk> jantje: "[ 9955.840185] readdir ino 1000000031e returning ino 21e" is okay i think?
[22:57] <cmccabe2> wido: it really would be nice to get statuses of components
[22:57] <wido> Yes, sure, but it would be nice to be extract the degration level
[22:57] <wido> num of up / down osd's, a simple struct would be enough with a few int's
[22:57] <sagewk> the problem is that the ceph ino's are 64-bit and it's hashing them down to 32 bits. the kernel does that internally for its own hash table
[22:58] <sagewk> and the resulting ino is given to userspace as is
[22:58] <wido> cmccabe2: I understand the priority is somewhere else at the moment, but building such plugins would be something I could contribute with :)
[22:58] <sagewk> we could improve the hash somewhat, i guess... but if htere are problems with collisions that's just delaying the inevitable
[22:58] <cmccabe2> wido: well, I guess file a feature request to add it to libceph
[22:59] <wido> Yeah, I'll do that
[23:03] <jantje> sagewk: ow
[23:03] <jantje> damn, so the problem must be somewhere else
[23:09] <sagewk> what is the symptom you're seeing? you're getting an EOVERFLOW error in your compile still?
[23:10] <jantje> this is what being printed: cc1: error: /mnt/ceph/panos-1: Value too large for defined data type
[23:10] <jantje> i'm trying to find out *what* cc command exactly
[23:11] <cmccabe2> try compiling with VERBOSE=1
[23:11] <cmccabe2> hmm, maybe that's just cmake
[23:11] <sagewk> if you can get an strace of the bad return code that'll help
[23:12] <cmccabe2> well, if you don't know what verbose is in your compile environment, you can always write a shell script like this
[23:12] <cmccabe2> #!/bin/bash
[23:12] <cmccabe2> echo $@
[23:12] <cmccabe2> gcc $@
[23:12] <cmccabe2> and then do CC=/path/to/my/shell/script
[23:18] <jantje> I wish it was that easy, appearently our project contains tons of makefiles and scripts that overide all those things
[23:19] <cmccabe2> well you can always run it in a chroot if it gets out of control
[23:20] <cmccabe2> or just temporarily replace /bin/gcc if you're... that frustrated with your build :)
[23:33] <wido> cmccabe2: Filed a feature request. These is no 'libceph' category, so I left it blank
[23:33] <cmccabe2> k
[23:44] <wido> I'm going afk, ttyl
[23:45] <cmccabe2> wido: see you!
[23:56] <jantje> it appears that gcc returns that error when I do gcc -I/path/to/dir
[23:56] <jantje> but does not complain on a different directory
[23:56] <jantje> so there must be something special about it?
[23:56] <cmccabe2> jantje: try running strace $GCC_COMMAND
[23:57] <cmccabe2> and maybe attach the output to a bug or paste it somewhere
[23:57] <jantje> yes, doing that
[23:57] <jantje> not seeing anything strange, to me
[23:57] <jantje> i'll put it online
[23:57] <cmccabe2> well, one of the system calls is failing presumably.
[23:57] <cmccabe2> so knowing which one could be important.
[23:58] <jantje> http://www.fpaste.org/aons/
[23:59] <jantje> cc1.orig is my renamed cc1 binary

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.