#ceph IRC Log


IRC Log for 2011-01-05

Timestamps are in GMT/BST.

[0:08] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[0:11] <iggy> grrr, kvm hasn't done a merge with upstream recently, so I'm going to have use upstream qemu to test qemu-rbd
[0:13] <yehudasa> iggy: the qemu-rbd changes are not too intrusive. You can try and apply them over the kvm tree directly
[0:13] <iggy> I thought about that too
[0:13] <yehudasa> basically it's block/rbd* and some makefile changes
[0:13] <yehudasa> and configure
[0:13] <iggy> oh, good, it's that self contained
[0:22] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has left #ceph
[0:30] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[0:34] <johnl_> hi, anyone around that can help me with memory profiling osd?
[0:34] <gregaf> johnl_: yeah
[0:35] <johnl_> ha, hi greg! perfect
[0:35] <johnl_> I read the wiki page you gave me on the mailing list: http://ceph.newdream.net/wiki/Memory_Profiling
[0:35] <johnl_> first off, it uses the term g_conf which is not mentioned anywhere else in the wiki
[0:35] <johnl_> I assume it means global config section of ceph.conf ?
[0:36] <gregaf> ah
[0:37] <gregaf> or the osd section — g_conf is the configuration setup we have, and it can be set in the ceph.conf in the appropriate sections (so, for the OSDs, the global section, the OSD section, or the per-osd specific section)
[0:37] <gregaf> or on the command line when you start the daemon
[0:38] <johnl_> ah k. I put "tcmalloc_profiler_run = true" in the global section of my ceph.conf
[0:38] <gregaf> that should be fine
[0:39] <johnl_> where will it write the profile info? I can't see it in /, /tmp, /var/log/ceph or /var/log/ceph/stat
[0:39] <johnl_> (or the current dir)
[0:40] <gregaf> well, if it's working it should output wherever your logs output to
[0:41] <johnl_> ah, I have high debug level on so was probably drowning it out
[0:41] <johnl_> the tcmalloc docs mention a separate file so I was looking for it
[0:42] <gregaf> oh, yes, I meant in that dir
[0:42] <gregaf> it'll output its own file
[0:43] <johnl_> ah right. it's not doing before cosd crashes (few minutes of running). I'll double check this is built with tcmalloc
[0:44] <gregaf> hmmm, that's odd
[0:44] <johnl_> ldd says so
[0:44] <gregaf> maybe I'd better check the profiling is actually working, I don't think it's been used much since getting put in
[0:44] <gregaf> you're on v0.24, right?
[0:45] <gregaf> and since you're here, did you get that info on how far the OSDs get on startup?
[0:46] <johnl_> yeah, 0.24
[0:46] <johnl_> yeah, I collected it whilst trying the profiler
[0:46] <johnl_> 2sec
[0:46] <johnl_> feel I should reply to the post though, for posterity :)
[0:46] <gregaf> yeah
[0:49] <gregaf> okay, it's at least basically working — start up one of your osds and then run "ceph osd tell # heapdump"
[0:49] <gregaf> and see if it's created a file there
[0:50] <gregaf> ohhh, I know what the problem is, duh
[0:50] <gregaf> you want "tcmalloc profiler run = true"
[0:50] <gregaf> without the underscores
[0:51] <johnl_> ah heh
[0:51] <johnl_> I tried the "ceph osd tell 0 start_profiler" as a failback though and that didn't work either
[0:52] <johnl_> I'll fix the config though
[0:52] <johnl_> "turning on heap profiler with prefix /var/log/ceph/osd.0
[0:52] <johnl_> Starting tracking the heap
[0:52] <johnl_> "
[0:52] <johnl_> :D
[0:52] <gregaf> yay
[0:53] <johnl_> do I need the tell heapdump?
[0:53] <johnl_> assume not now
[0:53] <iggy> pkg-config needs to be added to the build deps on the checking out page
[0:54] <gregaf> if it's working properly it should dump automatically as it goes through memory
[0:54] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has left #ceph
[0:54] <gregaf> at every 1GB of allocations and 100MB increase in mem usage, IIRC
[0:54] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has joined #ceph
[0:55] <iggy> and libcrypto++-dev
[0:56] <gregaf> iggy: on the wiki?
[0:56] <iggy> yes
[0:56] <gregaf> k
[0:57] <iggy> I'll do it later if nobody gets to it, I'm just kind of documenting as I go
[0:57] <gregaf> yeah, in general it is a wiki :p
[0:57] <gregaf> but libcrypto replaced some dependencies and I'm not sure about pkg-config
[0:57] <gregaf> so I'll get the guys here to do it
[0:57] <iggy> it's required for autogen.sh
[0:59] <iggy> I'm going to try to benchmark qemu/kvm-rbd vs kernel-rbd + qemu/kvm
[0:59] <johnl_> gregaf: I ok to just send the raw output? or do I need to preprocess them first (with pprof?)
[1:00] <gregaf> if you can preprocess them first that's easiest
[1:00] <gregaf> otherwise getting the symbols and stuff is a pain
[1:02] <cmccabe1> gregaf: pkgconfig is something I introduced to help with the gtk libraries
[1:02] <cmccabe1> gregaf: it's pretty closely integrated with autotools but still somehow a separate package on most systems
[1:03] <johnl_> helpfully, the pprof binary is missing from these ubuntu packages. thanks ubuntu. will sort it out
[1:03] <gregaf> johnl_: are there separate dev packages you don't have installed, maybe?
[1:05] <gregaf> actually, it looks like it's just a perl script
[1:06] <johnl_> not in any def package. will try just grab the perl script then
[1:06] <gregaf> so you should just be able to grab it out of their source control, eg http://google-perftools.googlecode.com/svn-history/r100/trunk/src/pprof
[1:07] <johnl_> yer, ta. was just looking for it :)
[1:20] <cmccabe1> johnl_: I replied to your mail
[1:21] <cmccabe1> johnl_: I did have my cosd get killed by the OOM killer a week or two ago
[1:21] <cmccabe1> johnl_: like I wrote in the mail, there were some other large programs running at the time (like a TCL shell and some compiles) so I kind of wrote it off as that
[1:22] <cmccabe1> johnl_: however, it's at least possible that we have a memory leak of some kind
[1:23] <johnl_> cmccabe1: this cluster has nothing else on it, and I can reproduce this, so looks like a leak
[1:23] <johnl_> getting all the debug info together than gregaf asked for.
[1:24] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[1:24] <cmccabe1> johnl_: sounds good!
[1:25] <gregaf> either that or some of our recent OSD changes inadvertently introduced some scaling memory use that we'll have to track down
[1:25] <cmccabe1> gregaf: it is possible, but I saw the issue when running from vstart.sh, with a completely standard config
[1:26] <gregaf> cmccabe1: on another note, did you remove generic_dout or do anything to it?
[1:26] <cmccabe1> gregaf: and little to no data stored!
[1:26] <gregaf> running the profiler via a config option doesn't work on the current unstable due to an assert failure in generic_dout, but it's fine on v0.24
[1:27] <cmccabe1> gregaf: generic_dout is the same as it ever was
[1:27] <gregaf> I tried to git bisect but the first commit it took me to the monitor wouldn't even start up ;)
[1:27] <cmccabe1> gregaf: I think I know what it is
[1:28] <cmccabe1> gregaf: generic_dout bypasses the dout mutex, causing problems if it is the first invocation of dout
[1:28] <gregaf> something about asserting the log file is opened
[1:30] <cmccabe1> gregaf: ah, generic_dout needs to take the mutex all the time, I can see now
[1:34] <iggy> that was easier than the last time I tried setting up a ceph cluster
[1:36] <yehudasa> iggy: as you may remember, we've come a long way
[1:38] <iggy> yeah, I've been following along for a while
[1:38] <gregaf> johnl_: I'm heading out now but I'll get back on later; if you leave me a message I'll see it ;)
[1:38] <iggy> the last time I tried there wasn't an init script... so that's how long it's been since I've tried
[1:42] <johnl_> gregaf, cmccabe1: mailed debug info
[1:43] <johnl_> unfortunately I'm quite tired and need to go to sleep. any questions before I go?
[1:43] <cmccabe1> johnl_: thanks. have a good one
[1:44] <cmccabe1> johnl_: it might be handy if you sent the ceph.conf
[1:47] <johnl_> right, will do
[1:49] <cmccabe1> johnl_: I forgot, did you ever give us a login on your machine?
[1:50] <cmccabe1> johnl_: unless I'm mis-interpreting it, the pprof output has a lot of unresolved function addresses :\
[1:50] <cmccabe1> johnl_: anyway, keep around those binaries, and when you next come online we can try to figure out what the mystery functions are.
[1:51] <cmccabe1> johnl_: have a good one.
[1:52] <johnl_> I did give logins yeah. though I've since removed public ips
[1:52] <johnl_> I can add back trivially. gimme a mo
[1:53] <johnl_> I used ldd to find every linked library I could and ensured the appropriate debug packages were installed
[1:53] <cmccabe1> johnl: oh, great. I just want to know what function is using 32.8% of the memory... and I'm pretty sure gdb will tell
[1:53] <johnl_> (which resolved many of the symbols)
[1:53] <johnl_> but not those tops ones
[1:54] <cmccabe1> johnl: yeah, symbol resolution is something that consistently seems to annoy
[1:55] <johnl_> you should be able to ssh in to root@public.srv-9pjdl.gb1.brightbox.com now
[1:55] <cmccabe1> johnl: somehow symbols in shared libraries sometimes don't come up. Nobody has had time to thoroughly investigate
[1:55] <johnl_> your key is still on there
[1:55] <johnl_> that is the node I've been running the profiler on
[1:55] <cmccabe1> johnl_: thx
[1:55] <johnl_> the ceph.conf is on there too obviously
[1:56] <johnl_> complete test cluster, feel free to be destructive if you need to. no real data either.
[1:58] <johnl_> added your key to the other two nodes if you need it
[1:58] <cmccabe1> johnl_: thanks! I'm just going to check out some symbol names in gdb though. Shouldn't be destructive of anything.
[1:58] <johnl_> accessible via the 10 network via that first node
[1:58] <johnl_> k cool
[1:59] <johnl_> right, sleep time. back in 8 hours or so :)
[2:00] <johnl_> nn
[2:01] <cmccabe1> johnl_: night
[2:08] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[2:09] * greglap (~Adium@ has joined #ceph
[2:23] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:24] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:43] * greglap1 (~Adium@ has joined #ceph
[2:45] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[2:50] * sjust (~sam@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:51] * greglap1 (~Adium@ Quit (Quit: Leaving.)
[3:02] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Read error: Connection reset by peer)
[3:08] * terang (~me@ip-66-33-206-8.dreamhost.com) has joined #ceph
[3:21] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[4:48] <darkfader> anyone still here?
[5:22] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[5:23] * terang (~me@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving)
[6:21] * gregorg_taf (~Greg@ has joined #ceph
[6:21] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[6:33] * ijuz_ (~ijuz@p4FFF7176.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[6:42] * ijuz_ (~ijuz@p4FFF6B07.dip.t-dialin.net) has joined #ceph
[8:44] * cmccabe1 (~cmccabe@adsl-76-200-188-5.dsl.pltn13.sbcglobal.net) has left #ceph
[8:45] * cmccabe1 (~cmccabe@adsl-76-200-188-5.dsl.pltn13.sbcglobal.net) has joined #ceph
[8:45] * cmccabe1 (~cmccabe@adsl-76-200-188-5.dsl.pltn13.sbcglobal.net) has left #ceph
[10:09] * Yoric (~David@ has joined #ceph
[10:15] * allsystemsarego (~allsystem@ has joined #ceph
[11:24] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[11:27] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[12:02] <jantje> *yawn*
[12:02] <jantje> so it's Ceph playtime again
[12:04] <jantje> I just hope it's a bit faster on random read/write with the new version, last time my compiles took longer on a 6 OSD ceph filesystem (3 nodes) than on a single local machine with 1 harddisk.
[12:04] <jantje> (read longer = *3)
[12:05] <jantje> well, i will let you know what it's like tomorrow
[12:05] <jantje> maybe I should also increasy my node size
[12:08] <stingray> [censored] scrubs, how do they work?
[15:01] <darkfader> hey all
[15:02] <darkfader> jantje: small block r/w is the hardest for distributed fs i think
[15:29] * allsystemsarego_ (~allsystem@ has joined #ceph
[15:35] * allsystemsarego (~allsystem@ Quit (Ping timeout: 480 seconds)
[16:02] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[16:03] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[17:50] * greglap (~Adium@ has joined #ceph
[17:51] * bchrisman1 (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[17:51] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[17:56] * bchrisman1 (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit ()
[18:18] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:32] * Yoric (~David@ Quit (Quit: Yoric)
[18:38] <greglap> jantje: I don't think your compile times are going to have improved in this version
[18:38] <greglap> most of the work was in failure recovery and such
[18:43] * greglap (~Adium@ Quit (Quit: Leaving.)
[18:53] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:58] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:05] <bchrisman> I've been looking for setup recommendations in terms of disk layout etc… I might be missing a wiki page on it?
[19:06] <bchrisman> such as… are we better of piling our disks under a software/hardware raid… or run naked devices? and if so.. does it make sense to split a single device to provide both a journal partition and a data partition?
[19:07] <bchrisman> Or will ceph's mkfs pretty much optimally setup a cluster, given a list of disks/osds?
[19:08] <gregaf> bchrisman: it's not something we've experimented with a lot
[19:08] <gregaf> but it depends partly on what characteristics you're after with your cluster
[19:08] * Anticimex (anticimex@netforce.csbnet.se) Quit (Quit: server move)
[19:08] * sjust (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:10] <gregaf> you've basically got the choice between putting them in a RAID and running one cosd, or running one cosd per disk (and any combination between there)
[19:11] <gregaf> the first option takes less RAM but will require more data recovery in the event you lose a single disk
[19:12] <bchrisman> I'm trying to think of the general write path… it's going to write to the journal first… and then eventually to data storage… (this is an assumption on my part) .. if I have the journal and the data storage on the same drive, even in separate partitions, a few parallel writes could start crunching the disk…? I'm also assuming that each osd has its own journal… is that correct?
[19:13] <gregaf> yes, that's correct
[19:13] <bchrisman> My use case is basically I'm going to put up a 'very volatile disk space' server that people in my office can bang on…
[19:14] <bchrisman> Is it possible to journal one OSD on another node in the cluster? From looking at the config file, I would assume no...
[19:14] <gregaf> no no, that wouldn't make much sense
[19:15] <gregaf> the OSD journal is a mechanism to make sure the OSD can always get to a consistent disk state, so it can't rely on other devices being up
[19:15] <bchrisman> definitely.
[19:16] <gregaf> administration will be easier on you if you just run one cosd per server, you won't have to deal with setting up proper CRUSH maps and stuff
[19:17] <gregaf> although if your RAID bandwidth drastically outpaces your journaling bandwidth you might just want to, say, devote 1 disk to journal for every 4 other disks so that you don't get the journal limiting write speeds
[19:18] <gregaf> I don't think I would bother splitting them until you get over ~5 drives, though, I think that's how many disks our OSDs here are using
[19:18] <stingray> setcrushmap, how does it work
[19:20] <stingray> gregaf: so I adjust weights, and push it
[19:20] <stingray> degraded percentometer immediately shoots to, say, 22%
[19:20] <stingray> ok, wrong starting point
[19:20] <stingray> I have 3 servers and pg_size 3
[19:21] <stingray> so every group is replicated on every server
[19:21] <stingray> then I add one more server
[19:21] <stingray> and adjust crushmap
[19:21] <stingray> now
[19:21] <bchrisman> gregaf: trying to make sure I'm groking that… journal write speed will limit overall performance. I'm guessing data write speed will *eventually* limit performance as well, but not as quickly as journal write spamming?
[19:21] <stingray> I kill 2 of my original servers
[19:21] <stingray> will the system be able to get 100% coverage ?
[19:22] <stingray> (the degraded percentometer may show 33% but that's another story)
[19:22] <gregaf> bchrisman: well, in general you expect data store write speed to limit performance, but we've seen some people running with high-end RAID controllers and 10+ disks who can sustain 200+MB random writes, and in cases like that their journaling device is an unexpected limiting factor
[19:23] <gregaf> stingray: if you've got 3 cosds and pg_size 3 and nothing's degraded, you will maintain read/write access even with one cosd operating, yes
[19:23] <gregaf> it'll be 66% degraded, though, I think
[19:23] <stingray> notice the order of events here
[19:24] <stingray> I added 4th osd
[19:24] <stingray> I set weight to 1
[19:24] <gregaf> oh, I didn't see the extra server
[19:24] <stingray> it started peering/moving stuff/whatever
[19:24] <stingray> THEN I shot 2 of my original servers
[19:25] <stingray> is it possible that chunks that are not supposed to be on the remaining osd (osd0 for short) are already discarded/disappeared/whatever
[19:25] <stingray> and I actually lose some data
[19:26] <bchrisman> gregaf: thanks
[19:27] <stingray> ?
[19:27] <sagewk> the peering/recovery does not delete old pgs until the pg is fully recovered/replicated in the new location(s).
[19:28] <sagewk> this is pretty conservative, and can push you toward ENOSPC in some cases, but it is better than losing data ;)
[19:37] <stingray> great
[19:37] <stingray> who knows the pg placement? mons?
[19:38] <sagewk> everyone; it's encoded in teh osdmap
[19:38] <bchrisman> client included?
[19:38] <sagewk> yeah
[19:38] <stingray> what if I do the following:
[19:38] <bchrisman> makes sense
[19:39] <stingray> 1. kill one osd
[19:39] <stingray> 2. wipe /data/osd
[19:39] <stingray> 3. cosd --mkfs blabla
[19:39] <stingray> 4. start osd
[19:40] <stingray> the pg placement suddenly shall become incorrect
[19:40] <gregaf> what do you mean, the placement shall become incorrect?
[19:41] <gregaf> proper placements are calculated based on the set of osds that are marked "in"; if an OSD goes down for any reason it's marked as such and the monitor defines a new "acting" set for its PGs so that they remain operational
[19:42] <gregaf> in the case you've outlined there, the OSD will just start recovery and all the PGs will get sent back to it from its peers
[19:43] <bchrisman> will look basically like an osd replace, I'd expect?
[19:43] <gregaf> you'll lose access to its PGs for a few seconds while the system determines the OSD is down and sets its peers to take over duty as the "primary" for PGs, but that will be short and everything will continue as normal once it's done
[19:43] <stingray> great
[19:43] <stingray> thanks
[19:43] <stingray> I was just trying to understand failure modes :)
[19:44] <gregaf> we should probably write some new reference material for that kind of thing
[19:44] <gregaf> data placement is calculated based on the set of OSDs that are "in"
[19:45] <gregaf> if an OSD fails (and all issues are failures, basically), it will get quickly marked "out" (how quickly is configurable, but generally about 20 seconds)
[19:45] <gregaf> and data that is supposed to be on that OSD is degraded, so there aren't enough copies
[19:46] <stingray> ok, so another thing
[19:46] <stingray> I start it back
[19:46] <gregaf> if after a configurable amount of time the OSD isn't back "up", it is marked "out" and then a new mapping is calculated
[19:46] <stingray> knowing that some data is good but some is bad
[19:46] <stingray> you can never know if power went down
[19:46] <stingray> how can I prevent this thing from marking everything back healthy unless it verifies the checksums or whatever
[19:47] <gregaf> in the intervening period the map has a different "acting" set encoded for PGs that include that OSD, which is just the other OSDs in that placement group, so that there's a new primary and the PG remains accessible
[19:47] <gregaf> at present it assumes all data it can find is good
[19:47] <gregaf> there are hooks for checksumming, though
[19:48] <gregaf> oh, sorry, I misspoke — it assumes all data it can find which has the same size as the other nodes have is good, it does comparisons to check for that
[19:48] <gregaf> and it replays the journal to make sure any in-progress writes make it through
[19:53] <ijuz_> no idea if that is interesting for someone, there are "comparisons" of different parallel filesystems, though i was not able to learn from that what i was searching http://wr.informatik.uni-hamburg.de/research/labs
[20:02] <bchrisman> nice document… reading through it… would've been nice if they got gfs & tahoe working but… :)
[20:03] <ijuz_> a colleague thought that might fhgfs might be an option and i was searching for something that makes clear that is sucks :)
[20:05] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:06] <alexxy> sagewk: sage: ping =)
[20:06] <alexxy> sagewk: sage: i added ceph to gentoo repos =)
[20:28] <sagewk> alexxy: nice!
[20:29] <alexxy> http://packages.gentoo.org/package/sys-cluster/ceph
[20:30] <alexxy> and i know i forget about 0.24
[20:30] <alexxy> =)
[20:30] <alexxy> so i'll add it today
[20:35] <cmccabe> johnl_: are you there?
[21:01] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:01] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[21:20] <wido> great news, my Ceph cluster on 'noisy' has been running for 13 days now
[21:20] <wido> No crash :)
[21:20] <cmccabe> :)
[21:20] <wido> It's just running a single VM, nothing special
[21:20] <wido> But I'm writing some apps again librados, everything seem to be holding out
[21:45] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:45] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[21:59] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:59] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[21:59] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) Quit ()
[22:07] * Anticimex (anticimex@netforce.csbnet.se) has joined #ceph
[23:47] * allsystemsarego_ (~allsystem@ Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.