#ceph IRC Log

Index

IRC Log for 2011-04-26

Timestamps are in GMT/BST.

[0:02] <Tv> bleh still get the same openssl internal error
[0:05] <cmccabe> yehudasa: I seem to have to restart apache after adding rgw users
[0:05] <cmccabe> yehudasa: is there a shortcut for that
[0:05] <Tv> hey this time around i found an android bug for that.. it might even get fixed some day
[0:07] <yehudasa> cmccabe: probably apache just caches some info?
[0:07] <yehudasa> cmccabe: just wait a couple of minutes?
[0:07] <cmccabe> yehudasa: this is more of an FYI
[0:07] <cmccabe> yehudasa: I don't know whether to call it a bug or not
[0:07] <cmccabe> yehudsas: I guess maybe it is caching the 403 forbidden error in my case
[0:08] <yehudasa> cmccabe: probably
[0:13] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[0:22] <cmccabe> yehudasa: I suppose that sort of thing is somewhat expected in a system with eventual consistency
[0:23] <cmccabe> yehudasa: still, I think we may want to turn down the caching a little bit to prevent too many surprises
[0:53] <bchrisman> cmccabe: I'm converting testceph.cc to straight C??? looks like just converting cout/cerr to printf.. anything else you can think of that might be problematic with this?
[0:54] <bchrisman> cmccabe: coverting meaning: making a testceph.c out of it.
[1:00] <cmccabe> bchrisman: back
[1:01] <cmccabe> bchrisman: it should be fine as straight C
[1:02] <cmccabe> bchrisman: I'm not sure why it was C++ to begin with; the libceph API is C
[1:02] <bchrisman> cmccabe: yeah??? odd :)
[1:06] <Tv> Oh hey -- who's support watch this week?
[1:07] <Tv> it's monday and we forgot!
[1:07] <cmccabe> tv: it's me
[1:08] <Tv> ok good to know
[1:08] <cmccabe> we forgot in the meeting, but I think greg sent to the channel afterwards
[1:09] <bchrisman> cmccabe: Did you see this before? Looks like a makefile issue??? it's not picking up testceph.o and thus not finding main? http://pastebin.com/HTXq89QF
[1:14] <cmccabe> bchrisman: automake isn't the smartest kid on the block
[1:14] <cmccabe> bchrisman: I think you may need to do make distclean after a change like this
[1:15] <bchrisman> cmccabe: ok.. will check that.
[1:15] <Tv> bchrisman: does it actually have a main?-)
[1:16] <Tv> ohhh wait even more so -- why are there no .o's listed to be linked into the output?
[1:16] <Tv> bchrisman: can you show a diff of what you did?
[1:16] <bchrisman> yup
[1:17] <bchrisman> that's the primary problem??? the .o file isn't getting into the link command
[1:17] <Tv> btw that's still saying g++
[1:17] <bchrisman> not sure if distclean will fix that but I'll run it and see.
[1:17] <bchrisman> yeah??? it's compiling testceph.cc right now
[1:17] <Tv> perhaps not even distclean but ./autogen.sh
[1:17] <bchrisman> (or rather, attempting)
[1:17] <bchrisman> yeah.. will run from autogen on down
[1:17] <Tv> bchrisman: make your branch, or a diff, visible to us, can't debug blind
[1:18] <bchrisman> Tv: understood
[1:19] <cmccabe> tv: the problem is you need to rerun make distclean
[1:19] <cmccabe> tv: if you want to experience it for yourself, just move a .cc file from one directory to another
[1:19] <Tv> cmccabe: i find ./autogen.sh is good enough for Makefile.am changes
[1:19] <cmccabe> tv: and update Makefile.am
[1:19] <Tv> faster
[1:21] <cmccabe> tv: I don't think that works in this case, but I will verify
[1:27] <cmccabe> tv: nope, it doesn't work.
[1:27] <cmccabe> tv: you must use make distclean to clear the build products automake leaves around
[1:27] <Tv> bleh
[1:27] <cmccabe> tv: I sent out an email to the mailing list when I moved src/config.cc to src/common/config.cc
[1:27] <Tv> automake is so hideous
[1:27] <cmccabe> tv: I guess the memory must have faded
[1:28] <cmccabe> tv: yeah, automake is pretty bad
[1:28] <cmccabe> tv: but it's no good pointing that out without pointing out the non-hideous alternative: cmake.
[1:37] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Operation timed out)
[1:39] <Tv> actually, i'm not convinced ceph needs automake
[1:39] <Tv> autoconf, yes
[1:39] <Tv> but that would take a lot of effort to figure out
[1:42] <sagewk> fwiw i've been doing rm -r src/.deps and the ./do_autogen.sh -d 3
[1:42] <cmccabe> sagewk: if removing .deps works, that would be nifty
[2:00] <cmccabe> looks like Amazon EBS is GNBD (Global Network Block Device): http://openfoo.org/blog/amazon_ec2_underlying_architecture.html
[2:01] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:01] <cmccabe> and ultimately... SAN!
[2:03] <Tv> i've read that before, it has juicy bits in it but it's kinda misleading in many parts
[2:04] <Tv> e.g. he muses about instance storage being NFS-mounted -- it most definitely is not, the latency graphs tell you that
[2:04] <Tv> basically, even when networking makes EBS slow, local "instance storage" is still as fast
[2:05] <cmccabe> tv: yeah, I think AMIs are *not* mounted over NFS
[2:06] <cmccabe> tv: networking speeds do vary quite a bit on AWS, but local storage doesn't seem to experience that variance
[2:06] <cmccabe> tv: so whatever they're doing, most or all of the AMI eventually ends up next to the CPU
[2:07] <Tv> i'd be willing to argue that copying the ami to the local disk is probably the major reason why spinning up instances is so slow
[2:07] <Tv> it doesn't seem demand-paged in, it's fast from the start
[2:07] <Tv> but getting an instance going takes quite a while
[2:08] <cmccabe> tv: seems pretty reasonable. also explains the 15GB limit
[2:08] <Tv> and e.g. starting their elastic mapreduce is quite a bit faster -- those are probably pre-imaged
[2:09] <Tv> come kernel gitbuilder, complete a run...
[2:14] <Tv> s/come/come on/
[2:15] <Tv> it must be 5pm, i'm losing the ability to produce coherent sentences
[2:15] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:17] <Tv> yessss, it works
[2:25] <Tv> cmccabe: i wonder if 55ae580e0c7c823acd6fef63218dcc8b45b536fd is racy...
[2:26] <cmccabe> tv: in the presence of multiple rgw processes, I don't see how it could work
[2:27] <Tv> cmccabe: well it'd need an underlying pool delete op that only deletes empty pools
[2:27] <Tv> and that level can guarantee atomicity
[2:27] <cmccabe> tv: yeah
[2:28] <cmccabe> tv: that wouldn't be hard to add I think
[2:35] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Read error: Operation timed out)
[3:09] * cmccabe (~cmccabe@208.80.64.174) has left #ceph
[3:16] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[4:08] * damoxc (~damien@94-23-154-182.kimsufi.com) Quit (Ping timeout: 480 seconds)
[4:52] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (synthon.oftc.net charon.oftc.net)
[4:52] * frank_ (frank@november.openminds.be) Quit (synthon.oftc.net charon.oftc.net)
[4:52] * zoobab (zoobab@vic.ffii.org) Quit (synthon.oftc.net charon.oftc.net)
[4:52] * jeffhung_ (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (synthon.oftc.net charon.oftc.net)
[4:52] * DLange (~DLange@dlange.user.oftc.net) Quit (synthon.oftc.net charon.oftc.net)
[4:52] * jeffhung_ (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[4:52] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[4:52] * frank_ (frank@november.openminds.be) has joined #ceph
[4:52] * zoobab (zoobab@vic.ffii.org) has joined #ceph
[4:53] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[5:07] * zk (identsucks@whatit.is) has joined #ceph
[5:07] <zk> ceph looks pretty awesome. what sort of read/write performance could i expect with a small cluster?
[6:23] <sage> tv: it is racy. the question is, how hard is it to make it not racy, and is it worth it
[6:24] <sage> i think pretty hard
[7:56] * tuhl (~tuhl@p5089679E.dip.t-dialin.net) has joined #ceph
[7:57] <tuhl> how many nodes would you recommend for a minimal test setup?
[8:27] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:49] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[9:01] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[9:10] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:29] * tjikkun (~tjikkun@195-240-187-63.ip.telfort.nl) has joined #ceph
[9:36] <tuhl> what are the experiences with the latest version?
[9:44] * tuhlm (~mobile@p5089679E.dip.t-dialin.net) has joined #ceph
[10:01] * Yoric (~David@87-231-38-145.rev.numericable.fr) has joined #ceph
[10:02] * tuhl (~tuhl@p5089679E.dip.t-dialin.net) has left #ceph
[10:18] * lidongyang_ (~lidongyan@222.126.194.154) Quit (Read error: Connection reset by peer)
[10:44] * greghome (~greghome@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[10:55] * greghome (~greghome@cpe-76-170-84-245.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[11:20] * allsystemsarego (~allsystem@188.25.129.75) has joined #ceph
[11:39] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[13:04] * jantje (~jan@paranoid.nl) has joined #ceph
[13:04] * wido_ (~wido@fubar.widodh.nl) has joined #ceph
[13:05] * wonko_be_ (bernard@november.openminds.be) has joined #ceph
[13:05] * tjikkun (~tjikkun@195-240-187-63.ip.telfort.nl) Quit (reticulum.oftc.net kilo.oftc.net)
[13:05] * wonko_be (bernard@november.openminds.be) Quit (reticulum.oftc.net kilo.oftc.net)
[13:05] * andret (~andre@pcandre.nine.ch) Quit (reticulum.oftc.net kilo.oftc.net)
[13:05] * stingray (~stingray@stingr.net) Quit (reticulum.oftc.net kilo.oftc.net)
[13:05] * jantje_ (~jan@paranoid.nl) Quit (reticulum.oftc.net kilo.oftc.net)
[13:05] * wido (~wido@fubar.widodh.nl) Quit (reticulum.oftc.net kilo.oftc.net)
[13:16] * andret (~andre@pcandre.nine.ch) has joined #ceph
[13:16] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[13:16] * tuhlm (~mobile@p5089679E.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[13:22] * stingray (~stingray@stingr.net) has joined #ceph
[13:36] * johnl (~johnl@johnl.ipq.co) Quit (Remote host closed the connection)
[13:44] * johnl (~johnl@johnl.ipq.co) has joined #ceph
[13:55] <chraible> hi can sone one tell me whats the advatage of the tcmalloc in cpeh? Is it strongly needed for good performance?
[13:56] <chraible> I have the problem, that under RHEL6 64-Bit tcmalloc is not supported
[15:05] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:20] <stingray> chraible: just turn it off and try without ut
[15:20] <stingray> it
[15:20] <stingray> if you're using RHEL6 you're missing btrfs already, so no big deal
[15:38] <chraible> i've compiled my own kernel (i know not realy good for Enterprise Distributions ;) ) but it's needed for my bachelor Thesis :D
[15:52] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[16:14] <Yulya> mds0 q.w.e.99:6800/22417 5 : [ERR] loaded dup inode 10000040ce3 [2,head] v7575 at /home/blah/blah/blah/libTAO_IDL_BE_la-be_visitor_structure.loT, but inode 10000040ce3.head v7762 already exists at /home/blah/blah/blah/libTAO_IDL_BE_la-be_visitor_structure.lo
[16:14] <Yulya> ??????
[16:14] <Yulya> hmm
[16:14] <Yulya> what does it mean?
[16:45] <stingray> chraible: if you compiled your own kernel then I guess you shouldn't care about if tcmalloc is supported there or not
[16:45] <stingray> chraible: just install google-perftools-devel rpm from fedora
[16:45] <stingray> it'll be grand
[17:01] <wido_> chraible: tcmalloc gives you much better memory usage
[17:02] <wido_> so you are able to run with less memory
[17:02] * wido_ is now known as wido
[17:33] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:36] * Yoric_ (~David@87-231-38-145.rev.numericable.fr) has joined #ceph
[17:36] * Yoric (~David@87-231-38-145.rev.numericable.fr) Quit (Read error: Connection reset by peer)
[17:36] * Yoric_ is now known as Yoric
[17:51] * alexxy (~alexxy@79.173.81.171) Quit (Remote host closed the connection)
[17:55] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[18:04] * zk (identsucks@whatit.is) Quit (Remote host closed the connection)
[18:05] * zk (identsucks@whatit.is) has joined #ceph
[18:13] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: zzZZZZzz)
[18:18] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:23] * joshd (~jdurgin@12.43.172.10) has joined #ceph
[18:35] <bchrisman> chraible: afaik, google-perf-tools rpms do not include x86_64 support. You will probably need to compile ceph starting with ./configure ???without-tcmalloc
[18:35] <Tv> bchrisman: are the rpms special in that sense?
[18:36] <Tv> google-perf-tools upstream doesn't support x86_64 *for some operations*
[18:36] <Tv> tcmalloc is just fine
[18:36] <bchrisman> chraible: (or rather, there is no google-perf-tools rpms for rhel6 with x86_64 arch
[18:36] <bchrisman> yeah??? it's not a tcmalloc issue
[18:36] <Tv> oh i think i remember this conversation, RH decided that because one part of the rpm is not supported, they'll disable the whole damn thing
[18:36] <bchrisman> yeah.. really annoying.
[18:37] <bchrisman> will have to either find a package or repackage something when it comes time for perf testing (hopefully soon)
[18:53] <Tv> yeah i don't know enough about the RH/Fedora/Centos ecosystem to suggest much.. on debian/ubuntu, someone could easily provide an unofficial repo with the packages.
[18:54] <Tv> oh wow i found an ambiguous git sha, that's pretty rare
[18:55] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[18:55] <Tv> alright now kernel gitbuilder has the non-compiling -rc2 hidden from view, because it's not our fault..
[18:56] <Tv> still 86 warnings :(
[18:57] <Tv> i wonder if i can split the ceph compilation out, and ignore the warnings for the rest..
[19:00] <Tv> yup that seems to be the way to go
[19:01] <Tv> actually no, it'll hide errors :(
[19:01] <Tv> because we do touch the mainline kernel in our changes
[19:01] * Yoric (~David@87-231-38-145.rev.numericable.fr) Quit (Quit: Yoric)
[19:02] <Tv> yeah all i can do is filtering known false positives :(
[19:02] <Tv> -> ignoring for now
[19:08] <gregaf> zk: depends on the exact pattern, unfortunately ??? I think most benchmarks I've seen put a real cluster at 2-10MB/s on small read workloads
[19:08] <gregaf> write is much faster due to buffering, assuming you don't sync after each one
[19:10] <zk> oh really
[19:10] <zk> thanks
[19:16] * fred_ (~fred@191-253.76-83.cust.bluewin.ch) has joined #ceph
[19:16] <fred_> hi
[19:16] <cmccabe> fred_: hi fred
[19:17] <stingray> bchrisman:
[19:17] <stingray> Installed:
[19:17] <stingray> google-perftools-devel.x86_64 0:1.7-1.fc14
[19:17] <stingray> Dependency Installed:
[19:17] <stingray> google-perftools.x86_64 0:1.7-1.fc14 libunwind.x86_64 0:0.99-0.13.20090430betagit4b8404d1.fc13
[19:19] <fred_> sagewk, are you around?
[19:20] <sagewk> fred_: here!
[19:21] <fred_> sagewk, great, about 1022. the cluster managed to restart using your patch. does it still make sense to send you additional debug as you requested ?
[19:22] <sagewk> yeah
[19:22] <sagewk> i'd like to reproduce the error if i can
[19:22] <fred_> ok, will do!
[19:23] <fred_> in the meantime, I got a crash in OSDMap::object_locator_to_pg will report soon
[19:24] <wido> My cluster is still acting weird, it seem it's not able to recover. I still don't want to mkcephfs it, since I think it could still give a good insight of what could go wrong. The Atom CPU's seem to be a bit to slow when all OSD's are all recovering.
[19:25] <wido> Right now I'm starting my OSD's starting up, one by one, but some OSD's are eating 100% CPU and when I try to kill them, they become a zombie
[19:25] <Tv> wido: that usually means the filesystem is the one being slow
[19:26] <stingray> wido: btrfs?
[19:26] <Tv> cpu at 100% is still odd
[19:26] * Yoric (~David@78.250.110.144) has joined #ceph
[19:26] <wido> Tv: Yes, that could be, but all the OSD's have their own 2TB disk with about 600G of data
[19:26] <wido> Yes, all btrfs
[19:26] <stingray> wido: and they are completely unkillable after that?
[19:26] <wido> I don't want to mkcephfs yet, do think it holds some valuable data to analyze
[19:27] <Tv> wido: you could use sysrq to get an idea of where the kernel execution is going, what code is responsible
[19:27] <Tv> 'l' - Shows a stack backtrace for all active CPUs.
[19:27] <Tv> that one
[19:28] <Tv> http://www.kernel.org/doc/Documentation/sysrq.txt
[19:28] <gregaf> recovery is the most intensive scenario for OSDs and we have seen issues with loaded OSDs hogging CPU ??? there's a big difference in CPU usage between idle/working/recovering
[19:28] <gregaf> how many OSDs do you have on each of those Atoms again, wido?
[19:29] <wido> gregaf: 4 OSD's at the moment, recovery ops is set to two
[19:29] <wido> So yes, I do think the CPU is the bottleneck here, but even starting the OSD's one by one isn't possible
[19:29] <Tv> ohhh i just realized
[19:29] <gregaf> did you happen to see where the CPU time was going?
[19:29] <Tv> they might not be unkillable
[19:30] <Tv> you might just be sending them SIGTERM, and them being sluggish to exit
[19:30] <Tv> whether kill -9 kills them or not changes the diagnosis completely
[19:30] <Tv> (yes => osd code itself just eats lots of cpu; no => btrfs is messing up somehow)
[19:31] <wido> I just killed on OSD with the init script, goes into zombie "osd <defunct>" and keeps eating 100% CPU. The stack of the PID doesn't show anything special
[19:31] <Tv> you could also just strace one of them and if it does stuff in userspace, it's the first case
[19:31] <Tv> ok that puts the blame in kernelspace pretty definitely
[19:32] <wido> They actually do die after a few hours
[19:32] <Tv> wow
[19:32] <wido> But before that, they were eating 100%, even before I killed them
[19:32] <Tv> wido: your system performance is comparable to running an 8GB workload on a machine with 1GB of ram... it "works" but at glacial speeds
[19:32] <cmccabe> wido: what kernel version are you on
[19:32] <Tv> meeting time!
[19:32] <sjust> cmccabe: meeting time
[19:33] <wido> cmccabe: 2.6.38
[19:34] <cmccabe> wido: wow, that's pretty current
[19:34] <wido> Tv: The Atom's are just a test, wanted to see if they are up to the task
[19:34] <wido> but go to your meeting, we'll talk later
[19:47] <Tv> we're back
[19:48] <Tv> wido: yeah i have to say the behavior you're seeing is definitely undesirable, but it seems like the problem is in btrfs/kernel buffer management/your block drivers/something like that.
[19:49] <Tv> so autotest collects logs from a worker node three times :(
[19:50] <wido> Tv: Yes. Btw, I do understand the Atom might be a bit underpowered for the heavy recovery tasks, but still, there are some issues
[19:50] <wido> I've also seen my Monitors going OOM with 4G of RAM
[19:51] <Tv> wido: you mean a host running nothing but mon.* ?
[19:52] <wido> Tv: yes
[19:52] <Tv> that should definitely not happen
[19:52] <wido> OOM killer comes along and kills cmon
[19:52] <wido> Yeah, my exact thought, haven't been able to look at it
[19:52] <wido> Seems to happen when my OSD's are responding slowly
[19:53] <Tv> wido: how many mon instances on a single machine?
[19:53] <Tv> ah that might be realted
[19:53] <Tv> related
[19:53] <wido> Tv: only one monitor
[19:53] <Tv> perhaps a bug in the messaging layer
[19:53] <Tv> yeah cmon is normally ~130MB VSS, for that to grow to 4GB is just insane
[19:55] <wido> indeed, but I still have to take a look at it, make sure it's nothing else
[19:57] <wido> but, back to the OSD, the cosd process is still in Z, the stack doesn't show anything special. I've got another cosd process eating 100%, which I didn't kill yet, that stack doesn't show anything related to the filesystem either
[19:58] <Tv> wido: can you strace -p the one that's 100% cpu?
[19:58] <wido> Tv: sure, the stack is: http://pastebin.com/QLKiwiGS
[19:58] <Tv> Z means the process exited but its parent hasn't noticed
[19:59] <Tv> if that parent is not init, that might be a bug; if the parent is init, then it's just that the system is soo slow
[20:00] <Tv> (a bug in the parent, whatever that is)
[20:00] <wido> Oh, wait. I've got one in Z which is eating 100% CPU. Another which is also eating 100% CPU, but not in Z, since I didn't kill it yet
[20:00] <wido> when I'd kill that proces, it would become a zombie
[20:00] <wido> the stack here is from the non-Zombie process
[20:00] <Tv> zombies are not supposed to consume any cpu
[20:00] <Tv> that's.. odd
[20:01] <wido> No, there shouldn't, but this one is doing so. 100% CPU, 0% mem
[20:01] <wido> the strace of the other process stays at: "futex(0x1bc91cc, FUTEX_WAIT_PRIVATE, 1, NULL"
[20:02] <Tv> wido: can you strace -p please? snapshots of stack are not the same as seeing what operations it does; if it spends say 90% of time sitting on a lock, you might miss the things in between by only looking at the snapshots
[20:02] <Tv> wido: and there's a huge difference between "sitting in that same lock forever" and "locking, doing stuff, unlocking"
[20:03] <wido> Tv: the strace of the still running process doesn't show anything else then FUTEX_WAIT_PRIVATE
[20:03] <Tv> then it is stuck on a lock
[20:03] <wido> strace -p <pid> -o <output>
[20:04] <Tv> at least that one thread.. i don't actually know how to work strace well with threads :-/
[20:04] <wido> If I would kill it, it would go into Z straight away and keep eating that much CPU
[20:04] <Tv> yeah that's really odd
[20:04] <wido> Right now, I have 23 OSD's up, but in the cluster 30 cosd processes are running
[20:04] <wido> those 7 are all doing the same thing, eating 100% CPU
[20:05] <Tv> wido: is there any logic to which 7 etc.. is it like, something in osd.13's journal makes it always do that
[20:05] <sagewk> tv: rbd showmapped or rbd show?
[20:05] <Tv> sagewk: are there other kinds of show?
[20:06] <sagewk> there's a list that lists images in a pool...
[20:06] <sagewk> or 'rbd mapped'?
[20:06] <Tv> sort of depends on what the output is
[20:06] <wido> Tv: Not that I could find, but those 7 processes are spread out over the hosts
[20:06] <sagewk> device pool name snap
[20:06] <sagewk> /dev/rbd0 rbd foo -
[20:07] <Tv> if it's just raw list of "major:minor cephmount", then i'd say mapped is good
[20:07] <Tv> show makes me think it should show all kinds of details about the item
[20:07] <sagewk> i can put major:minor in there too, tho nobody cares about those :)
[20:07] <Tv> sagewk: with udev, your /dev/rbd0 might not be how the user is accessing it at all
[20:08] <sagewk> yeah, but it will always be there (unless they have wonked-out rules), and we don't know what else udev may have made...
[20:08] * Yoric (~David@78.250.110.144) Quit (Quit: Yoric)
[20:09] <Tv> sagewk: if udev never hides the original dev, just adds more symlinks to it, then that's safe and will let you get at maj:min etc too
[20:10] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[20:10] <sagewk> worth includeing maj:min in there?
[20:11] <Tv> sagewk: not really if this is human-consumable
[20:11] <Tv> sagewk: i was thinking of machine parseable for a while, for some reason
[20:11] <sagewk> it could be.. it's just tab delimited
[20:12] <Tv> sagewk: honestly, i think i'd want a "rbd mapped /dev/my-funky-naming/*" that says "/dev/my-funky-naming/foo-bar-whatever mypool myname -"
[20:12] <Tv> sagewk: that way, you can have the names the user expects
[20:13] <Tv> sagewk: and not /dev/rbd0
[20:13] <Tv> sagewk: but this is splitting hairs
[20:13] <fred_> got the following crash: http://pastebin.com/RsXpDXnj will create ticket tomorrow. could you tell me if you need more information that I will add to the ticket?
[20:13] <Tv> sagewk: and i haven't used rbd enough to have good feel for how it behaves
[20:13] <sagewk> fred_: p oid and p loc and 'ceph osdmap dump -o -'
[20:14] <fred_> ok, now or in ticket ?
[20:14] <sagewk> either :)
[20:15] <fred_> first two are in pastebin already
[20:17] <fred_> osdmap dump yields: unrecognized subsystem... WTH?
[20:23] <sagewk> fred_: oh! so it is.
[20:23] <sagewk> oops, it's 'ceph osd dump -o -'
[20:23] <stingray> git diff
[20:23] <stingray> oops
[20:23] <stingray> sorry
[20:25] <fred_> sagewk, http://pastebin.com/bC0yyvk0
[20:25] <sagewk> fred_: can you f 7 and p op->oloc
[20:28] <fred_> sagewk, $1 = {pool = 3, preferred = -1, key = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x8e4778 ""}}}
[20:37] <wido> Tv: I can't find any similarity between those OSD's. A strace on them they are waiting on that FUTEX
[20:37] <Tv> wido: yeah i don't really have anything useful for you either :-/
[20:38] <Tv> wido: but please don't let this get buried, it sounds like you have found an actual, serious, bug that is just easier to trigger with the atom platform (different balance of operation speeds)
[20:39] <wido> Tv: that's why I didn't mkcephfs yet
[20:39] <wido> I'll dig up as much as I can tomorrow
[20:39] <Tv> perhaps gregaf has time to help you
[20:39] <wido> and create a issue for it, which can then be splitted up into several issues as we find what is actually going wrong
[20:39] <Tv> i'm nowhere near as knowledgeable about ceph's threading & locking
[20:40] <wido> I have the feeling gregaf is a bit busy, no problem at all!
[20:41] <wido> but my feeling is, although the Atom lacks CPU power at some points, there is actually something wrong in the code, the recovery seems to be bouncing, up, down, and it never completes
[20:41] <Tv> yeah
[20:41] <Tv> but the thing is, i've run clusters on pretty feeble virtual machines, and that was fine
[20:42] <cmccabe> wido: sam and josh are refactoring some of the OSD code
[20:42] <wido> cmccabe: Yes, I read it on the ml, so I'm waiting for that
[20:42] <cmccabe> wido: hopefully they will put some more thought into locking
[20:42] <Tv> lunch time
[20:42] <wido> but until then, I won't format my cluster
[20:43] <wido> there is about 4TB of data on it, really want to see if I'm able to get it running again
[20:43] <wido> seems like a good test scenario
[20:49] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) Quit (Remote host closed the connection)
[21:01] * fred_ (~fred@191-253.76-83.cust.bluewin.ch) Quit (Quit: Leaving)
[21:06] * tjikkun_ (~tjikkun@195-240-187-63.ip.telfort.nl) has joined #ceph
[21:07] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Read error: No route to host)
[21:07] * tjikkun_ (~tjikkun@195-240-187-63.ip.telfort.nl) Quit ()
[21:07] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[21:20] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:37] * joshd (~jdurgin@12.43.172.10) Quit (Ping timeout: 480 seconds)
[21:47] <wido> cmccabe: Tv: I'll try to gather as much as I can tomorrow and open a issue for it
[21:47] <cmccabe> wido: sounds good
[21:48] <wido> As I see now, a lot of OSD's are blocking, simply waiting on that FUTEX
[21:48] <cmccabe> wido: you probably should use --debug_ms=1
[21:48] <cmccabe> wido: and maybe debug osd = 20
[21:48] <wido> Yeah, that seems reasonable, but osd is a bit difficult
[21:48] <wido> it eats so much diskspace
[21:48] <cmccabe> wido: yeah, I understand
[21:48] <cmccabe> wido: how long does it take to get into this condition?
[21:49] <wido> cmccabe: Problem is, it seems to be triggered into it
[21:49] <cmccabe> it might be good to start with just 10 minutes worth of logs
[21:49] <cmccabe> just create a run where you start up, wait 10 minutes, and then shut down
[21:49] <cmccabe> and save just the log from that
[21:49] <wido> So when I start one OSD, a 'random' OSD starts to block, might be when a particular PG gets recovered
[21:50] <wido> cmccabe: Yes, that's what I've been doing, I really hoped the remote syslog would work better
[21:50] <cmccabe> wido: well, you can't expect to run with just one osd
[21:50] <cmccabe> wido: generally it's best to run all of them
[21:50] <wido> cmccabe: Damn my english :) What I mean: I've got 20 OSD's running
[21:50] <cmccabe> wido: since the bug you're looking at it is probably in the osd recovery code
[21:50] <wido> I start number 21, and number 8 then goes into the spin
[21:51] <cmccabe> wido: I see.
[21:51] <wido> I'm trying to get all 40 up and running again, but that seems mission impossible right now
[21:51] <wido> starting them all up is to much to ask of the atoms
[21:51] <cmccabe> wido: when they're in "the spin" no logs are produced?
[21:53] <cmccabe> wido: one thing you can try is adding "export CEPH_LOCKDEP=2" to /etc/init.d/ceph
[21:53] <cmccabe> wido: that will lead to a lot of debug output about mutex locks and unlocks
[21:53] <cmccabe> wido: actually, maybe try that before filing the bug, and see if there's any interesting output
[21:53] <cmccabe> wido: this problem seems a lot like a deadlock
[21:53] <wido> cmccabe: osd.11 is blocking right now, no logs from today
[21:54] <wido> has been blocking since the 22nd
[21:54] <wido> I'll try your suggestions, tnx!
[21:54] <wido> I'm afk for now, ttyl
[21:54] <cmccabe> wido: maybe we should implement something where you send a signal to the process, and it prints out a list of the locks its holding
[21:55] <wido> cmccabe: Could be useful I assume?
[21:55] <wido> But I'll play around with it tomorrow, see what I can find
[21:55] <wido> really afk now
[21:55] <cmccabe> wido: yeah, I'm tempted to try that out. It should be trivial to implement
[21:55] <cmccabe> wido: bye!
[22:05] <Tv> cmccabe: 100% cpu consumption -> livelock not deadlock?
[22:06] <cmccabe> tv: hmm
[22:06] <Tv> and would the lock status thing be doable with gdb, with no code changes?
[22:06] <cmccabe> tv: there is something called an adaptive mutex, which is a spinlock that turns into a mutex
[22:06] <cmccabe> tv: but I don't think we are using that
[22:06] <Tv> yeah futexes spin a while but switch to sleeping
[22:06] <Tv> and it is doing futex
[22:07] <cmccabe> tv: oh, then we are.
[22:07] <Tv> his just stays pegged at 100%
[22:08] <cmccabe> sjust: just an FYI, it looks like wido found some kind of livelock or other bug in the recovery code
[22:08] <cmccabe> sjust: he's going to retrieve some logs and file a bug at some point
[22:09] <cmccabe> tv: I can't seem to find any way to get a list of the locks a thread holds from gdb
[22:09] <cmccabe> tv: except for manually looking at all those memory locations and hoping that I didn't miss any
[22:09] <Tv> cmccabe: there's nothing special about a lock that would have any location have a list of all known locks
[22:09] <cmccabe> tv: google search reveals nothing but a very unhelpful page where someone brags that solaris has a way to list all locks
[22:10] <Tv> cmccabe: that's true whether it's in your code or via gdb
[22:10] <Tv> futexes are nothing but shared memory
[22:11] <cmccabe> tv: I think most programmers would be happy to use a ptrheads implementation where it kept a thread-local list of the addresses of locks taken
[22:12] <Tv> cmccabe: considering that locks can be attached to dynamically allocated objects, that sounds like a good way to slow things down
[22:12] <cmccabe> tv: what does dynamic allocation have to do with anything
[22:12] <cmccabe> tv: presumably you're not freeing the memory used by your lock while holding it
[22:13] <Tv> every lock & unlock would actually need to manipulate this list you desire
[22:13] <cmccabe> tv: apparently solaris did it that way
[22:13] <Tv> yes and it's called slowaris for a reason
[22:15] <cmccabe> tv: speed doesn't help if you're just driving off the cliff faster
[22:15] <cmccabe> tv: anyway, hopefully wido will uncover something with lockdep.
[22:15] <Tv> driving off the cliff slower is just as bad
[22:29] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[22:32] * joshd (~jdurgin@sccc-66-78-236-243.smartcity.com) has joined #ceph
[22:36] <Tv> cmccabe: do you know of any neat trick to merge to ceph.confs.. so that first one could contain "defaults" and the other one just does something like [osd] debug ms = 20 and then write out one that contains all the settings but with the [osd] debug ms = 20 in it
[22:36] <Tv> s/to/two/
[22:36] <cmccabe> tv: not sure quite what you're asking
[22:36] <cmccabe> tv: there is one thing you should know, which is that sections can be "re-opened"
[22:37] <cmccabe> tv: so you can just cat two configs together
[22:37] <Tv> yeah i'm worried that'll be non-intuitive for humans
[22:37] <Tv> but i might just go for that
[22:37] <cmccabe> tv: I haven't tested that very much, but that's the way I designed it
[22:37] <cmccabe> tv: you can mention the same section as many times as you want
[22:37] <cmccabe> tv: later values win
[22:38] <Tv> yeah that makes sense
[22:38] * allsystemsarego (~allsystem@188.25.129.75) Quit (Quit: Leaving)
[22:42] <cmccabe> yehudasa: do you have any insight into the relationship between user ID and DisplayName in S3?
[22:45] <yehudasa> cmccabe: the display name doesn't have to be unique I think
[22:45] <cmccabe> yehudasa: I'm having trouble understanding the point of having it in the ACL
[22:46] <yehudasa> cmccabe: just an extra info
[22:46] <cmccabe> yehudasa: it just seems to do nothing
[22:46] <cmccabe> yehudasa: the user ID is all-powerful
[22:46] <yehudasa> cmccabe: yes it is
[22:47] <yehudasa> but it saves extra operations when listing ACLs
[22:47] <cmccabe> yehudsasa: why is that
[22:47] <cmccabe> yehudasa: you mean it saves you from having to look up the pretty name from the id?
[22:47] <Tv> cmccabe: i actually don't think AWS stores the displayname in the acl, i think they fetch it from the user db on the fly
[22:48] <Tv> assuming the displayname is changeable
[22:48] <yehudasa> cmccabe: yes
[22:48] <cmccabe> tv: I wonder if I even need to include the pretty name in the XML I generate
[22:48] <Tv> cmccabe: yes you have to
[22:48] <cmccabe> tv: or what happens if I get it wrong
[22:48] <Tv> everything is going to try to display that
[22:48] <cmccabe> tv: does the pretty name change for that user? :)
[22:48] <Tv> because the userids are 40 hexdigits, nobody can identify those
[22:49] <cmccabe> tv: my point is that when I tell AWS to do something, it only cares about the user id
[22:49] <cmccabe> tv: it already knows what pretty name goes with that user id
[22:49] <yehudasa> cmccabe: it only cares about the user id
[22:49] <Tv> cmccabe: displayname is for humans
[22:49] <cmccabe> tv: right, but obsync talks to S3, not to humans
[22:49] <yehudasa> if you give it a wrong name it overrides it
[22:50] <Tv> cmccabe: yeah i don't expect *writing* the displayname in an acl to do anything useful
[22:50] <cmccabe> anyway, doesn't matter, I'll just try to pass it through unchanged
[22:50] <cmccabe> weird design though
[22:50] <Tv> cmccabe: hence me saying i don't think AWS really stores displaynames in acl
[22:50] <Tv> cmccabe: it makes sense for the things displaying the acls
[22:50] <cmccabe> I suppose it cuts down on the server traffic
[22:50] <Tv> cmccabe: because the usernames are completely nonsensical to humans
[22:50] <cmccabe> it's a form of denormalization
[22:50] <cmccabe> or something like that
[22:51] <Tv> you could force every acl-using thing to fetch them, yes
[22:51] <Tv> they decided not to
[22:51] <Tv> *shrug*
[22:51] <Tv> (fetch displaynames separately, i mean)
[22:51] <cmccabe> yeah
[23:00] * joshd (~jdurgin@sccc-66-78-236-243.smartcity.com) Quit (Ping timeout: 480 seconds)
[23:02] <cmccabe> tv: isn't there some way to just transform XML into a python struct
[23:02] <Tv> cmccabe: python has no structs ;)
[23:02] <cmccabe> tv: I feel like all this silliness with callback functions could be avoided by just looking at it like JSON
[23:02] <Tv> oh don't use a SAX-style parser unless you need streaming
[23:02] <Tv> lxml.etree is decent
[23:03] <cmccabe> tv: looks like etree has some of the methods I'd expect
[23:03] <cmccabe> tv: however, tags aren't guaranteed to be unique, which I guess makes it hard to have a really 1:1 mapping between XML and nested classes
[23:04] <Tv> that's why you access it as a tree
[23:05] <Tv> lxml.etree and xpath for element/attribute access is a decent combo
[23:05] <Tv> pretty much best you can do with the monster that is xml
[23:05] <cmccabe> tv: looks pretty reasonable
[23:30] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.