#ceph IRC Log


IRC Log for 2011-01-11

Timestamps are in GMT/BST.

[0:11] <johnl_> hey gregorg_taf, you about?
[0:21] <Tv|work> alright.. there's a branch "gtest" in the git repo, anyone brave enough should take a look at it
[0:21] <Tv|work> cmccabe: ^
[0:21] <cmccabe> tv: great, will take a look!
[0:22] * gnp421 (~hutchint@c-75-71-83-44.hsd1.co.comcast.net) has joined #ceph
[0:30] * tranceConscious (~tranceCon@ppp079166119202.dsl.hol.gr) has joined #ceph
[0:31] <tranceConscious> hello
[0:31] <tranceConscious> anyone here?
[0:31] <tranceConscious> when I run this...
[0:31] <tranceConscious> mkcephfs -c /etc/ceph/ceph.conf --allhosts -v
[0:32] <tranceConscious> as instructed by this...
[0:32] <tranceConscious> http://ceph.newdream.net/wiki/Installing_on_Debian
[0:33] <tranceConscious> I get this...
[0:33] <tranceConscious> mkcephfs requires '-k /path/to/admin/keyring'. default location is /etc/ceph/keyring.bin.
[0:33] <tranceConscious> usage: /sbin/mkcephfs -c ceph.conf [--allhosts] [--mkbtrfs] [-k adminkeyring]
[0:33] <tranceConscious> ** be careful, this WILL clobber old data; check your ceph.conf carefully **
[0:33] <tranceConscious> billy@node0:~$
[0:33] <cmccabe> tranceConscious: I think you might need a -k argument
[0:34] <tranceConscious> -k pointing to what file?
[0:34] <cmccabe> a keyring file that you have generated
[0:35] <tranceConscious> well I'm going by the wiki and it doesn't say anything about generating keyrings...
[0:35] <tranceConscious> when did I generate the keyring?
[0:35] <cmccabe> tranceConscious: yeah, I believe it needs updating
[0:35] <cmccabe> tranceConscious: you might try running vstart.sh -d -n to get a quick start
[0:36] <tranceConscious> where do I find vstart.sh?
[0:36] <cmccabe> ./src/vstart.sh
[0:37] <tranceConscious> woa
[0:38] <bchrisman> I think the -k flag will create that key
[0:38] <bchrisman> (calling mkcephfs)
[0:38] <bchrisman> unless I'm mistake.
[0:38] <bchrisman> err mistaken?
[0:38] <tranceConscious> sorry for flooding but check this out
[0:38] <bchrisman> that's creating a key for the filesystem while it's making it… I believe? then you'll need to reference that key in the mount.
[0:39] <tranceConscious> http://paste.ubuntu.com/552632/
[0:39] <tranceConscious> I run that script
[0:39] <cmccabe> bchrisman: no, you need to use ./cauthtool to create a key
[0:40] <cmccabe> tranceConscious: do mkdir -p dev/mon.a dev/mon.b dev/mon.c
[0:41] <cmccabe> tranceConscious: also there is something weird where it is not finding your ceph.conf; not sure what that is about
[0:42] <tranceConscious> I think it's looking in the curent dir for ceph.conf
[0:42] <tranceConscious> I have it on /etc/cept
[0:42] <cmccabe> vstart.sh should generate the conf
[0:42] <tranceConscious> I have it on /etc/ceph
[0:42] <cmccabe> so vstart.sh will generate the conf unless you pass -k
[0:42] <cmccabe> in which case it uses the current one
[0:42] <cmccabe> in your src directory
[0:43] <tranceConscious> when you say do mkdir dev/mon.a you mean /dev/mon.a?
[0:43] <cmccabe> no, vstart does everything relative to the src dir
[0:43] <cmccabe> it's... kind of a hack
[0:44] <cmccabe> but can be useful for testing
[0:45] <bchrisman> I was using: mkcephfs -c /etc/ceph/ceph.conf -a --mkbtrfs -k /etc/ceph/keyring.bin and then cauthtool --print-key /etc/ceph/keyring.bin > /etc/ceph/filesystem.key
[0:45] <bchrisman> would've sworn I had that working from a clean install
[0:45] <tranceConscious> http://paste.ubuntu.com/552634/
[0:46] <cmccabe> bchrisman: I'm not deeply familiar with the auth stuff and what program needs what credentials
[0:46] <cmccabe> bchrisman: I do know that vstart.sh invokes cauthtool first, before cmon --mkfs
[0:47] <cmccabe> bchrisman: actually, looks like authentication is optional?
[0:47] <tranceConscious> what does it mean it can't find ceph.conf?
[0:47] <cmccabe> bchrisman: we ought to document that a little bit better
[0:48] <cmccabe> tranceConscious: I don't think that's anything to worry about
[0:48] <tranceConscious> so what do I do now?
[0:48] <cmccabe> tranceConscious: it happens because vstart.sh just invokes "init-ceph stop" before writing out the conf
[0:48] <tranceConscious> what does vstart actually do?
[0:48] <cmccabe> tranceConscious: it's a little script for testing, intended to make things easier
[0:48] <cmccabe> it doesn't always succeed apparently
[0:49] <tranceConscious> oh...
[0:49] <tranceConscious> so what do I do now?
[0:49] <cmccabe> so do you have a ceph.conf in the src dir?
[0:49] <tranceConscious> no, I have it where the wiki script put it, on /etc/ceph/ceph.conf
[0:49] <tranceConscious> shall I copy over?
[0:50] <yehudasa> tranceConcious: have you compiled from source, or are you running a precompiled package?
[0:50] <tranceConscious> I did the git thingy
[0:50] <tranceConscious> enabled the rc
[0:50] <tranceConscious> the dpkg-buildpackage
[0:50] <tranceConscious> and the instal the debs
[0:51] <yehudasa> in any case the default ceph.conf doesn't use the auth stuff at all
[0:52] <yehudasa> do you have a 'supported auth' line in your ceph.conf?
[0:52] <tranceConscious> there is a ceph.conf inside the src directory, but that's not the one I created by copying and pasteing from the wiki page
[0:52] <yehudasa> oh, I see
[0:52] * eternaleye_ is now known as eternaleye
[0:52] <tranceConscious> I was going by the wiki
[0:52] <tranceConscious> and when it failed I said to visit
[0:53] <tranceConscious> I was actually following this...
[0:53] <tranceConscious> http://ceph.newdream.net/wiki/Installing_on_Debian
[0:53] <tranceConscious> exactly
[0:53] <tranceConscious> but the mkcephfs command gives me error
[0:54] <yehudasa> for the vstart that you were running, you did it with a sudo, and I assume that was the problem
[0:55] <cmccabe> yehudasa: good catch. Running vstart with sudo is not a good idea
[0:55] <tranceConscious> well I tried again without the sudo...
[0:55] <cmccabe> yehudasa: still confused about how exactly that led to the missing ceph.conf, but that script is kind of fugly
[0:56] <cmccabe> tranceConscious: first do chown -R <self> in the src dir
[0:56] <yehudasa> and for doing the mkcephfs, I assume there's a 'supported auth' line in your ceph.conf that if you'd remove it you should be fine for now
[0:57] <yehudasa> we should document how to enable auth later on
[0:57] <tranceConscious> that's the funny part
[0:57] <tranceConscious> there is no auth lines in that conf I created, look at wiki page
[0:58] <tranceConscious> that's what confused me on the first place
[0:58] <tranceConscious> and I still get no luck with vstart
[0:58] <tranceConscious> http://paste.ubuntu.com/552638/
[0:59] <yehudasa> can you run 'cconf - c ceph.conf "auth supported"'?
[0:59] <tranceConscious> yes
[1:00] <tranceConscious> but where should I run it and as what user?
[1:00] <yehudasa> just run it as billy
[1:00] <yehudasa> ceph.conf should point at your ceph.conf
[1:00] <yehudasa> there may be some issue with vstart not able to find the ip of your host
[1:01] <tranceConscious> I have my hosts file fixed and ,y ips fixed too
[1:01] <tranceConscious> billy@node0:~/ceph/src$ cconf -c ceph.conf "auth supported"
[1:01] <tranceConscious> none
[1:01] <tranceConscious> billy@node0:~/ceph/src$
[1:01] <yehudasa> can you run 'host node0'?
[1:02] <tranceConscious> billy@node0:~/ceph/src$ host node0
[1:02] <tranceConscious> Host node0 not found: 3(NXDOMAIN)
[1:02] <tranceConscious> billy@node0:~/ceph/src$ ping node0
[1:02] <tranceConscious> PING node0.musichaos ( 56(84) bytes of data.
[1:02] <tranceConscious> 64 bytes from node0.musichaos ( icmp_req=1 ttl=64 time=0.591 ms
[1:02] <tranceConscious> 64 bytes from node0.musichaos ( icmp_req=2 ttl=64 time=0.048 ms
[1:02] <tranceConscious> 64 bytes from node0.musichaos ( icmp_req=3 ttl=64 time=0.035 ms
[1:02] <tranceConscious> 64 bytes from node0.musichaos ( icmp_req=4 ttl=64 time=0.040 ms
[1:02] <tranceConscious> 64 bytes from node0.musichaos ( icmp_req=5 ttl=64 time=0.039 ms
[1:02] <tranceConscious> ^C
[1:02] <tranceConscious> --- node0.musichaos ping statistics ---
[1:02] <tranceConscious> 5 packets transmitted, 5 received, 0% packet loss, time 3998ms
[1:02] <tranceConscious> rtt min/avg/max/mdev = 0.035/0.150/0.591/0.220 ms
[1:02] <tranceConscious> billy@node0:~/ceph/src$
[1:03] <yehudasa> maybe your /etc/resolv.conf needs to be modified?
[1:03] <tranceConscious> ???
[1:03] <tranceConscious> what for? ping works fine?
[1:04] <yehudasa> the vstart.sh script assumes it can find the ip address by using the host utility
[1:04] <tranceConscious> damn
[1:05] <tranceConscious> well what do I do to resolv conf then?
[1:06] <tranceConscious> nameserver
[1:06] <tranceConscious> domain musichaos
[1:06] <tranceConscious> search musichaos
[1:06] <tranceConscious> ~
[1:06] <tranceConscious> this is what I have
[1:06] <tranceConscious> and that's my router's address
[1:06] <yehudasa> so you have a local dns server?
[1:06] <tranceConscious> no
[1:06] <tranceConscious> that's just my router
[1:06] <yehudasa> hmm, I see
[1:07] <tranceConscious> I'm also setting it up tomorrow at work where there is a dns
[1:07] <tranceConscious> but how do I proceed now?
[1:07] <yehudasa> we can modify vstart.sh
[1:08] <yehudasa> oh, actually, are you running everything locally?
[1:09] <tranceConscious> what do you mean?
[1:09] <tranceConscious> I'm on a macpro and running two vmware vm's with ubuntu 10.10 and ceph
[1:10] <yehudasa> just for testing everything, you can run the vstart.sh with '-l'
[1:11] <tranceConscious> http://paste.ubuntu.com/552642/
[1:12] <yehudasa> yeah, you should run it with '-n' also for the first time (to create a new filesystem)
[1:12] <tranceConscious> well I did, -d -n
[1:12] <yehudasa> -n -l
[1:13] <tranceConscious> woa...
[1:13] <tranceConscious> http://paste.ubuntu.com/552645/
[1:14] <yehudasa> now run ./ceph -s
[1:16] <tranceConscious> 2011-01-11 02:15:50.508013 7f0f3f36b900 -- :/3600 messenger.start
[1:16] <tranceConscious> 2011-01-11 02:15:50.515009 7f0f3f36b900 -- :/3600 --> mon0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x1d6e710
[1:16] <tranceConscious> 2011-01-11 02:15:53.514993 7f0f35636700 -- :/3600 mark_down -- 0x1d729b0
[1:16] <tranceConscious> 2011-01-11 02:15:53.515323 7f0f35636700 -- :/3600 --> mon1 -- auth(proto 0 30 bytes) v1 -- ?+0 0x1d71560
[1:16] <tranceConscious> 2011-01-11 02:15:53.562094 7f0f34e35700 -- :/3600 >> pipe(0x1d74010 sd=4 pgs=0 cs=0 l=0).fault first fault
[1:16] <tranceConscious> 2011-01-11 02:15:56.515833 7f0f35636700 -- :/3600 mark_down -- 0x1d74010
[1:16] <tranceConscious> 2011-01-11 02:15:56.517539 7f0f35636700 -- :/3600 --> mon0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x1d747e0
[1:16] <tranceConscious> 2011-01-11 02:15:59.518176 7f0f35636700 -- :/3600 mark_down -- 0x1d749f0
[1:16] <tranceConscious> 2011-01-11 02:15:59.518541 7f0f35636700 -- :/3600 --> mon1 -- auth(proto 0 30 bytes) v1 -- ?+0 0x1d74540
[1:16] <tranceConscious> 2011-01-11 02:15:59.568137 7f0f34c33700 -- :/3600 >> pipe(0x1d76050 sd=6 pgs=0 cs=0 l=0).fault first fault
[1:16] <tranceConscious> 2011-01-11 02:16:02.519083 7f0f35636700 -- :/3600 mark_down -- 0x1d76050
[1:16] <tranceConscious> 2011-01-11 02:16:02.519563 7f0f35636700 -- :/3600 --> mon0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x1d76800
[1:16] <tranceConscious> 2011-01-11 02:16:05.520038 7f0f35636700 -- :/3600 mark_down -- 0x1d76a10
[1:16] <tranceConscious> 2011-01-11 02:16:05.520439 7f0f35636700 -- :/3600 --> mon1 -- auth(proto 0 30 bytes) v1 -- ?+0 0x1d7a050
[1:16] <tranceConscious> 2011-01-11 02:16:05.566200 7f0f34a31700 -- :/3600 >> pipe(0x1d7a260 sd=8 pgs=0 cs=0 l=0).fault first fault
[1:16] <tranceConscious> 2011-01-11 02:16:08.520917 7f0f35636700 -- :/3600 mark_down -- 0x1d7a260
[1:16] <tranceConscious> 2011-01-11 02:16:08.521196 7f0f35636700 -- :/3600 --> mon0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x1d7a9f0
[1:16] <tranceConscious> 2011-01-11 02:16:11.521716 7f0f35636700 -- :/3600 mark_down -- 0x1d7ac00
[1:16] <tranceConscious> 2011-01-11 02:16:11.522055 7f0f35636700 -- :/3600 --> mon1 -- auth(proto 0 30 bytes) v1 -- ?+0 0x1d7a790
[1:16] <tranceConscious> 2011-01-11 02:16:11.567728 7f0f3482f700 -- :/3600 >> pipe(0x1d7e050 sd=10 pgs=0 cs=0 l=0).fault first fault
[1:16] <tranceConscious> shit
[1:16] <tranceConscious> sorry
[1:19] <yehudasa> maybe try ./ceph -c ./ceph.conf -s
[1:21] <tranceConscious> billy@node0:~/ceph/src$ ./ceph -c ./ceph.conf -s
[1:21] <tranceConscious> 2011-01-11 02:21:21.639202 pg v2: 24 pgs: 24 creating; 0 KB data, 0 KB used, 0 KB / 0 KB avail
[1:21] <tranceConscious> 2011-01-11 02:21:21.639773 mds e8: 3/3/3 up {0=up:creating,1=up:creating,2=up:creating}
[1:21] <tranceConscious> 2011-01-11 02:21:21.639883 osd e1: 0 osds: 0 up, 0 in
[1:21] <tranceConscious> 2011-01-11 02:21:21.640008 log 2011-01-11 02:20:37.791821 mon0 5 : [INF] mds? up:boot
[1:21] <tranceConscious> 2011-01-11 02:21:21.640090 class rbd (v1.3 [x86-64])
[1:21] <tranceConscious> 2011-01-11 02:21:21.640139 mon e1: 3 mons at {a=,b=,c=}
[1:21] <tranceConscious> billy@node0:~/ceph/src$
[1:22] <yehudasa> hmm.. you don't have any osds up
[1:22] <yehudasa> mkdir dev
[1:22] <yehudasa> cd dev
[1:22] <yehudasa> mkdir osd0
[1:23] <yehudasa> then go to ceph/src again, ./stop.sh, vstart.sh -n -l again
[1:23] <yehudasa> dev/osd0 should point to your osd partition
[1:25] <tranceConscious> http://paste.ubuntu.com/552647/
[1:27] <yehudasa> you're not mounted with user_xattr..
[1:27] <yehudasa> do you have a btrfs partition you want the osd to use?
[1:28] <tranceConscious> no
[1:28] <yehudasa> are you using ext3/4?
[1:28] <tranceConscious> well default install of ubuntu64 on 20gig disk
[1:28] <tranceConscious> on both node0 and node1
[1:28] <yehudasa> sudo mount -oremount,user_xattr /
[1:29] <yehudasa> or wherever your home directory is mounted
[1:29] <tranceConscious> and was planning to use the demo ceph.conf to understand the deal and then add disks to the machines and do the btrfs partitions on them
[1:37] * tranceConscious (~tranceCon@ppp079166119202.dsl.hol.gr) Quit (Quit: tranceConscious has no reason)
[1:41] <Tv|work> FYI office dwellers: i installed ccache on flak
[1:42] <yehudasa> tv: I think the idea was that those 5 servers have the same compilation environment so that we can distcc over them
[1:43] <Tv|work> yehudasa: sorry if this is a dumb question but how do i access them all?
[1:44] <Tv|work> i don't see any obvious dns names, like flak2 or such
[1:44] <yehudasa> flak, vit, swab, slider and kai
[1:44] <Tv|work> ahh thanks first time i hear that list
[1:44] <Tv|work> i'll put ccache on the rest
[1:44] <yehudasa> ccache is ok, however, can you make it that it doesn't use it by default?
[1:44] <Tv|work> it's not used by default
[1:44] <yehudasa> ok, great
[1:50] <Tv|work> btw, in case you guys aren't familiar with these tools, anyone at the office with a serious love for speed, install ccache & distcc locally and use something like: DISTCC_HOSTS='localhost/2 @flak/16 @vit/16 @swab/16 @slider/16 @kai/16' make check CXX='ccache distcc g++' -j100
[1:51] <cmccabe> tv: ccache is a great tool
[1:51] <cmccabe> tv: I am confused about what those servers are "normally" for
[1:51] <Tv|work> cmccabe: sage said "dev"
[1:51] <Tv|work> cmccabe: as in, feel free to log in & compile things
[1:51] <cmccabe> tv: I'd like to have one for system testing
[1:52] <Tv|work> cmccabe: not from that pool
[1:52] <Tv|work> cmccabe: as in, those machines are not supposed to run ceph, crash, etc
[1:52] <cmccabe> tv: interesting
[1:52] <cmccabe> tv: they really are purely for compiling?
[1:52] <Tv|work> cmccabe: there's literally dozens of other machines
[1:53] <Tv|work> cmccabe: i'll work some sense into the system, just give me some time ;)
[1:53] <cmccabe> tv: ok... compile cluster it is then...
[2:01] <yehudasa> tv, cmccabe: afaik, those machines are both for compiling and testing
[2:02] <Tv|work> yehudasa: yeah, but not "boot my custom kernel" kind of testing, and we need to be able to share them nicely
[2:02] <yehudasa> no.. for server side testing
[2:02] <Tv|work> share = need to figure out what to do about ports
[2:02] <Tv|work> fixed ports are bad for multiple people on the same host
[2:02] <yehudasa> yeah.. actually it was a server per developer for those
[2:03] <Tv|work> ah, assign hosts, sure that'll work as long as the pile of hardware is high enough
[2:03] <cmccabe> we should dedicate a machine or two to just compiling
[2:03] <cmccabe> but I would advise against running distcc on all of them if we're also doing server testing there
[2:03] <Tv|work> i don't see a real reason why we shouldn't use all of them for distcc
[2:04] <gnp421> why not use KVM, and setup a ton of virtual machines?
[2:04] <Tv|work> i mean, you're not benchmarking
[2:04] <cmccabe> in my experience, the spikes in load/memory/cpu consumption can make the results of testing unpredictable
[2:04] <Tv|work> gnp421: in the plans also ;)
[2:04] <Tv|work> cmccabe: that's a bad test, then :(
[2:04] <cmccabe> performance tests are not always bad
[2:04] <Tv|work> cmccabe: i hate "rainy day" tests
[2:04] <gnp421> Tv|work: awesome
[2:04] <cmccabe> in fact they're going to be more and more required in the project
[2:04] <Tv|work> cmccabe: for performance tests, you should get dedicated hardware from a different pile
[2:04] <Tv|work> cmccabe: we have that pile already
[2:05] <Tv|work> literally, based on what sage told me, we have ~70 machines to use
[2:05] <cmccabe> I've had some frustrating scenarios where people logged into flab, ran huge memory-hogging processes, and generally made my ssh experience unbearable
[2:06] <cmccabe> I think it's great to have "distcc machines"... but you won't catch me logging in to any of them
[2:06] <DJL> hey guys just for a quick interruption, in few hours later, im going to post some of my consolidated ceph questions in the mailinglist , so thanks in advance :)
[2:06] <DJL> be warned :p !
[2:07] <Tv|work> accordingly, i just named my dsh configuration for the flak/vit/etc pool "flaky" ;)
[2:07] <yehudasa> DJL: feel free to ask anything! that is.. ceph related
[2:07] <Tv|work> s/accordingly/appropriately/
[2:07] <cmccabe> heh
[2:07] <DJL> ive had like 5 pages initially (including outputs, etc) but they are kind of cut-down , eheh
[2:08] <DJL> i also have huge plots of performance measurements, but will cut down to some specific too, oh btw do mailist accept pdf attachment ?
[2:09] <yehudasa> DJL: not really.. you should send a link instead
[2:09] <DJL> ok, thx!
[2:09] <Tv|work> fwiw the "flaky" machines are completely idle except for the autobuilder i threw in there.. i rather have them be everyone's distcc and used than sacred and not used ;)
[2:10] <yehudasa> tv: sage usually uses flak, greg uses kai
[2:10] <Tv|work> i'm gonna work towards a "cloud" machines you can allocate for yourself, when you need dedicated boxes
[2:10] <cmccabe> tv: my cluster manager should take care of that
[2:10] <yehudasa> joshd and sam also use one each
[2:11] <Tv|work> yehudasa: well i put the autobuilder on flak right now, so it's good that i'll annoy sage first -- things will get resolved faster that way ;)
[2:11] <yehudasa> heh :)
[2:11] <cmccabe> tv: I'll merge in the cluster manager at the end of the week, can you hold off on the cloud machine stuff until then?
[2:11] <Tv|work> Tv|work: all i have now is libvirt-using scripts
[2:11] <Tv|work> kinda different
[2:11] <Tv|work> err, why do i talk to myself
[2:11] <cmccabe> tv: yeah, let's integrate our efforts on friday
[2:12] <cmccabe> tv: I see that you checked in the gtest project in the gtest branch
[2:12] <Tv|work> yeah have a go at that thing
[2:12] <cmccabe> tv: should this be an external dependency, or does it have to be part of our tree?
[2:12] <Tv|work> cmccabe: they recommend everyone build it for themselves
[2:12] <cmccabe> there's a general aversion to bundled libraries in linux-land
[2:12] <cmccabe> and I see that apt-get install libgtest0 works quite well
[2:14] <cmccabe> anyway, I will try to convert some unit tests to the new framework
[2:14] <cmccabe> but before we merge I'm sure sage will want to get rid of the bundled code
[2:16] <Tv|work> basically, there are cases where a single pre-built lib will behave wrong, and they recommend that every project using gtest compiles it themselves
[2:17] <Tv|work> i am well aware of the disdain for bundling, but in this case explicitly recommended against
[2:17] <cmccabe> can you show me this recommendation?
[2:18] <Tv|work> not sure where it was, either in their wiki or source tree :-/
[2:18] <cmccabe> also, if you want to include their project in ours, ours will have to be dual licensed. And you should preserve their git history when importing.
[2:18] <cmccabe> I've used this library before in the normal way, and no bad effects occurred.
[2:19] <Tv|work> http://code.google.com/p/googletest/wiki/FAQ#Why_is_it_not_recommended_to_install_a_pre-compiled_copy_of_Goog
[2:20] <Tv|work> no dual licensing needed, LGPL can subsume their NewBSD perfectly fine
[2:20] <Tv|work> ceph as a whole will still be LGPL
[2:20] <cmccabe> I don't think this will be an issue as long as libgtest-dev is compiled with the same flags as libgtest0
[2:20] <Tv|work> and i don't intend to track their whole history; we can just slurp in their releases when needed
[2:20] <Tv|work> cmccabe: that doesn't seem to solve the #if problem
[2:21] <Tv|work> cmccabe: remember, gtest is heavily based on headers
[2:21] <Tv|work> i can make it use a system install of gtest just fine, i'm just going based on recommendations
[2:22] <Tv|work> i really don't want to have to debug the unit test framework
[2:22] <cmccabe> what a mess
[2:23] <cmccabe> doesn't gtest export the flags it was compiled with somewhere?
[2:23] <cmccabe> maybe we could just slurp those and use them in our test compile.
[2:23] <Tv|work> there's no "test compile"
[2:23] <Tv|work> you need to test your actual product
[2:23] <cmccabe> well, I assume that tests will include gtest.h etc.
[2:24] <cmccabe> but probably not so much the other files
[2:24] <Tv|work> and link against libceph or something
[2:24] <Tv|work> not sure how much *those* can safely differ, either
[2:25] <cmccabe> I'm starting to have doubts about this library
[2:25] <Tv|work> yet it seems to be the best there is for c/c++ ;)
[2:27] <cmccabe> perhaps they should have made it header-file only
[2:27] <cmccabe> C++ in general has a lot of issues with ABI, it seems like this is just an especially bad case
[2:34] <cmccabe> well, it looks like gtest is 3-clause BSD
[2:35] <cmccabe> so as I understand it, that's not incompatible with LGPL
[2:35] <Tv|work> as I said 15 minutes ago..
[2:37] <cmccabe> tv: that is true. It's frustrating to think to have to do stuff like this though
[2:38] <Tv|work> well, i tend to think of more like, it's already done now, let's move on & start improving the code quality
[2:48] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:07] <Tv|work> i'm not exactly sure i understand why you guys are testing infinite recursions
[3:07] <cmccabe> tv: ignore that, it was just something I was experimenting with
[3:08] <cmccabe> tv: I was hoping to create a signal handler that could print a sensible message in that case by using an alternative signal stack
[3:08] <Tv|work> cmccabe: that's a bit complex..
[3:08] <cmccabe> tv: not really, it's just a few POSIX calls
[3:08] <Tv|work> (as in, use core files & shove it in another process to contain its bugs)
[3:09] <cmccabe> tv: man sigaltstack
[3:09] <cmccabe> tv: however I didn't finish working out some glitches, so currently infinite recursion doesn't generate a nice backtrace in the logs (it just goes splat and you get no logs)
[3:13] <Tv|work> alright time to head home; tomorrow, try to write unit tests for some of the low-level things like osdmaps
[3:13] <cmccabe> good night
[3:21] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[3:23] * cmccabe (~cmccabe@ has left #ceph
[3:24] * NoahWatkins (~NoahWatki@soenat3.cse.ucsc.edu) Quit (Remote host closed the connection)
[6:16] * gnp421_ (~hutchint@c-75-71-83-44.hsd1.co.comcast.net) has joined #ceph
[6:24] * gnp421 (~hutchint@c-75-71-83-44.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[6:26] * gnp421_ (~hutchint@c-75-71-83-44.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[6:49] * f4m8 is now known as f4m8_
[6:49] * f4m8_ is now known as f4m8
[6:52] * ijuz__ (~ijuz@p4FFF622F.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[7:02] * ijuz__ (~ijuz@p4FFF69B1.dip.t-dialin.net) has joined #ceph
[7:42] * Su2zuki (Su2zuki@c-68-44-246-23.hsd1.nj.comcast.net) has joined #ceph
[7:50] * morse (~morse@supercomputing.univpm.it) Quit (Ping timeout: 480 seconds)
[7:56] * gnp421 (~hutchint@c-75-71-83-44.hsd1.co.comcast.net) has joined #ceph
[8:08] * Su2zuki (Su2zuki@c-68-44-246-23.hsd1.nj.comcast.net) Quit ()
[8:10] * gnp421 (~hutchint@c-75-71-83-44.hsd1.co.comcast.net) Quit (Read error: Connection reset by peer)
[8:20] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:23] * Meths_ (rift@ has joined #ceph
[8:29] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[9:05] <gregorg_taf> johnl_: gni ?
[9:06] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:22] * Jiaju (~jjzhang@ Quit (Remote host closed the connection)
[9:29] * Jiaju (~jjzhang@ has joined #ceph
[9:36] * bchrisman1 (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[9:36] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[9:37] * tranceConscious (~billy@ has joined #ceph
[9:45] <tranceConscious> I'm trying to set up a two node test environment
[9:45] <tranceConscious> anyone around to give a hand?
[9:59] * Yoric (~David@ has joined #ceph
[10:15] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[10:16] <jantje> tranceConscious: sure, should be easy
[10:16] <jantje> i think there is a ceph.conf example in the src/ directory
[10:17] <jantje> (if you did build from source)
[10:17] <jantje> tranceConscious: and the twiki contains a lot things too
[10:30] <tranceConscious> was following the wiki, for debian, did the git thing, and dpkg-buildpackage, then installed the debs
[10:31] <tranceConscious> but when I created the sample conf from the wiki [as it says here http://ceph.newdream.net/wiki/Installing_on_Debian]
[10:31] <tranceConscious> and run the mkcephfs
[10:31] <tranceConscious> I got an error
[10:33] <tranceConscious> created a test setup with two virtual machines
[10:34] <tranceConscious> now I'am at the office and I'll try to create a two physical machine setup
[10:37] <jantje> what error?
[10:41] * allsystemsarego (~allsystem@ has joined #ceph
[10:45] <tranceConscious> I'll copy paste it when I get it again...
[11:04] * billy (~quassel@ has joined #ceph
[11:05] * tranceConscious (~billy@ Quit (Quit: Ex-Chat)
[11:13] <stingray> my uplink broke ipv6 again
[11:13] <stingray> fffuuuu
[11:49] <billy> libcrypto++-dev is a dependancy not mentioned on the wiki
[11:52] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[13:42] * billy (~quassel@ Quit (Remote host closed the connection)
[15:04] <jantje> [ 4225.389761] libceph: get_reply unknown tid 1764837 from osd3
[15:04] <jantje> [ 4225.399746] libceph: get_reply unknown tid 1764838 from osd3
[15:04] <jantje> [ 4492.918272] libceph: osd2 weight 0x0 (out)
[15:04] <jantje> [ 4492.918275] libceph: osd5 weight 0x0 (out)
[15:04] <jantje> [ 4522.961342] libceph: osd0 weight 0x0 (out)
[15:04] <jantje> [ 4522.961344] libceph: osd1 weight 0x0 (out)
[15:04] <jantje> [ 4556.562810] libceph: tid 1736722 timed out on osd3, will reset osd
[15:05] <jantje> wicked :)
[15:48] * f4m8 is now known as f4m8_
[16:05] <jantje> wido: are you here?
[16:05] <jantje> wido: if you have time, can you run: iozone -t 4 -s 2G -i 0 -i 2 -i 8 on your ceph clients
[16:06] <jantje> and can you tell me if you see btrfs errors on your servers?
[16:07] <jantje> because, when I get a btrfs warning (only when it's busy with the random read/write workload), my performance drops
[16:16] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[16:26] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[16:26] * bchrisman1 (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[16:32] * greglap (~Adium@ has joined #ceph
[16:53] * Meths_ is now known as Meths
[17:04] <greglap> stingray: what were you saying about the objecter or OSD being broken?
[17:11] * billy (~quassel@ has joined #ceph
[17:24] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[17:47] * billy (~quassel@ Quit (Remote host closed the connection)
[18:04] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:22] * cmccabe (~cmccabe@ has joined #ceph
[18:44] * raso (~raso@debian-multimedia.org) has joined #ceph
[18:50] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:08] <wido> jantje: still there?
[19:09] <wido> jantje: I think you are seeing: http://tracker.newdream.net/issues/563
[19:10] <stingray> gregaf: I said I have logs with "debug objecter=10" and they have more info for dump-journal run
[19:11] <gregaf> and you think the bug is in that layer?
[19:11] <gregaf> or you just have the logs?
[19:11] <gregaf> oh, right, it's coming back to me now
[19:13] <gregaf> stingray: did you post them somewhere I can access?
[19:14] * Yoric (~David@ Quit (Quit: Yoric)
[19:32] <wido> cmccabe: I'll give your fix a try, but I couldn't reproduce the crash today
[19:32] <cmccabe> wido: ah, sorry to hear that it's hard to reproduce
[19:33] <wido> yes.. Yesterday I kept going down by removing those three pools shortly after eachother
[19:33] <wido> when I tried the same today, it wouldn't go
[19:33] <wido> nevertheless, i'll build and see if it stays up
[19:33] <cmccabe> wido: k
[20:14] <johnl_> hey cmccabe: you were looking for me a few days ago.
[20:15] <johnl_> our irc use finally coincides
[20:15] <cmccabe> johnl_: yeah, I was going to ask you about your cluster setup
[20:15] <cmccabe> johnl_: but I figured it out.
[20:15] <johnl_> oki.
[20:15] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[20:15] <gregaf> cmccabe did some investigating and he says it looks like your memory use problem is actually related to the number of pools, not objects
[20:16] <gregaf> johnl_: is that plausible based on your usage scenario?
[20:18] <cmccabe> gregaf, johnl: I think that having thousands of pools is something we haven't tested that well, and I encountered a lot of problems when trying to set it up on my own machine
[20:18] <cmccabe> I also think that the large number of pools causes a correspondingly large number of PGs, which may lead to the heavy memory consumption we were seeing
[20:20] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:24] <johnl_> hrm, right
[20:24] <johnl_> I could see the s3 gateway thing encouraging multiple pools
[20:25] <johnl_> I kind of saw pools like amazon's buckets
[20:25] <johnl_> how does the gateway emulate buckets?
[20:25] <gregaf> it makes a pool per bucket
[20:25] <gregaf> we knew they were a little heavier but we didn't think it would cause problems
[20:25] <gregaf> apparently we were wrong
[20:25] <johnl_> :/
[20:26] <johnl_> I'm assuming a pool per bucket makes it eaiser to enumerate objects in a bucket
[20:26] <gregaf> most of our concerns at the time were focused on the OSDMap rather than on memory use
[20:26] <gregaf> yes
[20:26] <johnl_> I assume if you use a prefix or something on the object names that you'd need to enumerate all objects and filter the prefix yourself
[20:26] <gregaf> and makes life easier if you want to allow multiple forms of access, though I don't remember if we ever actually set that up
[20:26] <johnl_> though I saw something about filtering by xattr recently?
[20:27] <gregaf> yehuda's working on that
[20:27] <johnl_> suppose you could use an xattr to specify what bucket an object is in.
[20:27] <gregaf> our current use case is to do some fsck work if we lose directory inodes or whatever
[20:27] <johnl_> I'm guessing enumerating by xattr is much slower than enumerating by pool?
[20:27] <johnl_> s/is/will be
[20:27] <gregaf> I don't know, but I'm pretty sure
[20:28] <gregaf> enumerating by xattr will involve all the OSDs looking at every object they possess
[20:28] <johnl_> yer :/
[20:28] <cmccabe> buckets are pretty heavyweight in AWS too
[20:28] <johnl_> we need an xattr index :)
[20:28] <gregaf> whereas objects are stored in directories corresponding to their PGs, and PGs belong to a specific pool
[20:28] <cmccabe> creating a bucket is a Big Deal and may involve your account getting charged
[20:28] <cmccabe> also you can't move things between buckets without getting charged... ever
[20:29] <gregaf> a solution to this is not going to come by changing how rgw works, I don't think
[20:29] <gregaf> we haven't looked much at reducing per-PG memory usage and I'm not sure how feasible that is
[20:29] <johnl_> this something that can be fixed you reckon? or a major change?
[20:29] <johnl_> ah
[20:30] <cmccabe> I think we can do a lot of optimizations but we need to have some smoke tests first
[20:30] <cmccabe> otherwise it's very hard to know whether any particular change helps or hurts
[20:30] <cmccabe> also performance/memory regressions are hard to spot, even for experienced devs
[20:31] <cmccabe> devs never can predict where performance is lost, and always think they can... it's one of the oldest truths in programming.
[20:31] <cmccabe> Measure first, then optimize.
[20:31] <gregaf> this might just mean that using rgw to power an S3 emulating service is infeasible, although to be honest it wasn't meant to scale that high anyway (since rgw is stuck to the limits of one machine anyway)
[20:31] <johnl_> can load balance to multiple rgw right?
[20:31] <gregaf> but I bet that most of the PGs sucking up memory are completely unused, so we can probably do something to evict them out of memory
[20:32] <cmccabe> gregaf: I'm trying to figure out why PGs can only be in one pool at once
[20:32] <gregaf> I'm just not sure how that would impact performance or how difficult it would be to maintain guarantees
[20:32] <gregaf> cmccabe: that's what pools are, they're logical groupings of different PGs with different placement rules
[20:32] <cmccabe> gregaf: virtual memory already does something pretty similar to your evict-on-unused
[20:33] <johnl_> side note: I can't find anything about S3 buckets being heavy weight. They group objects by a region. acls are applied per bucket. and versioning is done per bucket.
[20:33] <gregaf> cmccabe: yes, but some people like to keep swap off
[20:33] <cmccabe> johnl_: I worked at a web 2.0 company previously and we had about a dozen buckets, total
[20:33] <gregaf> that doesn't mean they're heavy-weight, though
[20:33] <cmccabe> johnl_: and moving data between buckets was generally a big no-no
[20:33] <gregaf> just that you don't need many and working between them is difficult
[20:33] <cmccabe> johnl_: because the company got charged for doing that
[20:34] <johnl_> cmccabe: yeah, I mean, it's not common to create lots of buckets, but amazon don't treat them as heavyweight.
[20:34] <gregaf> that doesn't mean they're heavy from Amazon's perspective
[20:34] <gregaf> they are free to create (and destroy? unless you need to delete their contents first)
[20:35] <gregaf> anyway, don't remember exactly when PGs are created but I'm thinking specifically of the preferred ones, which are bound to a specific OSD as primary
[20:35] <gregaf> if those are created on every OSD for every pool, even when unused, we can probably optimize them away pretty easily
[20:35] <cmccabe> https://forums.aws.amazon.com/message.jspa?messageID=179086
[20:36] <johnl_> I keep losing grasp of what a pg is
[20:36] <wido> cmccabe: I tried your fix, it is working or I'm not hitting the bug again
[20:36] <cmccabe> 100 bucket limit set on amazon S3 accounts
[20:36] <johnl_> cmccabe: interesting
[20:36] <gregaf> johnl_: pg = placement group
[20:36] <gregaf> objects are hashed into a placement group
[20:36] <cmccabe> you might disagree, but I think having a 100 bucket limit per user means that buckets *are* heavyweight from amazon's perspective.
[20:36] <wido> cmccabe: I created 25 pools and removed them all, OSD's stayed up
[20:36] <gregaf> and the placement groups are super-awesome-cool-stable hashed onto OSDs
[20:37] <cmccabe> you can get around this by having multiple billing accounts, but it sounds like amazon doesn't "expect" users to have a bazillion buckets
[20:37] <johnl_> cmccabe: indeed
[20:37] <cmccabe> wido: great!
[20:38] <johnl_> and there are a number of pgs per pool then?
[20:38] <cmccabe> wido: although seeing as it's hard to reproduce I guess we should be cautious before declaring victory
[20:38] <gregaf> johnl_: yes, and you can change the number of PGs/pool there are (on the order of 100/OSD is generally good)
[20:39] <johnl_> depending on your data usage
[20:39] <gregaf> maybe we need less now, I forget exactly, but the more PGs you have the more even your data placement will be
[20:39] <johnl_> so 100 osd means 1% of your data is in each osd
[20:40] <johnl_> sorry 100 pgs
[20:40] <johnl_> 1% in each pg
[20:40] <gregaf> there is a reasonably small per-PG memory overhead and not all PGs go to all OSDs, but that per-PG overhead is currently fixed, so with 2400 pools...
[20:40] <johnl_> so assuming 100 osds, one osd dies and only 1% of your objects are missing.
[20:40] <gregaf> well, it's a probabilistic distribution, but essentially yes
[20:40] <gregaf> well, you ought to have replication turned on so you aren't missing any of it!
[20:40] <johnl_> yeah sure
[20:41] <johnl_> but you only need to redistribute 1% :)
[20:41] <gregaf> basically, yes
[20:41] <johnl_> interesting that pgs are per pool
[20:41] <johnl_> pools "felt" like a lightweight way of grouping objects
[20:42] <gregaf> they were meant to be a lightweight way of grouping objects
[20:42] <gregaf> they are maybe less lightweight than we thought
[20:42] <johnl_> which is all I'm looking for. I'm indifferent (in this case) to each pool managing it's own pgs
[20:42] <cmccabe> I wonder why we need new PGs per pool, rather than using the existing ones
[20:42] <johnl_> right
[20:42] <cmccabe> surely that would also ensure good distribution
[20:42] <johnl_> cmccabe: I suppose there is a use case where it's handy :)
[20:43] <gregaf> cmccabe: well, PGs belong to pools or there wouldn't be much point to pools?
[20:43] <cmccabe> oh yeah, I forgot... the placement stuff
[20:43] <cmccabe> pools are a nice way to get some things placed differently than others
[20:43] <gregaf> yes
[20:43] <johnl_> ah indeed
[20:43] <johnl_> as per the crush config.
[20:43] <gregaf> and if you have pre-existing PGs then they already have data in them...
[20:43] <johnl_> so you can replicate your metadata more than your data, or whatever. etc.
[20:43] <cmccabe> but again, why can't pools reuse PGs that exist
[20:44] <cmccabe> not all pools need to include all PGs of course
[20:44] <cmccabe> then placement could still be different for different pools
[20:44] <cmccabe> we could have pools prefix the object names with a special prefix or something
[20:44] <gregaf> well if you had PGs in two different pools, then objects placed via one pool would show up in the other pool....
[20:44] <cmccabe> I guess it might make finding pool size usage harder
[20:45] <gregaf> and what happens if you create a new pool that goes into 4 PGs, and then you place 90% of your data into that pool?
[20:45] <gregaf> then your data isn't well-distributed at all
[20:45] <cmccabe> how is that different than now?
[20:45] <gregaf> ....?
[20:45] <cmccabe> fewer PGs lead to worse data distribution
[20:45] <gregaf> right now pools get their own PGs and so the data of the pool is distributed over the entire cluster
[20:46] <gregaf> you're talking about squashing pools into pre-existing PGs but eventually you're going to run out of PGs
[20:46] <cmccabe> I was just saying that pools *could* use a subset of PGs, not that all pools would *have* to.
[20:46] <gregaf> not to mention, btw, prefixing would change the hash
[20:46] <cmccabe> my thought was basically to have pools share PGs.
[20:46] <cmccabe> we already know about the running out of PGs problem; that's why we have PG expansion
[20:47] <cmccabe> if anything, the current pool scheme is worse for PG expansion
[20:47] <cmccabe> since doubling the number of PGs is ridiculously infeasible when you have 2000 pools, and 100 PGs per pool...
[20:47] <sagewk> the main problem is that placement is controlled at pool granularity (i.e. which crush map, which osds get data, etc.)
[20:48] <sagewk> which makes sharing pgs impossible.
[20:48] <cmccabe> ic
[20:48] <cmccabe> so pools have their own crush maps?
[20:48] <sagewk> i don't think there's any problem with the current scheme unless you start creating bazillions of pools.
[20:48] <gregaf> which is something we encourage with rgw
[20:48] <sagewk> even then, pg = pool subset is fine, the main issue is osdmap bloat.
[20:49] <cmccabe> sagewk: yeah, as I said, having many buckets (=pools) is not a common use case in the real world
[20:49] <cmccabe> sagewk: so we might be ok as is, as long as we document our expected use cases
[20:49] <sagewk> if it's osd memory utilization we're worried about, we can reduce the in-memory pg footprint...
[20:49] <gregaf> it's not good if you have a bunch of users all using only a few buckets/pools, though
[20:49] <cmccabe> why not?
[20:50] <gregaf> because we're a scalable system
[20:50] <cmccabe> there was something I found on the web where some guy was like "I'm using AWS to create a bucket for each user that uses my website, and it scales poorly"
[20:50] <gregaf> and if you say "you can have no more than 10 pools per 5 MB of osd RAM" and somebody has 50,000 users....
[20:50] <cmccabe> and the response was like "don't do that"
[20:50] <gregaf> yeah
[20:51] <gregaf> but Amazon has a bucket for each user!
[20:51] <gregaf> we're not the web2.0 guy, we're Amazon!
[20:51] <johnl_> I wanted to provide lots of users access to one ceph cluster. was going to give them a pool each (or multiple pools each) and have rgw do some kind of access control.
[20:51] <sagewk> if you really want multiple users' data to be managed in teh same set of pgs, then your app can name objects like "user/object" or something.
[20:51] <johnl_> I don't want to have to run lots of ceph clusters. that doesn't seem right.
[20:51] <cmccabe> hmm, I see what you mean. In order to give users 100 buckets each, and have a bunch of users, we need better scalability.
[20:52] <johnl_> sagewk: yeah, but then I need a way to index the objects by that prefix.
[20:52] <johnl_> which is fine. I just thought pools gave me that.
[20:52] <johnl_> need the index so I can enumerage objects for a given "bucket"
[20:52] <sagewk> what do you mean by "index"?
[20:53] <sagewk> ah
[20:53] <johnl_> without retreiving all object names and doing it myself.
[20:53] <johnl_> pools gave me that for free.
[20:53] <johnl_> heh, cept for the ram use :)
[20:53] <sagewk> if your buckets are really small, you can reduce the nubmer of pgs in them (default is 8)
[20:53] <cmccabe> I thought S3 had a way to enumerate all objects whose name started with a prefix
[20:53] <cmccabe> surely that is what you should use?
[20:53] <sagewk> yeah, but i think it's still o(n) where n is size of bucket, not result size
[20:54] <sagewk> we can do the same, but it's not fast :)
[20:54] <gregaf> and it'll slow down anybody else using the cluster
[20:54] <Tv|work> sounds like you might need a "directory"
[20:54] <johnl_> reducing number of pgs is a good start. though I do have users with 8 million files. and others with 10,000 files.
[20:54] <sagewk> probably the solution here is to try to break the current invariant that all pgs have in-memory state. or reduce the amount of in-memory state.
[20:55] <Tv|work> hehe
[20:55] <sagewk> if the pgs are mostly idle (which they presumably will be if there are so many) then it's just a matter of using memory more intelligently, right?
[20:55] <sagewk> you adjust pg_num for each pool independently
[20:55] <johnl_> ok doing that on the fly?
[20:55] <sagewk> small pools get few pgs, big ones get lots.
[20:56] <johnl_> would rather avoid needing to specify how big your bucket is going to be when you create it
[20:56] <sagewk> the implementation is a bit weak at the moment, but that's the end goal.
[20:56] <johnl_> again, thinking rgw
[20:56] <sagewk> the goal would be to make the monitor auto-adjust as the pool grows/shrinks over time.
[20:57] <sagewk> currently we can split (with horrible locking) to increase pg_num, but not merge (reduce it).
[20:57] <Tv|work> side note: here's what i can do for testing cli tools:
[20:57] <Tv|work> $ cat src/test/cli/monmaptool/simple.t
[20:57] <Tv|work> $ monmaptool
[20:57] <Tv|work> usage: [--print] [--create [--clobber]] [--add name] [--rm name] <mapfilename>
[20:57] <Tv|work> [1]
[20:57] <Tv|work> that's it, that's the entirety of that test
[20:58] <Tv|work> the create, add, print one is a bit more interesting, but too long to paste
[20:58] <johnl_> I think I'd be tempted to just use one pool here and keep an external index of objects to "buckets"
[20:58] <johnl_> but seems a shame to have to keep that myself.
[20:58] <johnl_> suppose the metadata service gives me that :)
[20:58] <sagewk> yeah, i don't think that's the answer.
[20:58] <sagewk> yeah, then you get files and directories :)
[20:58] <johnl_> could just put files in directories on a ceph filesystem :)
[20:58] <johnl_> heh
[20:59] <sagewk> i think it's mainly a matter of figuring out how to reduce osd memory usage.
[20:59] <Tv|work> well, rgw needs a cheaper fs than ceph -- posix and all that..
[20:59] <sagewk> your use case is radosgw?
[20:59] <johnl_> though I was keen on the (perceived?) lower overhead of direct rados
[20:59] <Tv|work> osds themselves are already a really trivial fs
[20:59] <cmccabe> sagewk: we should figure out whether listing files in S3 should be cheap
[20:59] <cmccabe> sagewk: I have a gut feeling that it is cheap under AWS
[20:59] <Tv|work> sagewk: if it wasn't originally, it is now
[20:59] <sagewk> cmccabe: yeah
[20:59] <cmccabe> sagewk: if so, that should be the scalability point we should work on, since S3 users will expect that
[20:59] <Tv|work> err, cmccabe ^
[21:00] <Tv|work> this irc client is making me do stupid things
[21:00] <cmccabe> sagewk: experienced S3 users will not create a lot of buckets
[21:00] <johnl_> sagewk: close enough of a use case yeah. buckets and objects. lots of buckets. variable number of objects.
[21:00] <sagewk> one thing we could do is make objects name "foo/bar" actually store in directories on the osds, so that listing with a / delimited prefix is efficient.
[21:01] <cmccabe> sagewk: how is glob(2) implemented?
[21:01] <sagewk> that wouldn't be a terribly difficult change, really. All in os/FileStore.cc, and some changes to the list_objects code
[21:01] <cmccabe> er glob(3)
[21:01] <sagewk> it isn't?
[21:02] <johnl_> sagewk: that's interesting. though my next bug reports are going to be about the storage overhead of files on the fs :s
[21:02] <sagewk> iirc you can specify a prefix when listing objects but that's all?
[21:02] <cmccabe> no, I mean the glibc function glob
[21:02] <cmccabe> does it just do readdir and go to town, or is it more clever
[21:02] <sagewk> readdir + filter
[21:02] <johnl_> suppose the same could be done with the bdb implementation, should it ever be updated :)
[21:02] <cmccabe> sigh
[21:03] <johnl_> http://docs.amazonwebservices.com/AmazonS3/latest/API/index.html?RESTBucketGET.html
[21:03] <sagewk> the storage overhead of foo/bar having a 'foo' parent directory you mean?
[21:03] <johnl_> you specify a delimeter
[21:03] <johnl_> and a prefix
[21:03] <johnl_> apparently
[21:03] <johnl_> oh no, they're seperate things.
[21:03] <sagewk> interesting. they must have a bucket then index
[21:03] <sagewk> bucket index then
[21:03] <johnl_> you can search on a prefix. and group by prefix-to-delimeter
[21:04] <cmccabe> seems like the sticky wicket is that they have some kind of prefix tree
[21:04] <cmccabe> and slash is not special for them
[21:05] <johnl_> enumerations are not necessarily fast on S3 btw. you can only get 1000 at a time anyway. takes ages to enumerate millions of files over HTTP
[21:06] <cmccabe> johnl_: there's two different aspects of speed... latency between making the List Objects request and getting the data, and throughput
[21:07] <johnl_> cmccabe: yeah sure, but what I mean is S3 is not optimised for enumeration speed.
[21:07] <cmccabe> johnl_: throughput is probably constrained somewhat by the HTTP protocol, but how much is not clear
[21:07] <sagewk> i real question is whether it is O(result) or O(bucket size)
[21:07] <cmccabe> johnl_: also, try doing a readdir in a directory with a million files
[21:07] <cmccabe> johnl_: then tell me that S3 is slower. You might find that it's not!
[21:07] <johnl_> heh, ok.
[21:08] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[21:08] <cmccabe> jonhl_: indeed, I think ext2 had a limit that was far less than a million files per directory... I think it was in the thousands? I forget.
[21:09] <sagewk> you're thinking the 32768 subdirectory limit probably (due to signed short nlink)
[21:09] <cmccabe> sagewk: yeah, that is the real question
[21:09] <cmccabe> sagewk: O(result) or O(bucket size)
[21:10] <cmccabe> sagewk: should be pretty easy to answer by talking to some S3 users
[21:10] <cmccabe> sagewk: in my use of S3, I didn't notice a significant slowdown based on bucket size. But to be sure, should check with big enterprise users who really do have a million objects per bucket.
[21:16] <johnl_> Amazon claim: "The total number of keys in a bucket doesn't substantially affect list performance."
[21:17] <johnl_> "nor by the
[21:17] <johnl_> presence or absence of the prefix, marker, maxkeys, or delimiter arguments.
[21:17] <cmccabe> ic
[21:19] <johnl_> S3 supposedly uses Amazon's Dynamo db system though. which is more like cassandra/riak. so perhaps better suited to these kinds of queries.
[21:20] <johnl_> perhaps they're just throwing concurrency at the work of enumerating/filtering
[21:21] <johnl_> one interesting thing, you can't delete a bucket on S3 without deleting all the objects first.
[21:22] <johnl_> might just be a way to charge you for the work of deleting, but it might also be a clue to implementation
[22:21] <Tv|work> johnl_: i'm not convinced it uses dynamo
[22:21] <Tv|work> more like SimpleDB, these days
[22:23] <sagewk> the main hints for me are statements about how a write followed by a read may return a stale result
[22:23] <Tv|work> sure but that might be just replication lag; dynamo adds the everpresent possibility of two conflicting writes
[22:24] <sagewk> normally on a dynamo read tho you read N replicas to recover a consistent result. i guess they could just skip that on s3, but it doesn't seem to map cleanly to me.
[22:25] <Tv|work> i think i agree with what Kyle said over lunch.. just state rgw v1 limits well, optionally build rgw v2 later that uses rados for the objects but has an "index"
[22:26] <sagewk> yeah
[22:26] <Tv|work> the index doesn't really have to map to osd pools etc concepts
[22:28] <cmccabe> I had another thought, also during lunch
[22:28] <cmccabe> should we be creating an on-disk hierarchy of some kind to optimize readdir on the osds
[22:29] <Tv|work> "a cheaper mds when you don't need posix"
[22:29] <cmccabe> I guess that might still be problematic since we distribute the objects in the pool across multiple PGs
[22:30] <cmccabe> but the number of PGs isn't that large
[22:31] <cmccabe> so as I understand it, the problem with our current implementation of List Objects is O(num-objects-in-bucket), whereas Amazon's is O(request-size)
[22:33] <cmccabe> however, we can come closer to amazon's behavior by storing files smarter in the object store inside a single collection
[22:33] <cmccabe> so right now, collections store files 0 through 100 like this:
[22:33] <cmccabe> dev/osd0/<collection-name>/object0
[22:33] <cmccabe> dev/osd0/<collection-name>/object1
[22:33] <cmccabe> ...
[22:33] <cmccabe> dev/osd0/<collection-name>/object100
[22:34] <sagewk> yeah. if we make the / delimiter special, at least, we can just use directories in the underlying fs to do that.
[22:34] <cmccabe> well, or we just decide to break the collection name arbitrarily at some point
[22:34] <cmccabe> like if you're storing objects abcdefghij and xyz123456, we store them as
[22:34] <cmccabe> abcde/fghij
[22:34] <cmccabe> and
[22:34] <cmccabe> xyz12/3456
[22:34] <Tv|work> sagewk: but that's not s3 compliant
[22:35] <gregaf> forget being compliant, that seems gross to me
[22:35] <Tv|work> sagewk: s3 is not segments with slashes, it's arbitratry flat namespace
[22:35] <sagewk> tv: right. to handle arbitrary delimiters we need an index. but / is common at least.
[22:35] <gregaf> RADOS provides a flat namespace and breaking that up is something we'd need to think long and hard about
[22:35] <cmccabe> if we arbitrarily choose some point to break the collection name, our performance will be somewhere in between "all files in same dir" and "all characters are directories"
[22:35] <Tv|work> just add an index on top, just like you did for ceph directories!
[22:36] <cmccabe> tv: the problem is that we want to rely on the native filesystem
[22:36] <cmccabe> tv: not start running our own (we did that already as EBOFS, but no longer do)
[22:36] <Tv|work> Tv|work: so store the index in an osd object (just like you did for ceph!)
[22:36] <cmccabe> tv: *rolling our own
[22:36] <Tv|work> cmccabe: oh wait ignore what i just said
[22:36] <gregaf> I love your irc client, tv :P
[22:37] <Tv|work> gregaf: goddammit :-/
[22:37] <Tv|work> cmccabe: how would having an index force you to not use btrfs or such?
[22:37] <cmccabe> tv: it wouldn't, but I also believe that it would be slow
[22:37] <sagewk> the index could be maintained by os/FileStore.cc (internal to the pg), actually.
[22:38] <cmccabe> I really think that would be a big mistake.
[22:38] <cmccabe> it would be re-inventing the dcache
[22:38] <sagewk> listing+filtering then becomes O(result) + O(num pgs). which is inevitable, given that we're hashing the namespace across multiple nodes.
[22:38] <cmccabe> but without the benefit of the years of development spend on the dcache
[22:38] <Tv|work> sagewk: my instincts say don't complicate rados due to s3, build on top
[22:38] <Tv|work> sagewk: you guys are already drowning in quality problems, due to the codebase getting bigger and bigger..
[22:38] <cmccabe> but consider the multi-dir solution please
[22:38] <cmccabe> it's so simple
[22:39] <cmccabe> and I think it would solve the problem and possibly other issues related to having so many files in a single collection dir
[22:39] <gregaf> tv is right, everything is simple but you put them together and it gets complicated
[22:39] <sagewk> directories don't solve the arbitrary delimiter problem
[22:39] <Tv|work> what's simple is storing the index in a frigging sql (or nosql!) db on the rgw machines
[22:39] <gregaf> if rgw is our sole motivator for a change to RADOS, it's not something we should be spending any time on
[22:39] <Tv|work> gregaf: agree 100%
[22:40] <cmccabe> sagewk: consider a directory structure that we split arbitrarily after every 5 characters in a hash of object name
[22:40] <cmccabe> sagewk: er sorry, can't hash, need sorting
[22:40] <cmccabe> sagewk: but anyway, I think directories do make the list behavior better in the common case
[22:40] <gregaf> cmccabe: try and come up with a model for how many directories that will create, would you?
[22:41] <cmccabe> sagewk: for example, I have foobar, foobar2, foo, and fugly in my collection
[22:41] <gregaf> I could be mistaken, but I suspect it's going to be the same order as the number of files
[22:41] <gregaf> except with tree traversal going on too
[22:41] <gregaf> now, I think that reducing pool/PG memory use will have applications for real users of Ceph that we want to be seducing
[22:41] <cmccabe> I create:
[22:41] <cmccabe> foob/ar
[22:41] <cmccabe> foob/ar2
[22:41] <cmccabe> foo
[22:41] <cmccabe> fugl/y
[22:41] <gregaf> and if somebody wants to spend a week or whatever making rgw work better, cool
[22:41] <cmccabe> then when searching for fooba*, I only need to search the foob/ directory
[22:41] <cmccabe> not readdir on /
[22:42] <gregaf> but we need rados to be done, basically
[22:42] <cmccabe> that is a real savings!
[22:42] <Tv|work> cmccabe: and i search for f*
[22:42] <gregaf> and it's because of little changes like this that it wasn't done before I started working on this project
[22:42] <cmccabe> sure. Then you still have bad performance. But it's about optimizing the common case
[22:42] <cmccabe> and single-char prefixes are not going to be that I think
[22:43] <cmccabe> ideally, you would change the directory structure dynamically based on the actual contents of the collection
[22:43] <gregaf> back out a bit, cmccabe: how common a use case is this outside of rgw?
[22:43] <gregaf> what performance benefits does it give a user of the Ceph filesystem?
[22:44] <cmccabe> storing a lot of files in the same dir doesn't work well on ext2/3
[22:44] <cmccabe> I don't know how well it works on ext4
[22:44] <gregaf> well it's a good thing we can handle that via PGs and splitting
[22:44] <gregaf> and we recommend btrfs anyway
[22:44] <cmccabe> so it may speed up collections and the object-store in general
[22:44] <sagewk> ok guys, i don't think we need to worry about this for now. :)
[22:45] <cmccabe> :)
[22:45] <cmccabe> one last thing... here's a crazy question...
[22:45] <cmccabe> what if we create a new directory for every character in the object name?
[22:45] <Tv|work> cmccabe: how osd handles its own backing store is also unrelated to what's visible on the network
[22:45] <cmccabe> like object abcde creates
[22:45] <cmccabe> a/b/c/d/e/obj
[22:46] <gregaf> then we have to traverse directory trees to get anything done, which is expensive
[22:46] <cmccabe> I guess deleting objects would become a lot more expensive
[22:46] <cmccabe> ok, maybe not such a good idea.
[22:46] <sagewk> :)
[22:46] <cmccabe> someday maybe we'll have to roll our own object store again I guess
[22:47] <sagewk> cmccabe: hopefully not :)
[22:49] <cmccabe> it's kind of sad that there's no POSIX interface for "get all files in the directory starting with FOO"
[22:50] <Tv|work> the underlying problem here is that you don't have a convenient index in one place; rados is flat so it doesn't contain an index on that level; even if osds could give you their chunk of btrfs readdir fast, crush means files between "bar" and "baz" are spread over practically all the all the osds --> to have a fast s3-style index, you need an actual index
[22:50] <sagewk> sad for apps, good for fs devs. readdir is already a nightmare as is.
[22:50] <Tv|work> sagewk: well, telldir/seekdir is -- kill those, require full iteration always, and the world will be a happier place
[22:51] <gregaf> although this is another application of a pool-wide search function...
[22:51] <Tv|work> gregaf: do you really want to make that *fast*? fsck doesn't need it to be fast
[22:51] <sagewk> cmccabe: can you rebase the refactor_pg stuff? i fixed the merge conflicts but have an auto_ptr error (missing #include?)
[22:51] <cmccabe> k
[22:52] <gregaf> Tv|work: oh, that was just a reference to another conversation we had about how to do the fsck stuff
[22:52] <Tv|work> ah
[22:53] <gregaf> yehudasa was originally doing it as a new object Class thing but making it work was so hacky he (after discussion) just built it into the OSD codebase
[22:53] <sagewk> tv: exactly
[22:53] <gregaf> meanwhile, though, we should probably focus on our bugs which we have in the tracker
[22:54] <gregaf> at least until tomorrow's meeting when we will discuss things like stability versus feature creep
[22:57] <cmccabe> pushed
[22:58] <cmccabe> yeah, readdir sucks. Too bad about that interface.
[22:58] <cmccabe> readdir_r isn't even safe to use at all
[22:58] <cmccabe> I guess that's at the libc level though. Probably the same syscall in the end.
[23:59] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.