#ceph IRC Log

Index

IRC Log for 2013-08-21

Timestamps are in GMT/BST.

[0:01] <alphe> for the custom apache2 install how do I install to it the SSL module ?
[0:01] <alphe> is it built in ?
[0:02] <alphe> and rewrite module ?
[0:02] * jeff-YF (~jeffyf@67.23.117.122) Quit (Read error: Operation timed out)
[0:04] * kyann (~kyann@did75-15-88-160-187-237.fbx.proxad.net) Quit (Quit: Try HydraIRC -> http://www.hydrairc.com <-)
[0:08] * torment1 (~torment@pool-173-78-201-45.tampfl.fios.verizon.net) has joined #ceph
[0:10] * clayb (~kvirc@proxy-nj1.bloomberg.com) Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[0:17] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[0:17] * ircolle1 (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Read error: Operation timed out)
[0:21] * gregmark (~Adium@cet-nat-254.ndceast.pa.bo.comcast.net) Quit (Quit: Leaving.)
[0:24] * markbby (~Adium@168.94.245.4) Quit (Quit: Leaving.)
[0:26] <sage> sjust: have you tried to reproduce 5951? it's come up 2-3 times over the last 2 days
[0:26] <sjust> yes, that localhost ifup/down was from one of those
[0:26] <sjust> I've been pasting the job ids into the bug
[0:27] <sjust> no luck so far
[0:27] <sage> ah. weird that it's so hard to hit..
[0:27] <sjust> probably an issue with my yaml
[0:29] * AfC (~andrew@2407:7800:200:1011:205e:9d22:671d:4992) has joined #ceph
[0:29] <alphe> yaml = yet another markup language ?
[0:30] <dmick> yep
[0:31] <alphe> there is no doc about mod-fastcgi and apache2 special compilation right ?
[0:33] <alphe> the installation from the source is all messed up my poor apache ... there is no files in the usual ubuntu's place ...
[0:33] <sage> joshd: can you look at wip-6004?
[0:34] <alphe> my apache2 from github have no a2enmod ...
[0:34] <alphe> strange ...
[0:35] * tnt_ (~tnt@109.130.102.13) Quit (Ping timeout: 480 seconds)
[0:35] <dmick> alphe: why are you trying to build from source?
[0:36] <alphe> dmick I am following the advice from 1000-continue
[0:37] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[0:37] <alphe> dmick basically I tryed the apt-get regular way and ended with a s3-amazon able website without SSL layer working
[0:38] <alphe> and most of MS windows s3-amazon clients (drive mapping style) works only with SSL layer to s3-amazon cloud drive like share
[0:40] * mschiff (~mschiff@85.182.236.82) Quit (Remote host closed the connection)
[0:41] <dmick> do you mean you believe that the packages on ceph.com have SSL disabled? I find that hard to believe
[0:42] <alphe> dmick hum nope
[0:42] <alphe> in the ceph.com docs there is 2 ways explained in fact
[0:42] <alphe> one the so called "manual install" (for ceph object storage)
[0:43] * zhyan_ (~zhyan@101.82.104.206) has joined #ceph
[0:43] <alphe> in this one you are told to download sources for apache2 and mod_fastcgi then behave yourself then some tips on configuring and enabling ssl
[0:44] <alphe> mostly how to create the ssl selfsigned certificate
[0:44] <alphe> but that information is based on a location and files that a ubuntu distro regular apt-get install apache2 will give you
[0:45] * torment2 (~torment@pool-173-78-201-127.tampfl.fios.verizon.net) has joined #ceph
[0:45] <alphe> for example I downloaded the apache2 from the github with
[0:45] <alphe> git clone --recursive https://github.com/ceph/apache2.git
[0:45] <dmick> those aren't two ways; the configuration and ssl enable are necessary whether you get the packages from ceph.com or build yourself; the reason to build yourself is if you believe there's something missing or wrong with the ceph.com packages
[0:46] <dmick> but, regardless, yes, building something as big as apache and getting it installed in teh 'standard' way is a big job
[0:46] <alphe> dmick there is indeed two docs mixed up there
[0:47] <dmick> ok. well, you're in deep building your own, and most/all of the advice is not ceph-specific, so you'll be better off asking elsewhere for that sort of thing. Not many people here build their own apaches.
[0:47] <alphe> because if you apt-get from regular ubuntu repository then enable SSL do your certificate selfsigned and then do the /etc/apache2/available-site/rgw.conf then you end with a not working ssl
[0:47] <dmick> I mean, someone may be able to help, but I suspect you'll get better answers on an apache forum
[0:48] <alphe> because it is lacking some things ... for example rgw.conf file does have the instruction to activate the ssl layer for the port described by virtualhost only the fastcgi tuning ...
[0:48] * torment1 (~torment@pool-173-78-201-45.tampfl.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[0:49] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[0:50] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[0:50] <alphe> dmick before trying the compile mojo I followed the apt-get install regular way and ended with a s3-amazon website working fine but not with ssl layer and there are really few mapping drive client for windows that allows you to mount a s3-amazon share without ssl layer
[0:50] <dmick> alphe: I understood what you said, but it doesn't change my advice
[0:51] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[0:51] * ChanServ sets mode +v andreask
[0:51] * zhyan__ (~zhyan@101.83.119.69) Quit (Ping timeout: 480 seconds)
[0:51] <alphe> dmick so I compiled and all this isn t the problem ... the problem comes when I try to do the next step of the documentation wich is a2enmod rewrite (no need it is build in at compilation stage) but a2enmod fastcgi doesn t work
[0:52] <alphe> since I don t have a2enmod (or that it s path isn t loaded in my shell's env
[0:52] <joshd> sage: should merge_left() use the max last_read_tid just like last_write_tid?
[0:53] * piti (~piti@82.246.190.142) Quit (Ping timeout: 480 seconds)
[0:53] <sage> joshd: we disallow merges entirely for rx buffers..
[0:54] <alphe> my prob: the windows client that allows me to mount s3-amazone drive as a mapped drive is not "stable" for example after 30 minutes of files transfer I go down from 10 connections threads to 2 ...
[0:54] <alphe> for no reasons ...
[0:54] <sage> and we only do exact matches on the read tid
[0:55] <alphe> my work around: try to get the ssl layer of my s3-amazon share working ... and my only guide to acheive that is the ceph.com documentation
[0:55] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Read error: Connection reset by peer)
[0:55] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[0:55] * ChanServ sets mode +v andreask
[0:56] <alphe> when I get the ssl layer working then I will be able to use the wide choice of s3 clients
[0:56] <paravoid> dmick: #6049 for dumpling?
[0:56] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) Quit (Quit: ...)
[0:56] <dmick> paravoid: yes?
[0:56] <joshd> sage: ok, right with the exact match only that makes sense
[0:58] <joshd> sage: looks ok to me assuming it passes librbd fsx and ceph_test_objectcacher_stress
[0:59] * AfC (~andrew@2407:7800:200:1011:205e:9d22:671d:4992) Quit (Read error: No route to host)
[0:59] * AfC (~andrew@2407:7800:200:1011:205e:9d22:671d:4992) has joined #ceph
[0:59] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Read error: Operation timed out)
[1:00] * AfC (~andrew@2407:7800:200:1011:205e:9d22:671d:4992) Quit ()
[1:03] <dmick> paravoid: what did you want to know about 6049?
[1:03] <sage> joshd: thanks, i'll kick off the rbd suite against it
[1:04] * alram (~alram@cpe-76-167-50-51.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[1:11] * LeaChim (~LeaChim@176.24.168.228) Quit (Ping timeout: 480 seconds)
[1:15] <yehudasa_> sage: any luck?
[1:15] <sage> closer
[1:18] * mozg (~andrei@host109-151-35-94.range109-151.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[1:18] <sage> i'm writing a wrapper around yasm so that libtool can call it
[1:19] <yehudasa_> you think that's going to fix it?
[1:19] <sage> it will link!
[1:19] <alphe> dmick I understand what you say well I think a2enmod and a2ensite are ubuntu tools ...
[1:19] <alphe> fastcgi ssl and rewrite are loaded anyway
[1:19] <sage> the symbols appears in the .libs/libwhatever.a file
[1:22] <sage> links!
[1:22] <sage> now to make cpu detection work
[1:25] <yehudasa_> hmm.. sage, when not using all that magic, and using .o as the object suffix it warns about *** objects common/crc32c_intel_fast_asm.o is not portable!
[1:25] <yehudasa_> my guess is that the lack of -fPIC
[1:26] <sage> yeah, and i ultimately couldn't make libtool link it in
[1:26] <sage> even with that warning
[1:26] <sage> isntead i let it run the assembler and wrote a wrapper to filter out weird args
[1:26] <sage> and viola it works
[1:27] * flickerdown|2 (~flickerdo@97-95-180-66.dhcp.oxfr.ma.charter.com) has joined #ceph
[1:34] * flickerdown (~flickerdo@westford-nat.juniper.net) Quit (Ping timeout: 480 seconds)
[1:34] * adam2 (~adam@46-65-111-12.zone16.bethere.co.uk) has joined #ceph
[1:38] * zhyan_ (~zhyan@101.82.104.206) Quit (Ping timeout: 480 seconds)
[1:40] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[1:41] * adam1 (~adam@46-65-111-12.zone16.bethere.co.uk) Quit (Ping timeout: 480 seconds)
[1:45] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[1:45] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[1:53] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[1:54] * torment2 (~torment@pool-173-78-201-127.tampfl.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[2:04] <sage> behold: better crc32c code and autotools black magic: https://github.com/ceph/ceph/pull/519
[2:05] * torment2 (~torment@pool-173-78-201-127.tampfl.fios.verizon.net) has joined #ceph
[2:05] <sage> make that https://github.com/ceph/ceph/pull/520
[2:06] * piti (~piti@82.246.190.142) has joined #ceph
[2:10] * Cube1 (~Cube@88.128.80.12) has joined #ceph
[2:10] * Cube (~Cube@88.128.80.12) Quit (Read error: Connection reset by peer)
[2:22] * rafael (~rafael@ip70-162-218-204.ph.ph.cox.net) has joined #ceph
[2:29] * devoid (~devoid@130.202.135.225) Quit (Quit: Leaving.)
[2:33] * Dark-Ace-Z (~BillyMays@50.107.55.36) has joined #ceph
[2:34] * DarkAce-Z (~BillyMays@50.107.55.36) Quit (Ping timeout: 480 seconds)
[2:35] * AfC (~andrew@2407:7800:200:1011:6184:136a:9897:663b) has joined #ceph
[2:37] <dmick> judgmental commit messages. I like it.
[2:46] <rafael> is there a reason why "ceph-deploy create foo:srv/bar" succeeds but my cluster still shows 0 OSD's up?
[2:47] * madkiss (~madkiss@64.125.181.92) Quit (Quit: Leaving.)
[2:48] * Cube (~Cube@88.128.80.12) has joined #ceph
[2:48] * Cube1 (~Cube@88.128.80.12) Quit (Read error: Connection reset by peer)
[2:48] <rafael> overall ceph-deploy is a pretty frustrating tool
[2:48] <alfredodeza> rafael: what version of ceph-deploy are you using?
[2:49] <rafael> 1.2.1
[2:50] <alfredodeza> and I take it that when you say "ceph-deploy create" you really mean "ceph-deploy osd create" ?
[2:50] <rafael> yup sorry
[2:51] <rafael> ceph-deploy osd create ceph{1,2}:/srv/osd
[2:51] <alfredodeza> also useful would be a bit more context on what you did before getting there
[2:51] <rafael> that succeeds
[2:51] <alfredodeza> and what your logs are saying
[2:51] <rafael> two node cluster and I ran through the ceph-deploy documentation page
[2:51] <rafael> http://ceph.com/docs/next/start/quick-ceph-deploy/
[2:52] <alfredodeza> what about your logs?
[2:52] <rafael> gathering it.. one minute
[2:52] * yanzheng (~zhyan@134.134.137.73) has joined #ceph
[2:53] <rafael> https://gist.github.com/rferreira/6289200
[2:54] <rafael> https://gist.github.com/rferreira/6289207
[2:55] <rafael> last, but not least: https://gist.github.com/rferreira/6289215
[2:55] <rafael> don't get me wrong, ceph-deploy looks very promising
[2:56] <rafael> but ceph is already fairly complex
[2:56] <rafael> and if ceph-deploy doesn't work just right it just makes things messier
[2:56] <rafael> and I'm sure this is inevitably going to be something that I did wrong
[2:57] <rafael> I would just like to know what
[2:59] <alphe> ok got the ssl layer working the fastcgi loads so I m on the right track I think
[3:00] <alphe> now I need the radosgw install and setup fase
[3:00] <alphe> then bridging apache/ssl/fastcgi modules to radosgw :)
[3:00] * yy-nm (~Thunderbi@122.233.231.235) has joined #ceph
[3:01] * smiley (~smiley@cpe-67-251-108-92.stny.res.rr.com) Quit (Quit: smiley)
[3:01] <alphe> while in the process I took a good amount of notes so I could propose a documentation to replace the actual one if you like to.
[3:05] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[3:07] <alphe> rafael do you have monitors on you cluster ?
[3:07] <rafael> there should have been two
[3:07] <alphe> do you have disks setup ?
[3:07] <rafael> yup
[3:07] <rafael> I'm just using a directory
[3:07] <rafael> /srv/osd
[3:07] <rafael> on both nodes
[3:07] <rafael> xfs
[3:08] <dmick> your problem is obviously that the osd's aren't running. Do you have a log from ceph-deploy that shows its debug from running the osd create?
[3:08] <alphe> hum iptables and selinux ?
[3:08] <alphe> any mds ?
[3:08] <rafael> hmmmm
[3:08] <rafael> MDS is up as well
[3:08] <rafael> but apparmor...
[3:08] <rafael> let me check that
[3:09] <alphe> i would do netstat on all the machines to see what happends
[3:09] <alphe> could be a netstat -altn
[3:09] <rafael> got it, one sec, here's the mon and MDS running https://gist.github.com/rferreira/6289289
[3:10] <dmick> rafael: are the osd's on separate machines from the monitor?
[3:10] <rafael> same machines
[3:10] <dmick> mds is irrelevant
[3:10] <rafael> 2 machines, 2 osd and 2 monitors 1 mds (irrelevant)
[3:10] <alphe> dmick don t be so hard with mds ... it is a poor creature !
[3:10] <dmick> 2 mons is not good
[3:10] <dmick> 1, or 3; 2 is right out
[3:11] <dmick> so ceph-osd is not running on either machine?
[3:11] <alphe> ps -aux | grep osd ?
[3:11] <alphe> ok running out
[3:11] <rafael> netstat https://gist.github.com/rferreira/6289299
[3:11] <alphe> goodbye all !
[3:11] <rafael> later!
[3:11] <dmick> looks like ceph2 might have one
[3:11] * alphe (~alphe@0001ac6f.user.oftc.net) Quit (Quit: Leaving)
[3:12] <dmick> 0.0.0.0:6800. or that might be the mds
[3:12] <rafael> yup
[3:12] <rafael> MDS on ceph2
[3:12] <dmick> ok nm
[3:12] <rafael> for no good reason
[3:12] <dmick> so are there /var/log/ceph entries for either osd?
[3:12] <rafael> I'm shutting down app armor and rebooting
[3:12] <dmick> s/entries/files/
[3:12] <dmick> apparmor can't hurt.
[3:12] <dmick> I mean shutting it down can't hurt
[3:13] <rafael> there's a strange empty /var/log/ceph/ceph-osd..log
[3:13] <rafael> notice the dual ..
[3:13] <dmick> yeah
[3:13] <dmick> neither machine has a real osd log?
[3:13] <rafael> same thing on both nodes
[3:13] <rafael> nope
[3:13] <kraken> http://i.imgur.com/foEHo.gif
[3:13] <rafael> that's how I feel
[3:13] <rafael> :)
[3:14] <dmick> hm. so are there dirs in /var/lib/ceph that look like they're setting up osds?
[3:14] <dmick> /var/lib/ceph/osd/ceph-0 and/or ceph-1, for instance?
[3:14] <rafael> give me a sec
[3:14] <rafael> I rebooted everything
[3:16] <rafael> everything is back up but still no odd
[3:16] <rafael> OSD
[3:16] <rafael> /var/lib/ceph/osd/ is empty
[3:16] <rafael> let me post my ceph.conf file too
[3:16] <rafael> maybe is something I messed up there
[3:16] <rafael> but again
[3:16] <rafael> ceph-deploy was all happy
[3:16] <rafael> creating the osd
[3:17] <rafael> https://gist.github.com/rferreira/6289331
[3:17] <dmick> so this could be that, since you're using paths, ceph-deploy is assuming you're going to mount those filesystems where you configured them. again, do you have a ceph-deploy log?
[3:23] <rafael> this one? https://gist.github.com/rferreira/6289371
[3:23] <rafael> the filesystems are in fstab
[3:24] <rafael> and getting mounted at boot
[3:24] <rafael> I can try again with the device name
[3:26] <dmick> yeah, that one
[3:26] * Ilya_Bolotin (~ibolotin@38.122.20.226) has left #ceph
[3:26] * lightspeed (~lightspee@81.187.0.153) Quit (Ping timeout: 480 seconds)
[3:27] <dmick> /srv/osd are just plain paths, not mounts, right?
[3:27] <rafael> they are mounts
[3:27] <rafael> I tried this https://gist.github.com/rferreira/6289387
[3:27] <rafael> and still no osd
[3:28] <rafael> sorry for being such a pest guys
[3:28] <rafael> but I've been stuck here for 2 days now
[3:28] <rafael> many rebuilds
[3:28] <dmick> so osd activation is tricky
[3:28] <dmick> basically, it sets things up to be triggered by udev
[3:28] <rafael> hmmm
[3:28] <dmick> for actual block devices/partitions
[3:29] <dmick> and kicks udev in the head, which goes and runs rules from 95-ceph-whatever
[3:29] <dmick> that actually runs the stuff that mounts the devices on /var/lib/ceph/osd/* and starts the daemons
[3:30] <dmick> for paths, since the mount doesn't need to happen, things are simpler/different
[3:30] <dmick> in all cases, /var/lib/ceph/osd/* has to exist and contain osd stuff
[3:30] <dmick> by the time the daemon starts up
[3:31] <rafael> still, this is a standard ubuntu 12.04 LTS
[3:31] <dmick> but it looks like you're using devices in all cases I've seen here, so, let's focus on that
[3:31] <rafael> most of that should hopefully work reliably
[3:31] <dmick> what state are you in now (what was the last create you tried at this point)?
[3:32] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[3:32] <rafael> I tried the device on ceph1
[3:32] <rafael> and ceph-deploy worked but no OSDs are up
[3:32] <rafael> the cluster is still health_err
[3:33] <dmick> I was hoping for something more specific than "the device"
[3:33] * dpippenger1 (~riven@tenant.pas.idealab.com) Quit (Remote host closed the connection)
[3:33] <rafael> err sorry
[3:33] <rafael> https://gist.github.com/rferreira/6289387
[3:33] <dmick> ok. is /dev/vdb1 mounted?
[3:34] <dmick> i.e. in mount output?
[3:34] <rafael> checking
[3:34] <rafael> nope
[3:34] <dmick> mount it somewhere separate, like /mnt
[3:34] <rafael> not mounted, I unmounted it
[3:34] <rafael> before running the ceph-deploy
[3:34] <dmick> wait, you unmounted it?
[3:34] <dmick> oh
[3:35] <rafael> I did when I tried the ceph-deploy with the block device instead of the dir
[3:35] <dmick> the activate (part of osd create) should have caused it to be mounted
[3:35] <dmick> so mount it on /mnt and let's look at it
[3:35] <rafael> doing it
[3:35] <rafael> three files, ceph_fsid fsid and magic
[3:36] <dmick> ah. that's not enough.
[3:36] <dmick> so for whatever reason the creation was not successful
[3:36] <dmick> where do you expect journals to go?
[3:36] <rafael> I didn't expect
[3:36] <rafael> I didn't specify
[3:37] <dmick> is there anything else on other partitions of vdb?
[3:37] <rafael> nope that's the only partition
[3:37] <rafael> I did mess around with zapping it
[3:37] <rafael> at one point and it just made a bigger mess with the GPT partition table
[3:37] <dmick> so, while I would have expected errors,
[3:37] <dmick> I don't think ceph-deploy can deal with being given a partition for data and nothing for journal
[3:38] <rafael> oooh
[3:38] <rafael> well I can fix that
[3:38] <dmick> it can take the entire dev, at which point it will partition it and put both data and journal on it
[3:38] <rafael> oh
[3:38] <dmick> so I would 1) zap, and 2) specify vdb
[3:38] <rafael> so what's the right way here
[3:38] <rafael> k
[3:38] <rafael> I can handle that
[3:38] <dmick> (that is not vdb1)
[3:38] <rafael> standby
[3:38] * xmltok_ (~xmltok@pool101.bizrate.com) Quit (Remote host closed the connection)
[3:38] <rafael> yup got it
[3:38] <dmick> (and let c-d partition)
[3:38] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[3:39] * rafael2 (rafael2@a.clients.kiwiirc.com) has joined #ceph
[3:39] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[3:40] <rafael> I'm switching computers
[3:40] <rafael> give me 1 min
[3:40] * rafael (~rafael@ip70-162-218-204.ph.ph.cox.net) has left #ceph
[3:41] <dmick> gonna have to jet, sorry.
[3:41] <rafael2> no worries thanks
[3:46] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:05] <rafael2> yeah snake eyes
[4:12] * julian (~julian@125.69.106.188) has joined #ceph
[4:12] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[4:22] * rafael2 is now known as yeah
[4:22] * yeah is now known as cleverfoo
[4:22] <cleverfoo> yeah I'm giving up..
[4:23] * jjgalvez (~jjgalvez@ip72-193-217-254.lv.lv.cox.net) Quit (Quit: Leaving.)
[4:28] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[4:31] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[4:32] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[4:33] * xmltok_ (~xmltok@pool101.bizrate.com) has joined #ceph
[4:36] * mikedawson_ (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[4:38] * cleverfoo (rafael2@a.clients.kiwiirc.com) Quit (Quit: http://www.kiwiirc.com/ - A hand crafted IRC client)
[4:38] * jjgalvez (~jjgalvez@ip72-193-217-254.lv.lv.cox.net) has joined #ceph
[4:38] * xmltok (~xmltok@pool101.bizrate.com) Quit (Ping timeout: 480 seconds)
[4:42] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[4:44] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[4:49] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[4:52] * jjgalvez (~jjgalvez@ip72-193-217-254.lv.lv.cox.net) Quit (Quit: Leaving.)
[5:05] * fireD_ (~fireD@93-142-214-218.adsl.net.t-com.hr) has joined #ceph
[5:06] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[5:07] * fireD (~fireD@93-139-168-118.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:08] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[5:12] * xmltok_ (~xmltok@pool101.bizrate.com) Quit (Ping timeout: 480 seconds)
[5:13] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[5:20] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[5:21] * jlogan2 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[5:21] * haomaiwang (~haomaiwan@125.108.229.13) has joined #ceph
[5:24] * rafael (~rafael@ip70-162-218-204.ph.ph.cox.net) has joined #ceph
[5:24] * Cube (~Cube@88.128.80.12) Quit (Quit: Leaving.)
[5:24] * haomaiwang (~haomaiwan@125.108.229.13) Quit (Read error: Connection reset by peer)
[5:24] <rafael> does anyone know why ceph-deploy doesn't update the ceph.conf file?
[5:25] * haomaiwang (~haomaiwan@125.108.229.13) has joined #ceph
[5:31] * haomaiwa_ (~haomaiwan@125.108.229.13) has joined #ceph
[5:31] * haomaiwang (~haomaiwan@125.108.229.13) Quit (Read error: Connection reset by peer)
[5:33] * haomaiwang (~haomaiwan@125.108.229.13) has joined #ceph
[5:33] * haomaiwa_ (~haomaiwan@125.108.229.13) Quit (Read error: Connection reset by peer)
[5:34] * haomaiwa_ (~haomaiwan@125.108.229.13) has joined #ceph
[5:34] * haomaiwang (~haomaiwan@125.108.229.13) Quit (Read error: Connection reset by peer)
[5:35] * haomaiwang (~haomaiwan@125.108.229.13) has joined #ceph
[5:35] * haomaiwa_ (~haomaiwan@125.108.229.13) Quit (Read error: Connection reset by peer)
[5:42] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[5:42] <grepory> wrapping my head around pgp_num is killing me.
[5:42] <grepory> it might just be exhaustion. idk
[5:44] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Read error: Connection reset by peer)
[5:45] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[6:03] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[6:06] <sage> rafael: it doesn't need to
[6:10] * rafael (~rafael@ip70-162-218-204.ph.ph.cox.net) Quit (Quit: rafael)
[6:30] <grepory> oh. so if i have two pools, and i should have 1500 pgs… and i expect pool a to store roughly 90 percent of all objects, then pg_num would be 1500, and for pg a pgp_num would be 1500 * .9?
[6:34] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:37] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[6:41] * smiley (~smiley@cpe-67-251-108-92.stny.res.rr.com) has joined #ceph
[6:42] <grepory> nope!
[6:42] <grepory> heh. this is fun
[6:42] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Quit: Leaving...)
[6:44] * jjgalvez (~jjgalvez@ip72-193-217-254.lv.lv.cox.net) has joined #ceph
[6:51] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[6:52] <Qu310> Hi All, when using journal on the same disk as the osd is it best to use a partition or file based journal?
[6:53] <sage> Qu310: partitoin is faster
[6:54] <sage> which is why ceph-deploy/ceph-disk default to that
[6:57] * madkiss (~madkiss@184.105.243.180) has joined #ceph
[6:57] <lurbs> Speaking of journals on partitions, I saw a proposed patch on bug 5599 that'll cause ceph-disk to issue a partprobe on a journal's block device so that it can then see the newly partition.
[7:06] <Qu310> sage: thanks
[7:06] <sage> lurbs: yeah not sure when/why that is needed as i haven't seen the need on any of our systems
[7:06] <sage> haven't looked closely yet
[7:07] <Qu310> sage: so ceph dosn't need a filesystem on the journal partition?
[7:07] <sage> right
[7:07] <sage> it is a raw file or partition or device
[7:07] <Qu310> nice
[7:07] <Qu310> ah
[7:07] <Qu310> thanks again
[7:10] <lurbs> sage: Odd, it causes consistent breakage when using ceph-deploy for us - the creation of the second and subsequent journals on the same block device fails.
[7:10] <sage> not having that partprobe breaks it, you mean?
[7:10] <lurbs> Yeah.
[7:10] <sage> what is the distro?
[7:11] <sage> and what kind of devices are they?
[7:12] <lurbs> 12.04 LTS. Journal devices are SSDs, although we can replicate the failure on pretty much any block device we give it.
[7:12] <lurbs> The partition gets created, but /dev isn't populated, the mkjournal fails.
[7:13] <sage> doy ou remember the bug number?
[7:13] <lurbs> http://tracker.ceph.com/issues/5599
[7:13] <sage> are you mark?
[7:13] <lurbs> Nope, he's a workmate.
[7:15] <sage> it's strange that sgdisk isn't triggering a probe.
[7:15] <sage> is there perhaps a partition on the device that is a mounted fs?
[7:15] <lurbs> The only thing that those devices are being used for is Ceph journals.
[7:16] <sage> do you mind testing this patch instead? http://fpaste.org/33668/77062183/
[7:16] <lurbs> Sure.
[7:16] <sage> i think that's a better place to do it
[7:18] <sage> thanks!
[7:24] <lurbs> Yeah, that seems to work.
[7:29] <lurbs> Thanks for that, hopefully it or something similar will make it into 0.67.2. :)
[7:31] <lurbs> We've just been manually pushing out a patched /usr/sbin/ceph-disk, which works but is a little suboptimal.
[7:35] * alexxy[home] (~alexxy@masq118.gtn.ru) has joined #ceph
[7:37] <sage> i'll stick it in master and see if it passes all of our tests
[7:39] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Ping timeout: 480 seconds)
[7:41] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[7:42] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[7:44] <Qu310> sage: any real reason why one would have a journal bigger then 1gb?
[7:44] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Quit: Konversation terminated!)
[7:45] <sage> under high io load that is only ~10 seconds between syncs, which cna lead to the journal filling and associated slowness to compensate
[7:45] <sage> 5gb is the default now, iirc
[7:45] <Qu310> ah
[7:45] <lurbs> Qu310: http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings has a brief description on how to size the journals.
[7:45] <Qu310> so 10 shoudl be heaps
[7:46] <Qu310> yeah just stumbled across that link
[7:46] <sage> yanzheng: pushed to wip-coverity
[7:46] * alexxy[home] (~alexxy@masq118.gtn.ru) Quit (Ping timeout: 480 seconds)
[7:47] <yanzheng> I didn't fix "time of check to time of use" issues
[7:50] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[7:51] <yanzheng> sorry, send the same patch again
[7:51] * tnt (~tnt@109.130.102.13) has joined #ceph
[8:04] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[8:06] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[8:07] * iggy (~iggy@theiggy.com) Quit (Remote host closed the connection)
[8:07] * iggy (~iggy@theiggy.com) has joined #ceph
[8:16] * sherry (~sherry@wireless-nat-10.auckland.ac.nz) has joined #ceph
[8:17] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[8:17] <sherry> hi, I want to sart developing ceph, what is the most suitable IDE for ubuntu 13.04?
[8:23] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[8:25] * odyssey4me (~odyssey4m@165.233.71.2) has joined #ceph
[8:26] <sherry> sorry but Im new to ceph and I want to sart developing ceph, what is the most suitable IDE for ubuntu 13.04?
[8:28] * alexxy[home] (~alexxy@masq118.gtn.ru) has joined #ceph
[8:33] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Ping timeout: 480 seconds)
[8:33] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[8:34] * sherry (~sherry@wireless-nat-10.auckland.ac.nz) Quit (Quit: Konversation terminated!)
[8:34] <yanzheng> I guess most devs use vim
[8:40] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Quit: Bye!)
[8:43] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[8:44] * alexxy[home] (~alexxy@masq118.gtn.ru) Quit (Read error: Operation timed out)
[8:50] * smiley (~smiley@cpe-67-251-108-92.stny.res.rr.com) Quit (Quit: smiley)
[8:52] * Meths_ (rift@2.25.193.59) has joined #ceph
[8:54] * Meths (rift@2.25.214.150) Quit (Read error: Operation timed out)
[9:08] * jbd_ (~jbd_@2001:41d0:52:a00::77) has joined #ceph
[9:11] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:18] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[9:18] * ChanServ sets mode +v andreask
[9:19] * zetheroo (~zeth@home.meteotest.ch) has joined #ceph
[9:26] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[9:31] * vipr (~vipr@78-23-114-68.access.telenet.be) has joined #ceph
[9:38] * mschiff (~mschiff@p4FD7DCC6.dip0.t-ipconnect.de) has joined #ceph
[9:40] * madkiss (~madkiss@184.105.243.180) Quit (Quit: Leaving.)
[9:41] <zetheroo> with ceph, is split brain a possible issue?
[9:45] <yanzheng> no
[9:48] * AfC (~andrew@2407:7800:200:1011:6184:136a:9897:663b) Quit (Ping timeout: 480 seconds)
[9:50] <zetheroo> currently we are using glusterfs and are looking at possibly replacing this setup with ceph ... but I am trying to understand how ceph compares ...
[9:53] <zetheroo> we have an LVM pool on each server and then a xfs filesystem on top - this is then replicated to the next server
[9:54] <zetheroo> I am wondering if with ceph, do you hand it the disks without any partitioning etc and then perform your LVM or other partitioning on top?
[10:00] <yanzheng> ceph stores data on top of local filesystem
[10:01] <yanzheng> usually, one disk -> one filesystem -> one ceph-osd
[10:01] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[10:02] * ssejour (~sebastien@out-chantepie.fr.clara.net) has joined #ceph
[10:02] <zetheroo> ok, so we could hand over a LVM pool with an xfs partition to ceph
[10:04] <darkfader> yes
[10:05] <darkfader> but if you mean a volume group don't call it a pool. lvm does have pools too
[10:05] <darkfader> for thin provisioning
[10:05] <darkfader> but that's so unstable you could as well store to /dev/null
[10:05] <zetheroo> ok sorry ... vg ...
[10:06] <zetheroo> how does ceph compare to drbd ? ... I hear drbd is prone to split brain ...
[10:06] <yanzheng> one disk one ceph-osd usually has better performance
[10:08] <darkfader> ceph has "monitors" (1,3,5 as many as you need) that have an uneven number and talk to each other so a full split brain isn't very likely
[10:12] * jjgalvez (~jjgalvez@ip72-193-217-254.lv.lv.cox.net) Quit (Quit: Leaving.)
[10:14] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[10:14] * DLange (~DLange@dlange.user.oftc.net) Quit (Quit: witty message goes here)
[10:16] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[10:20] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[10:25] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[10:27] * Almaty (~san@81.17.168.194) has joined #ceph
[10:36] * tnt (~tnt@109.130.102.13) Quit (Ping timeout: 480 seconds)
[10:43] <loicd> joao: good morning :-) Running teuthology task mon_recovery it fails with "[WRN] message from mon.0 was stamped 0.250693s in the
[10:43] <loicd> future, clocks not synchronized" in cluster log". Using today's master. Does that ring a bell ? I'll try again anyway :-)
[10:43] <ccourtaut> morning
[10:43] <loicd> ccourtaut: \o
[10:43] <ccourtaut> loicd: o/
[10:44] * LeaChim (~LeaChim@176.24.168.228) has joined #ceph
[10:50] * Cube (~Cube@et-0-30.gw-nat.bs.kae.de.oneandone.net) has joined #ceph
[10:52] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[10:53] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[10:53] * SpamapS (~clint@xencbyrum2.srihosting.com) Quit (Read error: Operation timed out)
[10:53] * SpamapS (~clint@xencbyrum2.srihosting.com) has joined #ceph
[10:55] * yanzheng (~zhyan@134.134.137.73) Quit (Remote host closed the connection)
[10:55] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Read error: Operation timed out)
[10:58] * baffle (baffle@jump.stenstad.net) Quit (Remote host closed the connection)
[10:58] * baffle (baffle@jump.stenstad.net) has joined #ceph
[10:59] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[11:05] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[11:06] * Dark-Ace-Z (~BillyMays@50.107.55.36) Quit (Ping timeout: 480 seconds)
[11:08] * yy-nm (~Thunderbi@122.233.231.235) Quit (Quit: yy-nm)
[11:13] * ssejour (~sebastien@out-chantepie.fr.clara.net) Quit (Quit: Leaving.)
[11:16] <joao> loicd, you have clock skews among the servers where your monitors live
[11:16] <joao> or maybe latency, but clock skew is the most likely
[11:16] <joao> try running a ceph health detail --format json
[11:16] <joao> it will show you a "timechecks" section
[11:17] <loicd> joao: these are mira machines, I should probably nuke - reboot them to make sure they are properly reset. Thanks :-)
[11:18] <joao> loicd, just check if the monitors are in fact reporting clock skews and, if so, restart ntpd
[11:18] <joao> that ought to be enough (it usually is)
[11:19] <loicd> teuthology uninstalls everything after running which makes it impractical. Is there a way to say : "keep all on error" ?
[11:19] <joao> yes; add 'interactive-on-error: true' to the yaml file
[11:19] <joao> or even '- interactive: true' under the tasks
[11:20] <loicd> \o/
[11:20] <loicd> cool. I should have asked that earlier ...
[11:20] <joao> we probably should have documented it, if not already :p
[11:22] * ssejour (~sebastien@out-chantepie.fr.clara.net) has joined #ceph
[11:39] * Cube (~Cube@et-0-30.gw-nat.bs.kae.de.oneandone.net) Quit (Quit: Leaving.)
[11:41] <loicd> joao: the tasks runs fine after a nuke of the machines. It probably was a leftover from before I locked them.
[12:02] * Dark-Ace-Z (~BillyMays@50.107.55.36) has joined #ceph
[12:14] * gaveen (~gaveen@175.157.128.208) has joined #ceph
[12:17] * gaveen (~gaveen@175.157.128.208) Quit (Remote host closed the connection)
[12:22] <yo61> Morning all
[12:23] <joelio> yo61: morning!
[12:23] <yo61> Am back to my attempt to create a ceph cluster
[12:23] <yo61> :)
[12:23] <yo61> I've installed ceph on all nodes
[12:24] <yo61> firewall ports open on all nodes
[12:24] <yo61> I created a new cluster
[12:27] <joelio> ok, so anything wrong?
[12:27] <yo61> Am still testing...
[12:27] <yo61> (thought there was an issue, found the problem, am retesting)
[12:28] <joelio> n/p
[12:29] <yo61> Deployed first monitor
[12:29] <yo61> Waiting for it to finish
[12:29] <yo61> ceph-create-keys is still running
[12:29] <joelio> yea, give it a minute or two
[12:30] <yo61> It will exit when finished?
[12:32] <joelio> I'm not 100% actually, all I know is I had to leave a minute when rolling out my cluster
[12:32] <joelio> just try and proceed with the next steps
[12:33] <yo61> ceph_deploy.gatherkeys][WARNIN] Unable to find /etc/ceph/ceph.client.admin.keyring on ['ceph01']
[12:33] <andreask> probably waiting for entropy on key generation?
[12:34] <joelio> yea, perhaps.. try and generate some entropy or install rngtools etc..
[12:40] <yo61> Still running...
[12:43] <joelio> should take this long tbh
[12:44] <yo61> I stopped it, purged and redeployed
[12:44] <yo61> Got haveged running this time
[12:45] <yo61> Hmmm
[12:46] <yo61> Stracing the create-keys process, it says it's not in quorum
[12:47] * foosinn (~stefan@office.unitedcolo.de) has joined #ceph
[12:48] <yo61> So, let me be clear...
[12:48] <yo61> I've creating a 5-node cluster
[12:48] <yo61> Going to have mons on 3 of the 5
[12:48] <yo61> 01, 03, 05
[12:48] <mikedawson_> yo61: ceph-create-keys will not complete without quorum. For quorum, you need at least 2 of your three monitors up and in sync
[12:48] <yo61> <sigh>
[12:49] <yo61> yesterday, I was told I needed to make sure I started one monitor first to make sure there was no fight over which would be master
[12:49] <yo61> So what's the "right" way to do this?
[12:49] <foosinn> hi together, can i use a ssd for multiple ods as journal? do i have to set up partions for the every osd? and finally: as i understood i should have a mirrored device as journal, is that true?
[12:49] <joelio> for mine, I did ... ceph-deploy -v new vm-ds-0{1,2,3}.mcuk
[12:50] <joelio> 3 mons
[12:50] <joelio> ceph-deploy -v mon create vm-ds-0{1,2,3}.mcuk
[12:50] <joelio> wait!!
[12:50] <joelio> ceph-deploy -v gatherkeys vm-ds-01.mcu
[12:51] <yo61> So, it's OK to create all three mons at once?
[12:52] <joelio> it's what I did and it worked :)
[12:52] <mikedawson_> foosinn: you can use one SSD for multiple journals. I do 3 osds per ssd. I use partitions, but you can choose files instead. I do not use mirrored devices for the journal
[12:52] <yo61> But you gathered keys from just one?
[12:52] <joelio> yo61: yes
[12:53] <yo61> And that will have all the required keys?
[12:53] <joelio> as I said, it worked for me
[12:54] * Cube (~Cube@et-0-30.gw-nat.bs.kae.de.oneandone.net) has joined #ceph
[12:54] <foosinn> mikedawson_, can i simply add the path to the folder to the ceph-deploy osd command?
[12:54] <mikedawson_> foosinn: don't know. I haven't used ceph-deploy
[12:55] <foosinn> ok, thanks anyway :)
[12:55] <yo61> OK, so I gathered keys
[12:58] <yo61> Bloody hell, it seems to be working :)
[12:58] <joelio> tadaaaa :)
[12:59] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:59] <yo61> Need to work out how to get the OSDs to use a different network...
[13:00] * yanzheng (~zhyan@101.82.167.152) has joined #ceph
[13:00] <yo61> OK, so I've deployed OSDs to all 5 nodes
[13:01] <joelio> yo61: http://ceph.com/docs/master/rados/configuration/network-config-ref/
[13:01] <joelio> not sure how you do that via ceph-deploy bh, I'm isolated so not botehred with that
[13:02] <yo61> Does ceph-deploy use the local ceph.conf?
[13:03] <joelio> I don't think it reads ceph.conf, nom - it'll just setup the initial monitor memebers
[13:03] <joelio> I use config pull, edit, config push (with overwrite) to admin mine
[13:04] <yo61> OK
[13:04] <joelio> I'm not entirely certain that's best practice, but in the absence of something conclusive and it working, I'm rolling with it :)
[13:06] <yo61> How do I check cluster status from the admin machine?
[13:07] <joelio> ceph health
[13:07] <joelio> ceph -s
[13:07] <joelio> ceph -w
[13:07] <joelio> etc.
[13:07] <yo61> I get a missing keyring error - do I need to set something up locally?
[13:08] <joelio> set the machine to be admin host
[13:08] <joelio> ceph-deploy -v admin {host}
[13:08] <yo61> Yay
[13:13] <yo61> joelio: what do you mean by "isolated" ?
[13:13] <joelio> I mean it's an isolated, internal network
[13:13] <yo61> Ah, OK
[13:13] <yo61> Mine too
[13:13] <yo61> So, is the one config file shared across all hosts?
[13:14] <yo61> So you set up the config and push it to all hosts?
[13:14] * jabadia_ (~jabadia@194.90.7.244) has joined #ceph
[13:14] <joelio> so I have mitigated public/private by making it all private. The compute nodes are the only things that talk to the cluster, they're all internal too - but adequeate isolation
[13:14] <jabadia_> any easy way to verify Rados GW is up and running ? ( for nagios )
[13:14] <joelio> yo61: yea, any changes that you make on your local copy will need to be pushed
[13:15] <paravoid> jabadia_: I do an end-to-end check by having a special monitoring container and an object in there
[13:15] <paravoid> jabadia_: and then check_http fetching that file
[13:15] <yo61> Any chance of getting a copy of your config file?
[13:15] <jabadia_> paravoid: Thanks, so you keep this special container there always?
[13:16] <paravoid> yes
[13:16] <jabadia_> paravoid, Great, thanks.
[13:16] <paravoid> with read access * to everyone
[13:16] <paravoid> you can also check_procs radosgw of course
[13:17] <paravoid> but I have this an end-to-end check: it checks that apache, mod-fastcgi, radosgw are running, radosgw can connect to ceph and ceph at least partially works
[13:18] <joelio> yo61: it's like minimal, but sure.. https://gist.github.com/joelio/25e5c690f4f74bd634b9
[13:18] * smiley (~smiley@cpe-67-251-108-92.stny.res.rr.com) has joined #ceph
[13:18] <jabadia_> paravoid :sounds great, can you share it?
[13:18] <paravoid> share what?
[13:18] <jabadia_> nagios scripts
[13:18] <yo61> joelio: cheers
[13:19] <paravoid> create a container called "health", put an object named "backend" with contents "OK" there, have a check_http that fetches /health/backend expecting a 200 (or even a body of "OK")
[13:19] <paravoid> that's it
[13:19] <yo61> joelio: oh - very minimal :)
[13:19] <jabadia_> ok, thanks alot
[13:20] <joelio> yo61: yup - I also set crush tunables to optimal
[13:20] <yo61> Am trying to figure out how to best configure my OSDs to use the storage network
[13:20] <joelio> you can use an admin socket too to check running config
[13:21] <joelio> xeph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show
[13:21] <joelio> ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show
[13:21] <joelio> even
[13:21] * Almaty (~san@81.17.168.194) Quit (Ping timeout: 480 seconds)
[13:21] <yo61> Is that on the OSD node?
[13:21] <joelio> yea, as it needs and osd .asoc
[13:21] <joelio> .asok even, dann fingers
[13:22] <yo61> Ah, I hjave .1.asok
[13:22] <yo61> Is that config specified somewhere?
[13:23] <joelio> just replace .0 with .1
[13:23] <joelio> it all introspects the same info
[13:23] <yo61> So when it reboots, where does it read from?
[13:23] <joelio> read what?
[13:23] <yo61> The config
[13:23] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[13:23] * ChanServ sets mode +v andreask
[13:24] <yo61> Or is that stored in the mon nodes and the osd nodes just pull from there?
[13:24] <joelio> /etc/ceph/ceph.conf and will also look in /var/lib/ceph/osd etc.
[13:24] <joelio> ceph-deploy is just an admin too, it'll pull config and keys to int's current working dir
[13:24] <joelio> don't get confused by that (caught me out)
[13:25] <joelio> when you push to the host, it'll push to /etc/ceph/
[13:25] <joelio> or if you pull config, then it'll en up in $CWD
[13:25] <yo61> OK, so if I set different cluster IPs on the admin machine, then push to the cluster nodes, it will restart the nodes and use the new config?
[13:26] <joelio> that I don't know
[13:26] <joelio> I've not touched network configs, as I mentioned
[13:26] <yo61> k
[13:27] <yo61> Am still fumbling around trying to get an handle on this thing!
[13:27] * jabadia_ (~jabadia@194.90.7.244) Quit (Ping timeout: 480 seconds)
[13:28] <yo61> when you say you use config pusl/push/edit - I don't see that listed in ceph --help ?
[13:29] <joelio> ?:
[13:29] <joelio> cehp-deploy config pull {host}
[13:29] <yo61> Ah
[13:29] <yo61> ceph-deploy!
[13:29] <joelio> edit ceph.conf in vi/emacs/pico whatever
[13:29] <joelio> push
[13:31] <yo61> Gotcha
[13:33] <joelio> the push command is a bit peculiar to me, as you have to set it to overwrite config. I don't think I've had a situation where there's not been config on there to overwrite.. as if you use deplioy, it'll add config
[13:33] <joelio> right at provisioning stage
[13:33] <joelio> but maybe I'm using it wrong. Still, works great though :)
[13:35] <joelio> to me the tool should be checking existing config on a push, diffing and then asking you to merge changes, but that's just me
[13:35] * julian (~julian@125.69.106.188) Quit (Quit: Leaving)
[13:36] <joelio> get the nasty feeling something else under the hood is being changed
[13:36] <yo61> Hmm, didn't seem to restart the services to read the new config
[13:39] <joelio> yea, it won't, it's a config tooklm, not a service management tool
[13:39] <joelio> restart ceph-osd-all
[13:39] <joelio> or restart ceph-all etc..
[13:39] <yo61> With what tool?
[13:40] <joelio> restart is upstart
[13:40] <joelio> depends on your distro
[13:40] <joelio> service
[13:40] <joelio> /etc/init.d
[13:40] <joelio> same way as you'd restart any other service :)
[13:40] <yo61> OK, on EL6 I have just one ceph item in /etc/init.d/
[13:40] <joelio> you know, docs do exits - http://ceph.com/docs/master/rados/operations/operating/
[13:40] <joelio> ;)
[13:41] <yo61> Bah
[13:41] <yo61> :)
[13:41] <yo61> As we agreed yesterday, what's muissing is the 10,000ft view describing which docs you need to read
[13:42] * sleinen (~Adium@2001:620:0:46:f504:2be1:d863:12b6) has joined #ceph
[13:42] <joelio> is 'all of them' an option? :)
[13:43] <yo61> It's not an ideal option, no
[13:43] <yo61> forest/trees, etc.
[13:51] <ccourtaut> loicd: could you take a quick look to https://github.com/ceph/ceph/pull/522 ?
[13:56] <loicd> I would change the title to better reflect the purpose of the patch
[13:56] <loicd> "enable multiple clusters" or something
[13:57] <ccourtaut> loicd: it allows to configure the directories used by ceph cluster launch with vstart.sh
[13:57] <ccourtaut> so you can launch vstart.sh multiple times with differents directories
[13:57] <ccourtaut> but vstart.sh does not directly launch multiple clusters at the times
[13:57] <loicd> This is clear from the description but the current title is vague : "Adds more ENV variables to configure dev cluster"
[13:58] <ccourtaut> that's why i didn't wanted to insist on that point
[13:58] <ccourtaut> "Allow to configure cluster working directories" ?
[13:59] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[13:59] <ccourtaut> loicd: ^
[14:02] * mschiff_ (~mschiff@p4FD7DCC6.dip0.t-ipconnect.de) has joined #ceph
[14:02] * mschiff (~mschiff@p4FD7DCC6.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[14:04] <loicd> ccourtaut: the title show state what the purpose of the patch, not the mean. You want to run multiple clusters with vstart : that's what you're after. The mean by which you accomplish this is by configuring vstart with environment variables.
[14:04] <loicd> s/show state/should state/
[14:04] <ccourtaut> ok
[14:04] <ccourtaut> loicd: updated the pull request
[14:08] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:09] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:13] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[14:30] * yanzheng (~zhyan@101.82.167.152) Quit (Ping timeout: 480 seconds)
[14:30] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[14:33] * yanzheng (~zhyan@101.82.248.42) has joined #ceph
[14:45] <Fetch_> anyone have issue IDs handy for the problem(s) Sage alluded to with the bobtail->dumpling upgrade? I ran that upgrade, I just wanted to read up on what I might expect, problem-wise
[14:47] * janos too
[14:47] <janos> Fetch_, how' that upgrade doing for you so far?
[14:49] <Fetch_> good. My issues have revolved more around rbd/openstack regressions than anything osd/mon related (although 0.67.0->0.67.1 did a number on my mons)
[14:50] <janos> i'm hophing to do my upgrade soon
[14:52] <joelio> me too, holding off for a couple of point releases though, keep saying "must not upgrade when not needed".. but it's so shiny!
[14:55] * dosaboy_ (~dosaboy@host109-158-236-83.range109-158.btcentralplus.com) Quit (Quit: leaving)
[15:01] <jmlowe> I'm trying to get a support contract in place before I upgrade
[15:01] * flickerdown (~flickerdo@westford-nat.juniper.net) has joined #ceph
[15:05] * markl (~mark@tpsit.com) Quit (Quit: leaving)
[15:05] * markl (~mark@tpsit.com) has joined #ceph
[15:07] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Read error: Connection reset by peer)
[15:07] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[15:07] * flickerdown|2 (~flickerdo@97-95-180-66.dhcp.oxfr.ma.charter.com) Quit (Ping timeout: 480 seconds)
[15:08] <mikedawson_> Made the upgrade from 0.61.7 -> 0.67.1 this am. Very smooth.
[15:08] <joelio> stop tempting me :)
[15:09] <mikedawson_> joelio: it also didn't fix my main issue... rbd volumes hang when recovering/backfilling
[15:16] * ggreg_ is now known as ggreg
[15:24] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[15:25] * hybrid5121 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[15:25] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[15:30] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[15:31] * jefferai (~quassel@corkblock.jefferai.org) Quit (Ping timeout: 480 seconds)
[15:35] * jefferai (~quassel@corkblock.jefferai.org) has joined #ceph
[15:37] <yo61> Bah, re-trying the same steps I did earlier and 1/3 mon is not coming up
[15:38] <joelio> did you purge * and ensure nothing left over
[15:38] <yo61> Yup
[15:38] <yo61> rebuilt the machines
[15:38] <joelio> logs?
[15:39] * markbby (~Adium@168.94.245.4) has joined #ceph
[15:39] <yo61> ceph -s output: https://gist.github.com/robinbowes/6294559
[15:40] <joelio> 2 is not coming up, what are the logs on that when you try and start the mon?
[15:40] <yo61> 2?
[15:40] <kraken> 2 is not coming up, what are the logs on that when you try and start the mon? (joelio on 08/21/2013 09:40AM)
[15:40] <yo61> logs?
[15:40] <joelio> quorum 0,1 ceph01,ceph03
[15:40] <yo61> mons are 01, 03, 05
[15:41] <joelio> ok, 5 then
[15:41] <yo61> OK
[15:42] <yo61> Hmm, monmap e2: 3 mons at {ceph01=172.16.102.21:6789/0,ceph03=172.16.102.23:6789/0,ceph05=172.16.102.25:6800/0}, election epoch 236, quorum 0,1 ceph01,ceph03
[15:42] <yo61> port 6800?
[15:42] * vipr (~vipr@78-23-114-68.access.telenet.be) Quit (Remote host closed the connection)
[15:43] * allsystemsarego (~allsystem@5-12-37-127.residential.rdsnet.ro) has joined #ceph
[15:43] * Wolff_John (~jwolff@vpn.monarch-beverage.com) has joined #ceph
[15:45] <yo61> OK, it's up now
[15:45] <yo61> (restarted a few times)
[15:45] <yo61> but why is it on port 6800?
[15:45] * allsystemsarego_ (~allsystem@5-12-37-127.residential.rdsnet.ro) has joined #ceph
[15:46] * allsystemsarego_ (~allsystem@5-12-37-127.residential.rdsnet.ro) Quit ()
[15:46] <joelio> le shrug
[15:46] <joelio> do idea I'm afraid
[15:46] * thorus (~jonas@212.114.160.100) has joined #ceph
[15:46] * thorus (~jonas@212.114.160.100) has left #ceph
[15:46] * thorus (~jonas@212.114.160.100) has joined #ceph
[15:46] <yo61> ceph -s also shows this sometimes:
[15:46] <yo61> 2013-08-21 13:46:23.588495 7fcb187ff700 0 -- :/1005720 >> 172.16.102.25:6789/0 pipe(0x7fcb14021380 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fcb140215e0).fault
[15:47] <thorus> How to rebalance the data beetween the disks? At the moment I have a disk with 55% used and one with 24%, and their size is the same.
[15:48] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[15:50] * vipr (~vipr@78-23-114-68.access.telenet.be) has joined #ceph
[15:50] <absynth> i don't think there is a mechanism for that
[15:50] <absynth> probably you have a couple quite large PGs which hinder an even replication
[15:54] * haomaiwang (~haomaiwan@125.108.229.13) Quit (Ping timeout: 480 seconds)
[15:55] * BillK (~BillK-OFT@58-7-52-33.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[15:55] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[15:56] <joelio> yo61: looks like the original mon that was down is still in the monmap -what does -s say now?
[15:56] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) has joined #ceph
[15:56] * smiley (~smiley@cpe-67-251-108-92.stny.res.rr.com) Quit (Quit: smiley)
[15:58] * madkiss (~madkiss@184.105.243.180) has joined #ceph
[15:58] <yo61> I killed it and re-created it
[15:58] <yo61> Still not come back
[15:59] * yanzheng (~zhyan@101.82.248.42) Quit (Ping timeout: 480 seconds)
[15:59] <yo61> create keys still running
[16:00] <joelio> and you installed an entorpy generator?
[16:00] <yo61> Not on this node, no
[16:00] * alram (~alram@cpe-76-167-50-51.socal.res.rr.com) has joined #ceph
[16:04] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[16:05] <topro> anyone else experiencing huge memory footprint of OSD daemons? up to 7GB per osd process (debian wheezy + cuttlefish 0.61.8)
[16:06] <topro> and keeps growing steadily
[16:08] * netsrob (~thorsten@212.224.79.27) has joined #ceph
[16:08] * haomaiwang (~haomaiwan@218.71.125.126) has joined #ceph
[16:09] <netsrob> hi, is the repo for centos and dumpling broken?
[16:11] <absynth> topro: hm, that should have been solved in cuttlefish + some patch
[16:11] <absynth> topro: most memory leaks i read about had to do with scrubbing
[16:11] <absynth> do you see your cluster scrubbing currently?
[16:12] <joelio> yo61: so you're not doing the same as you did last time then?
[16:12] <yo61> I did, originally
[16:12] <joelio> tbh you shouldn't need it
[16:12] <topro> absynth: not right now but it did (even deep I think) a couple of minutes ago. and will scrub again as it does so nearly all the time (1-2 pg get scrubbed nearly all the time)
[16:13] <absynth> yeah
[16:13] <absynth> ok
[16:13] <absynth> disable scrubbing
[16:13] <absynth> restart osds
[16:13] <absynth> and see if the issue persists
[16:14] <absynth> (set min_scrub_interval = max_scrub_interval = 10000000000 or so to effectively disable scrubbing)
[16:14] <absynth> there are various threads on the mailing list
[16:14] <topro> how about "ceph osd set noscrub" ?
[16:14] <absynth> that is not 100% effective, but i don't remember why
[16:14] <absynth> could have been a bug
[16:15] <absynth> or it excluded deep scrubbing
[16:15] <absynth> don't remember really
[16:15] <topro> ok, I'll try as soon as cluster load allows some testing. thanks
[16:15] * haomaiwang (~haomaiwan@218.71.125.126) Quit (Read error: Connection reset by peer)
[16:15] <absynth> the weird thing is that i thought cuttlefish had this fixed generally
[16:16] <topro> my feeling is that is was better with 0.61.7, maybe?
[16:17] <absynth> it should have been fixed before that, really
[16:17] * jeff-YF (~jeffyf@67.23.117.122) has joined #ceph
[16:17] <absynth> last scrub-related bugffix was in 0.61.4
[16:18] <topro> thats what I mean. it was better with 0.61.7(.6, .5, ...) than it is now with 0.61.8?
[16:18] <absynth> that would mean regression
[16:18] * haomaiwang (~haomaiwan@125.108.228.243) has joined #ceph
[16:18] <topro> maybe what I see here is not even scrubbing related. I'll test to see
[16:18] <absynth> you should probably try to get ahold of either sam or sage as soon as they are awake and responsive
[16:18] <absynth> (_if_ you can determine that it is scrub related)
[16:19] <topro> well, testing what you proposed here will take some time as i will have to wait until tomorrow to get real (comparable) cluster load again
[16:24] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[16:24] * vata (~vata@2607:fad8:4:6:40a1:464d:8581:6fb8) has joined #ceph
[16:24] * mikedawson_ (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[16:32] * smiley (~smiley@cpe-67-251-108-92.stny.res.rr.com) has joined #ceph
[16:33] * haomaiwa_ (~haomaiwan@218.71.79.58) has joined #ceph
[16:35] <zetheroo> my plan:
[16:35] <zetheroo> 1. Install Debian Wheezy on 3 identical servers
[16:35] <zetheroo> 2. Install Proxmox 3.0 on all 3 servers via PPA
[16:35] <zetheroo> 3. Install ceph via PPA's and setup ceph cluster
[16:35] <zetheroo> 4. Hook ceph storage up to Proxmox
[16:35] <zetheroo> would this work?
[16:37] * haomaiwang (~haomaiwan@125.108.228.243) Quit (Ping timeout: 480 seconds)
[16:39] * thorus (~jonas@212.114.160.100) has left #ceph
[16:40] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[16:43] * ssejour (~sebastien@out-chantepie.fr.clara.net) Quit (Quit: Leaving.)
[16:48] <tnt> zetheroo: does proxmox uses RBD ? and if yes, does it use the kernel client ?
[16:49] <zetheroo> yes to it using rbd ... don't know about it using a kernel client ...
[16:51] <tnt> if it does then it's highly discouraged to have the rbd.ko module on the same machine as the OSD because of some possible weird interactions AFAIR.
[16:52] * dmsimard (~Adium@108.163.152.2) has joined #ceph
[16:54] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) has joined #ceph
[16:57] * rudolfsteiner (~federicon@181.21.154.81) has joined #ceph
[16:59] * madkiss (~madkiss@184.105.243.180) Quit (Quit: Leaving.)
[16:59] <zetheroo> so ceph cannot be installed on any server running proxmox?
[17:00] <jmlowe> this dependency chain usually ends in tears kernel module -> user space -> kernel module(s)
[17:00] * ssejour (~sebastien@out-chantepie.fr.clara.net) has joined #ceph
[17:01] <Gugge-47527> zetheroo: sure it can, but if it uses the kernel rbd driver, you can get deadlocks
[17:04] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:07] * golrrr (~SamiTorad@69.174.99.94) has joined #ceph
[17:08] <golrrr> hello !
[17:11] <yo61> I'm trying to understand publc vs. cluster addresses for ceph
[17:13] <mikedawson> yo61: cluster is for osd<->osd replication, public is for clients
[17:13] <yo61> Ah, OK
[17:13] <tnt> note that you don't need to do that at all ... you can just use 1 address for all traffic.
[17:13] <yo61> So, in a "real" install, you would put both on 10G?
[17:14] <tnt> depends of your use case, performance needed, budget, ...
[17:14] <yo61> Yeah, I mean, both would have similar bandwidth requirements
[17:15] <mikedawson> After the upgrade to Dumpling (0.67.1), I now have /var/log/ceph/ceph-osd.X.log for every osd in my cluster on every host. The files for all the non-local OSDs are 0bytes. But why are they created in the first place?
[17:16] <tnt> yo61: both would carry data traffic.
[17:16] <yo61> OK
[17:17] <golrrr> hello !
[17:17] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) Quit (Quit: Ex-Chat)
[17:18] * golrrr (~SamiTorad@69.174.99.94) Quit ()
[17:21] <joelio> /window/wi11
[17:21] <joelio> doh
[17:29] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[17:29] * noob2 (~cjh@173.252.71.189) has joined #ceph
[17:30] <noob2> loicd: are you around?
[17:30] <loicd> yes, how are you today ?
[17:30] <noob2> good :), yourself?
[17:30] <noob2> i had a question about your owncloud setup with ceph. how did you configure owncloud to talk to ceph? One giant mount point or s3?
[17:30] <loicd> excellent :-) jerasure does that to people ;-)
[17:31] * zetheroo (~zeth@home.meteotest.ch) has left #ceph
[17:31] <noob2> yes RS encoding rocks :D
[17:31] <loicd> noob2: you mean the installation described at http://dachary.org/?p=2087 ?
[17:31] <noob2> yup
[17:32] * BillK (~BillK-OFT@58-7-52-33.dyn.iinet.net.au) has joined #ceph
[17:32] <loicd> it's using cephfs
[17:32] <noob2> ah, i didn't consider that haha
[17:33] <noob2> i was thinking i could make an rbd mount point and point owncloud at that
[17:33] <noob2> but i wasn't sure how i'd scale it to multiple web servers
[17:38] * Cube (~Cube@et-0-30.gw-nat.bs.kae.de.oneandone.net) Quit (Quit: Leaving.)
[17:42] <loicd> rados qa-suite is running and watch-suite shows 43 pass and no fail on wip-5510. I feel lucky today :-D
[17:42] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[17:42] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[17:42] * ChanServ sets mode +v andreask
[17:43] * haomaiwang (~haomaiwan@218.71.77.214) has joined #ceph
[17:46] * haomaiwa_ (~haomaiwan@218.71.79.58) Quit (Ping timeout: 480 seconds)
[17:48] * netsrob (~thorsten@212.224.79.27) Quit (Quit: Lost terminal)
[17:48] * haomaiwang (~haomaiwan@218.71.77.214) Quit (Read error: Connection reset by peer)
[17:48] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[17:48] * tnt (~tnt@212-166-48-236.win.be) Quit (Read error: Operation timed out)
[17:51] * haomaiwang (~haomaiwan@218.71.120.193) has joined #ceph
[17:54] * carif (~mcarifio@146-115-183-141.c3-0.wtr-ubr1.sbo-wtr.ma.cable.rcn.com) has joined #ceph
[17:54] * ssejour (~sebastien@out-chantepie.fr.clara.net) Quit (Quit: Leaving.)
[17:56] <yo61> Well, I think I'm nearly there
[17:56] <yo61> Got a 5-node cluster up with 3 mon nodes
[17:56] <joelio> satisfying :)
[17:57] * haomaiwa_ (~haomaiwan@218.71.72.115) has joined #ceph
[17:57] <sage> loicd: there?
[17:57] <loicd> yes
[17:57] <yo61> Am now starting to figure out how to add the rados stuff
[17:57] <sage> loicd: the doc gitbuilder has some problems with teh rst in the ec doc: http://gitbuilder.sepia.ceph.com/gitbuilder-doc/log.cgi?log=a35ab949fd8e6e2e259076d76d0d41742f045398
[17:58] <loicd> ok, I'll fix it right now
[17:58] <joelio> yo61: what are you planning on using the cluster for>?
[17:58] <yo61> Openstack + cloud stack
[17:58] <yo61> Swift storage + block device
[17:59] * haomaiwang (~haomaiwan@218.71.120.193) Quit (Ping timeout: 480 seconds)
[17:59] * indego (~indego@91.232.88.10) Quit (Quit: Leaving)
[18:00] * devoid (~devoid@130.202.135.235) has joined #ceph
[18:00] <loicd> sage: I conveniently ignored the errors that were showing when running bash admin/build-doc I'll be more careful in the future.
[18:00] <sage> hehe
[18:01] * carif (~mcarifio@146-115-183-141.c3-0.wtr-ubr1.sbo-wtr.ma.cable.rcn.com) Quit (Quit: Ex-Chat)
[18:02] <joelio> yo61: cool, running ONE here, really like it
[18:03] <yo61> ONE what? :)
[18:03] <joelio> OpenNebula
[18:03] <yo61> Oh, OK
[18:03] * BillK (~BillK-OFT@58-7-52-33.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:03] <joelio> users like it a little too much perhaps.. just had a request in for a VM with 100GB RAM#
[18:04] <yo61> Heh
[18:04] <joelio> have approved, purely for 'testing' :D
[18:05] <joelio> normally we just allow them to use whatever we set in the templates, but being able to change resource in flight is major win
[18:05] <yo61> Is that thin probisioned?
[18:05] <yo61> Or do you have really large hypervisors?
[18:06] * Cube (~Cube@88.128.80.12) has joined #ceph
[18:06] <joelio> 128G in each, which doesn't seem that much nowadays
[18:06] * markbby (~Adium@168.94.245.4) Quit (Quit: Leaving.)
[18:07] * tnt (~tnt@109.130.102.13) has joined #ceph
[18:07] * markbby (~Adium@168.94.245.4) has joined #ceph
[18:07] <yo61> Would like to see that 100G VM migrate :)
[18:07] * joelio recalls 'Under Siege 2' where the villain is trying to crack Segal's palm "A Gig of RAM should do the trick"
[18:07] <janos> hahah
[18:07] <joelio> did a 32G one yesterday, 4 minutes
[18:07] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:08] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[18:08] * foosinn (~stefan@office.unitedcolo.de) Quit (Quit: Leaving)
[18:10] * mschiff_ (~mschiff@p4FD7DCC6.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[18:11] <yo61> So, do I install apache/fcgi/etc on each OSD node?
[18:12] <joelio> rados is s3 stuff, do you need s2?
[18:12] * X3NQ (~X3NQ@195.191.107.205) Quit (Ping timeout: 480 seconds)
[18:12] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[18:12] <joelio> s3 even? VM hosting. block storage is done via rbd
[18:14] <yo61> rados does swift too
[18:14] <joelio> ok, not sure then, not a stacker
[18:14] <yo61> k
[18:14] <joelio> rbd is what you want for VMs though
[18:14] <yo61> bbiab - going for train
[18:14] <joelio> yea, I'm off home too
[18:15] * loicd trying to figure out [ERR] scrub mismatch" in cluster log', http://pastebin.com/cgcrZD9n
[18:16] <loicd> ( one error out of 82 pass in the rados qa-suite, I'm still hopefull ;-)
[18:17] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[18:17] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[18:18] * madkiss (~madkiss@64.125.181.92) has joined #ceph
[18:18] <loicd> no core dump
[18:23] <alfredodeza> docs?
[18:23] <kraken> docs are http://pecan.readthedocs.org/ (alfredodeza on 08/20/2013 04:39PM)
[18:23] <alfredodeza> kraken: forget docs
[18:23] <kraken> roger
[18:32] * jjgalvez (~jjgalvez@ip72-193-217-254.lv.lv.cox.net) has joined #ceph
[18:32] * noob2 (~cjh@173.252.71.189) Quit (Read error: Connection reset by peer)
[18:37] * rudolfsteiner (~federicon@181.21.154.81) Quit (Quit: rudolfsteiner)
[18:38] * mschiff (~mschiff@85.182.236.82) has joined #ceph
[18:49] * hybrid5121 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[18:52] * jmlowe (~Adium@c-98-223-198-138.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[18:52] * rudolfsteiner (~federicon@181.21.164.196) has joined #ceph
[18:54] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[18:59] * gregaf (~Adium@2607:f298:a:607:fd96:c553:1184:eb0a) Quit (Quit: Leaving.)
[18:59] * clayb (~kvirc@199.172.169.79) has joined #ceph
[19:00] <loicd> I see this http://pastebin.com/FyQxGi5i in the mon logs
[19:00] * gregaf (~Adium@2607:f298:a:607:3d75:84b1:f173:d53b) has joined #ceph
[19:01] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[19:03] <mikedawson> sjusthm: ping
[19:03] * Vjarjadian (~IceChat77@90.214.208.5) Quit (Quit: For Sale: Parachute. Only used once, never opened, small stain.)
[19:04] <loicd> joao: these are mon errors, do they ring a bell ? http://pastebin.com/FyQxGi5i ? I can't imagine how / why they could relate to the changes introduced in wip-5510 ( ObjectContext shared_ptr).
[19:05] <loicd> joao: no worries if they don't I'm reproducing the error to make sure it's not an artificat :-)
[19:05] <joao> loicd, first time I'm seeing this
[19:05] <joao> let me know if this is reproducible
[19:06] <loicd> I will :-)
[19:06] <gregaf> this is reporting mismatches between the monitor leveldb stores (largely shared with the OSD ones), so if you changed anything in there as part of your transformation I'd examine that
[19:06] <gregaf> or it could be some new issue from elsewhere; I don't know how often those tests get run (sage/sagewk will as he wrote them ;))
[19:07] <paravoid> yehudasa: ping?
[19:10] <paravoid> sjusthm: or you, I have a different bug for each of you :)
[19:15] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has joined #ceph
[19:21] * nwat (~nwat@eduroam-237-79.ucsc.edu) has joined #ceph
[19:24] <sjusthm> paravoid: I am here
[19:24] <sjusthm> mikedawson: what's up?
[19:25] <paravoid> heh
[19:25] <paravoid> so http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Ceph+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[19:26] <paravoid> one, or maybe two, OSDs have quite elevated CPU usage
[19:26] * jbd_ (~jbd_@2001:41d0:52:a00::77) has left #ceph
[19:26] * ninkotech (~duplo@static-84-242-87-186.net.upcbroadband.cz) Quit (Read error: Connection reset by peer)
[19:26] <paravoid> I tried restarting but it persisted
[19:26] <paravoid> (now I've stopped it completely)
[19:26] * haomaiwang (~haomaiwan@218.71.78.15) has joined #ceph
[19:26] * ninkotech (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[19:26] <paravoid> I tried perf top but there was some unknown symbol with 100% CPU, that was weird
[19:27] <sjusthm> try dumpling head
[19:29] <paravoid> you're suspecting the PGLog thing?
[19:29] <paravoid> that's why I'm pinging you :)
[19:29] <sjusthm> yeah
[19:29] <paravoid> the details elude me obviously; would it affect just one or two out of 144 osds?
[19:30] <sjusthm> no, and it wouldn't cause that much cpu either
[19:30] * haomaiwa_ (~haomaiwan@218.71.72.115) Quit (Ping timeout: 480 seconds)
[19:30] <sjusthm> but it's an easy one to try
[19:32] * tobru (~quassel@217-162-50-53.dynamic.hispeed.ch) has joined #ceph
[19:33] <paravoid> how to pinpoint it further?
[19:34] <paravoid> I tried perf dump but ohsomanycounters
[19:35] <sjusthm> perf dump won't help
[19:35] <sjusthm> we'd need a profiler
[19:35] <sjusthm> so perf top
[19:35] <sjusthm> maybe install the -dbg package?
[19:46] * Wolff_John (~jwolff@vpn.monarch-beverage.com) Quit (Ping timeout: 480 seconds)
[19:48] <sagewk> joao: https://github.com/ceph/ceph/pull/524
[19:50] * gaveen (~gaveen@175.157.128.208) has joined #ceph
[19:51] * haomaiwa_ (~haomaiwan@218.71.75.115) has joined #ceph
[19:52] * mschiff (~mschiff@85.182.236.82) Quit (Ping timeout: 480 seconds)
[19:55] * haomaiwang (~haomaiwan@218.71.78.15) Quit (Ping timeout: 480 seconds)
[19:59] * Meths_ is now known as Meths
[20:06] <mikedawson> sjusthm: no luck with no_osd_recover_clone_overlap. Email sent on that thread. I feel like Windows guests suffer more than Linux guests. Could crushmap topology or COW play a role here?
[20:06] <sjusthm> mikedawson: not sure, if there's a difference between windows and linux vms, it's probably not osd side
[20:07] <sjusthm> if the raring vm is on the same cluster, than the crushmap probably isn't the issue
[20:07] <sjusthm> how are you using snapshots?
[20:08] <mikedawson> sjusthm: we aren't using snapshots, guest glance/cinder COW
[20:09] * carif (~mcarifio@146-115-183-141.c3-0.wtr-ubr1.sbo-wtr.ma.cable.rcn.com) has joined #ceph
[20:09] <sjusthm> in that case, that config shouldn't have had any effect at all
[20:09] <sjusthm> and your issue is not related to stefan's
[20:09] <sjusthm> glance/cinder COW ends up as rbd COW?
[20:09] <mikedawson> sjusthm: I believe so
[20:10] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[20:10] * ChanServ sets mode +v andreask
[20:11] <mikedawson> basically a "golden image" in the glance pool is the parent of the boot volume in the rbd pool
[20:11] <sjusthm> if it's an issue of slow osd requests, I'll need a dump of client/osd requests which were slow along with osd logs from the cluster during the period suround the slow requests with debug_ms=1, debug_osd=20, debug_filestore=20
[20:11] <sagewk> yehudasa yehudasa_: wip-6046 looks good... that should go to next and cherry-pick to dumpling, right?
[20:11] <sjusthm> obtaining the client/osd requests might be possible from the rbd admin socket if there is one?
[20:11] <sjusthm> joshd: is there a way to dump in progress rbd requests?
[20:12] <yehudasa_> sagewk: yes
[20:12] <joshd> sjusthm: yeah, if the admin socket is enabled you can dump requests from the objecter
[20:12] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[20:12] <mikedawson> sjusthm: I don't have rbd admin sockets, but I do have client.volume.logs
[20:12] <sjusthm> joshd: is that related?
[20:12] <sjusthm> joshd: is there a way to get objecter logging?
[20:12] <sagewk> k
[20:14] <joshd> sjusthm: logs with debug ms will show if there are more requests, debug objecter will show info about resends etc.
[20:14] <mikedawson> sjusthm: I think sagewk put in objector logging for me in wip-5955. Would that be the same thing? Wonder if that is in 0.67.1...
[20:14] <sjusthm> joshd: how would mikedawson enable it?
[20:15] <joshd> easiest way is put it in ceph.conf on the compute node, the detach/reattach or restart the vm
[20:16] <sjusthm> mikedawson: ok, so do debug_ms=1 debug_objecter=20 on the client which you expect will see slow requests
[20:16] <sjusthm> mikedawson: and debug_osd=20, debug_filestore=20, debug_ms=1 on all of the osds
[20:16] * carif (~mcarifio@146-115-183-141.c3-0.wtr-ubr1.sbo-wtr.ma.cable.rcn.com) Quit (Quit: Ex-Chat)
[20:16] <mikedawson> sjusthm: will do
[20:16] * ninkotech_ (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[20:20] * ninkotech (~duplo@static-84-242-87-186.net.upcbroadband.cz) Quit (Read error: Connection reset by peer)
[20:20] * Wolff_John (~jwolff@64.132.197.18) has joined #ceph
[20:21] * rudolfsteiner (~federicon@181.21.164.196) Quit (Quit: rudolfsteiner)
[20:23] * jmlowe (~Adium@2001:18e8:2:28cf:f000::dab8) has joined #ceph
[20:30] * Wolff_John_ (~jwolff@vpn.monarch-beverage.com) has joined #ceph
[20:33] * rudolfsteiner (~federicon@181.21.154.81) has joined #ceph
[20:34] * gaveen (~gaveen@175.157.128.208) Quit (Remote host closed the connection)
[20:35] * Wolff_John (~jwolff@64.132.197.18) Quit (Ping timeout: 480 seconds)
[20:35] * Wolff_John_ is now known as Wolff_John
[20:36] * gregmark (~Adium@68.87.42.115) has joined #ceph
[20:38] * vipr (~vipr@78-23-114-68.access.telenet.be) Quit (Remote host closed the connection)
[20:41] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[20:45] * danieagle (~Daniel@186.214.76.117) has joined #ceph
[20:46] * ntranger (~ntranger@proxy2.wolfram.com) has joined #ceph
[20:48] * ninkotech__ (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[20:50] * ninkotech_ (~duplo@static-84-242-87-186.net.upcbroadband.cz) Quit (Ping timeout: 480 seconds)
[20:52] * noob2 (~cjh@173.252.71.4) has joined #ceph
[20:52] * noob2 (~cjh@173.252.71.4) Quit ()
[20:56] * mschiff (~mschiff@85.182.236.82) has joined #ceph
[20:57] * ninkotech__ (~duplo@static-84-242-87-186.net.upcbroadband.cz) Quit (Ping timeout: 480 seconds)
[21:05] * Wolff_John_ (~jwolff@ftp.monarch-beverage.com) has joined #ceph
[21:07] <paravoid> sjusthm: I already tried perf top
[21:07] <paravoid> 99% of an unkown symbol
[21:09] * Wolff_John (~jwolff@vpn.monarch-beverage.com) Quit (Ping timeout: 480 seconds)
[21:09] * Wolff_John_ is now known as Wolff_John
[21:15] * ninkotech__ (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[21:16] <sjusthm> paravoid: did you install the -dbg package?
[21:17] <paravoid> yes :)
[21:26] * paravoid (~paravoid@scrooge.tty.gr) Quit (Remote host closed the connection)
[21:28] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Read error: Connection reset by peer)
[21:30] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[21:32] * paravoid (~paravoid@scrooge.tty.gr) has joined #ceph
[21:32] <sjusthm> darn
[21:36] <paravoid> yehudasa_: around?
[21:38] <yehudasa_> paravoid, yeah
[21:38] <paravoid> so I substantially increased load in production today and it produced three outages
[21:38] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[21:39] <yehudasa_> what kind of outages?
[21:39] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[21:39] <paravoid> it worked fine for three hours, then it broke, then worked fine for about 30-60' between each
[21:40] <paravoid> requests were timing out
[21:40] <paravoid> but ceph logs showed nothing
[21:40] <yehudasa_> paravoid: how many concurrent requests?
[21:40] <paravoid> rgw runs at debug level 1, unfortunately
[21:40] <paravoid> I have 4 rgw in a cluster
[21:40] <paravoid> about 100 req in total, so ~25 each?
[21:41] <paravoid> but I increased the thread pool to 600 just in case
[21:41] <yehudasa_> how many fds per process?
[21:41] <paravoid> I did find at one point one of them being waiting on the thread pool lock
[21:42] <paravoid> I didn't check at the time
[21:42] <paravoid> I have two more important data points
[21:42] * ninkotech__ (~duplo@static-84-242-87-186.net.upcbroadband.cz) Quit (Ping timeout: 480 seconds)
[21:42] <paravoid> the first is, that we did the exact same thing before with cuttlefish
[21:42] <paravoid> and it was fine for weeks
[21:42] <paravoid> so this is new
[21:43] <paravoid> the second data point is that the outages had an interesting pattern: in all three occurences, two out of the four timed out at the same time
[21:43] <paravoid> and a third one 3-4 minutes later
[21:43] <sjusthm> paravoid: urgh, we'll need to get perf working
[21:44] <paravoid> sjusthm: that's a different bug btw :)
[21:44] <sjusthm> paravoid: I know
[21:44] <sjusthm> I'm talking about the osd one :P
[21:44] <paravoid> so the osd perf symbols, just not that function I think
[21:45] <sjusthm> oh, the symbol was outside of ceph-osd?
[21:45] <paravoid> maybe?
[21:45] <kraken> maybe is something I messed up there (rafael on 08/20/2013 09:16PM)
[21:45] <paravoid> hm, let me do another test
[21:46] <paravoid> Events: 8K cycles, Thread: ceph-osd(3306), DSO: ceph-osd
[21:46] <paravoid> 96.42% 0x5636ff ◆
[21:46] <paravoid> 0.68% leveldb::Version::LevelFileNumIterator::~LevelFileNumIterator() ▒
[21:46] <paravoid> 0.62% std::_Rb_tree<hobject_t, std::pair<hobject_t const, std::_List_iterator<hobject_t> >, std::_Select1st<std::pair<hobject_t co▒
[21:46] <paravoid> this is what I'm seeing now
[21:46] <yehudasa_> paravoid: was the cluster completely healthy when that happened?
[21:46] <paravoid> yehudasa_: yes
[21:46] <paravoid> well, the first time :)
[21:47] <paravoid> the rest of the times it was backfiling
[21:47] <sjusthm> ok, let's try this, gdb attach to the misbehaving process and create a bug with the backtraces of all threads
[21:47] <yehudasa_> so it might have been that requests were timing out due to slow backend, then retry, which in the end consume all the thread pool
[21:48] <yehudasa_> but it's hard to say without any logs
[21:50] <yehudasa_> also, in such a case you might ended up with fds that are larger than 1024, and depending on your libfcgi (not the apache module, but the userspace library) version, you might have hit some bug
[21:50] <yehudasa_> paravoid, do you have libfcgi0ldbl installed?
[21:53] <paravoid> sorry, connection died
[21:53] <paravoid> let me check
[21:53] <paravoid> libfcgi0ldbl 2.4.0-8.1
[21:53] <paravoid> so, yes
[21:53] <paravoid> so regarding logs
[21:54] <paravoid> debug rgw is already noisy with two lines per request
[21:54] <yehudasa_> hmm.. I think that's a problematic version
[21:54] <paravoid> with debug rgw 2 (let alone higher), at ~25 req/s it starts growing
[21:54] <yehudasa_> I need to look at the logs, are you running ubuntu?
[21:54] <paravoid> yeah, precise
[21:54] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[21:55] <yehudasa_> you might be hitting the specific bug I was talking about, I'll need to check the package though
[21:55] <paravoid> what bug is that?
[21:55] <paravoid> fds > 1024?
[21:55] <yehudasa_> yeah, it doesn't handle that correctly
[21:55] <paravoid> with 25 req/s that would be weird though
[21:55] <paravoid> that's a lot of in-flight requests
[21:57] <yehudasa_> not necessarily, as due to slow backend it might have accumulated requests, also, there are more fds used for the rados communication, so in essense you only need to have like 400-500 requests to reach that
[21:57] <yehudasa_> also, it's not about how many fds you have, but what's the actual fd value
[21:57] <paravoid> how would I find out if it was a slow backend issue
[21:57] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[21:57] <paravoid> it could have easily been
[21:58] <paravoid> the first time the outage happened, it was then the other osd issue I was telling sjusthm spiked, so there's a correlation there
[21:58] <yehudasa_> to really find out you'd need to have messenger debugging
[21:58] <paravoid> I stopped that OSD so after that there was recovery traffic, which could have slowed things down I guess
[21:58] <yehudasa_> but you can try looking at the apache logs, see if requests were timing out just before
[21:59] <yehudasa_> (e.g., got 500 responses, then usually clients retry)
[22:02] <yehudasa_> paravoid: 2.4.0-8.1ubuntu1 has a fix for that select/poll issue, however, I don't see that available for precise
[22:04] <paravoid> well 500s don't say much
[22:04] <paravoid> I had 500s, I knew that :)
[22:04] <paravoid> because of timeouts
[22:05] <yehudasa_> paravoid: well, apache times out, if the gateway waits on a rados request then the thread is still alive, as well as the fds
[22:05] <paravoid> right
[22:05] <yehudasa_> so it does sound like this issue, and on top of that you have the bad package version
[22:06] <paravoid> but why did the requests pile up in the first place?
[22:06] * ninkotech__ (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[22:06] <sjusthm> paravoid: probably because something froze your osds
[22:06] <sjusthm> probably that thing using all that cpu
[22:06] <paravoid> note how it happened on two boxes simultaneously, then a third
[22:06] <paravoid> sjusthm: that could be the first outage, but then I killed that OSD
[22:07] <paravoid> so it doesn't account for the others
[22:07] <sjusthm> it doesn't happen on other osds?
[22:07] <paravoid> no
[22:07] <sjusthm> withdrawn
[22:07] <paravoid> just one, maaybe a second one, I wasn't sure so I killed that too
[22:07] <yehudasa_> paravoid: but at that point the cluster wasn't healthy, right?
[22:07] <paravoid> it was backfilling
[22:07] <sjusthm> paravoid: that should be ok
[22:07] <sjusthm> shouldn't cause timeouts
[22:07] <sjusthm> were you seeing slow request warnings?
[22:07] <paravoid> nope
[22:08] <kraken> http://i.imgur.com/foEHo.gif
[22:08] <yehudasa_> paravoid: try lowering your thread pool to 300-400, see if that happens again
[22:09] <paravoid> I need to find a way to reproduce this in lab first...
[22:09] * doxavore (~doug@99-7-52-88.lightspeed.rcsntx.sbcglobal.net) has joined #ceph
[22:10] <yehudasa_> paravoid: did you restart the gateways after the first issue?
[22:10] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has left #ceph
[22:10] <paravoid> I think after the second outage, when I increased the thread pool count
[22:13] <yehudasa_> paravoid: and also need to upgrade that package, it's buggy
[22:13] <paravoid> this didn't happen with cuttlefish though
[22:13] <yehudasa_> these symptoms really remind me of an issue we've seen before
[22:13] <paravoid> so it could have contributed but it wasn't the root cause
[22:14] <yehudasa_> it's hard for me to point at the root cause without any logs
[22:14] <paravoid> I know :)
[22:14] <paravoid> I'm sorry for that
[22:15] <paravoid> and thanks for the attempt at least
[22:15] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) has joined #ceph
[22:24] * mucky (~mucky@mucky.socket7.org) has joined #ceph
[22:24] * mucky is now known as mmmucky
[22:27] * danieagle (~Daniel@186.214.76.117) Quit (Read error: Operation timed out)
[22:31] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:34] <loicd> I get "DEBUG:teuthology.run_tasks:Exception was not quenched, exiting: error: [Errno 111] Conn
[22:34] <loicd> ection refused" on a number of teuthology tasks, should I be worried about it ? It looks like it's stuck but maybe it's part of the normal process ?
[22:35] <loicd> it looks like a failure to connect to the lock server
[22:36] <loicd> http://pastebin.com/xrjxerdU
[22:37] <loicd> maybe I should kill the suite ?
[22:37] <dmick> yep. do you expect to have a lock server running, or does your yaml set the option to avoid locking?
[22:37] <loicd> it's the rados test suite run with
[22:38] <loicd> ./schedule_suite.sh rados wip-5510 testing loic@dachary.org basic master plana
[22:39] <loicd> dmick: so I don't really run an individual yaml. It ran 234 tests successfully and now it looks like it's stuck.
[22:39] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) Quit (Quit: Ex-Chat)
[22:42] <dmick> the option is check-locks: false, and the log should contain "Lock checking disabled." if it's set
[22:44] * sagelap (~sage@2607:f298:a:607:ea03:9aff:febc:4c23) has joined #ceph
[22:44] <sagewk> loicd: i see a bunc of the processes are dead.. did you kill them or did the die off?
[22:45] <sagewk> houkouonchi-home: looks like the lockserver was temporarily unavailable
[22:47] * tobru (~quassel@217-162-50-53.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[22:47] * odyssey4me (~odyssey4m@165.233.71.2) Quit (Ping timeout: 480 seconds)
[22:48] <sagewk> loicd: nuking the leaked locks; should start making progress shortly
[22:49] <sagewk> zackc: also a bunch of machines locked from you rsentry run.. you need to put nuke-on-error: true in the yaml or else the machine won't get cleaned up on a failed run
[22:51] <zackc> sagewk: huh. where's that written?
[22:51] <sagewk> in the job yaml
[22:51] * paravoid (~paravoid@scrooge.tty.gr) Quit (Ping timeout: 480 seconds)
[22:51] <zackc> sagewk: no i mean how should i have known that
[22:51] <sagewk> :)
[22:51] * zackc is still super confused about how to schedule suites
[22:51] <sagewk> it's added by schedule_suite.sh normally
[22:52] <zackc> hm. i used teuthology-suite to schedule thar run.
[22:52] <sagewk> if you are scheduling using somethign else, you should add it yourself.
[22:52] <sagewk> yeah
[22:52] <zackc> am i correct in guessing that schedule_suite.sh wraps teuthology-suite?
[22:52] <sagewk> iirc teuthology-suite can take an extra input of yaml that it just sandwiches together with the rest
[22:52] <zackc> (if so, why?)
[22:52] <sagewk> yeah
[22:53] <sagewk> teuthology-suite isn't very convenient (because of things like this)
[22:53] <sagewk> one of the things we should clean up :)
[22:53] <zackc> if the tool everyone actually uses wraps it... it'll be pretty tough to clean it up
[22:53] <sagewk> the .sh is very prescriptive
[22:53] * jmlowe (~Adium@2001:18e8:2:28cf:f000::dab8) Quit (Quit: Leaving.)
[22:54] <sagewk> well, we should replace the .sh with something cleaner and more useful, and take a fresh look at what teuthology-suite does (i don't think anybody uses it directly).
[22:54] <zackc> i tried to use schedule_suite.sh to run a suite and it just exited with status 8. no error message.
[22:54] <zackc> :(
[22:54] <sagewk> yeah, it isn't friendly
[22:55] <sagewk> it expects a ceph-qa-suite.git checkout in ~/src/ceph-qa-suite, among other things
[22:55] <sagewk> and makes assumptions about how branch names should be resolved into sha1's
[22:55] <zackc> i did this on teuthology.front
[22:55] <sagewk> happy to do a g+ hangout and go over how it all fits together currently
[22:57] * allsystemsarego (~allsystem@5-12-37-127.residential.rdsnet.ro) Quit (Quit: Leaving)
[22:57] <zackc> hmm, i guess we should do that
[22:58] <sagewk> https://plus.google.com/hangouts/_/be1a7d88aab31cf6eb59abb8915aec19c3353fe6 if anybody else is interested
[22:59] * paravoid (~paravoid@scrooge.tty.gr) has joined #ceph
[22:59] <dmick> oh duh it's on the plana cluster :)
[23:01] <sagewk> ./schedule_suite.sh <suite> <ceph branch> <kernel branch> <your email>
[23:02] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[23:02] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[23:03] * dmsimard (~Adium@108.163.152.2) Quit (Quit: Leaving.)
[23:03] * Gamekiller77 (~oftc-webi@128-107-239-234.cisco.com) has joined #ceph
[23:03] <Gamekiller77> os having a problem placing an image form openstack on to my ceph cluster. It just stays saving with nothing to see in the glance-api logs
[23:03] <Gamekiller77> any ideas
[23:04] <Gamekiller77> or a way to manually place a file from the glance server to the glance rbd pool
[23:04] * dmsimard (~Adium@108.163.152.2) has joined #ceph
[23:05] <sagewk> http://pad.ceph.com/p/teuthology_notes
[23:05] * ntranger (~ntranger@proxy2.wolfram.com) Quit (Ping timeout: 480 seconds)
[23:06] * Guest4093 (~jr@190.38.48.76) has joined #ceph
[23:06] <loicd> sagewk: thanks !
[23:07] * loicd says hi from http://c-base.org/
[23:08] <loicd> sagewk: I see progress indeed.
[23:08] * Guest4093 (~jr@190.38.48.76) Quit ()
[23:08] <L2SHO> has anyone ran something like mongodb on top of ceph? I'm trying to think if there is a good way to make an exapndable database storage backend with ceph
[23:08] <kraken> http://youtu.be/b2F-DItXtZs
[23:09] * ntranger (~ntranger@proxy2.wolfram.com) has joined #ceph
[23:09] * ntranger_ (~ntranger@proxy2.wolfram.com) has joined #ceph
[23:09] * ntranger_ (~ntranger@proxy2.wolfram.com) Quit (Remote host closed the connection)
[23:12] * dmsimard (~Adium@108.163.152.2) Quit (Ping timeout: 480 seconds)
[23:14] <Gamekiller77> if i do a rados -p glance ls and get fault what should i look at
[23:14] * Wolff_John (~jwolff@ftp.monarch-beverage.com) Quit (Ping timeout: 480 seconds)
[23:14] * rudolfsteiner (~federicon@181.21.154.81) Quit (Quit: rudolfsteiner)
[23:16] * Dark-Ace-Z is now known as DarkAceZ
[23:22] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[23:22] * ChanServ sets mode +v andreask
[23:22] <Gamekiller77> no one for glance help i see
[23:22] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:24] * BillK (~BillK-OFT@58-7-52-33.dyn.iinet.net.au) has joined #ceph
[23:25] * loicd feeling brave and trying to compile ceph on AIX 7.1 ...
[23:25] * ntranger_ (~ntranger@proxy2.wolfram.com) has joined #ceph
[23:26] <lurbs> I'm unconvinced that 'brave' is the correct word.
[23:26] <lurbs> But good luck. :)
[23:26] * ntranger (~ntranger@proxy2.wolfram.com) Quit (Remote host closed the connection)
[23:27] * lmb (lmb@212.8.204.10) Quit (Ping timeout: 480 seconds)
[23:27] * lmb (lmb@212.8.204.10) has joined #ceph
[23:29] <loicd> ./autogen.sh[17]: libtoolize: not found.
[23:29] * loicd trying to get root access ...
[23:30] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[23:30] <loicd> lurbs: :-)
[23:31] * haomaiwang (~haomaiwan@218.71.77.108) has joined #ceph
[23:33] * markbby (~Adium@168.94.245.4) Quit (Quit: Leaving.)
[23:36] * haomaiwa_ (~haomaiwan@218.71.75.115) Quit (Ping timeout: 480 seconds)
[23:36] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[23:37] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[23:38] * jmlowe (~Adium@2601:d:a800:97:6403:f9a9:1669:5ce4) has joined #ceph
[23:42] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[23:42] * ircolle1 (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[23:42] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Read error: Connection reset by peer)
[23:43] * gregmark (~Adium@68.87.42.115) Quit (Quit: Leaving.)
[23:44] <joshd> sagewk: wip-6070-cuttlefish fixes the hang with rbd locks on a full cluster
[23:45] <gregaf> sagewk: in ReplicatedPg::prepare_transaction, there's a line
[23:45] <gregaf> ctx->new_obs.oi.version = ctx->at_version;
[23:45] <sagewk> joshd: k, can look in a few minutes
[23:45] <gregaf> which is immediately repeated inside of an "if (ctx->new_obs.exists) {"
[23:45] <gregaf> it looks like it's just redundant but I'm no expert here and it's making me nervous :p
[23:46] * carif (~mcarifio@ip-207-145-81-212.nyc.megapath.net) has joined #ceph
[23:51] * haomaiwa_ (~haomaiwan@218.71.73.3) has joined #ceph
[23:52] * haomaiwang (~haomaiwan@218.71.77.108) Quit (Ping timeout: 480 seconds)
[23:54] <loicd> is leveldb available as a debian package for sparc ?
[23:56] * Gamekiller77 (~oftc-webi@128-107-239-234.cisco.com) Quit (Remote host closed the connection)
[23:59] * BillK (~BillK-OFT@58-7-52-33.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.