#ceph IRC Log


IRC Log for 2013-05-28

Timestamps are in GMT/BST.

[0:00] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) Quit (Quit: leaving)
[0:09] * sh_t (~sht@NL.privatevpn.com) Quit (Ping timeout: 480 seconds)
[0:10] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) has joined #ceph
[0:12] * BillK (~BillK@124-169-77-36.dyn.iinet.net.au) has joined #ceph
[0:14] * vata (~vata@2607:fad8:4:6:6deb:15c:a201:cea4) Quit (Quit: Leaving.)
[0:18] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[0:21] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[0:29] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) has joined #ceph
[0:31] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:32] * loicd (~loic@magenta.dachary.org) has joined #ceph
[0:35] * ghartz (~ghartz@33.ip-5-135-148.eu) Quit (Remote host closed the connection)
[0:36] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) has joined #ceph
[0:49] * sleinen (~Adium@2001:620:0:26:c0b:2911:50d4:bd5b) Quit (Quit: Leaving.)
[0:51] * fmarchand (~fmarchand@85-168-75-207.rev.numericable.fr) Quit (Ping timeout: 480 seconds)
[0:52] * lightspeed (~lightspee@fw-carp-wan.ext.lspeed.org) has joined #ceph
[0:56] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:56] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:02] * lightspeed (~lightspee@fw-carp-wan.ext.lspeed.org) Quit (Ping timeout: 480 seconds)
[1:04] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[1:12] * andrei (~andrei@host86-155-31-94.range86-155.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[1:25] * diegows (~diegows@ has joined #ceph
[1:26] * lightspeed (~lightspee@fw-carp-wan.ext.lspeed.org) has joined #ceph
[1:38] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Read error: Operation timed out)
[1:44] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[1:45] * KindTwo (KindOne@h29.45.28.71.dynamic.ip.windstream.net) has joined #ceph
[1:45] * KindTwo is now known as KindOne
[1:46] * IHS (~horux@ Quit (Quit: Saindo)
[1:49] * tnt (~tnt@ Quit (Ping timeout: 480 seconds)
[1:49] * The_Bishop (~bishop@e177088119.adsl.alicedsl.de) has joined #ceph
[1:51] * BManojlovic (~steki@fo-d- Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:54] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[2:01] * musca (musca@tyrael.eu) has joined #ceph
[2:45] * redeemed (~redeemed@cpe-192-136-224-78.tx.res.rr.com) has joined #ceph
[2:47] * DarkAce-Z (~BillyMays@ has joined #ceph
[2:48] * DarkAceZ (~BillyMays@ Quit (Ping timeout: 480 seconds)
[2:52] * musca (musca@tyrael.eu) has left #ceph
[3:02] * mrjack (mrjack@office.smart-weblications.net) has joined #ceph
[3:02] <mrjack> re
[3:18] * ChanServ sets mode +v scuttlemonkey
[3:18] * ChanServ sets mode +v joao
[3:18] * ChanServ sets mode +v elder
[3:22] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[3:32] * mdxi_ (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[3:37] * diegows (~diegows@ Quit (Read error: Operation timed out)
[3:40] <doubleg> dmes
[3:40] <doubleg> whoops, wrong term
[3:51] <mrjack> :)
[4:09] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Ping timeout: 480 seconds)
[4:10] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[4:15] * The_Bishop_ (~bishop@f052100129.adsl.alicedsl.de) has joined #ceph
[4:18] * The_Bishop (~bishop@e177088119.adsl.alicedsl.de) Quit (Read error: Operation timed out)
[4:41] * redeemed (~redeemed@cpe-192-136-224-78.tx.res.rr.com) Quit (Quit: bia)
[4:47] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) Quit (Remote host closed the connection)
[4:50] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[4:52] * mrjack (mrjack@office.smart-weblications.net) Quit (Ping timeout: 480 seconds)
[4:56] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) has joined #ceph
[5:11] * Vanony (~vovo@i59F79DAF.versanet.de) has joined #ceph
[5:18] * Vanony_ (~vovo@ Quit (Ping timeout: 480 seconds)
[5:23] * jshen (~jshen@108-231-76-84.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[6:16] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[6:32] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 22.0/20130521223249])
[6:34] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Quit: noahmehl)
[6:36] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[6:40] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[6:42] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[6:46] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit ()
[6:46] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[6:53] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[6:54] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[7:15] * jshen (~jshen@108-231-76-84.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[7:40] * jerker (jerker@Psilocybe.Update.UU.SE) has joined #ceph
[8:03] * sean (~sean@ has joined #ceph
[8:08] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (Ping timeout: 480 seconds)
[8:13] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:31] * sean (~sean@ Quit ()
[8:40] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[8:55] * bergerx_ (~bekir@ has joined #ceph
[8:56] * LeaChim (~LeaChim@ has joined #ceph
[8:58] * tnt (~tnt@ has joined #ceph
[9:05] * ScOut3R (~ScOut3R@ has joined #ceph
[9:18] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:26] * sleinen (~Adium@ has joined #ceph
[9:27] * BManojlovic (~steki@ has joined #ceph
[9:28] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:28] * ChanServ sets mode +v andreask
[9:28] * sleinen1 (~Adium@2001:620:0:26:b577:2be7:a700:7875) has joined #ceph
[9:29] * sleinen (~Adium@ Quit (Read error: Operation timed out)
[9:37] * eschnou (~eschnou@ has joined #ceph
[9:44] * leseb (~Adium@ has joined #ceph
[9:45] * ChanServ sets mode +v leseb
[10:03] * tnt (~tnt@ Quit (Ping timeout: 480 seconds)
[10:18] * dxd828 (~dxd828@ Quit (Remote host closed the connection)
[10:22] * dxd828 (~dxd828@ has joined #ceph
[10:26] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[10:29] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[10:34] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) has joined #ceph
[10:37] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:44] * pixel (~pixel@ has joined #ceph
[10:44] * capri (~capri@ has joined #ceph
[11:09] * athrift (~nz_monkey@ has joined #ceph
[11:10] * Vjarjadian (~IceChat77@ Quit (Quit: He who laughs last, thinks slowest)
[11:19] * dcasier (~dcasier@ has joined #ceph
[11:19] <saaby> tnt: I just had one more monitor bite the dust this morning (not the one we recreated yesterday). - crashed, and kepts crashing if restarted.
[11:19] <saaby> have you seen something like that too?
[11:20] <tnt> nope. The one I recreated works fine, no issues with it.
[11:20] <saaby> diskusage wasn't too high when it happened, only ~16GB
[11:20] <tnt> what does the crash look like ?
[11:20] <tnt> "only" 16G ...
[11:20] <saaby> yeah.. :)
[11:21] <tnt> mon are supposed to take a couple 100M
[11:21] <saaby> this is not the same that crashed yesterday - that one if fine now
[11:22] <saaby> yesterday they hit +100GB when they crashed
[11:22] <tnt> even then ... they're not supposed to take that much space.
[11:24] <saaby> this was the error on the other monitor today: mon/PGMonitor.cc: In function 'virtual void PGMonitor::update_from_paxos()' thread 7ff88772b700 time 2013-05-28 10:54:12.250371
[11:24] <saaby> mon/PGMonitor.cc: 173: FAILED assert(err == 0)
[11:25] <saaby> whops, sry for spam..
[11:26] <saaby> I have now recreated it just like the other one yesterday, and it's fine now.. I'm just a bit worried about mons failing fast..
[11:26] <saaby> so, it this worth reporting - or does it look like the bug you have been chasing?
[11:26] <tnt> well, until 0.61.3 is released or you install a fixed-up version manually, you'll probably have weird issues.
[11:27] <saaby> ok
[11:27] <saaby> we plan to installe the fixup today
[11:36] <joao> saaby, do you have logs for that monitor?
[11:36] <joao> how often can you trigger that error?
[11:38] <joao> saaby, did that monitor go through a synchronization?
[11:39] <saaby> joao: yup, have logs. - will post them in a minute
[11:39] <joao> great
[11:39] <joao> thanks
[11:39] <joao> I'll go grab some coffee in the meantime then
[11:39] <joao> :)
[11:39] <saaby> :)
[11:42] * tnt disabled compact on trim ... but so far it doesn't look good
[11:46] <joao> tnt, started growing I presume?
[11:47] <tnt> yes
[11:48] <joao> okay, at least that proves the compact-on-foo patches were justified
[11:48] <joao> sucks we need to do it manually though
[11:48] <joao> s/manually/explicitly
[11:49] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[11:49] <tnt> I'll let it run a bit maybe the "steady state" space is just larger ...
[11:49] <niklas> wido: just implemented a method to delete a pool, sent you a pull request
[11:50] <wido> niklas: Cool! I'll look at it shortly.
[11:50] <saaby> joao: here is the log from the failed mon: http://www.saaby.com/files/mon_crash.log.gz
[11:51] <niklas> wido: also I'm rather new to git, so please tell me if I did something wrong using git
[11:52] <joao> saaby, are you able to trigger this often?
[11:52] <joao> if so, would you mind cranking up debug mon to 10 ?
[11:53] <wido> niklas: Sure, I'll let you know
[11:53] <wido> tnx!
[11:53] <saaby> joao: well, I'm not sure if this is what happened to one of the other mons yesterday (which we rebuilt).
[11:53] <saaby> I tried restarting this mon a couple of times, which resultet id an immediate crash
[11:54] <wido> niklas: There is a typo ;)
[11:54] <saaby> this mon is now rebuilt, and haven't failed again (yet)
[11:54] <joao> ah
[11:54] <joao> okay
[11:54] <wido> niklas: https://github.com/Niklas974/rados-java/commit/5c074ca25702568a61350619b9b4a6117545844d
[11:54] <wido> see the log error: Failed to create pool
[11:54] <wido> I'll merge it in an fix the wrong message
[11:55] <niklas> ok, thx
[11:55] <joao> saaby, next time you hit a crash on start, please set 'mon debug = 20', re-run it and let it crash, grab the logs, grab a copy of the store, point me to them and then feel free to rebuild the monitor :)
[11:55] <tnt> joao: on the plus side, the IO rate is much lower :p
[11:55] <joao> tnt, eh, I bet
[11:55] <saaby> deal
[11:55] <joao> I'm going to look into what the other guy on ##leveldb suggested
[11:56] <joao> using basho's leveldb implementation
[11:56] <joao> but I worry that's not going to be a viable option
[11:57] <tnt> too bad the enhancements aren't backported
[11:57] * guilhemL (~guilhem@tui75-3-88-168-236-26.fbx.proxad.net) has joined #ceph
[11:59] <wido> niklas: Merged and fixed pushed
[12:07] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[12:08] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Ping timeout: 480 seconds)
[12:14] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[12:18] * andrei (~andrei@host217-46-236-49.in-addr.btopenworld.com) has joined #ceph
[12:19] <andrei> Hello guys
[12:19] <andrei> i was wondering if anyone could help me with ceph + kvm with rbd?
[12:19] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:19] * ChanServ sets mode +v andreask
[12:19] <andrei> i am having massive issue with writing to disks attached to virtual machines
[12:23] <absynth> of which form?
[12:23] <andrei> by massive issues I mean that after a short period of writing I get kernel messages that the process was blocked for more than 120 seconds
[12:23] <andrei> and I can't unmount the file system
[12:23] <andrei> and I can't even shutdown the virtual machine
[12:23] <andrei> it just hangs
[12:24] <andrei> when I forcefully restart it and mount the ceph disk it mounts without any changes that i've made during the previous session
[12:24] <andrei> i am using ceph from ubuntu ppa 0.61.2 I think
[12:25] <andrei> with qemu 1.5 and libvirt 1.0.5
[12:27] <wido> andrei: Is the Ceph cluster healthy?
[12:28] <andrei> wido: yeah, it is
[12:28] <andrei> health HEALTH_OK
[12:28] <andrei> 3 mons and all 17 osds are up
[12:28] <andrei> and I can read data without any issues
[12:29] <andrei> i am using default ceph install by following the 5 minute quick start guide
[12:30] <andrei> no optimisation or changes from default settings from both ceph and kvm side
[12:30] <wido> andrei: No, it should work with default values
[12:30] <wido> have you tried kernel RBD to see what that does?
[12:31] * Volture (~quassel@office.meganet.ru) has joined #ceph
[12:31] * sleinen1 (~Adium@2001:620:0:26:b577:2be7:a700:7875) Quit (Quit: Leaving.)
[12:31] * sleinen (~Adium@ has joined #ceph
[12:32] * sleinen1 (~Adium@ has joined #ceph
[12:32] * sleinen (~Adium@ Quit (Read error: Connection reset by peer)
[12:33] * sleinen1 (~Adium@ Quit ()
[12:33] * sleinen (~Adium@ has joined #ceph
[12:33] <andrei> wido: you mean mounting rbd disk directly from the client rather than doing it through the virtual machine?
[12:33] <wido> andrei: Indeed. Using the 'rbd map' command
[12:34] <andrei> I will give it a go
[12:36] * sleinen (~Adium@ Quit (Read error: Operation timed out)
[12:41] <pixel> hi everybody, trying to run some config to using rbd with xen but always get the error: Block device type "rbd" is invalid. What can be wrong? http://pastebin.com/xjgvnFWV
[12:44] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Quit: upgrading)
[12:46] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:54] <wido> pixel: I assume RBD support is in your blktap driver?
[12:54] <wido> might want to ping tnt about this
[12:54] <wido> he wrote the Xen/RBD stuff
[13:00] <pixel> wido> I'm doing some modeification in conf file, let you know in a few minutes
[13:01] * jahkeup (~jahkeup@ has joined #ceph
[13:02] <andrei> wido: I've noticed one thing. I do not have the rbd modules on my centos 6.4 server which are running kvm and virtual machines
[13:02] <andrei> which is odd. How does it manage to connect the ceph drive in the first place with rbd module?
[13:04] <andrei> I am trying the rbd map directly on one of the file servers and so far the write benchmarks are working fine, I do not see any kernel messages relating to blocked processes/tasks
[13:04] <wido> andrei: Qemu uses librbd, which is in userspace
[13:05] <andrei> wido: i see
[13:06] <andrei> okay, in that case, do you think there is some sort of an issue with qemu which causes write issues on the vm side?
[13:09] <wido> andrei: Could be, probably is. I've never seen this happen
[13:09] <wido> The writes seem to block
[13:09] <wido> Did you compile Qemu against ceph 0.61.2
[13:10] <andrei> i believe so
[13:10] <andrei> i've initially installed ceph
[13:11] <andrei> and compilied both libvirt and qemu after
[13:11] <andrei> wido: do you use ceph + kvm yourself?
[13:11] <wido> andrei: Yes, running behind CloudStacl
[13:12] <andrei> wido: are you widodh on #cloudstack?
[13:13] <wido> andrei: Yes ;)
[13:13] <andrei> ))
[13:13] <andrei> sorry, didn't realise
[13:13] <andrei> we've had a chat last week
[13:14] <wido> I know :)
[13:14] <andrei> thanks for your help
[13:14] <andrei> i am now closure to implementing ceph with CS
[13:14] <andrei> but that write issue is bugging me!
[13:14] <andrei> not sure how to deal with it or where to look
[13:15] <andrei> i do not have this problem when using rbd map directly from one of the storage servers
[13:15] <andrei> it seems to work okay
[13:20] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[13:21] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[13:23] <wido> andrei: Why not stick with the default Qemu and libvirt in Ubuntu 13.04?
[13:23] <wido> And only upgrade Ceph to 0.61.2
[13:29] * jahkeup (~jahkeup@ Quit (Quit: My MacBook Pro has gone to sleep. ZZZzzz…)
[13:36] <andrei> wido: I wish I could do that, but the Infiniband hardware is not working on ubuntu
[13:36] <andrei> that is why I have to stick with centos for time being
[13:36] <andrei> (((
[13:38] <wido> andrei: Ah, ok. No experience with that
[13:42] * jahkeup (~jahkeup@ has joined #ceph
[14:00] <tnt> joao: Interesting effect ... if you re-enable compact-on-trim on the leader, it seems to affect them all.
[14:02] <tnt> is that expected ?
[14:07] * noahmehl_ (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[14:07] * sleinen (~Adium@ext-dhcp-231.eduroam.unibe.ch) has joined #ceph
[14:09] * sleinen1 (~Adium@2001:620:0:25:3049:5952:afee:fb47) has joined #ceph
[14:10] <tnt> joao: Ah yes, the leader generates a transaction, encodes it and it's "replayed" on the peons so the compact decision is made on the leader only ... I overlooked that.
[14:12] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Ping timeout: 480 seconds)
[14:12] * noahmehl_ is now known as noahmehl
[14:15] * sleinen (~Adium@ext-dhcp-231.eduroam.unibe.ch) Quit (Ping timeout: 480 seconds)
[14:15] * IHS (~horux@ has joined #ceph
[14:16] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:17] * KindTwo (~KindOne@h176.239.22.98.dynamic.ip.windstream.net) has joined #ceph
[14:17] * KindTwo is now known as KindOne
[14:23] * diegows (~diegows@ has joined #ceph
[14:24] <joao> tnt, the compact however is not replayed
[14:24] <joao> that's what's been bugging me
[14:25] * sleinen1 (~Adium@2001:620:0:25:3049:5952:afee:fb47) Quit (Quit: Leaving.)
[14:25] * sleinen (~Adium@ext-dhcp-231.eduroam.unibe.ch) has joined #ceph
[14:27] * pixel (~pixel@ Quit (Quit: Ухожу я от вас (xchat 2.4.5 или старше))
[14:30] <tnt> joao: well ... it sure seems like it is replayed because in the LOG of level DB on a peon that has compact-on-trim=false, I see "Manual compaction at level-2 ...".
[14:31] <tnt> joao: where exactly do you see that compacts are not replayed on the peons ?
[14:32] * mrjack (mrjack@office.smart-weblications.net) has joined #ceph
[14:33] * sleinen (~Adium@ext-dhcp-231.eduroam.unibe.ch) Quit (Ping timeout: 480 seconds)
[14:34] <joao> tnt, oh, nevermind
[14:35] <joao> I forgot that Sage created a MonitorDBStore operation that triggers it once the transaction is applied
[14:35] <saaby> joao, tnt: all mons are now upgraded with the "fix".
[14:36] <joao> tnt, everything makes sense now, disregard what I said ;)
[14:36] <joao> saaby, cool; if you bump into any further crashes, please let me know
[14:37] <saaby> "fix" being: http://tracker.ceph.com/issues/4895
[14:37] <tnt> joao: heheh. Still doesn't really explain why the background compaction of leveldb let the store grow. (and it does some "compaction" stuff, I see it in the LOG).
[14:38] <saaby> joao: will do!
[14:38] <joao> yeah, but from what I gathered from yesterday's chat with rvagg on ##leveldb, it's a known leveldb issue
[14:39] <joao> I've subscribed to their group hoping to see it resolved at some point; otherwise I might dive into the code if that's what it takes
[14:40] <joao> but would be great to be able to reproduce the whole shenanigan locally :\
[14:40] * IHS (~horux@ Quit (Quit: Saindo)
[14:41] <tnt> joao: it's kind of surprising you can't reproduce it. Unlike the other issue this one shows up directly.
[14:42] * jshen (~jshen@108-231-76-84.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[14:42] <joao> tnt, haven't run a test for it in a week or so, but the thing is that my stores grow reaaaally slow
[14:42] <joao> and get automatically compacted
[14:42] <joao> my guess is that I'm missing a real-world workload on it
[14:42] <tnt> you at least need a large pgmap ( I have 12k PGs )
[14:42] <joao> ahm
[14:42] <joao> that might be it
[14:43] <tnt> because this creates large keys with a bunch of sst file containing a single key. (in my case 5M)
[14:43] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (Read error: Operation timed out)
[14:43] <tnt> and then if there is some pgmap activity like stat update every second, then, it happens.
[14:49] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[14:49] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[14:49] <tnt> It's almost like it didn't delete the removed keys.
[14:50] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[14:51] * eegiks (~quassel@2a01:e35:8a2c:b230:115:9bf1:44d:8b34) Quit (Ping timeout: 480 seconds)
[14:52] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[14:55] <tnt> joao: https://code.google.com/p/leveldb/issues/detail?id=158
[14:58] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[14:58] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[14:59] <tnt> joao: so yeah ... basically the usage that ceph makes of leveldb is the worse possible workload for leveldb and it handles it the worse way possible pretty much.
[15:00] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Quit: Leaving.)
[15:03] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[15:04] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[15:04] * allsystemsarego (~allsystem@ has joined #ceph
[15:09] * eegiks (~quassel@2a01:e35:8a2c:b230:8cbd:e0cd:45e7:762c) has joined #ceph
[15:13] * sakari (sakari@turn.ip.fi) has joined #ceph
[15:14] * redeemed (~redeemed@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[15:18] * nunoes (~oftc-webi@ has joined #ceph
[15:21] <nunoes> Hi all, is anyone here in the mood to help a noob sizing a ceph instalation?
[15:22] <tnt> sizing ?
[15:23] <nunoes> yes, for performance
[15:24] <nunoes> I'm leading a projet to implement ceph as a backend for openstack, most of my team members want a propriatary solution like emc or netapp
[15:25] * jshen (~jshen@108-231-76-84.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[15:26] <nunoes> i was thinking to implement a ceph rdb, but everyone is talking about iops, and to be honest, i'm a more of a network guy than a server :)
[15:26] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[15:27] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[15:27] <nunoes> so looking at all documentation, i've got a good look at the basics, but i cant seem to find any values for performance.
[15:27] <joao> nunoes, there was a couple of blog posts by nhm on performance
[15:28] * jeff-YF (~jeffyf@ has joined #ceph
[15:28] * PerlStalker (~PerlStalk@ has joined #ceph
[15:28] <nunoes> For instance what would be the performance of a combination 2 storage node with 12 SATA drives and SSD for journal
[15:30] <nunoes> nhm?
[15:31] <joao> nunoes, http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[15:31] * Guest6791 (~julianwa@ Quit (Quit: afk)
[15:31] * sleinen (~Adium@ext-dhcp-231.eduroam.unibe.ch) has joined #ceph
[15:31] * andrei (~andrei@host217-46-236-49.in-addr.btopenworld.com) Quit (Ping timeout: 480 seconds)
[15:32] <joao> nunoes, http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
[15:32] <joao> nunoes, http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/ and http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/
[15:33] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[15:33] * sleinen1 (~Adium@2001:620:0:26:98b0:ccf:ca66:cf2) has joined #ceph
[15:33] <joao> also, this one might be interesting too: http://ceph.com/user-story/ceph-from-poc-to-production/
[15:35] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[15:37] <nunoes> Tanks for the information, i've allready looked at them
[15:39] <joao> wrt to performance, best avenue here to get some feedback would be to either wait for nhm to come online, or for someone with a production deployment to share their insights
[15:39] <joao> I, for one, am pretty clueless on that regard :)
[15:39] * sleinen (~Adium@ext-dhcp-231.eduroam.unibe.ch) Quit (Ping timeout: 480 seconds)
[15:39] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[15:40] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[15:42] <nunoes> I understand. You see my question here is how can demonstrate that a ceph instalation is a better solution than a proprietary one. Vendors tend to say they support a certain number of IOPS, and i havent found a way to explain how i can acomplish that with ceph.
[15:43] <tnt> I'd tend to says those numbers are BS anyway ..
[15:44] <nunoes> Yes i agree, but it's not easy to convince all the others of that :)
[15:45] <nunoes> I'm thinking of building small OSD's with 8 disks
[15:45] <nunoes> with 10G Ethernet interfaces for public and replication
[15:45] <janos> IOPS without all sorts of parameters exmplained are BS
[15:46] <janos> *explained
[15:46] <nunoes> SSD for journals
[15:46] <tnt> but tbh, ceph isn't the best performance wise (from my personal experience), but it gets better with each release :p
[15:46] <nunoes> but what type of disks should i use for the storage?
[15:47] <janos> ceph's major win is not raw speed. it's flexibility imo
[15:47] <janos> flexibility along a few different axis too. management, hardware, failure, usage scenarios
[15:48] <tnt> yes, and it can also provide much more than raw block devices but all that has a cost.
[15:48] <nunoes> yes but in order to run some virtual machines i need some degree of performance.
[15:48] <tnt> nunoes: if you want more IOPS ... get 15k SAS.
[15:48] <tnt> wether you really need that is another matter.
[15:48] <nunoes> correct
[15:49] <nunoes> that's what i don't know :)
[15:49] <tnt> personaly I have the VM disks on 10k SAS disk with battery backed cache raid card (in single drive RAID-0, per osd).
[15:50] <tnt> But I have the bulk of the application data stored in a S3/RadosGW pool which is assigned to simple 7.2k 2T drives.
[15:50] <tnt> (and tbh, they're sitting IDLE 95% of the time :p)
[15:51] <nunoes> do you have ssd for jornal?
[15:52] <tnt> no, I have the journals on the same drive. which is definitely not ideal.
[15:52] * jshen (~jshen@ has joined #ceph
[15:52] <janos> i could be wrong, but ssd seems more critical for speed gains with fewer osd's
[15:53] <janos> i would imagine a massive install isn't going to need them as much since writes are spread over so much hardware
[15:53] <nunoes> I got the impression SSD would allow for the use of 7.2k drives instead of having 10k or 15k drives.
[15:53] <tnt> janos: I guess it depends on the workload. if you have continuous operation, you'll fill the jurnal and the drives will have to follow the full rate anyway.
[15:53] <joao> janos, my guess is that it also depends on the workload imposed on such deployment :p
[15:53] <janos> yeah
[15:53] <janos> agreed
[15:54] <tnt> so ... back to "depends on what will run on it" :P
[15:54] <nunoes> We'll has far as i know, it will be mainly windows server, sqlserver
[15:54] <joao> hardly ever will you find a one-size-fits-all kind of solution
[15:55] <tnt> In my case I know that my limiting factor is not even ceph but some weird interaction between RBD and Ceph, then the network (using 2x1G bonded), then the actual drives.
[15:55] <tnt> between RBD and Xen I meant.
[15:56] <nunoes> I do get the feeling i'll only find out after testing :)
[15:56] <nunoes> my scenario,
[15:56] <nunoes> my issue here is that i need to prove my scenario first :)
[15:56] <nunoes> talk about a though situation
[15:58] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[15:59] <joao> nunoes, if you have the available hardware, creating a simple poc to test your use-case should be straightforward; then you probably will have to fine tune it according to your needs I suppose
[16:00] * Volture (~quassel@office.meganet.ru) Quit (Remote host closed the connection)
[16:00] <joao> and then factor in scalability, self-healing and all that; then again, I'm hardly a specialist in this sort of thing
[16:00] <nunoes> Yes i agree.
[16:00] <scuttlemonkey> hey nunoes, a bit late to the conversation...but inktank has actually done several of what you are looking at if you are interested in someone to help
[16:01] <scuttlemonkey> have had a couple places that wanted to do a PoC, tweak performance, and stack against proprietary options
[16:02] <scuttlemonkey> last I heard we were stacking up quite favorably against the big guys, not just on features like self-managing/-healing, but also on raw performance...once tuned
[16:03] <nunoes> that could be an idea
[16:03] <jmlowe> I'm in need of some help from sage or sjust, they normally get in in an hour or so?
[16:03] <nhm> nunoes: heya
[16:03] <joao> jmlowe, they're usually around 17h GMT
[16:03] <joao> *at
[16:04] <jmlowe> ah so 2 hours from now
[16:04] <joao> yes
[16:04] <nhm> nunoes: How many IOPs are you guys targeting?
[16:04] <nhm> jmlowe: sjust might not be in until a bit later
[16:05] <nhm> Sage some times is on as early as 8am PST though (but might not be)
[16:05] <jmlowe> I had a osd crash and wound up with a inconsistent pg, don't know enough about the filestore layout to describe my bug
[16:06] * yehudasa_ (~yehudasa@2602:306:330b:1410:ea03:9aff:fe98:e8ff) has joined #ceph
[16:06] <nunoes> nhm: Hi, someone fead me about 2000 IOP
[16:06] <nhm> nunoes: and how many disks?
[16:07] <jmlowe> I see the object on disk but it is marked as missing on the primary, so either something very unusual is going on or my understanding of the filestore is faulty
[16:08] * LeaChim (~LeaChim@ has joined #ceph
[16:08] <jmlowe> even more unusual is the secondary crashed, the primary stayed up the whole time but it is the one missing objects
[16:08] <nhm> jmlowe: I seem to remember Sage talking about some kind of repair utility, but I don't know how it works.
[16:08] <saaby> jmlowe: ceph pg repair <pg_num>
[16:09] <nhm> that's probably it. :)
[16:09] <saaby> :)
[16:09] <saaby> that should do it if they are just marked "inconsistent"
[16:09] <jmlowe> saaby: repair got my secondary in shape, it was also missing objects
[16:09] <saaby> ouch, so it's incomplete too?
[16:10] <jmlowe> I had a trim operation running and I'm guessing somewhere the journal has lost some transactions
[16:11] <nhm> jmlowe: yeah, this sounds like a Sage/Sam question.
[16:12] <jmlowe> saaby: well my secondary is complete, and the objects in question are on the disk, so at this point I don't even know what questions to ask
[16:12] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[16:12] <saaby> ok, do you see incomplete/inconsistent pg's reported from "ceph health details" ?
[16:14] <jmlowe> also, there is this, I appear to have an old copy of the objects in the secondary but the primary has only one copy
[16:14] <jmlowe> find 2.363_head/ -name 'rb.0.132c.238e1f29.000000006773*'
[16:14] <jmlowe> 2.363_head/DIR_3/DIR_6/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[16:14] <jmlowe> 2.363_head/DIR_3/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[16:15] <jmlowe> primary:
[16:15] <nunoes> nhm: i would say the ones needed to acheive or surpass that value
[16:15] <jmlowe> find 2.363_head/ -name 'rb.0.132c.238e1f29.000000006773*'
[16:15] <jmlowe> 2.363_head/DIR_3/DIR_6/DIR_F/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[16:16] <nunoes> nhm: i haven't implemented anything yet so i'm free to implement what makes sense.
[16:16] <jmlowe> I thought objects on the primary and secondary would have the same path, but they don't
[16:16] <nhm> nunoes: is this for block storage or?
[16:18] <nunoes> nhm: it's for keeping opentack VM disks
[16:18] <nhm> nunoes: ok, using KVM?
[16:18] <nunoes> nhm: yes
[16:23] <nunoes> nhm: any ideas?
[16:25] <nhm> nunoes: yeah, was just looking over some of my data.
[16:26] * BillK (~BillK@124-169-77-36.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[16:28] <nhm> nunoes: so basically I think 2k IOPs should be entirely doable with a few 2U nodes with 12 spinning disks each and SSD journals. The thing you have to watch out for are any issues that could cause high latency, especially if you are doing lots of replication.
[16:30] <nunoes> nhm: I'm going to use 10G ethernet links, with diferent links for public and replication
[16:31] <nhm> nunoes: that will help. I've found that using controllers like the SA9207-8i with SSD journals and no expanders in the backplane tends to yield the best performance.
[16:32] <nhm> nunoes: depending on how much storage you need, it might not be as important if you can just throw more servers at the problem.
[16:32] <nunoes> nhm : What types of spinning disks are you talking about?
[16:33] <nhm> nunoes: my test rig has 1TB 7200rpm enterprise SATA disks.
[16:33] <nhm> nunoes: higher RPM drives are certainly nice. :)
[16:35] * BillK (~BillK@124-169-221-201.dyn.iinet.net.au) has joined #ceph
[16:38] * sleinen1 (~Adium@2001:620:0:26:98b0:ccf:ca66:cf2) Quit (Quit: Leaving.)
[16:38] * sleinen (~Adium@ext-dhcp-231.eduroam.unibe.ch) has joined #ceph
[16:39] <nunoes> nhm: do you think there is a correlation between the number of drives/hosts and performance? e.g. If i need more iops, instead of using 10k drives, i'll add another osd...
[16:40] <nunoes> nhm: something like more hosts = more iops/performance
[16:40] <nhm> nunoes: It's complicated. It depends on the number of clients, how many concurrent IOs the app does, how much ops are backing up on any single OSD, etc.
[16:40] * jshen (~jshen@ Quit (Read error: Operation timed out)
[16:41] <nhm> nunoes: If you have lots of concurrency on the client side and the OSDs are well balanced (IE none of the disks are behaving badly) adding more OSDs should improve aggregate IOPS.
[16:42] * sleinen (~Adium@ext-dhcp-231.eduroam.unibe.ch) Quit (Read error: Operation timed out)
[16:49] <nunoes> nhm: tanks. I'll start with simple scenario with two nodes each using 12 disks, and try to do some testing.
[16:49] * tkensiski (~tkensiski@108.sub-70-197-11.myvzw.com) has joined #ceph
[16:49] <nhm> nunoes: cool. Do you already have some nodes you can use?
[16:49] * tkensiski (~tkensiski@108.sub-70-197-11.myvzw.com) has left #ceph
[16:51] <nunoes> nhm: no i have to order them. i think in two weaks i'll have them
[16:51] <nhm> nunoes: what kind of servers?
[16:52] <nunoes> i'm looking at supermicro and intel
[16:53] <nunoes> nhm: any thought's on that?
[16:53] * Rocky (~r.nap@ has joined #ceph
[16:53] <nhm> nunoes: the A version of the supermicro chassis with 2 controllers looks like it would be a potent combination with SAS9207-8i controllers and DC S3700 SSDs.
[16:54] <nhm> nunoes: never tested it directly in that chassis, but I've got the SC847A chassis and Intel 520 SSDs and SAS9207-8is and it does very well.
[16:55] <nunoes> actually they are proposing the 9211-8i
[16:57] <nhm> nunoes: The SAS2008 seems to work pretty well so long as you've got it in JBOD mode.
[16:58] <nhm> nunoes: but in both cases with that controller you really do need to have SSDs for the journals.
[16:58] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[16:58] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[16:58] <nhm> nunoes: the lack of WB cache really hurts if the journals are on the same disks as the data.
[16:59] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:02] <nunoes> nhm: I'm looking at the SC826E
[17:05] * andrzej (~chatzilla@ has joined #ceph
[17:05] * andrzej (~chatzilla@ Quit ()
[17:06] <nhm> nunoes: Might work fine. On some nodes I've seen behavior that makes me think the expander backplanes are hurting performance, but it's not clear for certain how much of an effect it has.
[17:07] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Quit: Ex-Chat)
[17:07] * Marcos (~Marco@ has joined #ceph
[17:07] * Wolff_John (~jwolff@vpn.monarch-beverage.com) has joined #ceph
[17:10] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Quit: afk)
[17:10] <nhm> nunoes: anyway, if you are thinking at all about support, it might be worth doing it sooner rather than later so you can get Inktank involved early on in the process.
[17:12] <nunoes> nhm : yes i'm thinking about that, but first i have to do a proof of concept. I'll let you know how it went.
[17:13] <Marcos> hello, i am installing radosgw on CentOS. 'hostname -f' returns 'ceph.example.com', 'hostname -s' returns 'ceph' and 'hostname' returns 'ceph.example.com'. In ceph.conf in section [client.radosgw.gateway] 'host' should be set to 'ceph'? I am correct?
[17:13] <nhm> nunoes: sure, I know how it goes.
[17:15] <nhm> nunoes: if you do end up going with the SC826E chassis I'd be interested to see how it does. I haven't been able to do much testing with the expanders they use.
[17:19] <nunoes> nhm: ok. i'll see what my supplier can spare for testing.
[17:19] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[17:21] <nhm> nunoes: cool. Like I said, I suspect the A chassis with 2 controllers may still be faster, but it's entirely possible the E series may do as well.
[17:23] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has left #ceph
[17:32] <nunoes> nhm: I'l ask for a comparison of the A and E chassis
[17:32] * gucki (~smuxi@77-56-36-164.dclient.hispeed.ch) has joined #ceph
[17:32] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[17:33] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Remote host closed the connection)
[17:34] * jgallard (~jgallard@gw-aql-129.aql.fr) has joined #ceph
[17:34] <nunoes> nhm: a quad core and 16GB RAM sound good for the 12 disk setup?
[17:36] * aliguori (~anthony@ has joined #ceph
[17:36] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[17:41] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[17:52] * kyle_ (~kyle@ has joined #ceph
[17:52] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:55] * jgallard (~jgallard@gw-aql-129.aql.fr) Quit (Quit: Leaving)
[17:57] * KindTwo (~KindOne@h36.236.22.98.dynamic.ip.windstream.net) has joined #ceph
[17:58] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:58] * KindTwo is now known as KindOne
[18:01] * Cube (~Cube@ has joined #ceph
[18:06] * tnt (~tnt@ has joined #ceph
[18:08] * scalability-junk (uid6422@id-6422.tooting.irccloud.com) Quit (Remote host closed the connection)
[18:08] * Tribaal (uid3081@id-3081.hillingdon.irccloud.com) Quit (Remote host closed the connection)
[18:08] * gregaf1 (~Adium@2607:f298:a:607:91c6:a098:e1e7:319f) Quit (Quit: Leaving.)
[18:09] * gregaf (~Adium@ has joined #ceph
[18:10] * Wolff_John (~jwolff@vpn.monarch-beverage.com) Quit (Ping timeout: 480 seconds)
[18:11] * Volture (~quassel@office.meganet.ru) has joined #ceph
[18:11] * aliguori (~anthony@ Quit (Ping timeout: 480 seconds)
[18:11] * leseb (~Adium@ Quit (Quit: Leaving.)
[18:17] * guilhemL (~guilhem@tui75-3-88-168-236-26.fbx.proxad.net) Quit (Remote host closed the connection)
[18:17] <jmlowe> sage: I need some help with recovery and filing at least one bug
[18:19] * tkensiski (~tkensiski@ has joined #ceph
[18:19] * tkensiski (~tkensiski@ has left #ceph
[18:30] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[18:32] * portante (~user@ has joined #ceph
[18:34] * sleinen (~Adium@2001:620:0:26:9d2:e876:6d75:fa65) has joined #ceph
[18:37] <nhm> nunoes: Might be better of with a 6-core, but a 3GHz+ quad core is probably enough.
[18:38] <nhm> nunoes: if you are running the mon(s) on the same machines you might want to go with the hexacore.
[18:38] <nhm> Something like a E5-2630 would do in that case.
[18:45] <nhm> hrm, something like an E3-1240 v2 would probably work fine and might be a bit cheaper.
[18:46] <nunoes> i was thinking more of a opteron 6320
[18:46] <nunoes> 8 cores at 2.8
[18:47] <nhm> nunoes: I haven't tested an opteron in a high performance configuration, so can't say much about it.
[18:48] * The_Bishop_ (~bishop@f052100129.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[18:48] <kyle_> hello all. I've been having trouble creating osds since ceph-deploy started being used. Is it even possible to use osd create when i only have one disk (/dev/sda) where the OS is partiond off? If provide /dev/sda as the disk. It says it cannot be mounted. If i create a new partition and provide that to osd create, it never gets past "Preparing host data0 disk /dev/sda6......"
[18:49] <nunoes> neither have i :) but since the KVM host will also have opteron, i'm keeping the same cpu for all this project
[18:52] <nunoes> nhm: also for the monitors is it ok to run in the same node than the OSD? it shure lowers the investment, but is it really ok?
[18:53] <nhm> nunoes: there can be some overhead, both in terms of CPU/Memory, and fsyncs. It's hard to say how much without testing because it depends on the clients, number of OSDs, etc.
[18:54] <nunoes> nhm: i'll go with dedicated hardware for the mon
[18:55] <nhm> nunoes: for the POC, running the mon on the OSD nodes may be ok.
[18:55] <nhm> nunoes: but in production you may want to put it on a management node or dedicated node.
[18:56] <nhm> nunoes: and you may want 3 mons for added redundancy.
[18:56] <nunoes> nhm: ok, looking at the hardware requirements its says 1 GB per daemon, what deamon is this? the osd's?
[18:56] <nhm> nunoes: fyi there's a bug that we are fighting right now in 0.61.2 that may cause a lot of extra mon disk and CPU usage. Should be a new point release soon that helps.
[18:57] * loicd (~loic@LLagny-156-36-54-154.w80-13.abo.wanadoo.fr) has joined #ceph
[18:57] <nhm> nunoes: yes, but I like to spec closer to 2GB per OSD personally.
[18:57] <nhm> nunoes: IE 24GB or 32GB of ram in a 2U OSD node instead of 16GB.
[18:58] <nunoes> nhm: u got me confused :)
[18:58] <nhm> nunoes: so in a 2U node, if you have 12 disks/osds, 24+GB of ram is how I like to spec them.
[18:59] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Ping timeout: 480 seconds)
[19:00] <nunoes> nhm: are u talking about the memory of the 2U node or the monitor node?
[19:00] <kyle_> Does anyone know if it's okay to use a partition such as /dev/sda6 as hte disk and journal for "osd create"? Or where i can look to figure out why "osd create" does not ever finish the prepare step? Or should i not be using RAID configurations for OSDs?
[19:00] * aliguori (~anthony@ has joined #ceph
[19:00] <nhm> nunoes: 2U node.
[19:02] <nunoes> nhm: ah! ok! i was talking about the monitor.
[19:02] * gucki_ (~smuxi@77-56-36-164.dclient.hispeed.ch) has joined #ceph
[19:03] * sleinen (~Adium@2001:620:0:26:9d2:e876:6d75:fa65) Quit (Quit: Leaving.)
[19:04] <nhm> nunoes: oh, it can be kind of variable depending on the size of the cluster. 16GB should be more than enough for the POC.
[19:06] * gucki_ (~smuxi@77-56-36-164.dclient.hispeed.ch) Quit (Remote host closed the connection)
[19:06] <nunoes> nhm: ok, but the guidlines on the site state 1G per deamon, this deamon is the monitor deamon itself or the number of the 2U nodes?
[19:08] <nhm> nunoes: link?
[19:08] <nunoes> nhm: http://ceph.com/docs/next/install/hardware-recommendations/
[19:10] <jmlowe> sjust: sage: either of you around?
[19:10] <nhm> nunoes: oh, they mean per mon-daemon, but I'd spec more than that.
[19:10] <sjust> jmlowe: I'm here
[19:11] <nhm> nunoes: often for small POCs the mon will hang around 200-300MB of ram, but I've seen it climb quite a bit higher in some scenarios.
[19:11] <jmlowe> ok, great, I think I need to file a bug but my understanding of what went wrong and what the state of my filestore is keeping me from doing that
[19:12] <jmlowe> starts with this: http://tracker.ceph.com/issues/5163
[19:13] <jmlowe> I restart the osd, recovery happens, periodic scrub marks one pg inconsistent
[19:13] <nunoes> nhm: ok i have hardware around to handle that, 8G RAM + dual core xeon
[19:13] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[19:13] <sjust> jmlowe: inconsistent how?
[19:14] <jmlowe> secondary crashed, specifically this pg 2.363 [1,9], I had 4 of these disk size (2297856) does not match object info size (4177920)
[19:14] <jmlowe> I managed to fix those
[19:14] <jmlowe> I had a bunch of missing objects on the primary and secondary
[19:14] <jmlowe> secondary has all its objects after ceph repair
[19:15] <jmlowe> primary is missing 57 objects as of last scrub
[19:15] * loicd (~loic@LLagny-156-36-54-154.w80-13.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[19:15] <jmlowe> now here is where it gets weird and I'm out of my depth
[19:15] <jmlowe> the objects are there but aren't found
[19:15] <jmlowe> find 2.363_head/ -name 'rb.0.132c.238e1f29.000000006773*'
[19:15] <jmlowe> 2.363_head/DIR_3/DIR_6/DIR_F/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[19:15] <jmlowe> and on the secondary
[19:15] <jmlowe> find 2.363_head/ -name 'rb.0.132c.238e1f29.000000006773*'
[19:15] <jmlowe> 2.363_head/DIR_3/DIR_6/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[19:15] <jmlowe> 2.363_head/DIR_3/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[19:16] * madkiss (~madkiss@ has joined #ceph
[19:18] <jmlowe> why does the secondary have two copies of the same object? why does the primary think the object is missing? Why do the paths differ for the most recent copies on the primary and secondary?
[19:19] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[19:19] <jmlowe> what exactly would I put in a bug report that would describe this?
[19:20] <jmlowe> If I didn't know better I would say that I lost some transactions from the journal during the initial crash/recovery
[19:21] <elder> joshd, let me know when you're in
[19:23] <jmlowe> oh, and how do I make that pg consistent again?
[19:25] * davidzlap (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:26] <nhm> jmlowe: one sec, those guys are in a short meeting
[19:26] <sjust> jmlowe: what do you mean you lost some transactions?
[19:27] <jmlowe> well, one of the rbd devices was doing some fstrim's, so maybe that's how I wound up with missing objects, the trim/discard operation didn't complete
[19:28] <sjust> loosing journal entries could definitely cause this
[19:29] <jmlowe> yeah, journal entries not transactions is what I meant
[19:30] <sjust> are your journals on the same disk as the filestore?
[19:30] <jmlowe> yes
[19:30] <sjust> xfs?
[19:30] <jmlowe> yes
[19:31] <jmlowe> no filesystem or other os problems in dmesg
[19:31] <sjust> do you have journal aio = true/
[19:31] <sjust> ?
[19:31] <jmlowe> checking
[19:32] <jmlowe> not set in ceph.conf
[19:32] <jmlowe> so whatever the default is
[19:32] <sjust> did this initially happen prior to cuttlefish?
[19:32] <jmlowe> no
[19:32] <sjust> can you verify that those two "copies" are the same inode?
[19:33] <jmlowe> happened Friday 2013-05-24
[19:33] <jmlowe> they have different mtimes
[19:33] <sjust> that's surprising
[19:34] * Tribaal (uid3081@id-3081.hillingdon.irccloud.com) has joined #ceph
[19:34] <jmlowe> ls -l 2.363_head/DIR_3/DIR_6/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2 2.363_head/DIR_3/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[19:34] <jmlowe> -rw-r--r-- 1 root root 4194304 May 26 06:44 2.363_head/DIR_3/DIR_6/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[19:34] <jmlowe> -rw-r--r-- 1 root root 4194304 Mar 27 18:21 2.363_head/DIR_3/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[19:35] * madkiss (~madkiss@ Quit (Quit: Leaving.)
[19:35] <jmlowe> I'm also the one who triggered the 0.61.1 release with the extra stuff in my filestores
[19:35] <jmlowe> an I split my pg's
[19:36] <kyle_> Does anyone know if it's okay to use a partition such as /dev/sda6 as the disk and journal for "osd create"? Or where i can look to figure out why "osd create" does not ever finish the prepare step? Or should i not be using a RAID configurations for OSDs (one disk with OS partioned off)?
[19:36] <jmlowe> the one with the two copies is the secondary and appears to be complete I understand the err's from the deep-scrub
[19:37] <sjust> complete?
[19:37] <sjust> I think this is unrelated to the 61.1 bug
[19:37] <jmlowe> well, not missing objects
[19:37] <sjust> ok
[19:38] <jmlowe> from what I can tell 2.363_head/DIR_3/DIR_6 is the real one, 2.363_head/DIR_3 and 2.363_head/DIR_3/DIR_6/DIR_F are not recognized
[19:40] <sjust> 2.363_head/DIR_3/DIR_6/DIR_F exists?
[19:42] <jmlowe> on osd.1 which says "osd.1 missing e9beff63/rb.0.132c.238e1f29.000000006773/head//2" but I have this
[19:42] <jmlowe> ls -l 2.363_head/DIR_3/DIR_6/DIR_F/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[19:42] <jmlowe> -rw-r--r-- 1 root root 4194304 Mar 16 16:30 2.363_head/DIR_3/DIR_6/DIR_F/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[19:42] * rturk-away is now known as rturk
[19:43] <jmlowe> 2.363_head/DIR_3/DIR_6/DIR_F does not exist on osd.9
[19:45] <sjust> is there a DIR_3/DIR_6/DIR_F/DIR_F on osd.1?
[19:45] <jmlowe> yes
[19:45] <jmlowe> wait
[19:45] <sjust> sorry, 2.363_head/DIR_3/DIR_6/DIR_F/DIR_F
[19:46] <jmlowe> for osd.1: yes to 2.363_head/DIR_3/DIR_6/DIR_F/ no to 2.363_head/DIR_3/DIR_6/DIR_F/DIR_F
[19:46] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) has joined #ceph
[19:48] <jmlowe> for osd.9: no to 2.363_head/DIR_3/DIR_6/DIR_F
[19:48] <sjust> which osds failed when?
[19:49] <jmlowe> osd.1 has never faild, osd.9 failed a few hours before a scrub picked up the inconsistent pg
[19:49] <jmlowe> that was friday
[19:49] <sjust> which currently is primary?
[19:49] <jmlowe> osd.1
[19:50] * Wolff_John (~jwolff@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[19:50] <sjust> can you do a read on rb.0.132c.238e1f29.000000006773
[19:50] <sjust> using rados get
[19:50] <jmlowe> 2.363 1130 0 0 0 4497661952 0 0 active+clean+inconsistent 2013-05-27 19:21:53.227457 10782'262433 10779'708188 [1,9] [1,9] 10782'260820 2013-05-27 19:21:53.227345 10782'260820 2013-05-27 19:21:53.227345
[19:54] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:55] <jmlowe> ok I can get the object and it seems to match the md5 sum of osd9: 2.363_head/DIR_3/DIR_6/rb.0.132c.238e1f29.000000006773__head_E9BEFF63__2
[19:56] * Wolff_John (~jwolff@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 21.0/20130511120803])
[19:57] <sjust> ok, osd.9 seems to have a strangely corrupt subdir structure, seems like xfs lost (or we failed to adequately fsync) the in progress split marker on 2.363_head/DIR_3/DIR_6
[19:57] <sjust> or perhaps the in progress merge
[19:57] <sjust> which might make more sense
[19:57] <sjust> you say osd.1 shows up as having missing objects
[19:57] <sjust> even though it hadn't had a failure?
[19:58] <jmlowe> yes, that is correct
[19:58] <sjust> the two copies on osd.9, are they similar?
[19:58] <sjust> lots of overlapping byte ranges?
[19:58] * scalability-junk (uid6422@id-6422.hillingdon.irccloud.com) has joined #ceph
[19:58] <sjust> I think osd.1 probably is correct and osd.9 has zombie objects
[19:58] <jmlowe> same size
[19:59] <jmlowe> huh, identical
[20:00] <sjust> can you post a ls -lahr on the osd.9 and osd.1 2.363 directories?
[20:02] * dmick (~dmick@2607:f298:a:607:2de1:5205:5306:e5df) has joined #ceph
[20:04] <sjust> well, I think I know what the problem is, I didn't fsync the tag on start_merge and start_split
[20:05] <jmlowe> also after changing pg_num, I never changed pgp_num
[20:05] <jmlowe> if that makes any difference
[20:05] <sjust> pgp_num only affects placement
[20:05] <jmlowe> ok
[20:05] <sjust> though you don't get any benefit from adjusting pg_num until you adjust pgp_num to match
[20:05] <jmlowe> I'm assuming you want the whole tree
[20:06] <sjust> just for that pg, but yeath
[20:06] <jmlowe> this doesn't recurse like I'd expect -lahR
[20:06] <sjust> oh, I meant the one that recurses
[20:06] <sjust> oops
[20:06] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[20:07] * eegiks (~quassel@2a01:e35:8a2c:b230:8cbd:e0cd:45e7:762c) Quit (Ping timeout: 480 seconds)
[20:10] <jmlowe> trying to figure out the correct arguments for tree
[20:10] <sjust> yeah
[20:11] <joshd> elder: I'm here
[20:12] <jmlowe> just the filenames or the metadata also?
[20:13] <sjust> mostly filenames
[20:13] <sjust> I think I know what happened, need to confirm and let you know the right way to clean up
[20:13] <sjust> although actually the best way would probably be to remove the osd.9 copy of the pg and let it recover
[20:13] <sjust> I'll take a look first though
[20:15] * The_Bishop (~bishop@2001:470:50b6:0:d14c:a623:a4fd:1381) has joined #ceph
[20:19] <jmlowe> https://iu.box.com/s/oas9a97oxv0m4w2f34q4
[20:19] <jmlowe> https://iu.box.com/s/7y3nrfz5coq159yvzk8u
[20:20] <sjust> sagewk: wip-5180 review?
[20:20] * newbie82 (~kvirc@pool-71-164-242-68.dllstx.fios.verizon.net) has joined #ceph
[20:21] * gucki (~smuxi@77-56-36-164.dclient.hispeed.ch) Quit (Remote host closed the connection)
[20:21] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[20:21] * Vjarjadian (~IceChat77@ has joined #ceph
[20:27] * eschnou (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:27] <sagewk> sjust: looking
[20:27] * eegiks (~quassel@2a01:e35:8a2c:b230:3c6d:e42b:7cbf:28db) has joined #ceph
[20:28] <sagewk> sjust: was this on ext4?
[20:28] <sjust> xfs
[20:29] <sjust> it's on a distinct directory from the actual split merge operation
[20:29] <sagewk> elder: i thought that the xfs journal would effectively order all metadata/namespace operations
[20:29] <sjust> so it's not necessarily ordered?
[20:29] <elder> Hmm.
[20:29] <sjust> elder: and does that necessarily include directory xattr updates
[20:29] <sjust> ?
[20:29] <sagewk> i think it's needed for ext4 tho
[20:29] <sagewk> oh, it probably does not include xattrs....
[20:29] <elder> I have to look at the history here, what you're talking about.
[20:29] * newbie82 (~kvirc@pool-71-164-242-68.dllstx.fios.verizon.net) Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[20:29] <elder> xattrs are metadata, essentially.
[20:30] <sjust> the order would be:
[20:30] <sjust> set xattr(../)
[20:30] <sjust> link (../a/b/c ../a/c)
[20:30] <sjust> would the link necessarily order after the set xattr?
[20:32] <elder> OK, I really haven't caught up on the context of this, but...
[20:33] <elder> Since the operands of those two operations (xattr has ../, link has ../a/b/c and ../a/c and the two involved parent directories) I'm not sure if they are necessarily ordered.
[20:33] <sjust> yeah
[20:34] <elder> But I really don't know that for sure.
[20:34] <sagewk> regardless, it can't hurt.
[20:34] <sjust> and it's not a common operation
[20:34] <elder> Oh, I didn't complete that sentence. Since they're independent, I'm not sure they're ordered.
[20:35] <sagewk> elder: i thought that since all the metadata ops hit the jouranl, in general tehy would be.. teh question would be whether the xattr contents are journaled, or whether it's wonky write-back ala file data
[20:35] <elder> No, xattr ops are journaled.
[20:35] <elder> Just like metadata.
[20:35] <elder> That includes xattr data.
[20:36] <sagewk> wouldn't that enforce an ordering then?
[20:36] <elder> You're probably right though, from a practical standpoint I suspect they will be ordered, even if they may technically not need to be.
[20:36] <elder> Let me see if I can quickly get a little more solid answer. Give me a few minutes.
[20:37] <jmlowe> sjust: would nobarrier have made a difference, my raid controller is battery backed?
[20:37] <sjust> you have nobarrier on?
[20:38] <jmlowe> correct, all the osd's had it on, over the weekend I had to shutdown half and when I brought that half back up I didn't set no barrier
[20:38] <tnt> Anyone knows if 0.61.3 is coming soon ?
[20:38] <sjust> please do not turn on nobarrier.
[20:39] <jmlowe> sjust: is that serious enough I should turn it off everywhere right now?
[20:40] <sjust> jmlowe: I don't really know, it depends a lot on exactly what the raid controller is doing I suppose
[20:40] * bergerx_ (~bekir@ Quit (Quit: Leaving.)
[20:41] <sjust> you probably want to go ahead and turn that off.
[20:41] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) Quit (Read error: Operation timed out)
[20:45] <elder> OK, sagewk, sjust, it looks to me after a fairly quick scan through the code that Sage is right. Since the xattr operation will commit its transaction before the transaction for the subsequent link transaction, and as a result it will get logged first.
[20:45] <elder> And that basically means it got done first, in that order.
[20:47] <elder> xattrs in XFS have this funky "rolling transaction" concept and while I don't think that affects that answer, I'm not 100% sure either.
[20:52] * Marcos (~Marco@ Quit (Quit: ChatZilla 0.9.90 [Firefox 20.0/20130329030848])
[20:52] * Volture (~quassel@office.meganet.ru) Quit (Remote host closed the connection)
[20:53] * eschenal (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:58] * eschenal (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) Quit (Quit: Leaving)
[20:59] * eschenal (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:00] * eschnou (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[21:04] * vata (~vata@2607:fad8:4:6:9541:d2f3:3800:4fec) has joined #ceph
[21:05] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[21:06] * sleinen1 (~Adium@2001:620:0:26:496b:e6e7:230c:b2ff) has joined #ceph
[21:12] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[21:13] <sjust> elder: thanks
[21:13] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[21:15] <sjust> jmlowe: I think you can recover by removing all of the non-directories from 2.363_head/DIR_3 on osd.9 (you should be able to find each of them in 2.363_head/DIR_3/DIR_6/ as well)
[21:15] * fridudad (~oftc-webi@p5B09D970.dip0.t-ipconnect.de) has joined #ceph
[21:16] * tziOm (~bjornar@ti0099a340-dhcp0745.bb.online.no) has joined #ceph
[21:19] <jmlowe> sjust: then what do I need to do?
[21:19] * eschenal (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) Quit (Read error: Connection reset by peer)
[21:19] <sjust> restart the osd
[21:19] <fridudad> upgrading bobtail to cuttlefish. I'm not sure how to upgrade the monitors. Should i install new binaries und thenn restart all monitors at once? or should i restart one after another? Is it a problem that the clients are still bobtail?
[21:19] <jmlowe> sjust: ok
[21:19] <sjust> jmlowe: instead of removing
[21:19] <jmlowe> move them ?
[21:19] <sjust> maybe just rename into a tmp dir out of the way
[21:19] <sjust> but yeah, should work fine
[21:19] <kyle_> Does anyone know if it's okay to use a partition such as /dev/sda6 as the disk and journal for "osd create"? Or where i can look to figure out why "osd create" does not ever finish the prepare step? Or should i not be using a RAID configurations for OSDs (one disk with OS partioned off)?
[21:19] <jmlowe> ok, stand by
[21:23] <jmlowe> my current/2.363_head only has DIR_3 now
[21:23] <PerlStalker> Crap. I just had an osd segfault on me.
[21:23] <sjust> ok, restart the osd
[21:24] <jmlowe> wait, that wasn't exactly what you asked for
[21:24] <sjust> ?
[21:24] <jmlowe> move those back?
[21:24] <sjust> wait, you had files in 2.363_head/
[21:24] <sjust> ?
[21:24] <jmlowe> nevermind, I did mv current/2.363_head/DIR_3/rb* 2.363_repair_tmp/
[21:24] <sjust> that's right
[21:25] <jmlowe> root@gwioss1:/data/osd.9# ls current/2.363_head/DIR_3/
[21:25] <jmlowe> DIR_6
[21:25] <sjust> correct
[21:25] <sjust> ok, restart the osd
[21:25] <sjust> and rescrub
[21:26] <jmlowe> 2699 active+clean, 1 active+clean+inconsistent
[21:26] <jmlowe> now scrubbing
[21:26] <saaby> PerlStalker: cuttlefish?
[21:27] <PerlStalker> bobtail
[21:27] * Volture (~quassel@office.meganet.ru) has joined #ceph
[21:27] <jmlowe> scrub stat mismatch, got 993/1130 objects, 7/7 clones, 3959644160/4497661952 bytes.
[21:27] <saaby> ok, then its probably not the same as we see then..: http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/15138
[21:27] <jmlowe> 2.363 scrub 236 missing, 0 inconsistent objects
[21:27] <jmlowe> 2013-05-28 15:26:38.227918 osd.1 [ERR] 2.363 scrub 237 errors
[21:27] <jmlowe> they all appear to be from osd.9
[21:28] <jmlowe> osd.1 [ERR] 2.363 osd.9 missing e9beff63/rb.0.132c.238e1f29.000000006773/head//2
[21:28] <Volture> Hi all
[21:28] <jmlowe> sjust: do I need to repair now?
[21:28] <Volture> Please tell me how to deal with it? http://pastebin.com/F2pesTYr
[21:33] <sjust> yeah
[21:33] <sjust> wait
[21:33] <sjust> 2.363 has missing objects?
[21:33] <sjust> ok, post the lses again
[21:34] <jmlowe> fresh ones?
[21:34] <sjust> yeah
[21:34] <sjust> that should have fixed it
[21:35] <jmlowe> this does look better "got 993/1130 objects, 7/7 clones, 3959644160/4497661952 bytes" it looked inverted before
[21:35] <jmlowe> deep-scrub 235 missing, 1 inconsistent objects
[21:35] * `10 (~10@juke.fm) has joined #ceph
[21:35] <kyle_> what is the ideal way to configure an OSD in RAID10? OS on seperate parition and then create a second partition to point "ceph-deploy osd create" at?
[21:35] <jmlowe> osd.9: soid 14115f63/rb.0.24fc.238e1f29.000000000a56/head//2 digest 135122252 != known digest 2868628257, size 2715648 != known size 4177920
[21:36] <fridudad> upgrading bobtail to cuttlefish. I'm not sure how to upgrade the monitors. Should i install new binaries und thenn restart all monitors at once? or should i restart one after another?
[21:36] * `10_ (~10@juke.fm) has joined #ceph
[21:36] <kyle_> i believe one after another is the correct way.
[21:36] <Volture> Please tell me how to deal with it? http://pastebin.com/F2pesTYr
[21:37] <kyle_> upgrade on mon at a time restarting each as after it's upgraded
[21:37] <kyle_> one*
[21:37] * eschenal (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:37] <fridudad> kyle_ thanks
[21:38] <Volture> Please tell me how to deal with it? http://pastebin.com/F2pesTYr ceph 0.61.2
[21:38] <jmlowe> https://iu.box.com/s/cfxq71vq7bltcmzp66dw
[21:38] <jmlowe> https://iu.box.com/s/url0x5dp8plvblkn4vzi
[21:38] * `10__ (~10@juke.fm) has joined #ceph
[21:41] * jshen (~jshen@157.sub-70-197-3.myvzw.com) has joined #ceph
[21:42] * `10` (~10@juke.fm) Quit (Ping timeout: 480 seconds)
[21:43] <sjust> 2.363_head/DIR_3/DIR_6/rb.0.2a03.238e1f29.000000007c25__head_4B4D1F63__2
[21:43] <sjust> that one somehow got removed
[21:43] <sjust> and every other file in DIR_3/DIR_6/*
[21:43] <sjust> oops
[21:43] * `10 (~10@juke.fm) Quit (Ping timeout: 480 seconds)
[21:43] <sjust> in DIR_3/DIR_6
[21:44] <sjust> jmlowe: did you remove the files in 2.363_head/DIR_3/DIR_6 as well?
[21:45] <sjust> anyway, repair should handle it correctly at this point, I think
[21:45] * `10_ (~10@juke.fm) Quit (Ping timeout: 480 seconds)
[21:47] <jmlowe> ok, so do repair?
[21:47] <sjust> yeah, but did you also remove the files in 2.363_head/DIR_3/DIR_6?
[21:47] <sjust> or just 2.363_head/DIR_3/?
[21:49] <jmlowe> just 2.363_head/DIR_3/
[21:50] <sjust> that's strange
[21:50] <sjust> anyway, repair away
[21:51] <jmlowe> 2700 pgs: 2700 active+clean
[21:51] <sjust> cool
[21:51] <sjust> probably want to rescrub
[21:51] * fmarchand (~fmarchand@85-168-75-207.rev.numericable.fr) has joined #ceph
[21:52] <jmlowe> 2.363 deep-scrub ok
[21:52] <jmlowe> whew
[21:52] <fmarchand> hi !
[21:52] * davidzlap1 (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[21:52] <sjust> jmlowe: cool, glad it worked out
[21:53] <jmlowe> sjust: thanks a lot, you have no idea how stressful it's been the past few days
[21:53] * davidzlap (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Read error: Connection reset by peer)
[21:53] <sjust> jmlowe: you'll want to go ahead and enable bariers, the barriers should be cheap anyway and no good can come of disabling bariers
[21:54] <jmlowe> yeah, I shouldn't have been so sanguine about the xfs faq saying I should have them off with a bbu
[21:54] <jmlowe> barriers are on everywhere now
[21:54] <sjust> k
[21:56] <jmlowe> would it be a good idea over time to unweight each osd in turn and once empty wipe it out then put it back in at full weight?
[21:56] <sjust> jmlowe: don't think so
[21:56] <sjust> if scrub thinks you are ok, you should be fine
[21:56] <fmarchand> I have a question ... I have to install/configure and compare a rados gw ceph cluster and an openstack cluster ... I know that a lot of work have been done with openstack and ceph ... what are benefits of ceph over openstack according to you guys ?
[21:56] <fridudad> since upgrading to cuttlefish i get a lot of inconsistent snapcolls on and unaccounted for links on object messages - is this normal?
[21:57] <jmlowe> ok, just worried about those extra junk in my filestores
[21:57] <sjust> which extra junk?
[21:57] <nhm> jmlowe: btw, support contract would be awesome if you can swing it. ;)
[21:57] <jmlowe> yeah, I was thinking the same thing
[21:58] <jmlowe> can somebody call me with a ballpark number?
[21:58] <nhm> jmlowe: let me hook you up with one of our business folks.
[21:59] <jmlowe> thow an 'o' in after the j of my irc username and append at iu edu
[22:00] <redeemed> jmlowe, you mean this? [insert full email address] ;)
[22:01] <jmlowe> the spam bots are real and they are listening :)
[22:02] <redeemed> the matrix is all around you
[22:02] <jmlowe> sjust: does this mean http://tracker.ceph.com/issues/5163 isn't an actual bug?
[22:02] <fmarchand> no advice or point of view ?
[22:03] <dmick> fmarchand: why not openstack *with* ceph? I don't understand the "one or the other" question
[22:03] <sjust> jmlowe: if you were able to restart it, you are probably fine
[22:03] <sjust> only one osd?
[22:04] <kyle_> Should using one device (/dev/sda) for the os osd and journal be okay as long as the OS is partitioned seperately?
[22:04] <sjust> kyle_: yeah
[22:04] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[22:04] <kyle_> okay because i cannot get osd create to finish
[22:04] <kyle_> seems to hang on prepare
[22:05] <kyle_> it's a RAID 10 with the OS and swap on there own partition.
[22:05] <sjust> dmick: thoughts?
[22:05] <fmarchand> dmick : if I propose that I need to convince people why I should use ceph under the hood ... and I don't know yet enough both clusters
[22:08] <fmarchand> dmick : I knw that first ceph means crush algorithm but I don't know what it means really ... except that it means non-centralized cluster ...
[22:09] <kyle_> so when i run: ceph-deploy disk list data0 i get:
[22:09] <dmick> sjust: on?
[22:09] <kyle_> /dev/sda :
[22:09] <kyle_> /dev/sda1 other, xfs, mounted on /
[22:09] <kyle_> /dev/sda2 other
[22:09] <kyle_> /dev/sda5 swap, swap
[22:09] <kyle_> /dev/sda6 other, xfs
[22:09] <kyle_> /dev/sr0 other, unknown
[22:09] <kyle_> then ceph-deploy osd create data0:/dev/sda6:/dev/sda6
[22:09] <sjust> dmick: sorry, kyle_'s thing
[22:09] <kyle_> but that never actually finishes
[22:09] <sjust> dmick: not sure who knows about ceph-disk-prepare
[22:10] <dmick> kyle_: you're telling ceph-deploy to create an OSD data and journal in the same partition
[22:10] <kyle_> yes, do they need to be seperate?
[22:10] <dmick> that's different than same device
[22:10] <kyle_> i see
[22:11] <dmick> I don't think that works; I think you can use the same device (i.e. /dev/sda) but then it wants to create its own two partitions
[22:11] <kyle_> gotcha. i'll try a seperate partition for the journal
[22:14] <jmlowe> nhm: got the email
[22:14] <jmlowe> sjust: only the one osd
[22:15] <sjust> jmlowe: you're probably fine
[22:16] <nhm> jmlowe: looks like Dona is out of town, but Nigel might be able to get you some numbers if he's not totally swampped. :)
[22:18] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[22:18] * ChanServ sets mode +v andreask
[22:25] * BManojlovic (~steki@fo-d- has joined #ceph
[22:30] * danieagle (~Daniel@ has joined #ceph
[22:33] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:40] * danieagle_ (~Daniel@ has joined #ceph
[22:42] * fridudad (~oftc-webi@p5B09D970.dip0.t-ipconnect.de) Quit (Quit: Page closed)
[22:42] <Volture> Please tell me how to deal with it? http://pastebin.com/F2pesTYr ceph 0.61.2
[22:42] * danieagle (~Daniel@ Quit (Ping timeout: 480 seconds)
[22:43] <dmick> Volture: do your hosts have real IP addresses in /etc/hosts?
[22:43] <dmick> looks wrong
[22:44] <Volture> dmick: Yes
[22:44] <dmick> then Ceph shouldn't be trying to use
[22:45] <Volture> dmick: My /etc/hosts http://pastebin.com/mArfYgp5
[22:46] * jshen (~jshen@157.sub-70-197-3.myvzw.com) Quit (Ping timeout: 480 seconds)
[22:47] <dmick> ceph osd dump,ceph mon dump show in any IP addrs?
[22:52] <Volture> dmick: http://pastebin.com/pCdSL5Kn how fix this problem ?
[22:52] * jshen (~jshen@ has joined #ceph
[22:52] <Volture> dmick: 1 osd is (((
[22:53] * fmarchand (~fmarchand@85-168-75-207.rev.numericable.fr) Quit (Ping timeout: 480 seconds)
[22:58] * danieagle_ (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[23:05] * etr (~etr@cs27053056.pp.htv.fi) has joined #ceph
[23:11] <dmick> Volture: well I dunno. What's different about that host? Is /etc/hosts identical? Is its local interface up? Does ceph.conf enumerate daemon hosts, or are you using ceph-deploy?
[23:11] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 21.0/20130511120803])
[23:12] <Volture> dmick: Thanks. All osd is up and ceph avalably again
[23:12] <dmick> what was the problem?
[23:13] * dcasier (~dcasier@ Quit (Remote host closed the connection)
[23:14] * MrNPP (~MrNPP@0001b097.user.oftc.net) Quit (Ping timeout: 480 seconds)
[23:15] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) has joined #ceph
[23:15] <kyle_> dmick: so i created seperate partions for the osd and journal but looks like i'm still stuck in the same place. Looks like prepare still does not finish.
[23:15] <kyle_> ceph_deploy.osd DEBUG Preparing host data0 disk /dev/sda7 journal /dev/sda6 activate True
[23:15] <kyle_> partitions*
[23:18] * MrNPP (~MrNPP@ has joined #ceph
[23:18] * jahkeup (~jahkeup@ Quit (Ping timeout: 480 seconds)
[23:18] <dmick> Volture: what did you fix?
[23:18] <Volture> dmick: Even if nothing. I thought I lost all data on ceph
[23:18] <Volture> dmick: osd2 restart
[23:19] <dmick> kyle_: well that's odd. is there a ceph-disk process running on host data0?
[23:19] <dmick> Volture: strange. Maybe its networking wasn't up the first time it started the OSD
[23:20] <Volture> dmick: Maybe ((
[23:20] <kyle_> dmick: negative. no logs have even been created on the data0 host either.
[23:20] <kyle_> data0 kernel: [351242.145364] XFS (sda7): Mounting Filesystem
[23:20] <kyle_> May 28 14:11:25 data0 kernel: [351242.196894] XFS (sda7): Ending clean mount
[23:21] <kyle_> ^^ data0 syslog last entry
[23:22] <kyle_> have not been able to get a cluster going since ceph-deploy. keep getting stuck on the osd deployment
[23:24] <dmick> kyle_: so ceph-deploy is hung, but the task it's waiting on is gone. that's....odd.
[23:26] <kyle_> yeah pretty much. i think i may try seperating off the OS and OSD stuff on seperate raid arrays so i can give "osd create" the device name and let it handle partitioning. Problem is i have to go down to the colo for that haha.
[23:26] <dmick> is /dev/sda7 mkfs'ed and mounted?
[23:27] <dmick> and does it have a /var/lib/ceph/osd/* either symlinked to or mounted with a populated filesystem in it?
[23:27] <kyle_> not mounted but looks like mkfs happened... /etc/ceph/ceph-deploy/ceph-deploy disk list data0 produces:
[23:27] <kyle_> /dev/sda :
[23:27] <kyle_> /dev/sda1 other, xfs, mounted on /
[23:27] <kyle_> /dev/sda2 other
[23:27] <kyle_> /dev/sda5 swap, swap
[23:27] <kyle_> /dev/sda6 other, Linux filesystem
[23:27] <kyle_> /dev/sda7 other, xfs
[23:27] <kyle_> /dev/sr0 other, unknown
[23:28] * madkiss (~madkiss@ has joined #ceph
[23:28] <kyle_> /var/lib/ceph/osd is empty
[23:28] * wer (~wer@206-248-239-142.unassigned.ntelos.net) Quit (Ping timeout: 480 seconds)
[23:28] <kyle_> the dir is there but empty
[23:28] * eschenal (~eschnou@60.197-201-80.adsl-dyn.isp.belgacom.be) Quit (Quit: Leaving)
[23:30] <elder> sjust, sagewk, I talked with Dave Chinner about the xattr/ln thing that was being discussed earlier.
[23:31] <elder> If there is concurrent activity things can get sort of mis-ordered, but it only matters if there's a crash involved.
[23:31] <etr> a really stupid question. I used ceph-deploy to create a cluster with one monitor and 4 osds. the thing seems to be working but when I went to see the config at /etc/ceph/ceph.conf I see only 7 lines of configuration with no mon.a or osd.0, 1 etc. what am I doing wrong?
[23:32] * wer (~wer@206-248-239-142.unassigned.ntelos.net) has joined #ceph
[23:33] <dmick> kyle_: does data0 have a ceph-create-keys process running?
[23:33] <dmick> etr: ceph-deploy doesn't put host configs in ceph.conf
[23:33] <dmick> startup scripts use the presence of things in /var/lib/ceph to determine what daemons to strat
[23:34] <kyle_> dmick: ps ax | grep ceph shows nothing
[23:35] <sjust> elder: yeah, there was a crash
[23:35] <dmick> kyle_: what OS on data0?
[23:35] <kyle_> ubuntu 13.04 the mds and mons are on 12.04
[23:35] <kyle_> the 12.04 boxes have the upgraded kernel
[23:35] <etr> dmick, oh, ok
[23:37] <dmick> kyle_: anything interesting in /var/log/upstart/ceph*?
[23:37] <kyle_> dmick: no, nothing ceph related in there
[23:39] <kyle_> well not on data0 at least
[23:39] <dmick> hm, would have expected something
[23:40] <dmick> initctl list | grep ceph show jobs?
[23:41] <kyle_> ceph-osd-all start/running
[23:41] <kyle_> ceph-mds-all-starter stop/waiting
[23:41] <kyle_> ceph-mds-all start/running
[23:41] <kyle_> ceph-osd-all-starter stop/waiting
[23:41] <kyle_> ceph-all start/running
[23:41] <kyle_> ceph-mon-all start/running
[23:41] <kyle_> ceph-mon-all-starter stop/waiting
[23:41] <kyle_> ceph-mon stop/waiting
[23:41] <kyle_> ceph-create-keys stop/waiting
[23:41] <kyle_> ceph-osd stop/waiting
[23:41] <kyle_> ceph-mds stop/waiting
[23:41] <etr> dmick, so in the future there won't be a ceph.conf file like in most of the examples around?
[23:43] <dmick> ceph-deploy doesn't use it for determining host/daemon mappings, but it's still useful for configuration settings
[23:44] <dmick> kyle_: I'm suspecting ceph-osd isn't working right, but I'm confused as to why it's not logging at all
[23:47] <etr> ok. thanks
[23:47] <dmick> perhaps initctl log-priority debug will make it more verbose, and then maybe restart ceph-osd-all
[23:47] <kyle_> yeah i'm not sure. i've tried purging and starting over for days. but always seem to get stuck on the osd create
[23:47] <dmick> failing that, maybe manually issuing some of the things in /etc/ceph/ceph-osd.conf
[23:47] <dmick> sorry
[23:47] <dmick> /etc/init/ceph-osd.conf
[23:48] * jeff-YF (~jeffyf@ Quit (Ping timeout: 480 seconds)
[23:50] * sjustlaptop (~sam@2607:f298:a:697:d08f:26ad:1620:ff3) has joined #ceph
[23:51] <kyle_> hmm okay. i can give it a try. although i suspect it's out of my league.
[23:52] * Cube (~Cube@ Quit (Quit: Leaving.)
[23:54] <kyle_> dmick: are there instructions anywhere for manually deploying cuttlefish?
[23:56] * madkiss (~madkiss@ Quit (Quit: Leaving.)
[23:58] * sjustlaptop (~sam@2607:f298:a:697:d08f:26ad:1620:ff3) Quit (Ping timeout: 480 seconds)
[23:58] <dmick> kyle_: don't know if there's anything specifically like that

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.