#ceph IRC Log


IRC Log for 2012-10-24

Timestamps are in GMT/BST.

[0:01] * sagelap (~sage@soenat3.cse.ucsc.edu) Quit (Ping timeout: 480 seconds)
[0:06] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[0:21] * scalability-junk (~stp@188-193-208-44-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[0:21] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Ping timeout: 480 seconds)
[0:22] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:25] * mdrnstm (~mdrnstm@ has joined #ceph
[0:29] * tryggvil (~tryggvil@163-60-19-178.xdsl.simafelagid.is) has joined #ceph
[0:32] * nhmlap (~nhm@ has joined #ceph
[0:33] * tjikkun (~tjikkun@82-169-255-84.ip.telfort.nl) has joined #ceph
[0:45] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:45] * loicd (~loic@magenta.dachary.org) has joined #ceph
[0:49] * Q310 (~Q@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (Read error: Connection reset by peer)
[0:54] * nhmlap (~nhm@ Quit (Remote host closed the connection)
[0:57] * lofejndif (~lsqavnbok@28IAAIKSA.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[0:57] * sagelap (~sage@soenat3.cse.ucsc.edu) has joined #ceph
[0:58] * PerlStalker (~PerlStalk@ Quit (Quit: rcirc on GNU Emacs 24.2.1)
[1:00] * nhmlap (~nhm@ has joined #ceph
[1:30] * steki-BLAH (~steki@bojanka.net) Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:30] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[1:31] * aliguori (~anthony@cpe-70-123-146-246.austin.res.rr.com) Quit (Remote host closed the connection)
[1:31] * renzhi (~xp@ Quit (Quit: Leaving)
[1:34] * cdblack (86868b4a@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[1:37] * Tv_ (~tv@2607:f298:a:607:190f:ecf7:102b:da8f) Quit (Quit: Tv_)
[1:53] * nwatkins1 (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[2:12] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[2:13] * sagelap1 (~sage@soenat3.cse.ucsc.edu) has joined #ceph
[2:13] * sagelap (~sage@soenat3.cse.ucsc.edu) Quit (Quit: Leaving.)
[2:16] * nhmlap (~nhm@ Quit (Remote host closed the connection)
[2:20] * sagelap1 (~sage@soenat3.cse.ucsc.edu) Quit (Quit: Leaving.)
[2:20] * sagelap (~sage@soenat3.cse.ucsc.edu) has joined #ceph
[2:22] * nhmlap (~nhm@ has joined #ceph
[2:26] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[2:30] * Cube (~Cube@ Quit (Quit: Leaving.)
[2:33] * nwatkins1 (~Adium@soenat3.cse.ucsc.edu) has left #ceph
[2:33] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[2:37] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[2:40] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[2:45] * sagelap (~sage@soenat3.cse.ucsc.edu) Quit (Ping timeout: 480 seconds)
[2:48] * chutzpah (~chutz@ Quit (Quit: Leaving)
[2:50] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[3:23] * mdrnstm (~mdrnstm@ Quit (Quit: Leaving.)
[3:27] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:35] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[3:37] * loicd (~loic@magenta.dachary.org) has joined #ceph
[3:38] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[3:41] * adjohn (~adjohn@ Quit (Quit: adjohn)
[3:46] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[3:50] * nhmlap (~nhm@ Quit (Ping timeout: 480 seconds)
[3:59] * renzhi (~renzhi@ has joined #ceph
[4:13] * nhmlap (~nhm@253-231-179-208.static.tierzero.net) has joined #ceph
[4:30] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[4:54] * slang (~slang@ace.ops.newdream.net) Quit (Ping timeout: 480 seconds)
[4:54] * slang (~slang@ace.ops.newdream.net) has joined #ceph
[5:05] * slang (~slang@ace.ops.newdream.net) Quit (Quit: slang)
[5:11] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[5:30] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[5:32] * loicd (~loic@magenta.dachary.org) has joined #ceph
[5:56] * sagelap (~sage@ has joined #ceph
[6:14] * dmick (~dmick@2607:f298:a:607:1c99:a06a:7608:e987) Quit (Ping timeout: 480 seconds)
[6:23] * nhmlap (~nhm@253-231-179-208.static.tierzero.net) Quit (Ping timeout: 480 seconds)
[6:25] * dmick (~dmick@2607:f298:a:607:5c63:ce2a:d9d9:8516) has joined #ceph
[6:28] * sagelap1 (~sage@8.sub-70-197-2.myvzw.com) has joined #ceph
[6:33] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[6:33] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[6:34] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit ()
[6:44] * deepsa (~deepsa@ has joined #ceph
[6:55] * sagelap1 (~sage@8.sub-70-197-2.myvzw.com) Quit (Ping timeout: 480 seconds)
[6:59] * Cube (~Cube@ has joined #ceph
[7:06] * Cube (~Cube@ Quit (Quit: Leaving.)
[7:20] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[7:25] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[7:27] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[7:34] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[7:47] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[8:04] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[8:07] * LarsFronius (~LarsFroni@95-91-242-157-dynip.superkabel.de) has joined #ceph
[8:11] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[8:12] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[8:14] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:17] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:20] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[8:26] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[8:34] * ninkotech (~duplo@ Quit (Read error: Connection reset by peer)
[8:34] * ninkotech (~duplo@ has joined #ceph
[8:37] <todin> morning
[8:45] * ao (~ao@ has joined #ceph
[9:08] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[9:10] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:47] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[10:07] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[10:12] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[10:17] * LarsFronius (~LarsFroni@95-91-242-157-dynip.superkabel.de) Quit (Quit: LarsFronius)
[10:23] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:35] * loicd (~loic@ has joined #ceph
[10:38] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[10:39] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[10:40] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[10:52] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[10:59] <joao> morning #ceph
[11:18] * loicd (~loic@ has joined #ceph
[11:20] <liiwi> good afternoon
[11:42] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[11:48] * jksM (~jks@4810ds1-ns.5.fullrate.dk) has joined #ceph
[11:55] * jks (~jks@3e6b7571.rev.stofanet.dk) Quit (Ping timeout: 480 seconds)
[11:57] * jksM (~jks@4810ds1-ns.5.fullrate.dk) Quit (Ping timeout: 480 seconds)
[12:15] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[12:24] * LarsFronius_ (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[12:24] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[12:24] * LarsFronius_ is now known as LarsFronius
[12:38] * jks (~jks@3e6b7199.rev.stofanet.dk) has joined #ceph
[12:42] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has joined #ceph
[12:46] <jeffhung_> ls
[12:50] <laevar> hi, we are thinking about giving bachelor- and masterthesis around ceph. Is this in the interest of the ceph-team?
[13:00] <joao> I don't see why not; any academic work around ceph is bound to be interesting :)
[13:04] * liiwi tries to figure pun about "academic workaround"
[13:06] <laevar> :)
[13:06] <laevar> are there some special interests or people we could/should ask to coordinate things?
[13:08] <laevar> there is of course the university group at the university of california
[13:09] <joao> I have no idea if there's interest in getting the team officially involved, but it doesn't hurt to either send an email to the list or maybe ask sage?
[13:09] <laevar> thx
[13:39] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[13:41] * Leseb (~Leseb@ has joined #ceph
[13:45] * morpheus__ (~morpheus@foo.morphhome.net) Quit (Ping timeout: 480 seconds)
[13:49] * loicd (~loic@magenta.dachary.org) has joined #ceph
[13:58] * Leseb (~Leseb@ Quit (Quit: Leseb)
[14:07] <madkiss> (22:27:39) <Brian_H> nice just got a pre-alpha release of drbd 9 :D
[14:07] <madkiss> (22:27:51) <Brian_H> 1 -> 32 replication here I come
[14:07] <madkiss> hmmmmmm.
[14:13] <ninkotech> people in company here are against using rbd for virtual machines - they are scared that they might (in case of weird accident somehow) loose everything. it would be taking way too long to get everything up and running, even if backups are good. is there some kind of counterargument for this?
[14:13] <ninkotech> or some way to make them feel safe?
[14:14] <madkiss> ninkotech: hu? parse error.
[14:14] <vhasi> built-in redundancy is risky, mmmkay?
[14:14] <madkiss> "people afraid of losing everything in case of weird accident" is, well, not something you can really argue against
[14:15] <madkiss> if there's Airbus A380 crashing into your DC, would that count as "weird accident"?
[14:16] <tontsa> you can lose everything with real servers too.. RAID firmware can go wonky, SSD-cache on RAID card can go wonky, etc. having working backups is always essential
[14:16] * stingray (~stingray@stingr.net) has joined #ceph
[14:16] <stingray> huh
[14:16] <madkiss> tontsa: totally agree.
[14:16] <tontsa> there are backup solutions that can backup whole system even every 5 minutes if needed without doing file level scan
[14:18] <stingray> nowadays what's the canonical way of changing an IP address of a monitor, if I have a single-monitor system?
[14:20] <ninkotech> its better to lose one server or disk, having problem with one customer
[14:20] <ninkotech> than to get problem with 200 customers
[14:20] <ninkotech> thats the argument
[14:20] <ninkotech> that it might kill the hosting company if that happens
[14:20] <ninkotech> (having few hundreds servers now)
[14:20] <tontsa> well same argument applies to SAN storage too
[14:21] <tontsa> just have multiple clusters so you don't end up with clusterfuck :)
[14:21] <ninkotech> yes, thats why they use raid-1
[14:21] <madkiss> ninkotech: that argument is valid for almost any setup out there.
[14:21] <ninkotech> i can see, i just could not find a good answer
[14:22] <ninkotech> which would encourage even testing of rbd
[14:22] <madkiss> ninkotech: A SAN is a SPOF.
[14:22] <madkiss> ninkotech: So you guys are using RAID1. Are you making sure that you are not running disks from the same production charge in one RAID1 array?
[14:22] <ninkotech> the point, again, is that having many SPOFs is better than having one big, even if chance of getting it clusterfucked is small
[14:22] <tontsa> well if they are too afraid they can go with traditional SAN.. but even with that you can have huge failure. like Tieto in sweden. whole datacenter and all customers lost almost all of their data
[14:23] <ninkotech> tontsa: we are talking about backend for VPS servers
[14:24] <ninkotech> and they didnt like sans too...
[14:25] <tontsa> managing local storage whether raid1 or raid10 requires a lot of manual labour, not to mention wasted disk space and electricity
[14:25] <madkiss> ninkotech: the best argument, in my eyes, is that in RADOS, you can define the numbers of replicas you want
[14:25] <madkiss> in whatever server and whatever rack you need them in.
[14:25] <madkiss> so even if a complete rack goes down or something, you can still run your storage.
[14:28] <ninkotech> what if it will stop working? did that happen to someone ?
[14:28] <ninkotech> i mean for example, that the VPS will not be able to mount the RBD device
[14:28] <ninkotech> or all VPS will fail
[14:29] <ninkotech> you know, those fears might be irrational...
[14:29] <ninkotech> but people are scared
[14:29] <madkiss> ninkotech: "VPS"?
[14:29] <ninkotech> qemu-kvm machine
[14:29] <tontsa> ninkotech, why not just install small test cluster and try to break it
[14:29] <ninkotech> using volumes from rbd
[14:29] <madkiss> you mean your hypervisor nodes?
[14:30] <zynzel> tontsa: i do that, and destroy cluster in less than 30 min ;)))
[14:30] <ninkotech> tontsa: thats what i would like to do, but there is no will to go into it (because of the fear)
[14:31] <tontsa> i guess you are stuck managing local storage then
[14:31] <ninkotech> tontsa: i do not give up so easily ;)
[14:31] <ninkotech> but you might be right
[14:31] <tontsa> it's pretty hard to change studborn minds.. and whether it's worth the effort even
[14:32] <ninkotech> tontsa: great point! thanks
[14:33] <ninkotech> (in another company, i installed it in few hours, and they play with rbd from the time)
[14:34] <ninkotech> i would say, here they think like RBD is SPOF too
[14:34] <ninkotech> (like SAN is SPOF)
[14:49] * nhmlap (~nhm@253-231-179-208.static.tierzero.net) has joined #ceph
[14:55] * aliguori (~anthony@cpe-70-123-146-246.austin.res.rr.com) has joined #ceph
[15:07] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:19] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) Quit (Ping timeout: 480 seconds)
[15:30] * alphe (~alphe@ has joined #ceph
[15:30] <alphe> hello all :)
[15:31] <alphe> I digged up some more the size display issue and apparently windows knows block size with a maximum of 64K
[15:31] <alphe> so the 1MB block size will not be properly evaluated
[15:32] <alphe> if you try to send 1TB to a 176GB drive obviously windows will have some conceptual troubles
[15:33] <alphe> don't know if samba can cheat with windows clients and send block numbers * 256 for blocksize = 4K
[15:38] * MikeMcClurg (~mike@ has joined #ceph
[15:38] * rlr219 (43c87e04@ircip2.mibbit.com) has joined #ceph
[15:48] * scalability-junk (~stp@188-193-208-44-dynip.superkabel.de) has joined #ceph
[15:50] <rlr219> morning all. we have 8 OSDs running. 6 are on Ubuntu Precise, 2 are on Ubuntu Quantal. we are running ceph version 0.48.1argonaut and ceph version 0.48.2argonaut for our OSDs. We seem to keep having OSD crashes. I Am pasting log from recent crash: http://pastebin.com/2sFZxnm9
[15:52] <rlr219> It seems that a pg has bad data??? How do we fix? Also had an OSD crash for heartbeat map failure suicide timeout. Log:http://pastebin.com/PRKbEZ4B How do we prevent that?
[15:53] <rlr219> Thanks, in advance, for any help!
[15:57] * PerlStalker (~PerlStalk@ has joined #ceph
[15:58] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[15:58] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[16:11] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[16:27] * jefferai <3 Arista
[16:27] <jefferai> I type "bash"
[16:27] <jefferai> and I get a bash promp
[16:27] <jefferai> prompt
[16:27] <jefferai> wut
[16:31] * cdblack (c0373727@ircip1.mibbit.com) has joined #ceph
[16:33] <alphe> r1r219 ugrading your osd to ubuntu quantal is a simple process
[16:34] <alphe> and on top of that you can only update the ceph part and dependencies if you don't want the whole thing (kernel will be update as dependencies and auto installed)
[16:35] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[16:35] <alphe> I recommand you tu use mini.iso version for installing your osd /mds because it's small to download and very minimalistic you don't need X11/gnome/KDE etc...
[16:41] <rlr219> alphe: thanks, I have updated to Quantal. that's not the issue. The issue is I am having random osd crashes. and these are 2 of the logs. one crashed because: [ERR] : scrub 4.2 3983aeca/rb.0.9.000000000044/d6//4 on disk size (0) does not match object info size (4194304)
[16:41] <rlr219> The other because of suicide timeout. I need to know how to fix these so they do not continue to happen.
[16:44] <jefferai> rlr219: so I've heard that the development release packages (http://ceph.com/docs/master/install/debian/)
[16:44] <jefferai> tend to be quite stable
[16:44] <jefferai> and that they have a lot of bug fixes/updates
[16:44] <jefferai> you could give that a try, but may want to wait for one of the devs to comment on how wise that is
[16:47] * tziOm (~bjornar@ has joined #ceph
[16:47] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[16:48] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[16:55] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[16:56] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit ()
[17:01] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[17:06] * ao (~ao@ Quit (Quit: Leaving)
[17:07] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[17:08] * vata (~vata@ has joined #ceph
[17:09] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[17:28] * loicd (~loic@ has joined #ceph
[17:39] * loicd (~loic@ Quit (Quit: Leaving.)
[17:43] * sagelap (~sage@84.sub-174-254-86.myvzw.com) has joined #ceph
[17:51] * sagelap1 (~sage@143.sub-70-197-145.myvzw.com) has joined #ceph
[17:55] * sagelap (~sage@84.sub-174-254-86.myvzw.com) Quit (Ping timeout: 480 seconds)
[18:07] <gregaf> sjust: mikeryan: one of you should look at rlr219's OSD crashes
[18:07] <gregaf> when you get in
[18:09] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[18:15] <rlr219> Thanks gregaf!
[18:18] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[18:22] * Tv_ (~tv@ has joined #ceph
[18:23] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[18:31] * aliguori_ (~anthony@cpe-70-123-134-29.austin.res.rr.com) has joined #ceph
[18:31] * sagelap1 (~sage@143.sub-70-197-145.myvzw.com) Quit (Ping timeout: 480 seconds)
[18:31] * mdrnstm (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[18:31] * sagelap (~sage@2607:f298:a:607:1d17:77a3:d8f5:26d2) has joined #ceph
[18:32] * aliguori (~anthony@cpe-70-123-146-246.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[18:35] * nhmlap (~nhm@253-231-179-208.static.tierzero.net) Quit (Ping timeout: 480 seconds)
[18:38] * tryggvil (~tryggvil@163-60-19-178.xdsl.simafelagid.is) Quit (Quit: tryggvil)
[18:42] * mikeryan (mikeryan@lacklustre.net) Quit (Ping timeout: 480 seconds)
[18:43] * mikeryan (mikeryan@lacklustre.net) has joined #ceph
[18:43] <mikeryan> rlr219: scrub 4.2 3983aeca/rb.0.9.000000000044/d6//4 on disk size (0) does not match object info size (4194304)
[18:43] <mikeryan> it looks like that's scrub behaving correctly
[18:44] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[18:44] <mikeryan> it implies that you're having underlying fs corruption
[18:44] <mikeryan> what fs are you running?
[18:45] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:46] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) has joined #ceph
[18:46] * loicd (~loic@jem75-2-82-233-234-24.fbx.proxad.net) has joined #ceph
[18:46] <rlr219> btrfs
[18:47] <dilemma> I have a couple questions about choosing an appropriate number of placement groups for a given pool
[18:47] <dilemma> The docs suggest OSDs x 100
[18:47] <dilemma> but is the goal to have each pair of OSDs sharing, on average, 1 placement group?
[18:47] <mikeryan> rlr219: i strongly suspect btrfs corruption is that cause
[18:48] <dilemma> that would suggest that a better formula would be (OSDs * (OSDs - 1)) / 2
[18:48] <Tv_> dilemma: that's less relevant than about 1000 per machine consumes about the right amount of RAM
[18:48] <mikeryan> rlr219: you said it crashed after that message?
[18:48] * mikeryan_ (mikeryan@lacklustre.net) has joined #ceph
[18:48] <Tv_> dilemma: less makes the load uneven, more is unneeded overhead
[18:48] * mikeryan (mikeryan@lacklustre.net) Quit (Quit: leaving)
[18:48] * mikeryan_ is now known as mikeryan
[18:48] <dilemma> what if you intend on expanding your cluster over time, and want to chose a number of placement groups appropriate for a theoretical future cluster size
[18:49] <mikeryan> rlr219: it should not crash, it should just mark your PG as inconsistent
[18:49] <Tv_> dilemma: currently pg number is fixed, so plan for the expansion
[18:49] <Tv_> dilemma: as long as you're not expanding by several magnitudes, that'll work
[18:49] <rlr219> yes. I pasted the last part of the log. http://pastebin.com/2sFZxnm9
[18:49] <dilemma> planning for 4x expansion
[18:49] <dilemma> define "overhead" though
[18:49] <dilemma> RAM? Mon CPU time?
[18:50] <Tv_> dilemma: most significant hit is recovery time RAM use
[18:50] <dilemma> on the OSDs?
[18:51] <dilemma> The docs are unclear about a lot of this, so this is incredibly helpful, Tv_
[18:52] <Tv_> dilemma: there's quite a lot more in the picture, and the Inktank professional services people are really the best for that...
[18:52] <Tv_> there's so many different use cases and hardware variations etc that the docs will never cover everything perfectly
[18:53] <mikeryan> is pastebin down for anyone else?
[18:54] <rlr219> mikeryan: my replication level is 3. each OSD that has that page crashed and all gave this error message. I could see fs corruption on one, but all 3?
[18:54] <rlr219> mikeryan: yes to pastebin being down
[18:56] <rlr219> mikeryan: pastebin looks to be back up.
[18:57] <dilemma> Tv_: we have someone from Inktank coming on site in a couple weeks, so I'll discuss it with them. Thanks!
[18:58] * Cube (~Cube@ has joined #ceph
[18:59] <mikeryan> rlr219: yep that looks like a bug on our part
[19:00] <dilemma> oh, one more question - any ballpark on when the feature for increasing the PG count will be available?
[19:00] * Cube (~Cube@ Quit (Read error: Connection reset by peer)
[19:01] * Cube (~Cube@ has joined #ceph
[19:01] <sagewk> dilemma: Real Soon Now(tm). the recovery qos and small io performance improvements are in front of it in the queue
[19:01] <sagewk> hopefully the first release after bobtail... ~2 months?
[19:01] <dilemma> haha, awesome. Just hoping it wasn't ~2 years :)
[19:02] * denken (~denken@dione.pixelchaos.net) has joined #ceph
[19:05] * MikeMcClurg (~mike@ Quit (Ping timeout: 480 seconds)
[19:05] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:05] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[19:05] * Leseb_ is now known as Leseb
[19:07] <rlr219> mikryan: is this something fixed in a later release?
[19:10] * nhmlap (~nhm@me10436d0.tmodns.net) has joined #ceph
[19:10] * chutzpah (~chutz@ has joined #ceph
[19:13] <mikeryan> rlr219: no, most likely not
[19:15] * adjohn (~adjohn@ has joined #ceph
[19:16] * nhmlap_ (~nhm@ma20436d0.tmodns.net) has joined #ceph
[19:19] <rlr219> mikeryan: Ouch. what about the 2nd OSD crash for heartbeat map failure suicide timeout. Log:http://pastebin.com/PRKbEZ4B ?
[19:21] * nhmlap (~nhm@me10436d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[19:21] <mikeryan> rlr219: sjust had a theory about that, but i think he's going to take a look at that log in a sec
[19:22] * rweeks (~rweeks@ has joined #ceph
[19:22] <rlr219> mikeryan: Ok.
[19:26] * loicd (~loic@jem75-2-82-233-234-24.fbx.proxad.net) Quit (Quit: Leaving.)
[19:27] <mikeryan> rlr219: can you give us full logs?
[19:28] <rlr219> mikeryan: on which?
[19:28] <mikeryan> both
[19:30] <rlr219> mikeryan: give me a few minutes.
[19:30] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[19:35] * aliguori_ (~anthony@cpe-70-123-134-29.austin.res.rr.com) Quit (Remote host closed the connection)
[19:36] * sjustlaptop (~sam@ has joined #ceph
[19:36] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[19:50] * loicd (~loic@ has joined #ceph
[19:55] * nwatkins1 (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[19:57] <joao> I finally figured out why thunderbird has been so reluctant on updating my inbox; apparently I had a 7GB mailbox and it deals poorly with that
[20:05] <rlr219> mikeryan: rlr219-ceph-osd.15.log.1.gz & rlr219-ceph-osd.6.log.1.gz are the logs for the "pg bug". rlr219-ceph-osd.12.log.2.gz is the logs for the suicide timout error.
[20:09] <mikeryan> rlr219: cool, taking a look
[20:18] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Ping timeout: 480 seconds)
[20:23] * Tv_ (~tv@ Quit (Quit: Tv_)
[20:23] * nhmlap_ (~nhm@ma20436d0.tmodns.net) Quit (Read error: Connection reset by peer)
[20:30] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[20:30] <rweeks> you have 7gb of mail?
[20:30] <rweeks> erm.
[20:30] <rweeks> why? :)
[20:31] * jefferai easily has that
[20:31] <jefferai> Except I have a good server (Dovecot) and an SSD, so things are still snappy
[20:32] <elder> joao, you're having trouble with Thunderbird too?
[20:32] <elder> I have been, and Mark K said he was too.
[20:32] <jefferai> elder: update recently?
[20:32] <jefferai> or are you all Inktank guys, in which case that would suggest your server
[20:32] <rlr219> mikeryan: I finally got the last log from the 3rd osd with the "pg bug". it is rlr219-ceph-osd.10.log.1.gz
[20:32] <rlr219> I am uploading it now.
[20:33] <joao> elder, yeah
[20:33] <joao> it simply wasn't downloading new messages
[20:33] <joao> and started consuming *a lot* of disk space
[20:33] <elder> I was getting sign-on errors.
[20:33] <joao> and I mean, like 60 MB each 20 minutes
[20:33] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[20:33] <joao> apparently I needed to compress some mailboxes (that fixed the unable to download new messages)
[20:34] * Cube (~Cube@ Quit (Remote host closed the connection)
[20:34] <joao> still struggling with disk consumption though
[20:34] <elder> If I knew what was required I'd do it.
[20:35] <sjustlaptop> rlr219: I think the rlr219-ceph-osd.12.log.2.gz at least is truncated
[20:35] <sjustlaptop> can you reupload?
[20:35] <mikeryan> rlr219: the ones you sent me were truncated as well
[20:35] <mikeryan> make sure you can gunzip them on your box, perhaps something happened to them before you uploaded
[20:38] * Cube (~Cube@ has joined #ceph
[20:38] * Tobarja (~athompson@cpe-071-075-064-255.carolina.res.rr.com) Quit (Read error: Connection reset by peer)
[20:45] <joao> elder, have you enabled two-factor auth on google?
[20:46] <elder> Um, I don't know.
[20:46] <elder> Does that have anything to do with Google Authenticator?
[20:46] <joao> think so
[20:46] <elder> Then maybe. Or, maybe not.
[20:47] <joao> that's the app one uses to generate auth codes, right?
[20:47] <elder> Numeric strings for one-time pad of some kind.
[20:47] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[20:47] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Quit: Leaving)
[20:48] <joao> yeah
[20:48] * Cube (~Cube@ Quit (Quit: Leaving.)
[20:49] <joao> elder, if you have that enabled, you'll need to go to 'account' -> 'security' -> 'authorize applications and sites' and add thunderbird to the application-specific passwords section
[20:50] * BManojlovic (~steki@ has joined #ceph
[20:51] <elder> Is it easy to verify I have it enabled (so I don't cause myself even more problems)?
[20:51] <dmick> log out from the web interface and try logging back in
[20:51] <dmick> if you only need a password, it's nto enabled
[20:52] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[20:52] <elder> So I would otherwise need to provide the number from my authenticator app every time I sign in?
[20:52] <joao> dmick, if he has defined that computer as trusted he won't need the verification code
[20:52] <dmick> ugh
[20:52] <elder> Ahh.
[20:53] <joao> for instance, I don't need the code on my desktop, but I do on my laptop
[20:53] <dmick> this is why I don't mess with these 'advanced' security things
[20:53] <elder> 2-step verificatino Status: OFF
[20:53] <dmick> they're frigging impossible to use and diagnose
[20:53] <joao> elder, I guess that's not the issue then :p
[20:53] <joao> well, dinner's ready; brb
[20:54] <elder> frig·ging/ˈfrigən/
[20:54] <elder> Adjective:
[20:54] <elder> vulgar. Used for emphasis, esp. to express anger, annoyance, contempt, or surprise.
[20:55] <rlr219> sjust & mikeryan: let me take a look.
[20:55] <dmick> elder: yes, it's a bowdlerized f-word
[20:58] <jefferai> they're only hard to diagnose with third-party apps
[20:58] <elder> (I knew that. I just wanted to put the formal definition out there.
[20:58] <jefferai> any of the Google properties make it easy, they prompt you for the authenticator code
[20:58] <jefferai> you just have to remember with third party apps to generate a password
[20:59] <jefferai> You guys hack filesystems, I believe in your ability to figure it out :-D
[21:02] * aliguori (~anthony@ has joined #ceph
[21:11] <elder> joshd, http://tracker.newdream.net/issues/3379 would not be a problem for version 2 images, would it?
[21:14] * LarsFronius (~LarsFroni@2a02:8108:3c0:79:9de4:7e2:789a:ef55) has joined #ceph
[21:17] <todin> joshd: hi, I tried your openstack tutorial, everything went fine, except the instance doesn't find a boot device
[21:17] <alphe> gregaf are you around ?
[21:17] <alphe> I found the turn around for the windows size display bug
[21:18] <joshd> elder: that's right
[21:18] <alphe> creat a executable bash script with this df $1 | tail -1 | awk '{printf "%.0f %.0f", $(NF-4),$(NF-2)}'
[21:18] <alphe> then add to your samba share definition dfree command = /pathtoyourscript
[21:18] <alphe> and tada windows sees the proper size
[21:19] <alphe> it's magical
[21:19] <joshd> todin: what's the qemu command line look like?
[21:21] * danieagle (~Daniel@ has joined #ceph
[21:23] <todin> joshd: http://pastebin.com/faiQUi09
[21:23] <elder> joshd, I updated that bug. Not sure if it should be closed or what...
[21:26] <rlr219> mikeryan & sjust: 6 & 10 are good. will re-upload other 2 in a few minutes.
[21:26] <joshd> todin: so you're not using cephx? could you verify that the libvirt/nova user can read /etc/ceph/ceph.conf?
[21:27] * adjohn (~adjohn@ Quit (Quit: adjohn)
[21:28] <todin> joshd: no cephx, and ceph.conf is world readable
[21:28] * alphe (~alphe@ Quit (Quit: Leaving)
[21:28] <joshd> todin: is the image a format other than raw?
[21:28] <todin> joshd: I use a ubuntu cloud image, not sure if it matters
[21:29] <todin> joshd: how do I know? glance image-list says it's qcow2
[21:31] <joshd> todin: yeah, that's the standard ubuntu cloud image. you'll need to convert it to raw (i.e. qemu-img convert -f img.qcow2 -O raw img.raw) before uploading to glance
[21:31] <todin> joshd: ok, I will try it, and give you than feedback, I think that info should go into your tutorial
[21:32] <joshd> yeah, someone else had the same issue as well
[21:34] * tziOm (~bjornar@ti0099a340-dhcp0778.bb.online.no) has joined #ceph
[21:35] * scalability-junk (~stp@188-193-208-44-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[21:35] <joshd> elder: it makes me think we should add a rados method for listing watchers, and check that there are no watchers *before* renaming a format 1 image
[21:36] * mdrnstm (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[21:36] <elder> That's not a bad idea. Racy, but still useful.
[21:37] <joshd> yeah, at least it's better than ignoring any currently in use ones
[21:39] <rlr219> mikeryan & sjust: it looks like the osd12 file is corrupt. I am uploading a good copy of the osd.15 file now.
[21:40] <mikeryan> rlr219: cool, let us know when that's done
[21:42] <rlr219> mikeryan: Done
[21:42] <mikeryan> k grabbing
[21:45] * Cube (~Cube@2607:f298:a:697:fd91:8ff3:d807:f014) has joined #ceph
[21:48] <tziOm> I have a problem compiling here:
[21:48] <tziOm> mon/Monitor.cc: In member function 'void Monitor::handle_signal(int)':
[21:48] <tziOm> mon/Monitor.cc:334:412: error: 'sys_siglist' was not declared in this scope
[21:49] <tziOm> also:
[21:49] <tziOm> mds/MDS.cc: In member function 'void MDS::handle_signal(int)':
[21:49] <tziOm> mds/MDS.cc:1534:447: error: 'sys_siglist' was not declared in this scope
[21:51] * vata (~vata@ Quit (Remote host closed the connection)
[21:52] <todin> joshd: with the raw images everything works, thanks
[21:54] <gregaf> tziOm: where are you building? (glowell1, is this part of your area now?)
[21:55] <joshd> todin: great, I'll clarify that in the docs
[21:56] <glowell1> gregaf: I can help track down build problems, but there are a lot of variables.
[21:57] <glowell1> tziOm: can you tell me what branch your building, and on what platform ?
[21:57] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[22:04] <tziOm> latest git
[22:04] <tziOm> compiling agains uclibc
[22:06] <dmick> tziOm: sys_siglist is definitely in libc on precise
[22:07] <dmick> (eglibc)
[22:08] <tziOm> it is there.
[22:09] <tziOm> but seems to be something wrong with code, then
[22:09] <dmick> maybe header files then
[22:09] <dmick> what distro/release are you building on?
[22:10] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[22:10] <glowell1> We've seen headers that were included by default on ubuntu not be included on fedora, requiring an explicit include in the source.
[22:11] <tziOm> the headers are there
[22:11] <tziOm> building as said with separate toolchain towards uclibc
[22:11] <dmick> tziOm: well then it must be working :)
[22:13] <tziOm> What do I pass to configure to disable the ./common/BackTrace.h inclution
[22:14] <tziOm> dmick, it says: was not declared in _this scope_
[22:15] <glowell1> tziOm: is your toolchain creating dependency files in .deps ? There should be an entry for Monitor.o with the header file signal.h.
[22:16] <tziOm> it does create deps
[22:16] <tziOm> but no Monitor.po
[22:17] <tziOm> 865 files in .deps
[22:17] <tziOm> 863 i ment
[22:17] <tziOm> libmon_a-Monitor.Po ?
[22:21] <glowell1> Yes, libmon_a-Monitor.po. My ubuntu 12.04 build shows /usr/include/signal.h.
[22:21] <tziOm> ok, its there -- but I fount the problem
[22:21] <tziOm> # ifdef __UCLIBC_HAS_SYS_SIGLIST__
[22:22] <tziOm> ...
[22:22] <glowell1> Great. Another mystery solved.
[22:23] * vata (~vata@ has joined #ceph
[22:23] <tziOm> Now I have a few other problems
[22:23] <tziOm> but do you really need sys_siglist ?
[22:23] <dmick> tziOm: good to hear you're building against uclibc
[22:24] <tziOm> ./common/BackTrace.h:5:22: fatal error: execinfo.h: No such file or directory
[22:24] <tziOm> this is "ok", since this is glibc spesific
[22:24] <tziOm> how can I disable this backtrace functionality?
[22:24] <dmick> no, I'm serious, more-portable is better.
[22:24] <dmick> sys_siglist is only used for error messages, so it's certainly not critical.
[22:25] <gregaf> oh, please don't take that away
[22:25] <gregaf> debugging user errors is near-impossible without that backtrace functionality
[22:25] <dmick> backtrace: #ifdef's, one assumes, but yes, it's useful
[22:26] <tziOm> Also, '::posix_fallocate' can perhaps be done other way? not portable
[22:26] <tziOm> gregaf, yeah, but it should be configurable
[22:26] <tziOm> and configure should detect if one has the functions ..?
[22:27] <dmick> tziOm: targeting a new toolchain is generally the time such decisions are made and portability mechanisms added. congratulations on being on the vanguard
[22:28] <gregaf> the use of posix_fallocate there is a performance thing; feel free to add a case which allocates it using a different implementation if not available
[22:28] <joshd> tziOm: a while ago someone ported ceph to freebsd and added a bunch of that infrastructure, but it sounds like your platform has other differences
[22:29] <tziOm> Im doing this for ipxe booting ceph osd into 10MB mem :)
[22:29] <gregaf> erm
[22:29] <gregaf> sorry, what?
[22:30] <dmick> putting the osd into a limited-memory system.
[22:30] <gregaf> I don't think 10MB is going to cut it :/
[22:30] <dmick> seems like a challenge to me
[22:30] <tziOm> not nessesarly limited memory, but "no disk for os"
[22:30] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) Quit (Quit: Leaving)
[22:30] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:31] <gregaf> hmm
[22:31] <tziOm> with a bootstrapped debian, system is easily 400Mb with ceph
[22:32] <tziOm> Anyway.. point beeing ceph should compile against other toolchains, for example uClibc
[22:33] <tziOm> and its probably just execinfo.h and posix_fallocate that are showstoppers
[22:33] <rturk> we're always happy to review patches :)
[22:34] <tziOm> :)
[22:34] <tziOm> I probably have to, so
[22:35] <gregaf> well, it'd be interesting to see how much smaller libraries slim down the OSD memory footprint :)
[22:36] <tziOm> total system now incl python/libvirt/openvswitch/quantum/nova is 26MB
[22:36] <tziOm> that is cramfs
[22:36] <gregaf> nice
[22:36] * adjohn (~adjohn@ has joined #ceph
[22:36] <tziOm> need to get ceph into the party
[22:37] <rweeks> neat. what are you going to be doing with those memory limited systems?
[22:39] * nhmlap (~nhm@mb40436d0.tmodns.net) has joined #ceph
[22:39] * pentabular (~pentabula@ has joined #ceph
[22:43] <tziOm> systems are not memory limited
[22:44] * slang (~slang@173-163-208-195-westflorida.hfc.comcastbusiness.net) has joined #ceph
[22:44] <dmick> just boot-disk limited
[22:44] <rweeks> ah, I see
[22:45] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:46] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[22:51] * tryggvil (~tryggvil@163-60-19-178.xdsl.simafelagid.is) has joined #ceph
[22:51] <tziOm> The X/Open definition of `signal' specifies the SVID semantic. Use the additional function `sysv_signal' when X/Open compatibility is requested
[22:52] <dmick> elder: http://tracker.newdream.net/issues/2933 yay
[22:52] * gregaf1 (~Adium@2607:f298:a:607:4da6:6dc3:f488:9e18) has joined #ceph
[22:53] * slang (~slang@173-163-208-195-westflorida.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[22:53] * gregaf1 (~Adium@2607:f298:a:607:4da6:6dc3:f488:9e18) Quit ()
[22:54] * gregaf1 (~Adium@2607:f298:a:607:4da6:6dc3:f488:9e18) has joined #ceph
[22:55] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[22:56] * Kioob (~kioob@luuna.daevel.fr) Quit (Ping timeout: 480 seconds)
[22:57] <elder> dmick, I'm pretty happy about it...
[22:58] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:00] <rweeks> wow, nice to squash bugs!
[23:01] <dmick> rweeks: that one has been quite the pain
[23:01] <rweeks> seems so
[23:01] * slang (~slang@ace.ops.newdream.net) has joined #ceph
[23:04] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[23:05] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[23:05] * aliguori (~anthony@ has joined #ceph
[23:08] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:10] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:14] <nwatkins1> buck: will you give me a ping when you have a look at the patches in wip-java-cephfs? I can manually add your signoff to the merge commit to make things easy. no rush, though.
[23:18] <gregaf1> nwatkins1: sounds like you mean reviewed-by, not signed-off-by :)
[23:18] <nwatkins1> Ahh ya. What's the difference? sign-off for author?
[23:19] <dmick> nwatkins1: yes
[23:19] <gregaf1> let me see if I can dig up the precise meanings
[23:19] <gregaf1> we use the kernel conventions
[23:19] <nwatkins1> Ahh, ok. I think I have a reference to some of Greg KH talk on that
[23:20] <buck> nwatkins1: will do. I should be able to get to that around 3 pm PST
[23:20] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:20] <gregaf1> I guess it's this doc
[23:20] <gregaf1> http://www.kernel.org/doc/Documentation/SubmittingPatches
[23:20] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:20] <dmick> yeah, item 12
[23:21] <gregaf1> basically, Signed-off-by is a statement about ownership and copyright to establish a chain of trust
[23:21] <nwatkins1> Cool. thanks for the ref
[23:22] <gregaf1> you wrote it, or you had custody of it at some point, and you believe it's properly open-sourced etc
[23:22] <gregaf1> Reviewed-by is what it sounds like
[23:23] <gregaf1> we don't use Acked-by, which is a sort of acknowledgement that the patch should be good from your perspective but not a full review
[23:23] <gregaf1> Reported-by and Tested-by are for bugfix patches and do what they sound like
[23:44] * vata (~vata@ Quit (Quit: Leaving.)
[23:46] * slang (~slang@ace.ops.newdream.net) Quit (Quit: slang)
[23:56] * aliguori_ (~anthony@ has joined #ceph
[23:56] * aliguori (~anthony@ Quit (Read error: Connection reset by peer)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.