#ceph IRC Log

Index

IRC Log for 2013-05-23

Timestamps are in GMT/BST.

[0:01] * BillK (~BillK@124-169-236-155.dyn.iinet.net.au) has joined #ceph
[0:01] * jeff-yf (~oftc-webi@67.23.117.122) has joined #ceph
[0:02] * jeff-yf (~oftc-webi@67.23.117.122) Quit ()
[0:02] * drokita1 (~drokita@199.255.228.128) Quit (Ping timeout: 480 seconds)
[0:02] * jeff-YF (~jeffyf@67.23.117.122) has joined #ceph
[0:02] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:04] <athrift> Hrmm, I am having issues removing an unhappy mon and re-adding it. I have removed it with "ceph mon remove a" after stopping the service, but when re-adding it by extracting the monmap/keyring and using "ceph-mon -i a --mkfs --monmap monmap.out --keyring monauth.out" I then do a "ceph mon add a 10.8.1.1:6789" which then complains "mon a 10.8.1.1:6789/0 already exists" which is odd as ceph -s after the mon remove command definately
[0:04] <athrift> showed it was removed
[0:04] <athrift> can anyone offer some advice ?
[0:06] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Quit: Ex-Chat)
[0:06] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[0:08] <tnt> what does monmaptool --print monmap.out shows ?
[0:10] * jtang1 (~jtang@80.111.97.194) Quit (Quit: Leaving.)
[0:11] <athrift> monmaptool: monmap file monmap.out
[0:11] <athrift> epoch 2
[0:11] <athrift> fsid 350bdef4-263e-48bf-839d-14434accad17
[0:11] <athrift> last_changed 2013-05-23 09:43:19.528552
[0:11] <athrift> created 2013-02-12 11:19:00.896657
[0:11] <athrift> 0: 10.8.1.2:6789/0 mon.b
[0:11] <athrift> 1: 10.8.1.3:6789/0 mon.c
[0:12] <tnt> oh, but re-reading it, it seems normal. Starting the mon will actually add it automatically.
[0:12] * pja (~pja@a.clients.kiwiirc.com) Quit (Quit: http://www.kiwiirc.com/ - A hand crafted IRC client)
[0:13] <athrift> tnt: hrmmI have done that but it still shows it as down, even though I can see it running in ps
[0:13] <athrift> health HEALTH_WARN 1 mons down, quorum 1,2 b,c
[0:13] <athrift> monmap e5: 3 mons at {a=10.8.1.1:6789/0,b=10.8.1.2:6789/0,c=10.8.1.3:6789/0}, election epoch 54, quorum 1,2 b,c
[0:13] <tnt> mmm, what does the log of mon.a says ?
[0:14] <athrift> 2013-05-23 10:14:19.488696 7fbd5bf6e700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption
[0:14] <athrift> 2013-05-23 10:14:19.488701 7fbd62ec1700 0 -- 10.8.1.1:6789/0 >> 10.8.1.2:6789/0 pipe(0x1566280 sd=22 :39363 s=1 pgs=0 cs=0 l=0).failed verifying authorize reply
[0:14] <athrift> so looks like possibly keyring ?
[0:14] * jeff-YF (~jeffyf@67.23.117.122) Quit (Ping timeout: 480 seconds)
[0:14] <tnt> athrift: are all mons at the same version ?
[0:18] <athrift> tnt: yes
[0:18] <athrift> 0.61.2
[0:19] <tnt> what distribution ? and were they upgraded from a previous version ?
[0:20] <tnt> given the "created 2013-02-12 11:19:00.896657" in the monmap I would say that they're been upgraded.
[0:25] <athrift> Ubuntu 12.04LTS and yes
[0:25] <tnt> run "apt-cache policy ceph ceph-common" on all the mons and pastebin the result
[0:30] <tchmnkyz> ok i know how to enable it
[0:30] <tchmnkyz> just need to figure out what level to enable
[0:30] <tchmnkyz> debug ms = 1
[0:30] <tchmnkyz> debug rbd = 20
[0:30] <tchmnkyz> log file = /tmp/rbd.$pid.log
[0:30] <tchmnkyz> something like that wokr?
[0:31] <tchmnkyz> it created a boat load of files
[0:31] <athrift> tnt: http://pastebin.com/axJDAWui
[0:32] <tnt> athrift: see those "Installed: 0.56.4-1precise" :)
[0:32] <tchmnkyz> http://pastebin.com/G4Tjka58
[0:32] <tchmnkyz> that is what i got from using those debug settings on one of the nodes
[0:32] <athrift> tnt: yes
[0:33] <athrift> oh wow
[0:33] <athrift> I wonder whats going on there
[0:33] <tnt> athrift: well, means two of the mon are actually still at 0.56.4 ... do apt-get install ceph on those two to force the update.
[0:34] <tnt> the new 'ceph' package had new dependencies and so "apt-get upgrade" didn't update them.
[0:34] <tchmnkyz> like now it has already created over 500 files in 10 minutes of turning it on
[0:34] <tnt> But it did update ceph-common ... and so you end up with mixed version.
[0:34] <tnt> that's like the 4th time I've seen people here with that exact situation
[0:35] <tchmnkyz> tnt: would my proxmox nodes having a older version then my ceph cluster cause the problems like this?
[0:35] <athrift> on the upside, everything kept running pretty nicely :)
[0:35] <tnt> tchmnkyz: not unless it's really old.
[0:35] <tchmnkyz> the proxmox nodes are on like 56.3 and ceph side is on 61.2
[0:36] <tnt> shouldn't matter then.
[0:36] <tchmnkyz> k
[0:37] <tchmnkyz> just figured i would ask
[0:37] <tchmnkyz> then i really dont know where to go with this
[0:37] <tchmnkyz> something is causing this to happen
[0:37] <tchmnkyz> and i dont know what
[0:38] <tchmnkyz> got any ideas?
[0:40] <tchmnkyz> ok guys i need to head out for the night i guess we can bash my head into the wall with this tomorrow.
[0:40] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[0:43] <athrift> tnt: Thank you for your assistance, the problem is all solved now :)
[0:48] * diegows (~diegows@200.68.116.185) Quit (Ping timeout: 480 seconds)
[0:51] <athrift> Im loving ceph -w in 0.61.2 showing read and write bandwidth as well as op/s
[0:52] * jtang1 (~jtang@80.111.97.194) Quit (Ping timeout: 480 seconds)
[0:55] * PerlStalker (~PerlStalk@72.166.192.70) Quit (Quit: ...)
[0:58] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:58] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:03] * tnt (~tnt@91.177.210.16) Quit (Ping timeout: 480 seconds)
[1:08] * lofejndif (~lsqavnbok@19NAAC8ZE.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[1:09] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[1:10] * tkensiski1 (~tkensiski@209.66.64.134) has joined #ceph
[1:10] * tkensiski1 (~tkensiski@209.66.64.134) has left #ceph
[1:13] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[1:33] * LeaChim (~LeaChim@176.250.160.87) Quit (Ping timeout: 480 seconds)
[1:35] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[1:36] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) has joined #ceph
[1:37] * esammy (~esamuels@host-2-102-69-49.as13285.net) Quit (Quit: esammy)
[1:41] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[1:43] * jtang1 (~jtang@80.111.97.194) Quit (Ping timeout: 480 seconds)
[1:44] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has left #ceph
[1:45] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[1:55] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[1:57] * wer (~wer@206-248-239-142.unassigned.ntelos.net) Quit (Remote host closed the connection)
[1:57] * wer (~wer@206-248-239-142.unassigned.ntelos.net) has joined #ceph
[2:00] * sh_t (~sht@lu.privatevpn.com) Quit (Ping timeout: 480 seconds)
[2:01] * sh_t (~sht@NL.privatevpn.com) has joined #ceph
[2:10] * alram (~alram@38.122.20.226) Quit (Ping timeout: 480 seconds)
[2:14] * The_Bishop (~bishop@f052098056.adsl.alicedsl.de) has joined #ceph
[2:28] * jeff-YF (~jeffyf@pool-71-163-34-123.washdc.fios.verizon.net) has joined #ceph
[2:29] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[2:37] * jtang1 (~jtang@80.111.97.194) Quit (Ping timeout: 480 seconds)
[3:06] * Tamil2 (~tamil@38.122.20.226) Quit (Quit: Leaving.)
[3:11] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[3:12] * jeff-YF (~jeffyf@pool-71-163-34-123.washdc.fios.verizon.net) Quit (Quit: jeff-YF)
[3:23] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[3:31] * jtang1 (~jtang@80.111.97.194) Quit (Ping timeout: 480 seconds)
[3:44] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[3:50] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[3:58] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[4:02] * treaki_ (88156254f0@p4FDF7CCD.dip0.t-ipconnect.de) has joined #ceph
[4:06] * treaki__ (0b0c2150c1@p4FDF714E.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[4:12] * The_Bishop_ (~bishop@f052096225.adsl.alicedsl.de) has joined #ceph
[4:17] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[4:18] * The_Bishop (~bishop@f052098056.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[4:22] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[4:23] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[4:24] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[4:25] * jtang1 (~jtang@80.111.97.194) Quit (Ping timeout: 480 seconds)
[4:33] <tchmnkyz> dmick: or tnt: got some time to try and help me figure this crap out?
[4:34] <dmick> maybe?...
[4:34] <tchmnkyz> i am trying to track down this corruption issue
[4:34] <tchmnkyz> i just lost an entire virtual drive
[4:35] <tchmnkyz> /dev/sdb just lost everything
[4:36] <dmick> oh
[4:36] <dmick> that's not pleasant
[4:37] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[4:37] <tchmnkyz> yea
[4:37] <tchmnkyz> it just keeps happening over and over
[4:38] <dmick> so I guess I need to prompt you for information? What does "lost a drive" and "lost everything" mean?
[4:41] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[4:41] <tchmnkyz> the vmdisk looks empty
[4:41] <tchmnkyz> no partition table
[4:42] <dmick> did it become filled with zeros? Is it still the right size?
[4:43] <tchmnkyz> size stayed the same
[4:44] <tchmnkyz> not filled with 0's but random data
[4:45] <dmick> what are the overall sizes of the cluster involved? (hosts, OSDs, mons)
[4:46] <tchmnkyz> i have 5 OSDs
[4:46] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[4:46] <tchmnkyz> 3 mons
[4:46] <tchmnkyz> mons are sep servers from the osd
[4:46] * loicd (~loic@magenta.dachary.org) has joined #ceph
[4:46] <dmick> so 8 hosts? or 4? or?
[4:46] <tchmnkyz> yes 8 physical devices
[4:47] <tchmnkyz> the osd are Dual Xeon 5620 with 48GB ram
[4:47] <tchmnkyz> 3ware 9750 Raid contrller
[4:47] <tchmnkyz> [client]
[4:47] <tchmnkyz> debug ms = 1
[4:47] <tchmnkyz> debug rbd = 20
[4:47] <tchmnkyz> log file = /tmp/rbd.$pid.log
[4:47] <dmick> "physical device" == "machine". ok
[4:47] <tchmnkyz> sorry wrong paste
[4:47] <dmick> how many VMs/images?
[4:47] <tchmnkyz> 55tb raid 50 array
[4:48] <tchmnkyz> right now i think we have like 50 vm with maybe 75 - 100 images
[4:48] <dmick> they're running on different hosts?
[4:48] <dmick> separate from teh cluster, I mean?
[4:48] <tchmnkyz> yes
[4:48] * jackhill (jackhill@pilot.trilug.org) has joined #ceph
[4:48] <tchmnkyz> i have 2 Dell M1000E Chassis
[4:48] <tchmnkyz> with 16 Dual Xeon Nodes
[4:49] <dmick> not important.
[4:49] <tchmnkyz> 32 total servers available for VM's
[4:49] * jackhill (jackhill@pilot.trilug.org) Quit ()
[4:49] <tchmnkyz> k
[4:49] <dmick> ok, so the one that lost sdb
[4:49] <tchmnkyz> the backend is Infiniband 10gb ipoib
[4:49] <dmick> find his image name, and rbd info about it
[4:49] <dmick> the sdb image, I mean, if he had more than one
[4:50] <tchmnkyz> rbd image 'vm-5000-disk-2': size 10240 GB in 2621440 objects order 22 (4096 KB objects) block_name_prefix: rb.0.a5b4d1.238e1f29 format: 1
[4:51] <dmick> ok. rados [-p pool] ls | grep rb.0.a5b4d1.238e1f29 | wc -l
[4:52] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[4:52] <tchmnkyz> i think what is happening is something is causing kvm to mess up the disk
[4:52] <tchmnkyz> that is running now btw
[4:52] <dmick> what leads you to think that?
[4:53] <tchmnkyz> because not all of my vms have the problem
[4:53] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[4:53] <tchmnkyz> on some
[4:53] <tchmnkyz> root@stor01:~ # rados -p moto ls | grep rb.0.a5b4d1.238e1f29 | wc -l
[4:53] <tchmnkyz> 11353
[4:53] <dmick> ok, so that's a pretty sparse image
[4:54] <tchmnkyz> yea it was one that i messed up on and made mbr not gpt
[4:54] <tchmnkyz> and rbd did not allow me to shrink it
[4:55] <dmick> ? rbd didn't allow it?
[4:55] <tchmnkyz> i could not figure out how to shrink an image once i had resized it to bigger
[4:55] <dmick> well from the rbd perspective, rbd resize, but
[4:56] <dmick> of course you have to get the filesystem to agree and shrink the partition first
[4:56] <dmick> with parted or some such
[4:56] <dmick> but anyway:
[4:56] <tchmnkyz> thepartition is 1tb the vdrive is 10tb
[4:56] <dmick> presumably rados -p moto stat rb.0.a5b4d1.238e1f29.0 shows a 4M block?
[4:57] * Cube (~Cube@12.248.40.138) Quit (Ping timeout: 480 seconds)
[4:57] <tchmnkyz> error stat-ing moto/rb.0.a5b4d1.238e1f29.0: No such file or directory
[4:58] <dmick> oh, yeah, it's probably got more zeros there
[4:58] <dmick> hm, how many
[4:58] <tchmnkyz> i would not know
[4:58] <dmick> sigh, yes, thinking out loud
[4:59] <tchmnkyz> and the thing that makes me think it is not ceph/rbd is that if i take a snapshot of the vdisk i can restore that snap and it is fine
[4:59] <tchmnkyz> dont know if that helps any
[4:59] <dmick> that doesn't mean much; the snap is not written to
[5:00] <tchmnkyz> o ok
[5:00] <tchmnkyz> just figured i would give as much info as i can
[5:00] <dmick> so if ceph is somehow muffing the write, it would be to the newer obs
[5:00] <tchmnkyz> good point
[5:01] <tchmnkyz> my first thoughts were a IO lag because i had seen other users with proxmox have that problem
[5:01] <tchmnkyz> they seen similar behavior with some write delays from seagate drives
[5:01] <tchmnkyz> so that is what pointed me to maybe an IO issue
[5:02] <dmick> how does io lag cause disk corruption?
[5:02] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[5:02] <dmick> rados -p moto stat rb.0.a5b4d1.238e1f29.000000000000
[5:02] <tchmnkyz> i am not sure but they said that changing from seagate to wd disks fixed the issue
[5:03] <dmick> sounds like voodoo
[5:03] <tchmnkyz> moto/rb.0.a5b4d1.238e1f29.000000000000 mtime 1369278137, size 4194304
[5:04] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) Quit (Ping timeout: 480 seconds)
[5:04] <dmick> ok, that mtime is Wed May 22 20:02:17 2013
[5:05] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:05] <dmick> so something is writing to the first 4M of the disk as we speak?
[5:05] <tchmnkyz> i had to start restoring
[5:05] <dmick> !!
[5:06] <tchmnkyz> i have to it is a customers VM i have to restore it now i cant just sit on it
[5:06] <dmick> ok, well, I can't do much to help then, sorry
[5:07] <tchmnkyz> i have another vm that is having a similar issue
[5:07] <tchmnkyz> we can look at that one
[5:10] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[5:10] * loicd (~loic@magenta.dachary.org) has joined #ceph
[5:12] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[5:12] <dmick> tchmnkyz: I can't really help you with customer-related issues unless you have an agreement in place; we're small, and it's just too risky for us.
[5:13] * khanhndq (~khanhndq@123.30.135.76) has joined #ceph
[5:13] <tchmnkyz> ok so help me with a internalvm that is part of my internal stuff not customer
[5:13] * khanhndq (~khanhndq@123.30.135.76) Quit ()
[5:13] * khanhndq (~khanhndq@123.30.135.76) has joined #ceph
[5:15] <khanhndq> hi everybody
[5:15] <khanhndq> now i faced one issue in ceph block device
[5:15] <khanhndq> ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
[5:15] <khanhndq> health HEALTH_OK
[5:15] <khanhndq> monmap e1: 2 mons at {a=49.213.67.204:6789/0,b=49.213.67.203:6789/0}, election epoch 20, quorum 0,1 a,b
[5:15] <khanhndq> osdmap e53: 2 osds: 2 up, 2 in
[5:15] <khanhndq> pgmap v535: 576 pgs: 576 active+clean; 11086 MB data, 22350 MB used, 4437 GB / 4459 GB avail
[5:15] <khanhndq> mdsmap e29: 1/1/1 up {0=a=up:active}, 1 up:standby
[5:15] <khanhndq> I do benchmark with one osd, I receive the io speed is about 190MB/s
[5:15] <khanhndq> But When i add more osd, replicate size =2 , the write performance degradated is about 90MB/s
[5:15] <khanhndq> Follow the pratice, the write performance must be increated as adding more osd but I can't receive that. :(
[5:16] <tchmnkyz> dmick: how much would it cost for support on our cluster this size
[5:16] <khanhndq> anyone can you help me check what's wrong in config file or anything else?
[5:17] <dmick> tchmnkyz: I really don't know; you'd need to contact the people who deal in that space
[5:17] <tchmnkyz> ok
[5:17] <tchmnkyz> inktank support?
[5:17] <dmick> http://www.inktank.com/support-services/
[5:18] <tchmnkyz> thnx
[5:19] <khanhndq> anyone can you help me check what's wrong in config file or anything else?
[5:20] * jtang1 (~jtang@80.111.97.194) Quit (Ping timeout: 480 seconds)
[5:21] <tchmnkyz> dmick: this is the errors i see on the vms when one of the drives go defunct
[5:21] <tchmnkyz> http://imgur.com/MmmXbmq
[5:22] <dmick> yeah, I saw that earlier. My strategy would be to try to see evidence of that error somewhere else
[5:22] <tchmnkyz> k
[5:22] <tchmnkyz> i sent a email to get a quote from support
[5:23] <khanhndq> <tchmnkyz> did you do benchmark in ceph system ?
[5:23] <tchmnkyz> was not really sure how to
[5:25] <khanhndq> first, create an ceph storage and create new pool and an image after that map that image such as file system
[5:25] <tchmnkyz> khanhndq: my cluster is 272tb of avaible space
[5:25] <khanhndq> and do io performance on it
[5:25] <tchmnkyz> it was fine and working perfect till a few weeks ago
[5:25] <tchmnkyz> this started like 2 weeks ago
[5:27] <khanhndq> did you do a benchmark like as read/write io per ceph cluster ?
[5:27] <tchmnkyz> yea i was getting great speeds before i started spining up vms like this
[5:28] <khanhndq> do you share me your benchmark ?
[5:28] <tchmnkyz> that was like 8 months ago
[5:28] <tchmnkyz> i dont know where they are now
[5:29] <khanhndq> do you remember how much is the write speed/ volumes ?
[5:29] <tchmnkyz> i was getting average of Sata3 speeds on the test VMs
[5:30] <tchmnkyz> i had ran them under Debian, Windows 2003 & 2008
[5:31] * athrift (~nz_monkey@222.47.255.123.static.snap.net.nz) Quit (Quit: No Ping reply in 180 seconds.)
[5:32] <tchmnkyz> dmick: sorry if i was going about this the wrong way. I just have the owner of the company riding me because i am the one that setup this cluster and i am having problems now
[5:33] <tchmnkyz> he is freaking out on me so i am trying to fix this
[5:37] <tchmnkyz> like now i dont get this. I just did a data restore to that VM. and the same drive is gone again.
[5:40] <tchmnkyz> dmick: can you tell me this much with a VM should i be using some kind of caching for the virtual disk or none?
[5:40] <tchmnkyz> all of my vms now have no caching enabled
[5:41] <dmick> tchmnkyz: I understand pressure to support customers; that's why you have support contracts for escalations.
[5:41] <dmick> on caching:
[5:41] <dmick> if you enable rbd caching, you have to be sure that qemu knows about it
[5:41] <dmick> or you'll get data corruption
[5:41] <tchmnkyz> i am talking more at the qemu side
[5:42] <dmick> http://ceph.com/docs/master/rbd/qemu-rbd/#running-qemu-with-rbd
[5:42] <tchmnkyz> thnx
[6:06] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[6:12] * pja (~pja@a.clients.kiwiirc.com) has joined #ceph
[6:12] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[6:13] * loicd (~loic@magenta.dachary.org) has joined #ceph
[6:13] * pja (~pja@a.clients.kiwiirc.com) Quit ()
[6:14] * jtang1 (~jtang@80.111.97.194) Quit (Ping timeout: 480 seconds)
[6:14] * pja (~pja@a.clients.kiwiirc.com) has joined #ceph
[6:15] * pja (~pja@a.clients.kiwiirc.com) Quit ()
[6:17] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[6:18] * jjgalvez1 (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[6:22] * KindTwo (KindOne@h24.238.22.98.dynamic.ip.windstream.net) has joined #ceph
[6:27] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[6:30] * KindTwo (KindOne@h24.238.22.98.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[6:41] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[6:47] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[6:55] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[6:58] * portante is now known as portante|afk
[6:59] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[7:00] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[7:03] * khanhndq (~khanhndq@123.30.135.76) Quit (Ping timeout: 480 seconds)
[7:03] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[7:08] * jtang1 (~jtang@80.111.97.194) Quit (Ping timeout: 480 seconds)
[7:11] <MrNPP> I'm getting a tone of "currently waiting for pg to exist locally"
[7:11] <MrNPP> after a system with 6 osd's on it crashed
[7:11] <MrNPP> pretty much all of ceph is unusable at this point, and i'm not sure why
[7:12] <MrNPP> i tried turing on debuggin and nothing, and the system logs don't show anything either
[7:12] <MrNPP> HEALTH_WARN 1515 pgs degraded; 8 pgs down; 8 pgs peering; 2416 pgs stale; 8 pgs stuck inactive; 2416 pgs stuck stale; 1599 pgs stuck unclean; recovery 35805/129330 degraded (27.685%)
[7:16] <MrNPP> i removed the down host from the crushmap and its backfilling, if one host of 16 goes down with say 25 other od's should the entire cluster be affected this bad?
[7:18] <phantomcircuit> MrNPP, depends on if the crush map is replication per osd or per host
[7:18] * esammy (~esamuels@host-2-102-69-49.as13285.net) has joined #ceph
[7:27] <MrNPP> per osd
[7:27] <MrNPP> i tried to change it to per host, but it crashed
[7:27] <MrNPP> the monitor when i tried to inject it
[7:31] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[7:54] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[7:56] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[8:00] * tnt (~tnt@91.177.204.242) has joined #ceph
[8:02] * jtang1 (~jtang@80.111.97.194) Quit (Ping timeout: 480 seconds)
[8:04] <MrNPP> going to try it again
[8:08] * derRichard (~derRichar@pippin.sigma-star.at) has left #ceph
[8:09] * Qten (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (Read error: Connection reset by peer)
[8:09] * Rorik (~rorik@199.182.216.68) Quit (Read error: Connection reset by peer)
[8:09] * Qten (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[8:09] <MrNPP> yeah i tried to insert a change into the crushamp using ceph osd setcrushmap -i crush.map
[8:09] <MrNPP> and now ceph isn't responding
[8:09] * Rorik (~rorik@199.182.216.68) has joined #ceph
[8:10] * iggy (~iggy@theiggy.com) Quit (Remote host closed the connection)
[8:10] * iggy (~iggy@theiggy.com) has joined #ceph
[8:13] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[8:13] * markl_ (~mark@tpsit.com) Quit (Remote host closed the connection)
[8:13] * markl (~mark@tpsit.com) has joined #ceph
[8:15] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[8:21] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:22] * loicd (~loic@2a01:e35:2eba:db10:a1a6:68fa:2645:5e59) has joined #ceph
[8:26] * jtang1 (~jtang@80.111.97.194) has joined #ceph
[8:30] * jjgalvez1 (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[8:31] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[8:57] * jtang1 (~jtang@80.111.97.194) Quit (Quit: Leaving.)
[8:59] * bergerx_ (~bekir@78.188.204.182) has joined #ceph
[9:00] * KindTwo (~KindOne@h237.168.17.98.dynamic.ip.windstream.net) has joined #ceph
[9:02] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:02] * KindTwo is now known as KindOne
[9:21] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) has joined #ceph
[9:29] * tziOm (~bjornar@194.19.106.242) has joined #ceph
[9:33] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:33] * ChanServ sets mode +v andreask
[9:35] * ScOut3R (~ScOut3R@212.96.47.215) has joined #ceph
[9:36] * t4nk672 (~79f4c73c@webuser.thegrebs.com) has joined #ceph
[9:42] <t4nk672> asd
[9:42] * t4nk672 (~79f4c73c@webuser.thegrebs.com) has left #ceph
[9:47] * humbolt (~elias@178-190-255-82.adsl.highway.telekom.at) Quit (Ping timeout: 480 seconds)
[9:58] * humbolt (~elias@178-190-244-123.adsl.highway.telekom.at) has joined #ceph
[10:01] * nooky (~nooky@190.221.31.66) has joined #ceph
[10:03] * nooky_ (~nooky@190.221.31.66) Quit (Ping timeout: 480 seconds)
[10:09] * LeaChim (~LeaChim@176.250.160.87) has joined #ceph
[10:10] * leseb (~Adium@83.167.43.235) has joined #ceph
[10:15] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[10:17] * lightspeed (~lightspee@lns-c10k-ld-01-m-62-35-37-66.dsl.sta.abo.bbox.fr) Quit (Read error: Operation timed out)
[10:25] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[10:28] * loicd (~loic@2a01:e35:2eba:db10:a1a6:68fa:2645:5e59) Quit (Quit: Leaving.)
[10:36] * tnt (~tnt@91.177.204.242) Quit (Ping timeout: 480 seconds)
[10:42] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[10:44] * yehuda_hm (~yehuda@2602:306:330b:1410:22:96ef:51be:2ef9) Quit (Ping timeout: 480 seconds)
[10:47] * nooky_ (~nooky@190.221.31.66) has joined #ceph
[10:49] * nhm (~nhm@65-128-142-169.mpls.qwest.net) Quit (Remote host closed the connection)
[10:49] * nhm (~nhm@65-128-142-169.mpls.qwest.net) has joined #ceph
[10:49] * nooky (~nooky@190.221.31.66) Quit (Ping timeout: 480 seconds)
[10:52] * tserong (~tserong@124-171-115-108.dyn.iinet.net.au) Quit (Quit: Leaving)
[10:52] * coyo|2 is now known as Codora
[11:13] * joao (~JL@89.181.159.84) has joined #ceph
[11:13] * ChanServ sets mode +o joao
[11:20] <blue> hi. I'm having some problems with radosgw, where creating and listing buckets seem to work, but everything else returns 403 (InvalidAccessKeyId)
[11:21] <blue> i basically just followd http://ceph.com/docs/master/man/8/radosgw/ and a bit from http://wiki.debian.org/OpenStackCephHowto
[11:21] <blue> any ideas what could be wrong?
[11:43] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Read error: Connection reset by peer)
[11:47] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[11:56] * Vanony (~vovo@88.130.192.131) has joined #ceph
[11:59] * diegows (~diegows@190.190.2.126) has joined #ceph
[12:23] * acalvo (~acalvo@208.Red-83-61-6.staticIP.rima-tde.net) has joined #ceph
[12:23] <acalvo> Hello
[12:25] <acalvo> Trying to follow the install instructions on CentOS 6.4, but it fails the mkcephfs and the key generation and mentions using ceph-deploy. Howerver, ceph-deploy seems to be Ubuntu based so it fails to run on CentOS. Any tutorial up to date with that?
[12:28] * portante|afk is now known as portante|ltp
[12:30] * tdb (~tdb@willow.kent.ac.uk) has joined #ceph
[12:32] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[12:41] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[12:41] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[12:42] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[13:23] * andrei (~andrei@82.150.98.65) has joined #ceph
[13:23] * ninkotech (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[13:27] * madkiss (~madkiss@217.237.167.132) has joined #ceph
[13:30] <andrei> hello guys
[13:30] <andrei> could some please suggest a way to deal with changes in drive letters in /dev/ ?
[13:31] <andrei> i've unplugged an unused HD and after a reboot the disk letters have changed in /dev/
[13:32] <andrei> as a result i get errors on every osd when i try to start it
[13:32] <andrei> like these: unable to authenticate as osd.0
[13:37] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) Quit (Ping timeout: 480 seconds)
[13:47] * KindTwo (~KindOne@h208.53.186.173.dynamic.ip.windstream.net) has joined #ceph
[13:47] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[13:48] * KindTwo is now known as KindOne
[13:49] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[13:55] * acalvo (~acalvo@208.Red-83-61-6.staticIP.rima-tde.net) Quit (Quit: Ex-Chat)
[13:56] * diegows (~diegows@190.190.2.126) Quit (Ping timeout: 480 seconds)
[14:03] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[14:04] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) Quit ()
[14:06] * KindTwo (~KindOne@h190.175.17.98.dynamic.ip.windstream.net) has joined #ceph
[14:08] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:08] * KindTwo is now known as KindOne
[14:12] * john_barbee_ (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:13] * john_barbee_ (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit ()
[14:13] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[14:18] * jackhill (jackhill@pilot.trilug.org) has joined #ceph
[14:20] * DarkAce-Z (~BillyMays@50.107.54.92) has joined #ceph
[14:22] * DarkAceZ (~BillyMays@50.107.54.92) Quit (Ping timeout: 480 seconds)
[14:32] <tnt> andrei: don't use raw device names ...
[14:33] <tnt> use udev names /dev/disk/by-*/* to refer to your partitions and then label them properly to get persistent names
[14:39] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[14:42] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has left #ceph
[14:49] <andrei> tnt: thanks
[14:49] <andrei> i will!
[14:53] * ScOut3R_ (~ScOut3R@212.96.47.215) has joined #ceph
[14:53] <andrei> tnt: I seems to be getting issues with writes when I reboot one of the servers
[14:53] <andrei> i've tried to restart it with init 6
[14:54] <andrei> the clients are mounted with ceph-fuse
[14:55] <andrei> the dd write command freezes a few seconds after I do init 6 on one of the servers
[14:55] <andrei> there are 3 mons
[14:55] <andrei> and 2 osd servers
[14:55] <andrei> one mds
[14:56] <tnt> I don't use cephfs ... no idea
[14:56] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 21.0/20130511120803])
[14:57] <andrei> does any one here know why the write doesn't automatically resume and use the second osd server ?
[14:57] <andrei> i can list the fs contents
[14:58] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[15:00] * ScOut3R (~ScOut3R@212.96.47.215) Quit (Ping timeout: 480 seconds)
[15:02] * madkiss (~madkiss@217.237.167.132) Quit (Remote host closed the connection)
[15:02] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[15:03] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[15:07] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[15:21] * mrjack_ (mrjack@office.smart-weblications.net) Quit ()
[15:23] * jeff-YF (~jeffyf@67.23.117.122) has joined #ceph
[15:24] * fghaas (~florian@217.237.167.132) has joined #ceph
[15:24] * andrei (~andrei@82.150.98.65) Quit (Ping timeout: 480 seconds)
[15:31] * fghaas (~florian@217.237.167.132) Quit (Quit: Leaving.)
[15:31] * hufman (~hufman@rrcs-67-52-43-146.west.biz.rr.com) has joined #ceph
[15:32] <hufman> hey, is sage around? i would like to know if i should include both node's logs in bug #5031
[15:39] <nhm> hufman: it's around 6:30am in california right now
[15:39] <hufman> lol
[15:39] * yanzheng (~zhyan@134.134.137.71) has joined #ceph
[15:39] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:47] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) Quit (Remote host closed the connection)
[15:48] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) has joined #ceph
[15:49] <hufman> so how is everyone this morning?
[15:52] * drokita (~drokita@199.255.228.128) has joined #ceph
[15:56] * portante|ltp (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Ping timeout: 480 seconds)
[16:01] * PerlStalker (~PerlStalk@72.166.192.70) has joined #ceph
[16:02] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[16:07] <absynth> joao or nhm around?
[16:08] <absynth> preferrably both for separate questions
[16:08] <joao> absynth, sup?
[16:08] <absynth> joao: do you have a very high-level diagram of ceph infrastructure components?
[16:08] <absynth> in the docs or something?
[16:08] <joao> how high?
[16:09] <joao> just components as in, the mons, osds, etc and how they all relate to each other?
[16:09] <absynth> yeah
[16:09] <absynth> exactly
[16:09] <joao> there's some of those diagrams in some of the presentations on http://ceph.com/presentations
[16:10] <joao> not sure if the docs also have something like that
[16:10] <joao> I thought they did, but last time I checked didn't find any, so I might have been wrong all along
[16:11] <absynth> also, is sage awake yet?
[16:11] <joao> I bet he is, just haven't seen him online
[16:14] * yehudasa_ (~yehudasa@2602:306:330b:1410:cd14:65a8:f853:77ed) has joined #ceph
[16:14] * yehudasa_ (~yehudasa@2602:306:330b:1410:cd14:65a8:f853:77ed) has left #ceph
[16:14] * yehuda_hm (~yehuda@2602:306:330b:1410:d90a:7f97:da86:902a) has joined #ceph
[16:14] <absynth> ok, i'll just reopen our old deepscrub memleak ticket then
[16:15] <absynth> and hope it gets assigned correctly
[16:15] <tnt> you still see it on cuttlefish ?
[16:15] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Remote host closed the connection)
[16:16] <absynth> no, we still see it on latest bobtail
[16:16] <absynth> updating to cuttlefish is currently not an option - did you see the mailinglist postings lately? ;)
[16:17] <tnt> absynth: which ones ?
[16:17] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[16:18] <absynth> the mon issues, the ceph pg repair issues... all that stuff
[16:18] <tnt> oh,that's from yesterday :p
[16:22] * fghaas (~florian@217.237.167.132) has joined #ceph
[16:25] <Vanony> Hi. I'm lost on this one: I created a small test cluster with 3 machines, two osds per machine (using ceph-deploy). Then extend to a 4th. Appearently I messed up at some point, because now I have 9 osds showing up in the status and crushmap, one is down and out - and never was created/started on the 4th machine. How can I get rid of that osd. I pastied some info in http://pastie.org/7948053
[16:26] * fghaas (~florian@217.237.167.132) Quit ()
[16:28] <jeff-YF> has anyone here had any success with lio or tgt with ceph?
[16:29] <Vanony> I also tried editing out the "device 6 device6" line in the decompiled crushmap, recompiled and put it back. to no avail
[16:30] * Wolff_John (~jwolff@ftp.monarch-beverage.com) has joined #ceph
[16:33] <nhm> absynth: yo. :)
[16:34] <tnt> mmm, I'm having some weird behavior where an osd suddently throws a bunch of "pipe(0x1e0b6280 sd=87 :60923 s=2 pgs=8598 cs=1673 l=0).fault, initiating reconnect" ... (and it's like thousands of them).
[16:34] * portante (~user@66.187.233.207) has joined #ceph
[16:35] <dspano> absynth: Not to give you false confidence, but I upgraded to cuttlefish yesterday with no problems so far. I upgraded the mons, then restarted them all at once, and they came up rather quick. The conversion only took a few seconds.
[16:36] <dspano> Then I upgraded my osds and finally my mds servers.
[16:37] <hufman> my problem is that i assumed the package upgrade to bobtail would restart the programs, so one of my nodes never finished upgrading from argonaut to bobtail
[16:45] <joao> tnt, can you infer to whom it is trying to connect?
[16:45] <joao> can't recall if that debug message has that information
[16:45] <absynth> dspano: are you using rbd?
[16:47] * aliguori (~anthony@32.97.110.51) has joined #ceph
[16:47] <tnt> joao: yes, I striped the start but it connects to all the other OSDs.
[16:48] <joao> can you check if you can actually connect to those other osds servers from that server?
[16:48] <tnt> joao: it seems to starts with dozens of "2013-05-23 13:43:31.166272 7f589c9dc700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f589a1d7700' had timed out after 15
[16:48] <joao> hmm
[16:48] <joao> okay
[16:49] <joao> don't think I can help you with that then
[16:49] <tnt> joao: well, a few minutes later it was restored without me doing anything so I guess it could connect.
[16:50] <tnt> and trying like 1000 connection inside 1 second seems like a bad idea ... (that pipe fault / initiating reconnect is present for several seconds and thre is ~ 1000 msg/sec)
[16:51] <joao> yeah, that should probably be backed off or something
[16:51] * tziOm (~bjornar@194.19.106.242) Quit (Remote host closed the connection)
[16:52] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) Quit (Remote host closed the connection)
[16:54] * yanzheng (~zhyan@134.134.137.71) Quit (Remote host closed the connection)
[17:01] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[17:01] * samware (~bo_samwar@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[17:02] * portante_ (~user@66.187.233.206) has joined #ceph
[17:02] <tnt> joao: and at the same time the quorum leader (which is running the proposed fix for the growth issue) also seemed to just 'stop' ... (dmesg shows hunged process). But I'm not really sure if it's really the mon's fault or if something else triggered both events ...
[17:03] <dspano> absynth: Yeah. RBD with Openstack Folsom. I'm also using cephfs for a small share.
[17:04] * samware (~bo_samwar@static-71-170-33-24.dllstx.fios.verizon.net) Quit ()
[17:04] * samware (~bo_samwar@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[17:05] <joao> tnt, is the mon and that osd running on the same server?
[17:06] <dspano> absynth: I will also say, I am what you would call a small deployment, so I don't have a lot going on. I run two Dell R515s as OSD/Mon/MDS servers, and one Dell R210 as an MDS/Mon.
[17:06] <tnt> they're running as Xen VMs on the same physical server yes.
[17:06] <tnt> (they have distinct physical drives though)
[17:06] <dspano> Lol. My production is probably the size of most people's POC or test environments.
[17:07] <tchmnkyz> tnt/dmick: it seems that enabling the writeback caching seems to have helped prevent the corruption i was seeing
[17:07] * portante (~user@66.187.233.207) Quit (Ping timeout: 480 seconds)
[17:07] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[17:07] <tnt> tchmnkyz: or hiding it ...
[17:08] * jshen (~jshen@209.133.73.98) has joined #ceph
[17:08] <absynth> dspano: ok, we run hundreds of VMs, so probably should back off a little while more
[17:09] <tchmnkyz> i have reached out to inktank to purchase support
[17:09] <tchmnkyz> figured that should help too
[17:09] <absynth> that helps, yeah
[17:09] <absynth> one of the wiser business decisions i made the last year
[17:10] * diegows (~diegows@200.68.116.185) has joined #ceph
[17:11] <tnt> I'm wondering what the pricing is for it.
[17:11] <tchmnkyz> i hope not too horrible
[17:12] <tchmnkyz> also convinced my boss to spring to go through the ceph training as well
[17:15] <absynth> tnt: it's affordable even for a small company like us, and inktank has really gone to lengths to solve our problems
[17:15] <absynth> everyone with a production setup should seriously consider a contract
[17:15] <absynth> (joao, did we talk about commission yet?)
[17:16] <janos> lol
[17:17] <absynth> it is my conviction, though. seriously.
[17:17] <absynth> if you guys knew what inktank went through with us...
[17:17] * fghaas (~florian@217.237.167.132) has joined #ceph
[17:17] * fghaas (~florian@217.237.167.132) Quit ()
[17:18] <topro> hi, what do you think might be the bottleneck of a cephfs cluster when running fio benchmark with 4k random-read and write only gives about 40iops but on the ceph nodes neither cpu-load nor disk-usage seem to mark the limit?
[17:18] * scuttlemonkey_ is now known as scuttlemonkey
[17:19] * ChanServ sets mode +o scuttlemonkey
[17:19] <jshen> hi everyone happy morning! i wonder if anyone knows how to change the cephfs mount size. i could not find it in the docs. thanks!
[17:20] <tnt> topro: network latency ?
[17:20] <topro> tnt: is there a way to tell?
[17:20] <tchmnkyz> provided that inktank is decent enough on the support contract i will def ink the deal on one
[17:20] <topro> s/tell/test/
[17:23] <tchmnkyz> and depending on the cost of the classes the boss man says i can do them too
[17:23] <samware> take me w/you, tchmnkyz
[17:24] <tchmnkyz> with the size of my cluster now i have to do soemthing
[17:24] <tchmnkyz> we are at a 8 node (3 mon 5osd) and about to expand it out to 20 osd's
[17:25] * samware is now known as bo
[17:25] * bo (~bo_samwar@static-71-170-33-24.dllstx.fios.verizon.net) Quit (Quit: bia)
[17:25] * bo (~bo@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[17:25] <tchmnkyz> by mid summer i hope to have 1.6pb of seph space
[17:25] <tchmnkyz> ceph
[17:25] <tchmnkyz> even
[17:25] <tchmnkyz> been awake way too long
[17:27] <bo> 20 OSDs = 1.6PBs in your projected deployment?
[17:28] <bo> raiding drives per OSD or something?
[17:29] <dspano> absynth: In your situation, that's probably a wise decision.
[17:33] <absynth> 20 TB per OSD, how is that going to work?
[17:35] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Quit: Leaving.)
[17:35] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Quit: Konversation terminated!)
[17:35] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[17:40] <topro> what would be a reasonable ping node-to-node and client-to-node?
[17:41] <absynth> you mean osd-to-osd? <1ms
[17:41] <topro> well I get ~0.1ms osd-to-osd and about ~0.2ms client-to-osd
[17:43] <topro> still, running a 4k random read write benchmark on cephfs I only get about 40iops sum (20 read, 20 write) but neither disks nor cpu of any of the osds would max out. where might the bottleneck be hiding?
[17:43] <imjustmatthew> topro: that's definitely workable latency, I'm using regional clients with cephfs at client-to-osd latencies of 5-8ms
[17:44] <imjustmatthew> topro: are you running the kernel client or fuse?
[17:44] <topro> kernel client (debian experimental supplied linux-image 3.8.5)
[17:45] * pja (~pja@a.clients.kiwiirc.com) has joined #ceph
[17:45] * portante_ is now known as portante
[17:46] * topro desperately waiting for a 3.9 linux-image to get tunables going
[17:47] <imjustmatthew> topro: hmm, what kind of speed are you getting from the rados bench?
[17:47] <pja> hello
[17:48] <hufman> what are some good rados_bench options?
[17:49] * pja (~pja@a.clients.kiwiirc.com) Quit ()
[17:49] * pja (~pja@a.clients.kiwiirc.com) has joined #ceph
[17:49] * dcasier (~dcasier@223.103.120.78.rev.sfr.net) has joined #ceph
[17:52] <topro> imjustmatthew: can't find rados bench cmdline, have a short hint, please?
[17:53] <imjustmatthew> topro: ceph osd tell N bench [BYTES_PER_WRITE] [TOTAL_BYTES]
[17:53] <imjustmatthew> http://ceph.com/docs/master/rados/operations/control/?highlight=bench
[17:53] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[17:54] <pja> health HEALTH_WARN clock skew detected on mon.b, mon.c
[17:54] <pja> but ive ntp running
[17:54] <pja> hwclock and date says identical
[17:54] <pja> where does the skew come from?
[17:55] <tchmnkyz> pja run a date at the same time on both nodes and make sure they match node to node
[17:57] <topro> imjustmatthew: osd.0 [INF] bench: wrote 1024 MB in blocks of 4096 KB in 15.819383 sec at 66284 KB/sec
[17:57] <topro> its about the same for all 9 osds
[17:59] <tchmnkyz> bo: yes it is 20 x Raid 50 (55TB arrays)
[17:59] <imjustmatthew> topro: yeah, so the OSD part seems okay. That's all I can point you towards, I would ask gregaf or sagewk; they're both devs on US-Pacific time
[17:59] <tnt> joao: I'm fairly sure the issue I had a couple hours ago has nothing to do with the mon changes, I think it's an issue I've seen before when you get IO on a RBD disk hosted on the same Xen physical server as the OSD serving it ... sometimes it screws things up somehow.
[18:00] <topro> imjustmatthew: thanks so far
[18:00] <imjustmatthew> np, good luck
[18:07] <pja> @tchmnkyz: ive done this
[18:07] <cephalobot`> pja: Error: "tchmnkyz:" is not a valid command.
[18:07] <pja> tchmnkyz: ive done this
[18:08] * BillK (~BillK@124-169-236-155.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:09] <joao> tnt, let us know if that happens again
[18:09] * alram (~alram@38.122.20.226) has joined #ceph
[18:11] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[18:12] * ScOut3R_ (~ScOut3R@212.96.47.215) Quit (Ping timeout: 480 seconds)
[18:13] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:15] * Dark-Ace-Z (~BillyMays@50.107.54.92) has joined #ceph
[18:16] * tnt (~tnt@91.177.204.242) has joined #ceph
[18:18] <pja> but im right,  that this skew failure is because of different times on the nodes?
[18:18] * DarkAce-Z (~BillyMays@50.107.54.92) Quit (Ping timeout: 480 seconds)
[18:19] <pja> root@hcmonko1:~# date && ssh hcmonko3 date && ssh hcmonko2 date
[18:19] <pja> Thu May 23 18:19:38 CEST 2013
[18:19] <pja> Thu May 23 18:19:38 CEST 2013
[18:19] <pja> Thu May 23 18:19:38 CEST 2013
[18:21] <sage> joao, tnt: how did that branch do?
[18:22] <joao> sage, haven't seen any complaints so far, but haven't heard any definite good news either
[18:22] <joao> tnt?
[18:22] <sage> mikedawson: ping!
[18:22] <sage> i'm inclined to merge into next and test further.. and backport to cuttlefish only when we get confirmation
[18:25] <tchmnkyz> pja restart the mons
[18:25] <joao> sage, imo it wouldn't hurt
[18:25] <sage> joao: 2 minor nits on that branch.. then let's merge
[18:25] <joao> kay
[18:25] <joao> just finishing updating a #5069
[18:25] <joao> s/a//
[18:26] <tchmnkyz> joao: have you seen any major reasons not to use infiniband for a 10gbps backend to ceph??
[18:26] <joao> I have absolutely no idea how to answer that
[18:27] <tchmnkyz> ok
[18:27] <tchmnkyz> basically i am running ceph and the network layer is ipoib 10gbps links
[18:27] <tchmnkyz> back to a IB grid director
[18:28] * Tamil (~tamil@38.122.20.226) has joined #ceph
[18:29] <tchmnkyz> ok i am kinda laughing my bawls off right now. I was just informed that this guy is a expert dba that does not know how to use ssh/cmdline
[18:29] <tchmnkyz> he needs a GUI to install oracle db
[18:29] <tchmnkyz> this makes me laugh so much
[18:30] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[18:32] <joao> sagewk, whenever you have the chance: http://tracker.ceph.com/issues/5069#note-4
[18:37] <mikedawson> sage: yessir
[18:38] <pja> ive restarted them
[18:38] <tchmnkyz> fix it?
[18:38] <pja> nope
[18:39] <tchmnkyz> that worked for me when mine was skewed
[18:39] <pja> restarted them more than one, restatet also all the servers
[18:39] <pja> one=once
[18:39] <pja> can we check, why ceph thinks its skew?
[18:40] <tchmnkyz> that is not something i would be able to help with
[18:41] <tchmnkyz> not sure where to start
[18:41] <tchmnkyz> mine was actually off on my cluster
[18:41] <pja> hmm
[18:41] <pja> i'm new to ceph
[18:42] <pja> when i want to add a node or only an osd.. is it right, that i change the config, copy it via ssh to each node(mon,osd,..) and then make ceph -a start to reload the new config, so the new servers are activated?
[18:43] <tchmnkyz> i think so
[18:43] <tchmnkyz> have not really added new nodes yet
[18:43] <tchmnkyz> will be doing that soon
[18:43] * BillK (~BillK@124-169-236-155.dyn.iinet.net.au) has joined #ceph
[18:43] <pja> another question: when i have to restart 1 server, i think the ceph mechanism will start replicating the data to the other hosts
[18:44] <pja> is it possible to mark a server for maintenance, so it doesnt start replicating the data, only because 1 osd is restarting
[18:50] * Wolff_John (~jwolff@ftp.monarch-beverage.com) Quit (Ping timeout: 480 seconds)
[18:52] <mikedawson> sage: I have seen no issues with wip-4895-cuttlefish so far
[18:53] <sage> excellent
[18:55] * mynameisbruce (~mynameisb@tjure.netzquadrat.de) has joined #ceph
[18:56] <sage> joao: oh.. we should have a complete mon log for that
[18:56] <sage> from the teuthology failure
[18:57] * lyncos (~chatzilla@208.71.184.41) has joined #ceph
[18:58] <lyncos> Hi I need help it seems I cannot write to my ceph cluster anymore .. I'm usin version 0.61.1 doing 'rados bench -p data 300 write' create slow requests.. is there any way to find out which OSD is not answering correctly ?
[18:58] <lyncos> I can delete and create pool just fine
[18:59] <lyncos> it seems I cannot put anything on the cluster anymore and I just did change the network interfaces to a bond ... and yes the network is working
[19:00] <lyncos> Latency is 19.0213 .. maybe it can help
[19:00] <lyncos> seems like a timeout .. it's always the same value for latency
[19:00] * bo is now known as redeemed
[19:00] * redeemed is now known as bo
[19:01] * bo is now known as redeemed
[19:01] <lyncos> by the way my health is OK
[19:02] * Vjarjadian (~IceChat77@90.214.208.5) Quit (Quit: Easy as 3.14159265358979323846... )
[19:03] * eegiks (~quassel@pro75-5-88-162-203-35.fbx.proxad.net) Quit (Remote host closed the connection)
[19:05] <lyncos> hmmm I see
[19:06] <joao> sage, where are the log for #5069 that you mentioned on the bug?
[19:06] <sage> should be in the teuth dir
[19:07] <joao> not the one initially reported
[19:07] <joao> those are gone
[19:07] <joao> run must have been nuked
[19:07] <sage> bah, ok.. needs more info then
[19:07] <sage> and lets downgrade to high, since we haven't seen this since.
[19:07] <joao> kay
[19:12] <lyncos> no one can help ?
[19:14] <jshen> could anyone help with my cephfs size configurationg (bobtail) question? the default is too small. thanks in advance!
[19:15] <sage> lyncos: rados --debug-ms 1 .... will give you some clue as to what part is slow
[19:16] <lyncos> sage ok thanks let me try this
[19:18] <lyncos> nothing seems to be wrong
[19:19] <lyncos> sage can you check please if something wrong... to me it seems fine: http://pastebin.com/MuSC47Hx
[19:19] <sage> repeat with -t 1 pls?
[19:20] <lyncos> ok
[19:21] <lyncos> http://pastebin.com/mzaiUYKt
[19:22] * diegows (~diegows@200.68.116.185) Quit (Ping timeout: 480 seconds)
[19:22] <lyncos> you see... cur MB/s is almost always 0
[19:22] <lyncos> sometime it works at around 100 MB/s but just for 1 itteration
[19:23] <lyncos> here is ceph -w
[19:23] <lyncos> http://pastebin.com/6UsAwV6c
[19:24] * dwt (~dwt@128-107-239-235.cisco.com) has joined #ceph
[19:24] * redeemed (~bo@static-71-170-33-24.dllstx.fios.verizon.net) Quit (Quit: bia)
[19:25] * pja (~pja@a.clients.kiwiirc.com) Quit (Quit: http://www.kiwiirc.com/ - A hand crafted IRC client)
[19:25] * redeemed (~redeemed@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[19:26] <hufman> sage: for that bug 5031, which log(s) do you want? one node started up, then crashed when the second node started up
[19:26] <hufman> first node's log is nice and small, second node's log compressed to 14mb
[19:28] <sage> actually, one of yan's patches he sent out this morning addresses this issue
[19:29] * Wolff_John (~jwolff@ftp.monarch-beverage.com) has joined #ceph
[19:29] <sage> but maybe attach the log ot the bug for posterity's sake? :)
[19:29] <sage> both for good measure
[19:29] <hufman> so i can attach such logs? is txt.xz a good format?
[19:29] <lyncos> sage . you see something wrong ?
[19:29] <sage> xz is fine
[19:29] <sage> as is 14mb
[19:30] <hufman> ok :)
[19:30] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Ping timeout: 480 seconds)
[19:33] * leseb (~Adium@83.167.43.235) Quit (Quit: Leaving.)
[19:34] <lyncos> sage .. what is wrong .. is that the recovery seems to works.. if it was a problem with an OSD I guess this would fail ?
[19:34] <sage> sorry, in a meeting..
[19:35] <lyncos> anyone else can help ?
[19:35] <sjust> lyncos: what is the output of ceph -s
[19:36] <sjust> also, ceph osd tree
[19:36] <lyncos> ok 1 sec
[19:37] <lyncos> http://pastebin.com/LAF841HL
[19:37] <lyncos> it's degraded now .. but I have the same issue when HEALTH_OK
[19:38] <lyncos> it dosen't seems to recover
[19:38] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[19:38] <sjust> you have 10 pgs on 5 osds
[19:38] <lyncos> I must have less than 5 ?
[19:38] <sjust> you should have around 100 pgs/osd
[19:38] <lyncos> ahhh let me re-create my pool
[19:38] <sjust> but that's not actually causing the problem
[19:39] <sjust> ceph versino?
[19:39] <Kioob> (Hi)
[19:40] <lyncos> ceph version 0.61.1
[19:40] <lyncos> I did ceph osd pool create data 100 100
[19:40] <lyncos> now it scrubbing
[19:41] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[19:44] <lyncos> after re-creating my pool i get this http://pastebin.com/rQipLKQ6
[19:44] <lyncos> it seems I have mon down
[19:44] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[19:44] * ChanServ sets mode +v andreask
[19:52] * BillK (~BillK@124-169-236-155.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[19:52] * davidzlap (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:52] * LeaChim (~LeaChim@176.250.160.87) Quit (Ping timeout: 480 seconds)
[19:55] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[19:58] <lyncos> even with rbd it cannot write
[20:01] * LeaChim (~LeaChim@2.127.72.50) has joined #ceph
[20:02] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) has joined #ceph
[20:05] * lyncos (~chatzilla@208.71.184.41) Quit (Remote host closed the connection)
[20:06] * rturk-away is now known as rturk
[20:06] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[20:06] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[20:07] * Cube (~Cube@12.248.40.138) has joined #ceph
[20:07] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[20:08] * lyncos (~chatzilla@208.71.184.41) has joined #ceph
[20:08] <lyncos> Argh I lost my connection.. did I miss something ?
[20:09] * eschnou (~eschnou@203.39-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:09] <hufman> nope :)
[20:09] <lyncos> I still have my problem of slow request
[20:09] <lyncos> I cannot write anything to my cluster
[20:10] <lyncos> osd_op(client.6336.1:211 rb.0.18c8.74b0dc51.00000001e842 [write 3584~29184] 10.84428b0d RETRY=-1 e304) currently waiting for ondisk
[20:10] <lyncos> what currently waiting for ondisk means ?
[20:10] * gucki (~smuxi@84-73-201-95.dclient.hispeed.ch) has joined #ceph
[20:10] <gucki> hi there
[20:11] <lyncos> means the data is on the journal ? or the journal is not working ?
[20:11] <lyncos> 21 slow requests, 1 included below; oldest blocked for > 581.232859 secs
[20:11] <gucki> i'm running the latest bobtail release 0.56.6 in production and would like to upgrade to cuttlefish. i just wonder if there are any known issues or bugs fixed whic have not yet been released?
[20:11] <lyncos> how to find what is blocking
[20:12] * SvenPHX1 (~scarter@wsip-174-79-34-244.ph.ph.cox.net) has left #ceph
[20:14] * kyle__ (~kyle@216.183.64.10) has joined #ceph
[20:15] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[20:16] * bergerx_ (~bekir@78.188.204.182) Quit (Quit: Leaving.)
[20:20] * kyle_ (~kyle@216.183.64.10) Quit (Ping timeout: 480 seconds)
[20:40] * rturk is now known as rturk-away
[20:41] * jeff-YF (~jeffyf@67.23.117.122) Quit (Quit: jeff-YF)
[20:43] * jeff-YF (~jeffyf@67.23.117.122) has joined #ceph
[20:46] * lyncos (~chatzilla@208.71.184.41) Quit (Quit: ChatZilla 0.9.90 [Firefox 21.0/20130512193848])
[20:49] * b1tbkt_ (~quassel@24-216-67-250.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[20:53] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[20:56] * b1tbkt (~b1tbkt@24-217-196-119.dhcp.stls.mo.charter.com) has joined #ceph
[21:02] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:16] * eschnou (~eschnou@203.39-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[21:20] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[21:21] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[21:22] * Tamil (~tamil@38.122.20.226) Quit (Quit: Leaving.)
[21:23] <wonko_be> hey, anyone here who can help me with some issues i have on using the ceph cookbooks?
[21:25] <wonko_be> ceph-mon seems to need a fsid when using with --mkfs
[21:26] <wonko_be> neither the manpage or the --help mentions this, and the online documentation yields no results when searching for ceph-mon
[21:27] <wonko_be> (i got 0.61.2-1precise installed)
[21:27] * eschnou (~eschnou@203.39-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:40] * dcasier (~dcasier@223.103.120.78.rev.sfr.net) Quit (Ping timeout: 482 seconds)
[21:45] * Vjarjadian (~IceChat77@90.214.208.5) has joined #ceph
[21:45] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has joined #ceph
[21:47] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has left #ceph
[21:53] <wonko_be> no worries, found it
[22:13] <redeemed> wonko_be: what did you discover?
[22:14] <Kioob> gucki : I lost one MON during the migration (on 5), but all 49 OSD migrate fine.
[22:14] * diegows (~diegows@200.68.116.185) has joined #ceph
[22:14] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:15] <Kioob> You have to upgrade MON one by one, then OSD one by one (or host per host, depending of your CRUSH replication scheme)
[22:18] <wonko_be> seems the whole ceph-deploy thing is now the new way to go... and all the low-level commands have been ripped out of the docs
[22:21] <redeemed> wonko_be: ceph_deploy is unreliable in my environment :/ been a burden
[22:22] <wonko_be> yes
[22:22] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[22:22] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[22:22] <wonko_be> the lack of low-level documentation is a bit a pain when trying to update and modify some things
[22:23] <wonko_be> i have to go and read the python wrapper to get an idea of the parameters and the normal way to do things
[22:24] <redeemed> wonko_be: perhaps the old docs may be archived on their site or cached on google.
[22:25] <wonko_be> repeating my nick isnt really helpfull either
[22:25] <wonko_be> but i might be talking to a bot now
[22:25] <wonko_be> i'll just leave it for today
[22:26] * Tamil (~tamil@38.122.20.226) has joined #ceph
[22:28] <janos> targeting nicks is pretty common practice so that communications don't get confused
[22:28] <janos> usually helpful when multiple conversations are going on
[22:30] <mrjack> it would be cool if the monitor would gracefully exit the quorum and notify the other mons that it will be down before it exits...
[22:30] <mrjack> if i stop the monitor via initscripts
[22:30] <mrjack> or do restart
[22:35] * drokita (~drokita@199.255.228.128) Quit (Ping timeout: 480 seconds)
[22:39] * redeemed_ (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[22:41] <tnt> gucki: there is a mon issue in cuttlefish where the store would grow in size very fast, but that has been narrowed down yesterday and mikedawson and myself are currently testing a fix and it looks promising.
[22:43] <tnt> and when upgrading, it's actually best to restart a majority of mons at the same time ... because new mons can't talk to old mons during bobtail -> cuttlefish upgrade so you'll loose quorum anyway.
[22:44] * redeemed_ (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) Quit (Remote host closed the connection)
[22:44] <tnt> also if you're on debian, make sure to do _both_ a "apt-get upgrade" and a "apt-get install ceph ceph-mds" or you'll have a mixed version situation where things can get weird.
[22:47] * eternaleye (~eternaley@2607:f878:fe00:802a::1) Quit (Ping timeout: 480 seconds)
[22:48] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) Quit (Quit: Leaving.)
[22:50] * eternaleye (~eternaley@2607:f878:fe00:802a::1) has joined #ceph
[22:57] <mrjack> tnt: cool good news?! ;)
[22:57] <mikedawson> gucki, tnt: yep. Cuttlefish is a bit suspect right now, but the patch looks promising. If you can wait for 0.61.3, I would.
[22:59] * tnt agrees
[22:59] <tnt> because other than those mon issues, i've been seeing less OSD memory usage and better RBD perf on small IO so it's pretty good.
[23:04] * eschnou (~eschnou@203.39-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:06] * gippa (~gippa@corsico.wiran.net) has joined #ceph
[23:07] <gippa> hello! I'm having a problem installing ceph cuttlefish in both 2 nodes with debian7 and 2 nodes with ubuntu 12.04
[23:08] <gippa> mds is dying in both nodes
[23:08] <gippa> in both clusters
[23:08] <gippa> any help please?
[23:09] <gippa> thanks
[23:12] <gippa> ply(2 mds0_sessionmap [read 0~0] ack = -2 (No such file or directory)) v4 ==== 114+0+0 (3848257979 0 0) 0x26f7e00 con 0x270c840
[23:12] <gippa> 0> 2013-05-23 23:12:02.206228 7f3e94e94700 -1 *** Caught signal (Aborted) **
[23:12] <gippa> in thread 7f3e94e94700
[23:12] * Wolff_John (~jwolff@ftp.monarch-beverage.com) Quit (Remote host closed the connection)
[23:12] <gippa> ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
[23:12] <gippa> 1: /usr/bin/ceph-mds() [0x874632]
[23:12] <gippa> 2: (()+0xf030) [0x7f3e99f85030]
[23:12] <gippa> 3: (gsignal()+0x35) [0x7f3e98882475]
[23:12] <gippa> 4: (abort()+0x180) [0x7f3e988856f0]
[23:12] <gippa> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f3e990d789d]
[23:12] <gippa> 6: (()+0x63996) [0x7f3e990d5996]
[23:12] <gippa> 7: (()+0x639c3) [0x7f3e990d59c3]
[23:12] <gippa> 8: (()+0x63bee) [0x7f3e990d5bee]
[23:12] <gippa> 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x7dc3a7]
[23:12] <gippa> 10: (SessionMap::decode(ceph::buffer::list::iterator&)+0x3a) [0x6ff27a]
[23:12] <gippa> 11: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x6e) [0x70026e]
[23:12] <gippa> 12: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1b) [0x7275bb]
[23:12] <gippa> 13: (MDS::handle_core_message(Message*)+0xae7) [0x513c57]
[23:12] <gippa> 14: (MDS::_dispatch(Message*)+0x33) [0x513d53]
[23:12] <gippa> 15: (MDS::ms_dispatch(Message*)+0xab) [0x515b3b]
[23:12] <gippa> 16: (DispatchQueue::entry()+0x393) [0x847ca3]
[23:12] <gippa> 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x7caeed]
[23:12] <gippa> 18: (()+0x6b50) [0x7f3e99f7cb50]
[23:12] <gippa> 19: (clone()+0x6d) [0x7f3e9892aa7d]
[23:12] <gippa> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
[23:12] <phantomcircuit> gippa, pastebin.com
[23:12] <phantomcircuit> please and thank you
[23:13] <scuttlemonkey> ^
[23:13] <gippa> sorry, being new here
[23:14] <gippa> http://pastebin.com/C81g5jFd
[23:15] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[23:15] <gippa> it's more or less the same on the nodes of the two cluster
[23:15] <gippa> it was originally on two nodes with ubuntu with multiple osd
[23:16] <gippa> tried to reproduce it on a smaller cluster on debian7 with two osd just to test if I was able to reproduce
[23:16] <gippa> got another cluster on bobtail on CentOS working ok
[23:17] <mrjack> tnt are you on wip-4895-cuttlefish?
[23:17] * Tamil (~tamil@38.122.20.226) Quit (Quit: Leaving.)
[23:20] <tnt> mrjack: I have those patches yes. (but I have some other custom patches as well so it's not exactly that branch).
[23:22] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 481 seconds)
[23:22] <mrjack> tnt: i had not much time for ceph the last few days, tried to get a patch for linux kernel that fixes a problem when using bonding mode 6 with bridge backported for 3.4 and 3.2 and put it in next stable kernel release.. was a struggle.. ;)
[23:23] <tnt> mrjack: which one is mode 6 ?
[23:23] <gippa> phantomcircuit any hint please? I can't understand why is dying
[23:23] <mrjack> balance-alb
[23:24] <lurbs> mrjack: ARP stupidity?
[23:24] <mrjack> yep
[23:24] <lurbs> Yeah, I've seen that one. Made running VMs over it a bit useless.
[23:24] <mrjack> yes
[23:24] <mrjack> there is a patch for 3.4.46 now
[23:25] <lurbs> Good to hear. Would be great if it made it into a 12.04 LTS backport.
[23:25] <mrjack> it was rejected in february because of two linebreaks... :(
[23:26] <mrjack> and noone cared....
[23:26] <mrjack> lurbs: i don't know ubuntu or if they will backport it
[23:27] <mrjack> lurbs: but its deb based so you could send a mail to kernel-package maintainer ...
[23:27] <tnt> mrjack: if it's backported to 3.2, then ubuntu will get it.
[23:27] <mrjack> lurbs: [ Upstream commit 567b871e503316b0927e54a3d7c86d50b722d955 ]
[23:27] <mrjack> tnt: yes it is also backported to 3.2
[23:28] <tnt> ubuntu kernel team tracks the longterm 3.2 tree.
[23:28] * madkiss (~madkiss@ds80-237-216-40.dedicated.hosteurope.de) has joined #ceph
[23:28] <lurbs> mrjack: Ta, I'll take a look.
[23:29] * rturk-away is now known as rturk
[23:30] * nigwil (~idontknow@174.143.209.84) Quit (Quit: leaving)
[23:30] * nigwil (~idontknow@174.143.209.84) has joined #ceph
[23:32] <mrjack> ceph performs a lot better with mode 6 comapred to mode 5 (which i had to do before)
[23:32] * rturk is now known as rturk-away
[23:32] * vata (~vata@2607:fad8:4:6:98cb:c558:8f0:e7fb) Quit (Quit: Leaving.)
[23:32] * rturk-away is now known as rturk
[23:34] * drokita1 (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[23:38] <tnt> personally I use the lacp mode ...
[23:39] <tnt> my switch itself only does ip xor balancing anyway.
[23:39] * diegows (~diegows@200.68.116.185) Quit (Ping timeout: 480 seconds)
[23:39] <mrjack> tnt: i would use lacp if my switches would support it
[23:40] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[23:44] * gippa (~gippa@corsico.wiran.net) Quit (Quit: gippa)
[23:59] * Dark-Ace-Z (~BillyMays@50.107.54.92) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.