#ceph IRC Log


IRC Log for 2013-05-23

Timestamps are in GMT/BST.

[0:04] <athrift> Hrmm, I am having issues removing an unhappy mon and re-adding it. I have removed it with "ceph mon remove a" after stopping the service, but when re-adding it by extracting the monmap/keyring and using "ceph-mon -i a --mkfs --monmap monmap.out --keyring monauth.out" I then do a "ceph mon add a" which then complains "mon a already exists" which is odd as ceph -s after the mon remove command definately
[0:04] <athrift> showed it was removed
[0:04] <athrift> can anyone offer some advice ?
[0:08] <tnt> what does monmaptool --print monmap.out shows ?
[0:11] <athrift> monmaptool: monmap file monmap.out
[0:11] <athrift> epoch 2
[0:11] <athrift> fsid 350bdef4-263e-48bf-839d-14434accad17
[0:11] <athrift> last_changed 2013-05-23 09:43:19.528552
[0:11] <athrift> created 2013-02-12 11:19:00.896657
[0:11] <athrift> 0: mon.b
[0:11] <athrift> 1: mon.c
[0:12] <tnt> oh, but re-reading it, it seems normal. Starting the mon will actually add it automatically.
[0:13] <athrift> tnt: hrmmI have done that but it still shows it as down, even though I can see it running in ps
[0:13] <athrift> health HEALTH_WARN 1 mons down, quorum 1,2 b,c
[0:13] <athrift> monmap e5: 3 mons at {a=,b=,c=}, election epoch 54, quorum 1,2 b,c
[0:13] <tnt> mmm, what does the log of mon.a says ?
[0:14] <athrift> 2013-05-23 10:14:19.488696 7fbd5bf6e700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption
[0:14] <athrift> 2013-05-23 10:14:19.488701 7fbd62ec1700 0 -- >> pipe(0x1566280 sd=22 :39363 s=1 pgs=0 cs=0 l=0).failed verifying authorize reply
[0:14] <athrift> so looks like possibly keyring ?
[0:14] <tnt> athrift: are all mons at the same version ?
[0:18] <athrift> tnt: yes
[0:18] <athrift> 0.61.2
[0:19] <tnt> what distribution ? and were they upgraded from a previous version ?
[0:20] <tnt> given the "created 2013-02-12 11:19:00.896657" in the monmap I would say that they're been upgraded.
[0:25] <athrift> Ubuntu 12.04LTS and yes
[0:25] <tnt> run "apt-cache policy ceph ceph-common" on all the mons and pastebin the result
[0:30] <tchmnkyz> ok i know how to enable it
[0:30] <tchmnkyz> just need to figure out what level to enable
[0:30] <tchmnkyz> debug ms = 1
[0:30] <tchmnkyz> debug rbd = 20
[0:30] <tchmnkyz> log file = /tmp/rbd.$pid.log
[0:30] <tchmnkyz> something like that wokr?
[0:31] <tchmnkyz> it created a boat load of files
[0:31] <athrift> tnt: http://pastebin.com/axJDAWui
[0:32] <tnt> athrift: see those "Installed: 0.56.4-1precise" :)
[0:32] <tchmnkyz> http://pastebin.com/G4Tjka58
[0:32] <tchmnkyz> that is what i got from using those debug settings on one of the nodes
[0:32] <athrift> tnt: yes
[0:33] <athrift> oh wow
[0:33] <athrift> I wonder whats going on there
[0:33] <tnt> athrift: well, means two of the mon are actually still at 0.56.4 ... do apt-get install ceph on those two to force the update.
[0:34] <tnt> the new 'ceph' package had new dependencies and so "apt-get upgrade" didn't update them.
[0:34] <tchmnkyz> like now it has already created over 500 files in 10 minutes of turning it on
[0:34] <tnt> But it did update ceph-common ... and so you end up with mixed version.
[0:34] <tnt> that's like the 4th time I've seen people here with that exact situation
[0:35] <tchmnkyz> tnt: would my proxmox nodes having a older version then my ceph cluster cause the problems like this?
[0:35] <athrift> on the upside, everything kept running pretty nicely :)
[0:35] <tnt> tchmnkyz: not unless it's really old.
[0:35] <tchmnkyz> the proxmox nodes are on like 56.3 and ceph side is on 61.2
[0:36] <tnt> shouldn't matter then.
[0:36] <tchmnkyz> k
[0:37] <tchmnkyz> just figured i would ask
[0:37] <tchmnkyz> then i really dont know where to go with this
[0:37] <tchmnkyz> something is causing this to happen
[0:37] <tchmnkyz> and i dont know what
[0:38] <tchmnkyz> got any ideas?
[0:40] <tchmnkyz> ok guys i need to head out for the night i guess we can bash my head into the wall with this tomorrow.
[0:43] <athrift> tnt: Thank you for your assistance, the problem is all solved now :)
[0:51] <athrift> Im loving ceph -w in 0.61.2 showing read and write bandwidth as well as op/s
[4:33] <tchmnkyz> dmick: or tnt: got some time to try and help me figure this crap out?
[4:34] <dmick> maybe?...
[4:34] <tchmnkyz> i am trying to track down this corruption issue
[4:34] <tchmnkyz> i just lost an entire virtual drive
[4:35] <tchmnkyz> /dev/sdb just lost everything
[4:36] <dmick> oh
[4:36] <dmick> that's not pleasant
[4:37] <tchmnkyz> yea
[4:37] <tchmnkyz> it just keeps happening over and over
[4:38] <dmick> so I guess I need to prompt you for information? What does "lost a drive" and "lost everything" mean?
[4:41] <tchmnkyz> the vmdisk looks empty
[4:41] <tchmnkyz> no partition table
[4:42] <dmick> did it become filled with zeros? Is it still the right size?
[4:43] <tchmnkyz> size stayed the same
[4:44] <tchmnkyz> not filled with 0's but random data
[4:45] <dmick> what are the overall sizes of the cluster involved? (hosts, OSDs, mons)
[4:46] <tchmnkyz> i have 5 OSDs
[4:46] <dmick> so 8 hosts? or 4? or?
[4:46] <tchmnkyz> yes 8 physical devices
[4:47] <tchmnkyz> the osd are Dual Xeon 5620 with 48GB ram
[4:47] <tchmnkyz> 3ware 9750 Raid contrller
[4:47] <tchmnkyz> [client]
[4:47] <tchmnkyz> debug ms = 1
[4:47] <tchmnkyz> debug rbd = 20
[4:47] <tchmnkyz> log file = /tmp/rbd.$pid.log
[4:47] <dmick> "physical device" == "machine". ok
[4:47] <tchmnkyz> sorry wrong paste
[4:47] <dmick> how many VMs/images?
[4:47] <tchmnkyz> 55tb raid 50 array
[4:48] <tchmnkyz> right now i think we have like 50 vm with maybe 75 - 100 images
[4:48] <dmick> they're running on different hosts?
[4:48] <dmick> separate from teh cluster, I mean?
[4:48] <tchmnkyz> yes
[4:48] <tchmnkyz> i have 2 Dell M1000E Chassis
[4:48] <tchmnkyz> with 16 Dual Xeon Nodes
[4:49] <dmick> not important.
[4:49] <tchmnkyz> 32 total servers available for VM's
[4:49] <tchmnkyz> k
[4:49] <dmick> ok, so the one that lost sdb
[4:49] <tchmnkyz> the backend is Infiniband 10gb ipoib
[4:49] <dmick> find his image name, and rbd info about it
[4:49] <dmick> the sdb image, I mean, if he had more than one
[4:50] <tchmnkyz> rbd image 'vm-5000-disk-2': size 10240 GB in 2621440 objects order 22 (4096 KB objects) block_name_prefix: rb.0.a5b4d1.238e1f29 format: 1
[4:51] <dmick> ok. rados [-p pool] ls | grep rb.0.a5b4d1.238e1f29 | wc -l
[4:52] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[4:52] <tchmnkyz> i think what is happening is something is causing kvm to mess up the disk
[4:52] <tchmnkyz> that is running now btw
[4:52] <dmick> what leads you to think that?
[4:53] <tchmnkyz> because not all of my vms have the problem
[4:53] <tchmnkyz> on some
[4:53] <tchmnkyz> root@stor01:~ # rados -p moto ls | grep rb.0.a5b4d1.238e1f29 | wc -l
[4:53] <tchmnkyz> 11353
[4:53] <dmick> ok, so that's a pretty sparse image
[4:54] <tchmnkyz> yea it was one that i messed up on and made mbr not gpt
[4:54] <tchmnkyz> and rbd did not allow me to shrink it
[4:55] <dmick> ? rbd didn't allow it?
[4:55] <tchmnkyz> i could not figure out how to shrink an image once i had resized it to bigger
[4:55] <dmick> well from the rbd perspective, rbd resize, but
[4:56] <dmick> of course you have to get the filesystem to agree and shrink the partition first
[4:56] <dmick> with parted or some such
[4:56] <dmick> but anyway:
[4:56] <tchmnkyz> thepartition is 1tb the vdrive is 10tb
[4:56] <dmick> presumably rados -p moto stat rb.0.a5b4d1.238e1f29.0 shows a 4M block?
[4:57] <tchmnkyz> error stat-ing moto/rb.0.a5b4d1.238e1f29.0: No such file or directory
[4:58] <dmick> oh, yeah, it's probably got more zeros there
[4:58] <dmick> hm, how many
[4:58] <tchmnkyz> i would not know
[4:58] <dmick> sigh, yes, thinking out loud
[4:59] <tchmnkyz> and the thing that makes me think it is not ceph/rbd is that if i take a snapshot of the vdisk i can restore that snap and it is fine
[4:59] <tchmnkyz> dont know if that helps any
[4:59] <dmick> that doesn't mean much; the snap is not written to
[5:00] <tchmnkyz> o ok
[5:00] <tchmnkyz> just figured i would give as much info as i can
[5:00] <dmick> so if ceph is somehow muffing the write, it would be to the newer obs
[5:00] <tchmnkyz> good point
[5:01] <tchmnkyz> my first thoughts were a IO lag because i had seen other users with proxmox have that problem
[5:01] <tchmnkyz> they seen similar behavior with some write delays from seagate drives
[5:01] <tchmnkyz> so that is what pointed me to maybe an IO issue
[5:02] <dmick> how does io lag cause disk corruption?
[5:02] <dmick> rados -p moto stat rb.0.a5b4d1.238e1f29.000000000000
[5:02] <tchmnkyz> i am not sure but they said that changing from seagate to wd disks fixed the issue
[5:03] <dmick> sounds like voodoo
[5:03] <tchmnkyz> moto/rb.0.a5b4d1.238e1f29.000000000000 mtime 1369278137, size 4194304
[5:04] <dmick> ok, that mtime is Wed May 22 20:02:17 2013
[5:05] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:05] <dmick> so something is writing to the first 4M of the disk as we speak?
[5:05] <tchmnkyz> i had to start restoring
[5:05] <dmick> !!
[5:06] <tchmnkyz> i have to it is a customers VM i have to restore it now i cant just sit on it
[5:06] <dmick> ok, well, I can't do much to help then, sorry
[5:07] <tchmnkyz> i have another vm that is having a similar issue
[5:07] <tchmnkyz> we can look at that one
[5:12] <dmick> tchmnkyz: I can't really help you with customer-related issues unless you have an agreement in place; we're small, and it's just too risky for us.
[5:13] <tchmnkyz> ok so help me with a internalvm that is part of my internal stuff not customer
[5:15] <khanhndq> hi everybody
[5:15] <khanhndq> now i faced one issue in ceph block device
[5:15] <khanhndq> ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
[5:15] <khanhndq> health HEALTH_OK
[5:15] <khanhndq> monmap e1: 2 mons at {a=,b=}, election epoch 20, quorum 0,1 a,b
[5:15] <khanhndq> osdmap e53: 2 osds: 2 up, 2 in
[5:15] <khanhndq> pgmap v535: 576 pgs: 576 active+clean; 11086 MB data, 22350 MB used, 4437 GB / 4459 GB avail
[5:15] <khanhndq> mdsmap e29: 1/1/1 up {0=a=up:active}, 1 up:standby
[5:15] <khanhndq> I do benchmark with one osd, I receive the io speed is about 190MB/s
[5:15] <khanhndq> But When i add more osd, replicate size =2 , the write performance degradated is about 90MB/s
[5:15] <khanhndq> Follow the pratice, the write performance must be increated as adding more osd but I can't receive that. :(
[5:16] <tchmnkyz> dmick: how much would it cost for support on our cluster this size
[5:16] <khanhndq> anyone can you help me check what's wrong in config file or anything else?
[5:17] <dmick> tchmnkyz: I really don't know; you'd need to contact the people who deal in that space
[5:17] <tchmnkyz> ok
[5:17] <tchmnkyz> inktank support?
[5:17] <dmick> http://www.inktank.com/support-services/
[5:18] <tchmnkyz> thnx
[5:19] <khanhndq> anyone can you help me check what's wrong in config file or anything else?
[5:21] <tchmnkyz> dmick: this is the errors i see on the vms when one of the drives go defunct
[5:21] <tchmnkyz> http://imgur.com/MmmXbmq
[5:22] <dmick> yeah, I saw that earlier. My strategy would be to try to see evidence of that error somewhere else
[5:22] <tchmnkyz> k
[5:22] <tchmnkyz> i sent a email to get a quote from support
[5:23] <khanhndq> <tchmnkyz> did you do benchmark in ceph system ?
[5:23] <tchmnkyz> was not really sure how to
[5:25] <khanhndq> first, create an ceph storage and create new pool and an image after that map that image such as file system
[5:25] <tchmnkyz> khanhndq: my cluster is 272tb of avaible space
[5:25] <khanhndq> and do io performance on it
[5:25] <tchmnkyz> it was fine and working perfect till a few weeks ago
[5:25] <tchmnkyz> this started like 2 weeks ago
[5:27] <khanhndq> did you do a benchmark like as read/write io per ceph cluster ?
[5:27] <tchmnkyz> yea i was getting great speeds before i started spining up vms like this
[5:28] <khanhndq> do you share me your benchmark ?
[5:28] <tchmnkyz> that was like 8 months ago
[5:28] <tchmnkyz> i dont know where they are now
[5:29] <khanhndq> do you remember how much is the write speed/ volumes ?
[5:29] <tchmnkyz> i was getting average of Sata3 speeds on the test VMs
[5:30] <tchmnkyz> i had ran them under Debian, Windows 2003 & 2008
[5:32] <tchmnkyz> dmick: sorry if i was going about this the wrong way. I just have the owner of the company riding me because i am the one that setup this cluster and i am having problems now
[5:33] <tchmnkyz> he is freaking out on me so i am trying to fix this
[5:37] <tchmnkyz> like now i dont get this. I just did a data restore to that VM. and the same drive is gone again.
[5:40] <tchmnkyz> dmick: can you tell me this much with a VM should i be using some kind of caching for the virtual disk or none?
[5:40] <tchmnkyz> all of my vms now have no caching enabled
[5:41] <dmick> tchmnkyz: I understand pressure to support customers; that's why you have support contracts for escalations.
[5:41] <dmick> on caching:
[5:41] <dmick> if you enable rbd caching, you have to be sure that qemu knows about it
[5:41] <dmick> or you'll get data corruption
[5:41] <tchmnkyz> i am talking more at the qemu side
[5:42] <dmick> http://ceph.com/docs/master/rbd/qemu-rbd/#running-qemu-with-rbd
[5:42] <tchmnkyz> thnx
[7:11] <MrNPP> I'm getting a tone of "currently waiting for pg to exist locally"
[7:11] <MrNPP> after a system with 6 osd's on it crashed
[7:11] <MrNPP> pretty much all of ceph is unusable at this point, and i'm not sure why
[7:12] <MrNPP> i tried turing on debuggin and nothing, and the system logs don't show anything either
[7:12] <MrNPP> HEALTH_WARN 1515 pgs degraded; 8 pgs down; 8 pgs peering; 2416 pgs stale; 8 pgs stuck inactive; 2416 pgs stuck stale; 1599 pgs stuck unclean; recovery 35805/129330 degraded (27.685%)
[7:16] <MrNPP> i removed the down host from the crushmap and its backfilling, if one host of 16 goes down with say 25 other od's should the entire cluster be affected this bad?
[7:18] <phantomcircuit> MrNPP, depends on if the crush map is replication per osd or per host
[7:18] * esammy (~esamuels@host-2-102-69-49.as13285.net) has joined #ceph
[7:27] <MrNPP> per osd
[7:27] <MrNPP> i tried to change it to per host, but it crashed
[7:27] <MrNPP> the monitor when i tried to inject it
[7:54] * jtang1 (~jtang@ has joined #ceph
[7:56] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[8:00] * tnt (~tnt@ has joined #ceph
[8:02] * jtang1 (~jtang@ Quit (Ping timeout: 480 seconds)
[8:04] <MrNPP> going to try it again
[8:09] <MrNPP> yeah i tried to insert a change into the crushamp using ceph osd setcrushmap -i crush.map
[8:09] <MrNPP> and now ceph isn't responding
[11:20] <blue> hi. I'm having some problems with radosgw, where creating and listing buckets seem to work, but everything else returns 403 (InvalidAccessKeyId)
[11:21] <blue> i basically just followd http://ceph.com/docs/master/man/8/radosgw/ and a bit from http://wiki.debian.org/OpenStackCephHowto
[11:21] <blue> any ideas what could be wrong?
[12:23] <acalvo> Hello
[12:25] <acalvo> Trying to follow the install instructions on CentOS 6.4, but it fails the mkcephfs and the key generation and mentions using ceph-deploy. Howerver, ceph-deploy seems to be Ubuntu based so it fails to run on CentOS. Any tutorial up to date with that?
[13:30] <andrei> hello guys
[13:30] <andrei> could some please suggest a way to deal with changes in drive letters in /dev/ ?
[13:31] <andrei> i've unplugged an unused HD and after a reboot the disk letters have changed in /dev/
[13:32] <andrei> as a result i get errors on every osd when i try to start it
[13:32] <andrei> like these: unable to authenticate as osd.0
[14:32] <tnt> andrei: don't use raw device names ...
[14:33] <tnt> use udev names /dev/disk/by-*/* to refer to your partitions and then label them properly to get persistent names
[14:39] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[14:42] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has left #ceph
[14:49] <andrei> tnt: thanks
[14:49] <andrei> i will!
[14:53] <andrei> tnt: I seems to be getting issues with writes when I reboot one of the servers
[14:53] <andrei> i've tried to restart it with init 6
[14:54] <andrei> the clients are mounted with ceph-fuse
[14:55] <andrei> the dd write command freezes a few seconds after I do init 6 on one of the servers
[14:55] <andrei> there are 3 mons
[14:55] <andrei> and 2 osd servers
[14:55] <andrei> one mds
[14:56] <tnt> I don't use cephfs ... no idea
[14:57] <andrei> does any one here know why the write doesn't automatically resume and use the second osd server ?
[14:57] <andrei> i can list the fs contents
[14:58] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[15:32] <hufman> hey, is sage around? i would like to know if i should include both node's logs in bug #5031
[15:39] <nhm> hufman: it's around 6:30am in california right now
[15:39] <hufman> lol
[16:07] <absynth> joao or nhm around?
[16:08] <absynth> preferrably both for separate questions
[16:08] <joao> absynth, sup?
[16:08] <absynth> joao: do you have a very high-level diagram of ceph infrastructure components?
[16:08] <absynth> in the docs or something?
[16:08] <joao> how high?
[16:09] <joao> just components as in, the mons, osds, etc and how they all relate to each other?
[16:09] <absynth> yeah
[16:09] <absynth> exactly
[16:09] <joao> there's some of those diagrams in some of the presentations on http://ceph.com/presentations
[16:10] <joao> not sure if the docs also have something like that
[16:10] <joao> I thought they did, but last time I checked didn't find any, so I might have been wrong all along
[16:11] <absynth> also, is sage awake yet?
[16:11] <joao> I bet he is, just haven't seen him online
[16:14] <absynth> ok, i'll just reopen our old deepscrub memleak ticket then
[16:15] <absynth> and hope it gets assigned correctly
[16:15] <tnt> you still see it on cuttlefish ?
[16:16] <absynth> no, we still see it on latest bobtail
[16:16] <absynth> updating to cuttlefish is currently not an option - did you see the mailinglist postings lately? ;)
[16:17] <tnt> absynth: which ones ?
[16:25] <Vanony> Hi. I'm lost on this one: I created a small test cluster with 3 machines, two osds per machine (using ceph-deploy). Then extend to a 4th. Appearently I messed up at some point, because now I have 9 osds showing up in the status and crushmap, one is down and out - and never was created/started on the 4th machine. How can I get rid of that osd. I pastied some info in http://pastie.org/7948053
[16:28] <jeff-YF> has anyone here had any success with lio or tgt with ceph?
[16:29] <Vanony> I also tried editing out the "device 6 device6" line in the decompiled crushmap, recompiled and put it back. to no avail
[16:30] * Wolff_John (~jwolff@ftp.monarch-beverage.com) has joined #ceph
[16:33] <nhm> absynth: yo. :)
[16:34] <tnt> mmm, I'm having some weird behavior where an osd suddently throws a bunch of "pipe(0x1e0b6280 sd=87 :60923 s=2 pgs=8598 cs=1673 l=0).fault, initiating reconnect" ... (and it's like thousands of them).
[16:34] * portante (~user@ has joined #ceph
[16:35] <dspano> absynth: Not to give you false confidence, but I upgraded to cuttlefish yesterday with no problems so far. I upgraded the mons, then restarted them all at once, and they came up rather quick. The conversion only took a few seconds.
[16:36] <dspano> Then I upgraded my osds and finally my mds servers.
[16:37] <hufman> my problem is that i assumed the package upgrade to bobtail would restart the programs, so one of my nodes never finished upgrading from argonaut to bobtail
[16:45] <joao> tnt, can you infer to whom it is trying to connect?
[16:45] <joao> can't recall if that debug message has that information
[16:45] <absynth> dspano: are you using rbd?
[16:47] <tnt> joao: yes, I striped the start but it connects to all the other OSDs.
[16:48] <joao> can you check if you can actually connect to those other osds servers from that server?
[16:48] <tnt> joao: it seems to starts with dozens of "2013-05-23 13:43:31.166272 7f589c9dc700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f589a1d7700' had timed out after 15
[16:48] <joao> hmm
[16:48] <joao> okay
[16:49] <joao> don't think I can help you with that then
[16:49] <tnt> joao: well, a few minutes later it was restored without me doing anything so I guess it could connect.
[16:50] <tnt> and trying like 1000 connection inside 1 second seems like a bad idea ... (that pipe fault / initiating reconnect is present for several seconds and thre is ~ 1000 msg/sec)
[16:51] <joao> yeah, that should probably be backed off or something
[16:52] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[17:05] <joao> tnt, is the mon and that osd running on the same server?
[17:06] <dspano> absynth: I will also say, I am what you would call a small deployment, so I don't have a lot going on. I run two Dell R515s as OSD/Mon/MDS servers, and one Dell R210 as an MDS/Mon.
[17:06] <tnt> they're running as Xen VMs on the same physical server yes.
[17:06] <tnt> (they have distinct physical drives though)
[17:06] <dspano> Lol. My production is probably the size of most people's POC or test environments.
[17:07] <tchmnkyz> tnt/dmick: it seems that enabling the writeback caching seems to have helped prevent the corruption i was seeing
[17:07] * portante (~user@ Quit (Ping timeout: 480 seconds)
[17:07] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[17:07] <tnt> tchmnkyz: or hiding it ...
[17:08] <absynth> dspano: ok, we run hundreds of VMs, so probably should back off a little while more
[17:09] <tchmnkyz> i have reached out to inktank to purchase support
[17:09] <tchmnkyz> figured that should help too
[17:09] <absynth> that helps, yeah
[17:09] <absynth> one of the wiser business decisions i made the last year
[17:11] <tnt> I'm wondering what the pricing is for it.
[17:11] <tchmnkyz> i hope not too horrible
[17:12] <tchmnkyz> also convinced my boss to spring to go through the ceph training as well
[17:15] <absynth> tnt: it's affordable even for a small company like us, and inktank has really gone to lengths to solve our problems
[17:15] <absynth> everyone with a production setup should seriously consider a contract
[17:15] <absynth> (joao, did we talk about commission yet?)
[17:16] <janos> lol
[17:17] <absynth> it is my conviction, though. seriously.
[17:17] <absynth> if you guys knew what inktank went through with us...
[17:18] <topro> hi, what do you think might be the bottleneck of a cephfs cluster when running fio benchmark with 4k random-read and write only gives about 40iops but on the ceph nodes neither cpu-load nor disk-usage seem to mark the limit?
[17:18] * scuttlemonkey_ is now known as scuttlemonkey
[17:19] <jshen> hi everyone happy morning! i wonder if anyone knows how to change the cephfs mount size. i could not find it in the docs. thanks!
[17:20] <tnt> topro: network latency ?
[17:20] <topro> tnt: is there a way to tell?
[17:20] <tchmnkyz> provided that inktank is decent enough on the support contract i will def ink the deal on one
[17:20] <topro> s/tell/test/
[17:23] <tchmnkyz> and depending on the cost of the classes the boss man says i can do them too
[17:23] <samware> take me w/you, tchmnkyz
[17:24] <tchmnkyz> with the size of my cluster now i have to do soemthing
[17:24] <tchmnkyz> we are at a 8 node (3 mon 5osd) and about to expand it out to 20 osd's
[17:25] <tchmnkyz> ceph
[17:25] <tchmnkyz> even
[17:25] <tchmnkyz> been awake way too long
[17:27] <bo> 20 OSDs = 1.6PBs in your projected deployment?
[17:28] <bo> raiding drives per OSD or something?
[17:29] <dspano> absynth: In your situation, that's probably a wise decision.
[17:33] <absynth> 20 TB per OSD, how is that going to work?
[17:35] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Quit: Leaving.)
[17:35] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Quit: Konversation terminated!)
[17:41] <absynth> you mean osd-to-osd? <1ms
[17:41] <topro> well I get ~0.1ms osd-to-osd and about ~0.2ms client-to-osd
[17:43] <topro> still, running a 4k random read write benchmark on cephfs I only get about 40iops sum (20 read, 20 write) but neither disks nor cpu of any of the osds would max out. where might the bottleneck be hiding?
[17:43] <imjustmatthew> topro: that's definitely workable latency, I'm using regional clients with cephfs at client-to-osd latencies of 5-8ms
[17:44] <imjustmatthew> topro: are you running the kernel client or fuse?
[17:44] <topro> kernel client (debian experimental supplied linux-image 3.8.5)
[17:46] * topro desperately waiting for a 3.9 linux-image to get tunables going
[17:47] <imjustmatthew> topro: hmm, what kind of speed are you getting from the rados bench?
[17:52] <topro> imjustmatthew: can't find rados bench cmdline, have a short hint, please?
[17:53] <imjustmatthew> topro: ceph osd tell N bench [BYTES_PER_WRITE] [TOTAL_BYTES]
[17:53] <imjustmatthew> http://ceph.com/docs/master/rados/operations/control/?highlight=bench
[17:54] <pja> health HEALTH_WARN clock skew detected on mon.b, mon.c
[17:54] <pja> but ive ntp running
[17:54] <pja> hwclock and date says identical
[17:54] <pja> where does the skew come from?
[17:55] <tchmnkyz> pja run a date at the same time on both nodes and make sure they match node to node
[17:57] <topro> imjustmatthew: osd.0 [INF] bench: wrote 1024 MB in blocks of 4096 KB in 15.819383 sec at 66284 KB/sec
[17:57] <topro> its about the same for all 9 osds
[17:59] <tchmnkyz> bo: yes it is 20 x Raid 50 (55TB arrays)
[17:59] <imjustmatthew> topro: yeah, so the OSD part seems okay. That's all I can point you towards, I would ask gregaf or sagewk; they're both devs on US-Pacific time
[17:59] <tnt> joao: I'm fairly sure the issue I had a couple hours ago has nothing to do with the mon changes, I think it's an issue I've seen before when you get IO on a RBD disk hosted on the same Xen physical server as the OSD serving it ... sometimes it screws things up somehow.
[18:00] <topro> imjustmatthew: thanks so far
[18:00] <imjustmatthew> np, good luck
[18:07] <pja> @tchmnkyz: ive done this
[18:07] <cephalobot`> pja: Error: "tchmnkyz:" is not a valid command.
[18:07] <pja> tchmnkyz: ive done this
[18:08] * BillK (~BillK@124-169-236-155.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:09] <joao> tnt, let us know if that happens again
[18:18] <pja> but im right,  that this skew failure is because of different times on the nodes?
[18:19] <pja> root@hcmonko1:~# date && ssh hcmonko3 date && ssh hcmonko2 date
[18:19] <pja> Thu May 23 18:19:38 CEST 2013
[18:19] <pja> Thu May 23 18:19:38 CEST 2013
[18:19] <pja> Thu May 23 18:19:38 CEST 2013
[18:21] <sage> joao, tnt: how did that branch do?
[18:22] <joao> sage, haven't seen any complaints so far, but haven't heard any definite good news either
[18:22] <joao> tnt?
[18:22] <sage> mikedawson: ping!
[18:22] <sage> i'm inclined to merge into next and test further.. and backport to cuttlefish only when we get confirmation
[18:25] <tchmnkyz> pja restart the mons
[18:25] <joao> sage, imo it wouldn't hurt
[18:25] <sage> joao: 2 minor nits on that branch.. then let's merge
[18:25] <joao> kay
[18:25] <joao> just finishing updating a #5069
[18:25] <joao> s/a//
[18:26] <tchmnkyz> joao: have you seen any major reasons not to use infiniband for a 10gbps backend to ceph??
[18:26] <joao> I have absolutely no idea how to answer that
[18:27] <tchmnkyz> ok
[18:27] <tchmnkyz> basically i am running ceph and the network layer is ipoib 10gbps links
[18:27] <tchmnkyz> back to a IB grid director
[18:28] * Tamil (~tamil@ has joined #ceph
[18:29] <tchmnkyz> ok i am kinda laughing my bawls off right now. I was just informed that this guy is a expert dba that does not know how to use ssh/cmdline
[18:29] <tchmnkyz> he needs a GUI to install oracle db
[18:32] <joao> sagewk, whenever you have the chance: http://tracker.ceph.com/issues/5069#note-4
[18:37] <mikedawson> sage: yessir
[18:38] <pja> ive restarted them
[18:38] <tchmnkyz> fix it?
[18:38] <pja> nope
[18:39] <tchmnkyz> that worked for me when mine was skewed
[18:39] <pja> restarted them more than one, restatet also all the servers
[18:39] <pja> one=once
[18:39] <pja> can we check, why ceph thinks its skew?
[18:40] <tchmnkyz> that is not something i would be able to help with
[18:41] <tchmnkyz> not sure where to start
[18:41] <tchmnkyz> mine was actually off on my cluster
[18:41] <pja> hmm
[18:41] <pja> i'm new to ceph
[18:42] <pja> when i want to add a node or only an osd.. is it right, that i change the config, copy it via ssh to each node(mon,osd,..) and then make ceph -a start to reload the new config, so the new servers are activated?
[18:43] <tchmnkyz> i think so
[18:43] <tchmnkyz> have not really added new nodes yet
[18:43] <tchmnkyz> will be doing that soon
[18:43] * BillK (~BillK@124-169-236-155.dyn.iinet.net.au) has joined #ceph
[18:43] <pja> another question: when i have to restart 1 server, i think the ceph mechanism will start replicating the data to the other hosts
[18:44] <pja> is it possible to mark a server for maintenance, so it doesnt start replicating the data, only because 1 osd is restarting
[18:53] <sage> excellent
[18:56] <sage> joao: oh.. we should have a complete mon log for that
[18:56] <sage> from the teuthology failure
[18:57] * lyncos (~chatzilla@ has joined #ceph
[18:58] <lyncos> Hi I need help it seems I cannot write to my ceph cluster anymore .. I'm usin version 0.61.1 doing 'rados bench -p data 300 write' create slow requests.. is there any way to find out which OSD is not answering correctly ?
[18:58] <lyncos> I can delete and create pool just fine
[18:59] <lyncos> it seems I cannot put anything on the cluster anymore and I just did change the network interfaces to a bond ... and yes the network is working
[19:00] <lyncos> Latency is 19.0213 .. maybe it can help
[19:00] <lyncos> seems like a timeout .. it's always the same value for latency
[19:03] * eegiks (~quassel@pro75-5-88-162-203-35.fbx.proxad.net) Quit (Remote host closed the connection)
[19:05] <lyncos> hmmm I see
[19:06] <joao> sage, where are the log for #5069 that you mentioned on the bug?
[19:06] <sage> should be in the teuth dir
[19:07] <joao> not the one initially reported
[19:07] <joao> those are gone
[19:07] <joao> run must have been nuked
[19:07] <sage> bah, ok.. needs more info then
[19:07] <sage> and lets downgrade to high, since we haven't seen this since.
[19:07] <joao> kay
[19:12] <lyncos> no one can help ?
[19:14] <jshen> could anyone help with my cephfs size configurationg (bobtail) question? the default is too small. thanks in advance!
[19:15] <sage> lyncos: rados --debug-ms 1 .... will give you some clue as to what part is slow
[19:16] <lyncos> sage ok thanks let me try this
[19:18] <lyncos> nothing seems to be wrong
[19:19] <lyncos> sage can you check please if something wrong... to me it seems fine: http://pastebin.com/MuSC47Hx
[19:19] <sage> repeat with -t 1 pls?
[19:20] <lyncos> ok
[19:21] <lyncos> http://pastebin.com/mzaiUYKt
[19:22] <lyncos> you see... cur MB/s is almost always 0
[19:22] <lyncos> sometime it works at around 100 MB/s but just for 1 itteration
[19:23] <lyncos> here is ceph -w
[19:23] <lyncos> http://pastebin.com/6UsAwV6c
[19:24] * redeemed (~bo@static-71-170-33-24.dllstx.fios.verizon.net) Quit (Quit: bia)
[19:26] <hufman> sage: for that bug 5031, which log(s) do you want? one node started up, then crashed when the second node started up
[19:26] <hufman> first node's log is nice and small, second node's log compressed to 14mb
[19:28] <sage> actually, one of yan's patches he sent out this morning addresses this issue
[19:29] * Wolff_John (~jwolff@ftp.monarch-beverage.com) has joined #ceph
[19:29] <sage> but maybe attach the log ot the bug for posterity's sake? :)
[19:29] <sage> both for good measure
[19:29] <hufman> so i can attach such logs? is txt.xz a good format?
[19:29] <lyncos> sage . you see something wrong ?
[19:29] <sage> xz is fine
[19:29] <sage> as is 14mb
[19:30] <hufman> ok :)
[19:33] * leseb (~Adium@ Quit (Quit: Leaving.)
[19:34] <lyncos> sage .. what is wrong .. is that the recovery seems to works.. if it was a problem with an OSD I guess this would fail ?
[19:34] <sage> sorry, in a meeting..
[19:35] <lyncos> anyone else can help ?
[19:35] <sjust> lyncos: what is the output of ceph -s
[19:36] <sjust> also, ceph osd tree
[19:36] <lyncos> ok 1 sec
[19:37] <lyncos> http://pastebin.com/LAF841HL
[19:37] <lyncos> it's degraded now .. but I have the same issue when HEALTH_OK
[19:38] <lyncos> it dosen't seems to recover
[19:38] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[19:38] <sjust> you have 10 pgs on 5 osds
[19:38] <lyncos> I must have less than 5 ?
[19:38] <sjust> you should have around 100 pgs/osd
[19:38] <lyncos> ahhh let me re-create my pool
[19:38] <sjust> but that's not actually causing the problem
[19:39] <sjust> ceph versino?
[19:39] <Kioob> (Hi)
[19:40] <lyncos> ceph version 0.61.1
[19:40] <lyncos> I did ceph osd pool create data 100 100
[19:40] <lyncos> now it scrubbing
[19:44] <lyncos> after re-creating my pool i get this http://pastebin.com/rQipLKQ6
[19:44] <lyncos> it seems I have mon down
[20:01] * LeaChim (~LeaChim@ has joined #ceph
[20:02] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) has joined #ceph
[20:08] <lyncos> Argh I lost my connection.. did I miss something ?
[20:09] <hufman> nope :)
[20:09] <lyncos> I still have my problem of slow request
[20:09] <lyncos> I cannot write anything to my cluster
[20:10] <lyncos> osd_op(client.6336.1:211 rb.0.18c8.74b0dc51.00000001e842 [write 3584~29184] 10.84428b0d RETRY=-1 e304) currently waiting for ondisk
[20:10] <lyncos> what currently waiting for ondisk means ?
[20:10] <gucki> hi there
[20:11] <lyncos> means the data is on the journal ? or the journal is not working ?
[20:11] <lyncos> 21 slow requests, 1 included below; oldest blocked for > 581.232859 secs
[20:11] <gucki> i'm running the latest bobtail release 0.56.6 in production and would like to upgrade to cuttlefish. i just wonder if there are any known issues or bugs fixed whic have not yet been released?
[20:11] <lyncos> how to find what is blocking
[20:12] * SvenPHX1 (~scarter@wsip-174-79-34-244.ph.ph.cox.net) has left #ceph
[20:14] * kyle__ (~kyle@ has joined #ceph
[20:15] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[20:16] * bergerx_ (~bekir@ Quit (Quit: Leaving.)
[20:20] * kyle_ (~kyle@ Quit (Ping timeout: 480 seconds)
[21:25] <wonko_be> ceph-mon seems to need a fsid when using with --mkfs
[21:26] <wonko_be> neither the manpage or the --help mentions this, and the online documentation yields no results when searching for ceph-mon
[21:27] <wonko_be> (i got 0.61.2-1precise installed)
[21:53] <wonko_be> no worries, found it
[22:13] <redeemed> wonko_be: what did you discover?
[22:14] <Kioob> gucki : I lost one MON during the migration (on 5), but all 49 OSD migrate fine.
[22:15] <Kioob> You have to upgrade MON one by one, then OSD one by one (or host per host, depending of your CRUSH replication scheme)
[22:18] <wonko_be> seems the whole ceph-deploy thing is now the new way to go... and all the low-level commands have been ripped out of the docs
[22:21] <redeemed> wonko_be: ceph_deploy is unreliable in my environment :/ been a burden
[22:22] <wonko_be> yes
[22:22] <wonko_be> the lack of low-level documentation is a bit a pain when trying to update and modify some things
[22:23] <wonko_be> i have to go and read the python wrapper to get an idea of the parameters and the normal way to do things
[22:24] <redeemed> wonko_be: perhaps the old docs may be archived on their site or cached on google.
[22:25] <wonko_be> repeating my nick isnt really helpfull either
[22:25] <wonko_be> but i might be talking to a bot now
[22:25] <wonko_be> i'll just leave it for today
[22:28] <janos> targeting nicks is pretty common practice so that communications don't get confused
[22:28] <janos> usually helpful when multiple conversations are going on
[22:30] <mrjack> it would be cool if the monitor would gracefully exit the quorum and notify the other mons that it will be down before it exits...
[22:30] <mrjack> if i stop the monitor via initscripts
[22:30] <mrjack> or do restart
[22:35] * drokita (~drokita@ Quit (Ping timeout: 480 seconds)
[22:41] <tnt> gucki: there is a mon issue in cuttlefish where the store would grow in size very fast, but that has been narrowed down yesterday and mikedawson and myself are currently testing a fix and it looks promising.
[22:43] <tnt> and when upgrading, it's actually best to restart a majority of mons at the same time ... because new mons can't talk to old mons during bobtail -> cuttlefish upgrade so you'll loose quorum anyway.
[22:44] <tnt> also if you're on debian, make sure to do _both_ a "apt-get upgrade" and a "apt-get install ceph ceph-mds" or you'll have a mixed version situation where things can get weird.
[22:47] * eternaleye (~eternaley@2607:f878:fe00:802a::1) Quit (Ping timeout: 480 seconds)
[22:57] <mrjack> tnt: cool good news?! ;)
[22:57] <mikedawson> gucki, tnt: yep. Cuttlefish is a bit suspect right now, but the patch looks promising. If you can wait for 0.61.3, I would.
[22:59] * tnt agrees
[22:59] <tnt> because other than those mon issues, i've been seeing less OSD memory usage and better RBD perf on small IO so it's pretty good.
[23:07] <gippa> hello! I'm having a problem installing ceph cuttlefish in both 2 nodes with debian7 and 2 nodes with ubuntu 12.04
[23:08] <gippa> mds is dying in both nodes
[23:08] <gippa> in both clusters
[23:08] <gippa> any help please?
[23:09] <gippa> thanks
[23:12] <gippa> ply(2 mds0_sessionmap [read 0~0] ack = -2 (No such file or directory)) v4 ==== 114+0+0 (3848257979 0 0) 0x26f7e00 con 0x270c840
[23:12] <gippa> 0> 2013-05-23 23:12:02.206228 7f3e94e94700 -1 *** Caught signal (Aborted) **
[23:12] <gippa> in thread 7f3e94e94700
[23:12] <gippa> ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
[23:12] <gippa> 1: /usr/bin/ceph-mds() [0x874632]
[23:12] <gippa> 2: (()+0xf030) [0x7f3e99f85030]
[23:12] <gippa> 3: (gsignal()+0x35) [0x7f3e98882475]
[23:12] <gippa> 4: (abort()+0x180) [0x7f3e988856f0]
[23:12] <gippa> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f3e990d789d]
[23:12] <gippa> 6: (()+0x63996) [0x7f3e990d5996]
[23:12] <gippa> 7: (()+0x639c3) [0x7f3e990d59c3]
[23:12] <gippa> 8: (()+0x63bee) [0x7f3e990d5bee]
[23:12] <gippa> 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x7dc3a7]
[23:12] <gippa> 10: (SessionMap::decode(ceph::buffer::list::iterator&)+0x3a) [0x6ff27a]
[23:12] <gippa> 11: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x6e) [0x70026e]
[23:12] <gippa> 12: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1b) [0x7275bb]
[23:12] <gippa> 13: (MDS::handle_core_message(Message*)+0xae7) [0x513c57]
[23:12] <gippa> 14: (MDS::_dispatch(Message*)+0x33) [0x513d53]
[23:12] <gippa> 15: (MDS::ms_dispatch(Message*)+0xab) [0x515b3b]
[23:12] <gippa> 16: (DispatchQueue::entry()+0x393) [0x847ca3]
[23:12] <gippa> 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x7caeed]
[23:12] <gippa> 18: (()+0x6b50) [0x7f3e99f7cb50]
[23:12] <gippa> 19: (clone()+0x6d) [0x7f3e9892aa7d]
[23:12] <gippa> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
[23:12] <phantomcircuit> gippa, pastebin.com
[23:12] <phantomcircuit> please and thank you
[23:13] <scuttlemonkey> ^
[23:13] <gippa> sorry, being new here
[23:14] <gippa> http://pastebin.com/C81g5jFd
[23:15] <gippa> it's more or less the same on the nodes of the two cluster
[23:15] <gippa> it was originally on two nodes with ubuntu with multiple osd
[23:16] <gippa> tried to reproduce it on a smaller cluster on debian7 with two osd just to test if I was able to reproduce
[23:16] <gippa> got another cluster on bobtail on CentOS working ok
[23:17] <mrjack> tnt are you on wip-4895-cuttlefish?
[23:20] <tnt> mrjack: I have those patches yes. (but I have some other custom patches as well so it's not exactly that branch).
[23:22] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 481 seconds)
[23:22] <mrjack> tnt: i had not much time for ceph the last few days, tried to get a patch for linux kernel that fixes a problem when using bonding mode 6 with bridge backported for 3.4 and 3.2 and put it in next stable kernel release.. was a struggle.. ;)
[23:23] <tnt> mrjack: which one is mode 6 ?
[23:23] <gippa> phantomcircuit any hint please? I can't understand why is dying
[23:23] <mrjack> balance-alb
[23:24] <lurbs> mrjack: ARP stupidity?
[23:24] <mrjack> yep
[23:24] <lurbs> Yeah, I've seen that one. Made running VMs over it a bit useless.
[23:24] <mrjack> yes
[23:24] <mrjack> there is a patch for 3.4.46 now
[23:25] <lurbs> Good to hear. Would be great if it made it into a 12.04 LTS backport.
[23:25] <mrjack> it was rejected in february because of two linebreaks... :(
[23:26] <mrjack> and noone cared....
[23:26] <mrjack> lurbs: i don't know ubuntu or if they will backport it
[23:27] <mrjack> lurbs: but its deb based so you could send a mail to kernel-package maintainer ...
[23:27] <tnt> mrjack: if it's backported to 3.2, then ubuntu will get it.
[23:27] <mrjack> lurbs: [ Upstream commit 567b871e503316b0927e54a3d7c86d50b722d955 ]
[23:27] <mrjack> tnt: yes it is also backported to 3.2
[23:28] <tnt> ubuntu kernel team tracks the longterm 3.2 tree.
[23:28] <lurbs> mrjack: Ta, I'll take a look.
[23:29] * rturk-away is now known as rturk
[23:32] <mrjack> ceph performs a lot better with mode 6 comapred to mode 5 (which i had to do before)
[23:32] * rturk is now known as rturk-away
[23:38] <tnt> personally I use the lacp mode ...
[23:39] <tnt> my switch itself only does ip xor balancing anyway.
[23:39] <mrjack> tnt: i would use lacp if my switches would support it
[23:40] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
