#ceph IRC Log

Index

IRC Log for 2013-07-09

Timestamps are in GMT/BST.

[0:01] * athrift (~nz_monkey@203.86.205.13) has joined #ceph
[0:01] * dpippenger (~riven@tenant.pas.idealab.com) Quit ()
[0:01] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[0:02] <skm> ceph df looks accurate as far as i can see
[0:03] <sagewk> nothing newer. looks ok then?
[0:03] <sagewk> gregaf: ^
[0:03] <nhm> skm: interesting
[0:04] <skm> POOLS:
[0:04] <skm> NAME ID USED %USED OBJECTS
[0:04] <skm> data 0 10557M 4.84 2653
[0:05] * drokita1 (~drokita@199.255.228.128) has joined #ceph
[0:07] * mozg (~andrei@host217-44-214-64.range217-44.btcentralplus.com) has joined #ceph
[0:07] * BillK (~BillK-OFT@124-169-221-120.dyn.iinet.net.au) has joined #ceph
[0:08] * drokita (~drokita@199.255.228.128) Quit (Read error: Operation timed out)
[0:09] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[0:13] * drokita1 (~drokita@199.255.228.128) Quit (Ping timeout: 480 seconds)
[0:16] * portante is now known as portante|afk
[0:21] <gregaf> sagewk: oh, nope, I missed some patches
[0:22] <gregaf> when I was talking about naming the enum it was for type safety, so if we're going to name it we should keep the op that way too and do a conversion for encode/decode…rest of it looks okay
[0:25] * PerlStalker (~PerlStalk@72.166.192.70) Quit (Quit: ...)
[0:26] * madkiss (~madkiss@217.194.70.18) has joined #ceph
[0:30] <sagewk> k
[0:34] * madkiss (~madkiss@217.194.70.18) Quit (Ping timeout: 480 seconds)
[0:44] * haomaiwang (~haomaiwan@211.155.113.223) has joined #ceph
[0:49] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[0:50] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[0:52] * haomaiwang (~haomaiwan@211.155.113.223) Quit (Ping timeout: 480 seconds)
[0:55] * mschiff (~mschiff@85.182.236.82) Quit (Remote host closed the connection)
[1:08] * b1tbkt_ (~b1tbkt@24-217-196-119.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[1:11] * b1tbkt (~b1tbkt@24-217-196-119.dhcp.stls.mo.charter.com) has joined #ceph
[1:16] * dosaboy (~dosaboy@host86-164-81-178.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[1:16] * b1tbkt (~b1tbkt@24-217-196-119.dhcp.stls.mo.charter.com) Quit (Read error: Connection reset by peer)
[1:18] * dosaboy (~dosaboy@host86-150-242-76.range86-150.btcentralplus.com) has joined #ceph
[1:19] * b1tbkt (~b1tbkt@24-217-196-119.dhcp.stls.mo.charter.com) has joined #ceph
[1:20] * madkiss (~madkiss@217.194.70.18) has joined #ceph
[1:25] * LeaChim (~LeaChim@90.217.166.163) Quit (Ping timeout: 480 seconds)
[1:27] * mozg (~andrei@host217-44-214-64.range217-44.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[1:28] * madkiss (~madkiss@217.194.70.18) Quit (Ping timeout: 480 seconds)
[1:29] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:45] * haomaiwang (~haomaiwan@211.155.113.223) has joined #ceph
[1:53] * haomaiwang (~haomaiwan@211.155.113.223) Quit (Ping timeout: 480 seconds)
[1:57] * nhm (~nhm@184-97-193-106.mpls.qwest.net) Quit (Quit: Lost terminal)
[2:07] * tnt (~tnt@109.130.77.55) Quit (Ping timeout: 480 seconds)
[2:09] * b1tbkt (~b1tbkt@24-217-196-119.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[2:10] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[2:14] * jtang1 (~jtang@blk-222-209-164.eastlink.ca) has joined #ceph
[2:16] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has left #ceph
[2:20] * nwat (~nwatkins@eduroam-226-128.ucsc.edu) has left #ceph
[2:21] * mtanski (~mtanski@69.193.178.202) Quit (Read error: Operation timed out)
[2:37] * portante|afk is now known as portante
[2:39] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[2:45] * sagelap (~sage@2600:1012:b019:c478:fd5e:b438:4682:65c6) has joined #ceph
[2:45] <sagelap> joao: back
[2:45] * haomaiwang (~haomaiwan@117.79.232.209) has joined #ceph
[2:47] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Remote host closed the connection)
[2:53] * haomaiwang (~haomaiwan@117.79.232.209) Quit (Ping timeout: 480 seconds)
[2:53] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[2:57] * yy (~michealyx@218.74.35.50) has joined #ceph
[3:01] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[3:05] * yanzheng (~zhyan@134.134.139.76) has joined #ceph
[3:06] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[3:11] * bwesemann (~bwesemann@2001:1b30:0:6:9c59:3517:55a:64af) Quit (Remote host closed the connection)
[3:12] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[3:12] * bwesemann (~bwesemann@2001:1b30:0:6:990e:2128:78c7:8573) has joined #ceph
[3:15] * b1tbkt (~b1tbkt@24-217-192-155.dhcp.stls.mo.charter.com) has joined #ceph
[3:15] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[3:19] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[3:20] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[3:22] * dosaboy_ (~dosaboy@host86-161-207-41.range86-161.btcentralplus.com) has joined #ceph
[3:22] * jtang1 (~jtang@blk-222-209-164.eastlink.ca) Quit (Quit: Leaving.)
[3:24] * haomaiwang (~haomaiwan@117.79.232.209) has joined #ceph
[3:26] * Brian (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[3:28] * Tamil (~tamil@38.122.20.226) Quit (Quit: Leaving.)
[3:29] * dosaboy (~dosaboy@host86-150-242-76.range86-150.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[3:35] * sagelap (~sage@2600:1012:b019:c478:fd5e:b438:4682:65c6) Quit (Ping timeout: 480 seconds)
[3:39] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[3:40] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[3:40] * xmltok (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[3:44] * n3c8-35575 (~mhattersl@84.19.35.10) Quit (Read error: Connection reset by peer)
[3:45] * markbby (~Adium@168.94.245.1) has joined #ceph
[3:45] * n3c8-35575 (~mhattersl@pix.office.vaioni.com) has joined #ceph
[3:45] * markbby (~Adium@168.94.245.1) Quit ()
[3:46] * markbby (~Adium@168.94.245.1) has joined #ceph
[3:48] * Midnightmyth_ (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[3:50] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Quit: Leaving.)
[3:55] * eternaleye (~eternaley@2002:3284:29cb::1) Quit (Ping timeout: 480 seconds)
[3:56] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[3:56] * AfC (~andrew@jim1020952.lnk.telstra.net) has joined #ceph
[3:58] * diegows (~diegows@190.190.2.126) has joined #ceph
[4:00] * eternaleye (~eternaley@2002:3284:29cb::1) has joined #ceph
[4:03] * AfC (~andrew@jim1020952.lnk.telstra.net) Quit (Quit: Leaving.)
[4:09] * AfC (~andrew@jim1020952.lnk.telstra.net) has joined #ceph
[4:16] * markbby (~Adium@168.94.245.1) Quit (Remote host closed the connection)
[4:17] * julian (~julianwa@125.69.104.140) has joined #ceph
[4:18] * AfC (~andrew@jim1020952.lnk.telstra.net) Quit (Quit: Leaving.)
[4:28] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) has joined #ceph
[4:41] * diegows (~diegows@190.190.2.126) Quit (Ping timeout: 480 seconds)
[5:00] * fireD1 (~fireD@93-139-162-89.adsl.net.t-com.hr) has joined #ceph
[5:05] * joshd1 (~jdurgin@2602:306:c5db:310:e110:2102:3a19:638c) Quit (Quit: Leaving.)
[5:06] * xmltok (~xmltok@relay.els4.ticketmaster.com) Quit (Quit: Leaving...)
[5:07] * fireD (~fireD@93-142-237-252.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:07] * xmltok (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[5:19] * johnugeorge (~chatzilla@99-31-208-175.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[5:20] <johnugeorge> hi
[5:25] <johnugeorge> I need some advice for my scenario. I am trying to integrate openstack with ceph. I need to have some shared folders for all vms , running above my KVM. so i created a new block device using rbd. Is it possible to mount this block device and share this mountpoint across all vms using virtio
[5:26] <janos> block devices - generally speaking - should not be shared
[5:27] <janos> think of a block device like a physical harddrive. and you're trying to run sata cables from multiple computers to it
[5:27] <janos> yes, there are conditions where it can be done, but based on your question - ignore that ;)
[5:28] <johnugeorge> :). But, how can I implement this scenario otherwise?
[5:28] <janos> depedning on your use-case, you may want to have a VM with samba
[5:29] <janos> and mount a share from all
[5:29] <janos> to handle locking issues
[5:30] <janos> or, if you want to stick with a ceph-based approach, you could do a rados-gw that they can all target
[5:30] <janos> byt it all depends on what your apps can/should do
[5:30] <janos> s/byt/but
[5:31] <johnugeorge> I was planning to stick with ceph itself. I didn't get , why you told that the idea of block device is bad in this scenario ? :)
[5:31] <janos> block devices by themselves do not manage locking
[5:31] <janos> multiple systems using one block device will likely corrupt it rather quickly
[5:32] * portante is now known as portante|afk
[5:32] <johnugeorge> oh.. ok.. thanks :). How can I use rados-gw for this?.. How can applications inside vm's access the ceph?
[5:33] <janos> well, rados-gw will present an s3 or swift compatible API
[5:33] <janos> so and apps inside VM's would need to be able to speak that
[5:34] <janos> if you want straight up mounted-drive type of file-level access, then samba may be the way for you
[5:34] <johnugeorge> ok. Which would be faster and will there be any other side effects?
[5:35] <janos> i think there is also a way to access ceph directly via samba, but i have not investigated
[5:35] <janos> well, it's a SPoF
[5:35] <janos> that's a drawback with that approach
[5:36] <janos> i would let others chime in to confirm/deny what i've said though
[5:38] <johnugeorge> ok. why is it SPOF?. it has to be taken care in RADOS right?
[5:38] * AfC (~andrew@jim1020952.lnk.telstra.net) has joined #ceph
[5:40] * portante|afk is now known as portante
[5:42] <johnugeorge> If openstack images are mapped onto RADOS with replication , how can it be SPOF?.
[5:43] * sagelap (~sage@76.89.177.113) has joined #ceph
[5:48] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[5:51] * AfC (~andrew@jim1020952.lnk.telstra.net) Quit (Quit: Leaving.)
[6:00] <sage> teh samba gateway would be a spof if there is only 1 daemon running and you do'nt have it set up in some HA way
[6:05] <johnugeorge> If one of the VM's in openstack cluster is configured to be the Smaba Server and if the VM images are handled using RADOS with replication factor > 1, how will it be a SPOF?.
[6:08] <johnugeorge> I was planning to have this architecture with multiple VM'S on openstack . http://ceph.com/docs/master/rbd/rbd-openstack/
[6:10] * portante is now known as portante|afk
[6:13] <sage> seems ok. it means cifs will be unavailable while the vm is rebooted elsewhere...
[6:18] <johnugeorge> oh. ok. Is there any other way where I can get file level access?(Hierarchy structure ). The other suggestion was RADOSGW, but I can't use it because of this.
[6:21] <johnugeorge> I wish to stick to ceph itself. As, I am not sure of the performance issues and also, as you suggested , smb shares can get unavailable .Is there any better way for this?
[6:21] <sage> dmick: ping
[6:44] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[6:45] * sleinen1 (~Adium@2001:620:0:25:e56a:7ef6:38a7:2535) has joined #ceph
[6:46] * sleinen1 (~Adium@2001:620:0:25:e56a:7ef6:38a7:2535) Quit ()
[6:46] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Read error: Connection reset by peer)
[6:47] <johnugeorge> sage, I was trying hadoop on openstack lying on top of RADOS. Since , ceph itself takes care of replication and other major stuff, HDFS is not needed. If we can map files and directories on to Object store, it must be fine.
[6:50] * AfC (~andrew@2001:44b8:31cb:d400:501c:4d92:724:bd1a) has joined #ceph
[6:55] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[7:04] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Quit: Leaving.)
[7:04] * madkiss (~madkiss@217.194.70.18) has joined #ceph
[7:06] <dmick> sage: pong
[7:06] <sage> take a quick peek at wip-ceph-cli-dup?
[7:08] <dmick> ergh. it'll work, but, why not a wrapper...
[7:08] <sage> just makes it easy to enable via tuethology for an arbitrary test
[7:09] <sage> a wrapper would be trickier to install?
[7:09] <sage> maybe not
[7:10] <dmick> this is definitely easier on the test writer, but...bleh. I can swallow my offense I suppose.
[7:12] <johnugeorge> Am I missing something? :)
[7:13] <sage> np, i can make teuthology push a wrapper to replace it at /usr/bin/ceph instead
[7:13] <dmick> or just make the test case have a function that does it?...
[7:15] <sage> easier, but less coverage that way
[7:17] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[7:26] * xmltok (~xmltok@relay.els4.ticketmaster.com) Quit (Ping timeout: 480 seconds)
[7:27] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[7:28] <mtanski> is Yan ever on irc?
[7:30] <dmick> yes
[7:35] * madkiss (~madkiss@217.194.70.18) Quit (Ping timeout: 480 seconds)
[7:36] <mtanski> do you know what time zone he is?
[7:39] * zapotah (~zapotah@dsl-hkibrasgw2-50de22-61.dhcp.inet.fi) Quit (Read error: Operation timed out)
[7:39] <dmick> US Pacifict
[7:39] <dmick> s/ct/c/
[7:39] <dmick> assuming we're talking about the same Yan
[7:42] * zapotah (~zapotah@dsl-hkibrasgw2-50defd-137.dhcp.inet.fi) has joined #ceph
[7:48] * dosaboy (~dosaboy@host86-164-137-144.range86-164.btcentralplus.com) has joined #ceph
[7:50] <mtanski> Zheng
[7:55] * dosaboy_ (~dosaboy@host86-161-207-41.range86-161.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[7:58] * johnugeorge (~chatzilla@99-31-208-175.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[7:59] * tnt (~tnt@109.130.77.55) has joined #ceph
[8:10] <dmick> ah. no, sorry
[8:10] <dmick> I don't nkow about him
[8:11] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[8:14] * zapotah (~zapotah@dsl-hkibrasgw2-50defd-137.dhcp.inet.fi) Quit (Remote host closed the connection)
[8:15] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[8:15] * ChanServ sets mode +v andreask
[8:16] * zhangjf_zz2 (~zjfhappy@222.128.1.105) has joined #ceph
[8:20] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[8:21] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[8:28] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[8:33] * Vjarjadian (~IceChat77@90.214.208.5) Quit (Quit: Easy as 3.14159265358979323846... )
[8:36] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) Quit (Quit: mtanski)
[8:42] * madkiss (~madkiss@pD9E039FE.dip0.t-ipconnect.de) has joined #ceph
[8:43] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Read error: Operation timed out)
[8:46] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[8:51] * sleinen (~Adium@2001:620:0:26:4943:8960:2ba8:dd29) has joined #ceph
[8:53] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[8:55] * yanzheng (~zhyan@134.134.139.76) Quit (Remote host closed the connection)
[8:57] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[9:02] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:07] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:16] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[9:19] * johnugeorge (~chatzilla@99-31-208-175.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[9:21] * mschiff (~mschiff@pD9510356.dip0.t-ipconnect.de) has joined #ceph
[9:23] * johnugeorge (~chatzilla@99-31-208-175.lightspeed.sntcca.sbcglobal.net) Quit ()
[9:26] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:29] * tnt (~tnt@109.130.77.55) Quit (Ping timeout: 480 seconds)
[9:34] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:36] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[9:39] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:44] * leseb (~Adium@83.167.43.235) has joined #ceph
[9:47] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[9:53] * madkiss1 (~madkiss@pD9E025D2.dip0.t-ipconnect.de) has joined #ceph
[9:59] * jjgalvez1 (~jjgalvez@ip72-193-215-88.lv.lv.cox.net) Quit (Quit: Leaving.)
[9:59] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[9:59] * madkiss (~madkiss@pD9E039FE.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[9:59] * markit (~marco@151.78.74.112) has joined #ceph
[10:01] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[10:08] * vipr (~vipr@78-21-226-240.access.telenet.be) has joined #ceph
[10:10] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) has joined #ceph
[10:12] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[10:16] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[10:19] * sleinen (~Adium@2001:620:0:26:4943:8960:2ba8:dd29) Quit (Quit: Leaving.)
[10:19] * sleinen (~Adium@130.59.94.187) has joined #ceph
[10:20] * LeaChim (~LeaChim@90.217.166.163) has joined #ceph
[10:21] <markit> I'm really confused by ceph "quick start" documentation , I'm on this page: http://ceph.com/docs/master/start/quick-ceph-deploy/
[10:21] <markit> and trying to setup a 3 node (2 dedicated storage dist + 1 OS disk) cluster
[10:21] <markit> I'm in the fist node, and I use it as "admin node", correct?
[10:22] <markit> I've created "my-cluster" dir and issuing commands ceph-deploy from there
[10:23] <markit> I guess I've to run "ceph-install" only for ceph02 and ceph03, since in the current node I've installed ceph with apt-get, right?
[10:23] <markit> and I've already done a $ ceph-deploy new ceph0{1,2,3}
[10:24] * jtang1 (~jtang@blk-222-209-164.eastlink.ca) has joined #ceph
[10:25] * jks (~jks@3e6b5724.rev.stofanet.dk) has joined #ceph
[10:25] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[10:25] * kyle__ (~kyle@216.183.64.10) has joined #ceph
[10:26] * zynzel_ (zynzel@spof.pl) has joined #ceph
[10:26] * darkfaded (~floh@88.79.251.60) has joined #ceph
[10:27] * sleinen (~Adium@130.59.94.187) Quit (Ping timeout: 480 seconds)
[10:27] * leseb1 (~Adium@83.167.43.235) has joined #ceph
[10:29] * fuzz_ (~pi@c-76-30-9-9.hsd1.tx.comcast.net) has joined #ceph
[10:29] * grifferz_ (~andy@specialbrew.392abl.bitfolk.com) has joined #ceph
[10:30] * Rocky_ (~r.nap@188.205.52.204) has joined #ceph
[10:30] * denken_ (~denken@dione.pixelchaos.net) has joined #ceph
[10:30] * jackhill_ (jackhill@pilot.trilug.org) has joined #ceph
[10:30] * tchmnkyz_ (~jeremy@ip23.67-202-99.static.steadfastdns.net) has joined #ceph
[10:30] * brambles_ (lechuck@s0.barwen.ch) has joined #ceph
[10:30] * terje- (~root@135.109.216.239) has joined #ceph
[10:30] * trond_ (~trond@trh.betradar.com) has joined #ceph
[10:30] * leseb (~Adium@83.167.43.235) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * BillK (~BillK-OFT@124-169-221-120.dyn.iinet.net.au) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * kyle_ (~kyle@216.183.64.10) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * terje-_ (~root@135.109.216.239) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * xmltok_ (~xmltok@pool101.bizrate.com) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * fuzz (~pi@c-76-30-9-9.hsd1.tx.comcast.net) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * rongze (~zhu@173-252-252-212.genericreverse.com) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * _Tassadar (~tassadar@tassadar.xs4all.nl) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * jamespage (~jamespage@culvain.gromper.net) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * darkfader (~floh@88.79.251.60) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * jksM (~jks@3e6b5724.rev.stofanet.dk) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * brambles (lechuck@s0.barwen.ch) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * markl (~mark@tpsit.com) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * tchmnkyz (~jeremy@0001638b.user.oftc.net) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * grifferz (~andy@specialbrew.392abl.bitfolk.com) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * trond (~trond@trh.betradar.com) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * wrencsok (~wrencsok@wsip-174-79-34-244.ph.ph.cox.net) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * jackhill (jackhill@pilot.trilug.org) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * denken (~denken@dione.pixelchaos.net) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * Rocky (~r.nap@188.205.52.204) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * zynzel (zynzel@spof.pl) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * nyerup (irc@jespernyerup.dk) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * Elbandi (~ea333@elbandi.net) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * mynameisbruce (~mynameisb@tjure.netzquadrat.de) Quit (resistance.oftc.net oxygen.oftc.net)
[10:30] * markl (~mark@tpsit.com) has joined #ceph
[10:30] * Elbandi (~ea333@elbandi.net) has joined #ceph
[10:30] * nyerup (irc@jespernyerup.dk) has joined #ceph
[10:30] * sleinen (~Adium@2001:620:0:26:7435:8db3:d166:b6db) has joined #ceph
[10:30] * _Tassadar (~tassadar@tassadar.xs4all.nl) has joined #ceph
[10:31] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) Quit (Remote host closed the connection)
[10:31] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) has joined #ceph
[10:31] * jamespage (~jamespage@culvain.gromper.net) has joined #ceph
[10:32] * BillK (~BillK-OFT@124-169-221-120.dyn.iinet.net.au) has joined #ceph
[10:32] * BillK (~BillK-OFT@124-169-221-120.dyn.iinet.net.au) Quit (Read error: Operation timed out)
[10:32] * mynameisbruce (~mynameisb@tjure.netzquadrat.de) has joined #ceph
[10:32] * rongze (~zhu@173-252-252-212.genericreverse.com) has joined #ceph
[10:32] * BillK (~BillK-OFT@124-169-221-120.dyn.iinet.net.au) has joined #ceph
[10:33] * wrencsok (~wrencsok@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[10:33] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) has joined #ceph
[10:35] * masterpe_ (~masterpe@2a01:670:400::43) Quit (Quit: Changing server)
[10:37] * masterpe (~masterpe@2a01:670:400::43) has joined #ceph
[10:37] <markit> mmm in my admin ceph01 node I've no /var/lib/ceph !
[10:38] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has left #ceph
[10:41] <tnt> markit: well if the admin not isn't running any daemon ... you don't need one.
[10:41] <tnt> the admin node is nothing else than a machine with packages installed (and so you have the 'ceph' and 'rados' and ... utilities installed) and a keyring with admin privileges.
[10:42] <markit> tnt: installing ceph-deploy on the admin node, I have ONLY ceph-deploy package
[10:44] <tnt> I never used ceph-deploy so I couldn't tell you ... but that looks suspect.
[10:48] <markit> "When using ceph-deploy, the tool enforces a single Ceph Monitor per node" what does it mean? can't have more than one monitor per node, or can't have a node without a monitor?
[10:48] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[10:49] <tnt> I'd guess the former. The latter just doesn't make sense.
[10:50] <markit> $ ceph-deploy mon create ceph0{1,2,3}
[10:50] <markit> ceph-mon: mon.noname-c 192.168.40.103:6789/0 is local, renaming to mon.ceph03
[10:50] <markit> I'm on ceph01, 192.168.40.101... what does it mean?
[10:52] * julian (~julianwa@125.69.104.140) Quit (Read error: Connection reset by peer)
[10:53] * julian (~julianwa@125.69.104.140) has joined #ceph
[10:57] <markit> ceph-deploy gatherkeys is obscure too
[10:58] <markit> mm probably has to be issued from a new node and specify a monitor one
[10:59] <markit> and "your local directory"... any one?
[10:59] <markit> anyone? I mean, any local directory is good?
[11:04] <markit> this helps a little: http://ceph.com/howto/deploying-ceph-with-ceph-deploy/
[11:17] * Macmonac (~opera@194.199.107.6) has joined #ceph
[11:21] <Macmonac> hello, i read in the doc, that is better to use ceph-deploy in Cuttlefish, but in 0.61.4 the option for dm-crypt doesn't work and i can't use a lvm on sdd for the journal.
[11:21] <Macmonac> have you an idée how a can "prepare" the osds ?
[11:28] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[11:30] <Macmonac> do you think we can "prepare" the OSD without formatting?
[11:38] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[11:39] <markit> Macmonac: newbie here, fighting with ceph-deploy doc... I need a bloc device only, and reading ceph-deploy osd create ../dev/sdb don't understand where specify the filesystem, after having read that XFS is the one to choose
[11:44] <Macmonac> markit: i understand the documentation. the create action make prepare action and activate action
[11:46] <Macmonac> so i have try to format myself cryptsetup partition to xfs and create the logical volume on lvm for the journal but the pepare action seems to make other operation and the osds won't start only with the activate action
[11:48] * zynzel_ is now known as zynzel
[11:49] <markit> Macmonac: I've instead ceph -s taht tells: monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
[11:49] <markit> I've no idea of what I did wrong, in /etc/ceph there is the admin key (in all 3 nodes)
[11:50] <Macmonac> ahhh sorry, it's a question.
[11:51] <markit> :)
[11:51] <markit> confused newbie here :(
[11:51] <Macmonac> have you done the "gatherkeys" operation
[11:51] <Macmonac> me to sorry ;)
[11:51] <markit> I try to setup a test 3 node cluster, just to use with proxmox
[11:52] <markit> yes, but it writes 3 files in current directory. I'm using ceph01 as admin node (I mean, I issue ceph-deploy commands from there)
[11:52] <markit> I did ceph-deploy gatherkeys ceph01
[11:52] <markit> since ceph01 is also a mon node
[11:53] <markit> the key on /etc/ceph has a different date, since I issued ceph-deploy gatherkeys ceph02 at the beginning, then deleted the keys and issued ceph-deploy gatherkeys ceph01
[11:53] <markit> but don't know what command created and delployed the /etc/ceph content in the 3 nodes
[11:54] <markit> also I don't understand ceph.conf... was created by ceph.deploy, but do I have to edit it manually and add stuff there?
[11:55] <markit> seens have no reference to the other nodes config
[11:55] <markit> (has only [global] section)
[11:56] * yy (~michealyx@218.74.35.50) has left #ceph
[11:58] * sleinen (~Adium@2001:620:0:26:7435:8db3:d166:b6db) Quit (Quit: Leaving.)
[11:58] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[12:00] <markit> Macmonac: to answer your question: ceph-deploy osd create -h
[12:00] <markit> (I think) I've seen --dmcrypt use dm-crypt on DISK
[12:01] <markit> and "subcommand" has also the option "prepare"
[12:11] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[12:12] * portante|afk is now known as portante
[12:14] <markit> mmm sudo ceph -s replied differently :(
[12:18] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[12:28] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[12:29] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[12:32] * smiley_ (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[12:39] * yy (~michealyx@218.74.35.50) has joined #ceph
[12:43] * stacker666 (~stacker66@90.163.235.0) has joined #ceph
[12:43] * portante is now known as portante|afk
[12:44] <stacker666> hi all
[12:45] * portante|afk is now known as portante
[12:46] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[12:46] <stacker666> someone has managed to export images using tgt?
[12:47] <stacker666> when i want to create the lun i have this message: tgtd: tgt_device_create(522) device 0 already exists
[12:59] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[13:01] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[13:07] * infinitytrapdoor (~infinityt@134.95.27.132) has joined #ceph
[13:13] * jtang1 (~jtang@blk-222-209-164.eastlink.ca) Quit (Quit: Leaving.)
[13:20] * yy (~michealyx@218.74.35.50) has left #ceph
[13:21] * Rocky_ (~r.nap@188.205.52.204) Quit (Quit: **Poof**)
[13:21] * Rocky (~r.nap@188.205.52.204) has joined #ceph
[13:25] * zhangjf_zz2 (~zjfhappy@222.128.1.105) Quit (Quit: 离开)
[13:26] * jks (~jks@3e6b5724.rev.stofanet.dk) Quit (Read error: Connection reset by peer)
[13:30] * sleinen (~Adium@130.59.94.187) has joined #ceph
[13:31] * sleinen1 (~Adium@2001:620:0:25:d9d9:6f60:417c:687f) has joined #ceph
[13:35] * yy (~michealyx@218.74.35.50) has joined #ceph
[13:35] * yy (~michealyx@218.74.35.50) has left #ceph
[13:38] * sleinen (~Adium@130.59.94.187) Quit (Ping timeout: 480 seconds)
[13:45] * Machske (~Bram@d5152D87C.static.telenet.be) has joined #ceph
[13:47] * portante is now known as portante|afk
[13:56] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:03] * jtang1 (~jtang@142.176.24.2) has joined #ceph
[14:14] * capri (~capri@212.218.127.222) has joined #ceph
[14:22] * jtang2 (~jtang@142.176.24.2) has joined #ceph
[14:22] * jtang1 (~jtang@142.176.24.2) Quit (Read error: Connection reset by peer)
[14:24] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) has joined #ceph
[14:24] * jtang2 (~jtang@142.176.24.2) Quit (Read error: Connection reset by peer)
[14:24] * jtang1 (~jtang@142.176.24.2) has joined #ceph
[14:32] <erwan_taf> anyone familiar with rados bench here ?
[14:32] <erwan_taf> I do think the aio code have some troubles :/
[14:33] <erwan_taf> but would love to speak about it with someone to {con|in}firm my idea
[14:34] <markit> I've just configured a 3 node test ceph cluster with ceph-deploy, I see in /etc/ceph/ceph.conf only the [general] part, no entries in /etc/fstab, but everything seems to work... is it ok? Is that long ceph.conf definition obsolete with 0.61.4?
[14:37] <Gugge-47527> markit: yes its okay, it uses some "magic" udev rules to mount, and the upstart scripts starts whatever is mounted in /var/lib/ceph
[14:37] * julian (~julianwa@125.69.104.140) Quit (Quit: afk)
[14:40] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:41] <markit> great! tons of doc to update ;P
[14:41] * leseb1 (~Adium@83.167.43.235) Quit (Quit: Leaving.)
[14:42] * jtang1 (~jtang@142.176.24.2) Quit (Read error: Connection reset by peer)
[14:42] <markit> Gugge-47527: I've also ceph.mon.keyring that is only on my ceph01, user dir, that I used as admin node... what is it's purpouse and is it ok?
[14:45] <erwan_taf> hum, ceph uses gettimeofday while a get clock monotonic would be surely better
[14:45] * Psi-Jack_ (~Psi-Jack@yggdrasil.hostdruids.com) has joined #ceph
[14:46] * mattch (~mattch@pcw3047.see.ed.ac.uk) has joined #ceph
[14:47] * julian (~julianwa@125.69.104.140) has joined #ceph
[14:48] * diegows (~diegows@190.190.2.126) has joined #ceph
[14:49] * ScOut3R_ (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[14:51] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[14:53] * vipr (~vipr@78-21-226-240.access.telenet.be) Quit (Remote host closed the connection)
[14:53] * ScOut3R__ (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[14:54] <markit> urgh, [WRN] mon.2 192.168.40.103:6789/0 clock skew 0.286895s > max 0.05s
[14:55] <Gugge-47527> setup ntpd on the nodes :)
[14:55] <markit> isn't that timing too restrictive? how can I fix it? ntp is not enough?
[14:55] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[14:55] <Gugge-47527> if ntpd is running (and has set the time), and you have more than 0.05s difference, something is broken :)
[14:55] <markit> Gugge-47527: mmm nothing automatically done by ceph-deploy? I've found no mention in the "fast howto" guide
[14:55] <markit> Gugge-47527: let me check :)
[14:55] <Gugge-47527> no, ceph-deploy does not setup ntpd
[14:56] * ScOut3R_ (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Read error: Operation timed out)
[14:57] <markit> Gugge-47527: I see two pacakges, openntpd and ntpdate
[14:57] <Gugge-47527> what distro?
[14:59] * markbby (~Adium@168.94.245.3) has joined #ceph
[15:00] <markit> Gugge-47527: ubuntu 12.04 server 64
[15:00] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 22.0/20130618035212])
[15:02] <Gugge-47527> the ntp package is not good enough?
[15:04] <markit> oh, sorry, I searched ntpd
[15:05] <niklas> Hi there. I am currently benchmarking my cluster, and I am wondering why a file that has a size of 10GB on my disk has a size of 16GB when put into ceph?
[15:05] <niklas> both measured with du -h
[15:06] <niklas> actually on disk it is onla 9.6G
[15:11] * leseb (~Adium@83.167.43.235) has joined #ceph
[15:12] <niklas> hmm, without doing anything to the file, du -h now gives me 9.6GB in ceph…
[15:12] <niklas> odd
[15:17] * Midnightmyth (~quassel@0x3e2c86fd.mobile.telia.dk) has joined #ceph
[15:19] <joelio> you need to put iburst after the ntp server name, otherwise it'll take ages
[15:19] <joelio> or force with ntpdate
[15:19] * nhm (~nhm@184-97-193-106.mpls.qwest.net) has joined #ceph
[15:19] * ChanServ sets mode +o nhm
[15:21] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Quit: Ex-Chat)
[15:23] <niklas> Does anyone have an Idea why "rados -p data 10gfile 10gfile" won't go full speed? On the local osd it will write with about 80MB/s, it utilizes the network with about 40-50MB/s (110MB/s possible) and will write with about 80MB/s to the remote osd
[15:23] * leseb (~Adium@83.167.43.235) Quit (Ping timeout: 480 seconds)
[15:23] <niklas> each osd got its own hard-drive which can handle 130MB/s
[15:24] <niklas> rados -p data bench 60 write will averages at about 110 MB/s
[15:24] <tnt> where are the journals ?
[15:25] <niklas> on the same disks as the data
[15:25] <tnt> keep in mind that each write will be doubled then. I.e. written to journal, then to final store.
[15:25] <niklas> oh, you mean the difference comes from the disks moving their heads about?
[15:26] * PerlStalker (~PerlStalk@72.166.192.70) has joined #ceph
[15:26] <tnt> then you have the replication. IIRC, the write is send to master, then from the master to the slaves and the write will only return once the slave have written it.
[15:27] <niklas> ok, so factor 2 between net and disk comes from writing data twice, that makes sense
[15:27] <niklas> do you think the disks only writing at 80MB/s instead of 130MB/s comes from them having to move the heads from journal to the actual storage?
[15:28] <niklas> because they are not writing sequentially any more
[15:33] * AfC (~andrew@2001:44b8:31cb:d400:501c:4d92:724:bd1a) Quit (Quit: Leaving.)
[15:37] * leseb (~Adium@83.167.43.235) has joined #ceph
[15:38] * PodMan99 (~keith@dr-pepper.1stdomains.co.uk) Quit (Remote host closed the connection)
[15:46] * portante|afk is now known as portante
[15:47] * AfC (~andrew@2001:44b8:31cb:d400:501c:4d92:724:bd1a) has joined #ceph
[15:47] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[15:48] * Qu310 (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (Read error: Connection reset by peer)
[15:49] * Qu310 (~Qten@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[15:49] * aliguori (~anthony@32.97.110.51) has joined #ceph
[15:50] * drokita (~drokita@199.255.228.128) has joined #ceph
[15:58] * markbby (~Adium@168.94.245.3) Quit (Quit: Leaving.)
[15:58] * markbby (~Adium@168.94.245.3) has joined #ceph
[15:59] * infinitytrapdoor (~infinityt@134.95.27.132) Quit (Read error: Connection reset by peer)
[16:00] * infinitytrapdoor (~infinityt@134.95.27.132) has joined #ceph
[16:00] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[16:00] * ChanServ sets mode +v andreask
[16:07] * mtanski (~mtanski@cpe-74-65-252-48.nyc.res.rr.com) Quit (Quit: mtanski)
[16:07] * yeled (~yeled@spodder.com) Quit (Quit: reboot)
[16:08] * yeled (~yeled@spodder.com) has joined #ceph
[16:14] * BillK (~BillK-OFT@124-169-221-120.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[16:25] * Midnightmyth_ (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[16:25] * vipr (~vipr@78-21-226-240.access.telenet.be) has joined #ceph
[16:29] * Midnightmyth (~quassel@0x3e2c86fd.mobile.telia.dk) Quit (Read error: Operation timed out)
[16:31] * rudolfsteiner (~federicon@220-122-245-190.fibertel.com.ar) has joined #ceph
[16:34] * markbby (~Adium@168.94.245.3) Quit (Remote host closed the connection)
[16:44] * haomaiwang (~haomaiwan@117.79.232.209) Quit (Remote host closed the connection)
[16:50] * AfC (~andrew@2001:44b8:31cb:d400:501c:4d92:724:bd1a) Quit (Quit: Leaving.)
[16:51] * infinitytrapdoor (~infinityt@134.95.27.132) Quit (Read error: Operation timed out)
[17:04] * ScOut3R__ (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Read error: Operation timed out)
[17:04] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[17:05] <Machske> Is there a way t minimize the impact of scrubbing ? When scrubbing starts, the performance drops radically. This is on a 0.61.4 release
[17:10] * madkiss1 (~madkiss@pD9E025D2.dip0.t-ipconnect.de) Quit (Quit: Leaving.)
[17:13] * Midnightmyth_ (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[17:14] * julian (~julianwa@125.69.104.140) Quit (Quit: afk)
[17:14] * haomaiwang (~haomaiwan@notes4.com) has joined #ceph
[17:16] * rudolfsteiner (~federicon@220-122-245-190.fibertel.com.ar) has left #ceph
[17:16] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:16] * elder (~elder@193.120.41.118) has joined #ceph
[17:21] * mtanski (~mtanski@69.193.178.202) has joined #ceph
[17:23] * markbby (~Adium@168.94.245.4) has joined #ceph
[17:24] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[17:24] <Machske> I think I found the issue:
[17:24] <Machske> xfs_db -r /dev/sda2
[17:24] <Machske> xfs_db> frag
[17:24] <Machske> actual 294080, ideal 243337, fragmentation factor 17.25%
[17:24] <Machske> whooops that's a big fragmentation value. Looks to be on all OSD's
[17:27] * haomaiwang (~haomaiwan@notes4.com) Quit (Read error: Operation timed out)
[17:27] * mtanski_ (~mtanski@69.193.178.202) has joined #ceph
[17:29] * sleinen1 (~Adium@2001:620:0:25:d9d9:6f60:417c:687f) Quit (Quit: Leaving.)
[17:29] * sleinen (~Adium@130.59.94.187) has joined #ceph
[17:31] * mtanski (~mtanski@69.193.178.202) Quit (Ping timeout: 480 seconds)
[17:31] * mtanski_ is now known as mtanski
[17:32] * Midnightmyth_ (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[17:33] * stacker666 (~stacker66@90.163.235.0) Quit (Ping timeout: 480 seconds)
[17:33] * brambles_ is now known as brambles
[17:37] * sleinen (~Adium@130.59.94.187) Quit (Ping timeout: 480 seconds)
[17:39] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[17:40] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[17:42] * sagelap (~sage@76.89.177.113) Quit (Ping timeout: 480 seconds)
[17:44] * Midnightmyth_ (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Read error: Operation timed out)
[17:45] * sagelap (~sage@2600:1012:b029:c392:98cc:b5d6:b7b3:9525) has joined #ceph
[17:47] * elder (~elder@193.120.41.118) Quit (Quit: Leaving)
[17:49] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[17:51] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[17:52] * markit (~marco@151.78.74.112) Quit (Quit: Konversation terminated!)
[17:52] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:53] * sagelap (~sage@2600:1012:b029:c392:98cc:b5d6:b7b3:9525) Quit (Ping timeout: 480 seconds)
[17:57] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:00] * rturk-away is now known as rturk
[18:00] * sleinen (~Adium@2001:620:0:25:d4df:73d1:c80f:5aae) has joined #ceph
[18:01] * sagelap (~sage@193.sub-70-197-77.myvzw.com) has joined #ceph
[18:01] * sagelap (~sage@193.sub-70-197-77.myvzw.com) Quit ()
[18:02] * sagelap (~sage@193.sub-70-197-77.myvzw.com) has joined #ceph
[18:03] * sleinen (~Adium@2001:620:0:25:d4df:73d1:c80f:5aae) Quit ()
[18:07] * Midnightmyth_ (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[18:11] * Tamil (~tamil@38.122.20.226) has joined #ceph
[18:12] * jjgalvez (~jjgalvez@ip72-193-215-88.lv.lv.cox.net) has joined #ceph
[18:12] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Read error: Operation timed out)
[18:18] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[18:20] * haomaiwang (~haomaiwan@117.79.232.209) has joined #ceph
[18:20] * Midnightmyth__ (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[18:21] * sagelap1 (~sage@38.122.20.226) has joined #ceph
[18:22] * sagelap1 (~sage@38.122.20.226) Quit ()
[18:22] * sagelap (~sage@193.sub-70-197-77.myvzw.com) Quit (Ping timeout: 480 seconds)
[18:23] * Midnightmyth___ (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[18:25] * Midnightmyth_ (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[18:27] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[18:28] * haomaiwang (~haomaiwan@117.79.232.209) Quit (Ping timeout: 480 seconds)
[18:29] * tnt (~tnt@109.130.77.55) has joined #ceph
[18:30] * Midnightmyth__ (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[18:31] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[18:34] * leseb (~Adium@83.167.43.235) Quit (Quit: Leaving.)
[18:56] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[18:56] * ChanServ sets mode +v andreask
[18:58] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has left #ceph
[19:00] * iii8 (~Miranda@91.207.132.71) Quit (Read error: Connection reset by peer)
[19:05] * hybrid512 (~walid@106-171-static.pacwan.net) Quit (Quit: Leaving.)
[19:06] <sagewk> gregaf: look at lastest wip-mon-newsync patches?
[19:07] <gregaf> making my way to HEAD of branch, get there shortly :)
[19:08] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[19:09] * sleinen1 (~Adium@2001:620:0:25:6963:1a91:6857:1c10) has joined #ceph
[19:11] * mschiff (~mschiff@pD9510356.dip0.t-ipconnect.de) Quit (Remote host closed the connection)
[19:11] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[19:13] * tchmnkyz_ (~jeremy@ip23.67-202-99.static.steadfastdns.net) Quit (Quit: Lost terminal)
[19:16] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[19:16] <gregaf> sagewk: comments on the new commits and a few others I hadn't gotten to yet
[19:17] <gregaf> I assume you're going to squash those commits? otherwise the descriptions need to get cleaned up
[19:17] * Midnightmyth___ (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[19:17] * tchmnkyz (~jeremy@0001638b.user.oftc.net) has joined #ceph
[19:18] <gregaf> sagewk: and do you need me to check over last night's branches too?
[19:18] <sagewk> i haven't looked
[19:21] * humbolt (~elias@212095007107.public.telering.at) has joined #ceph
[19:21] * haomaiwang (~haomaiwan@117.79.232.209) has joined #ceph
[19:21] * xdeller (~xdeller@91.218.144.129) has joined #ceph
[19:29] * haomaiwang (~haomaiwan@117.79.232.209) Quit (Ping timeout: 480 seconds)
[19:29] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Remote host closed the connection)
[19:30] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[19:33] <grepory> is there a particular distro that you recommend for ceph? we're re-investigating our distro choice
[19:33] * mschiff (~mschiff@85.182.236.82) has joined #ceph
[19:33] * nwat (~nwatkins@eduroam-226-128.ucsc.edu) has joined #ceph
[19:35] <sagewk> gregaf: ok going to squash it all back down
[19:35] * rturk is now known as rturk-away
[19:35] <gregaf> sagewk: I think you still need to clear out in_sync and force_sync in Monitor::preinit?
[19:36] <sagewk> they will get cleared when sync completes
[19:36] <gregaf> that's not part of the sync_prefixes so it's not cleared out there
[19:36] <gregaf> hmm, k
[19:36] <sagewk> and if sync never complets, then we still need to restart, so no real value in clearing it earlier
[19:36] <gregaf> sounds good
[19:36] <gregaf> still need to rename Paxos::apply_transactions, since it doesn't
[19:37] <sagewk> that one confused me.. it does. it reads them and applies them
[19:37] <sagewk> you want read_and_apply_transactions instead?
[19:37] <gregaf> you ripped out the apply part, it's encoding them into a transaction
[19:37] <gregaf> which is applied elsewhere
[19:37] <sagewk> ooh, gotcha
[19:38] * kyle__ (~kyle@216.183.64.10) Quit (Quit: Leaving)
[19:38] * iii8 (~Miranda@91.207.132.71) has joined #ceph
[19:38] <gregaf> the is_consistent check sounds good; I'm still confused about how the last_committed floor can fix anything
[19:40] <gregaf> sagewk: and the sync is now safe, but if the sync gets restarted and our original start point has fallen off the end of the paxos trail we get forced into a full sync
[19:40] <gregaf> even if we have enough paxos transactions on disk to overlap the current run
[19:40] <sagewk> the recent doesn't set in_sync
[19:40] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[19:40] <sagewk> it just slurps recent commits..
[19:40] <sagewk> is that what you mean?
[19:43] <gregaf> yeah, but it only updates the last_committed when it catches up, right?
[19:43] <sagewk> yeah
[19:43] <sagewk> oh, it could apply as it goes
[19:44] <gregaf> so if it doesn't quite catch up, and it started at commit 500 and the cluster got to 1100 while slurping, then on restart it goes to full sync
[19:47] * TiCPU (~jeromepou@209.52.17.78) has joined #ceph
[19:48] * markit (~marco@88-149-177-66.v4.ngi.it) has joined #ceph
[19:48] * infinitytrapdoor (~infinityt@109.46.3.22) has joined #ceph
[19:48] <sagewk> yeah, i'll fix that
[19:57] * dpippenger1 (~riven@tenant.pas.idealab.com) has joined #ceph
[20:02] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Ping timeout: 480 seconds)
[20:09] * humbolt_ (~elias@212095007096.public.telering.at) has joined #ceph
[20:12] * humbolt (~elias@212095007107.public.telering.at) Quit (Ping timeout: 480 seconds)
[20:12] * humbolt_ is now known as humbolt
[20:13] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[20:13] * humbolt (~elias@212095007096.public.telering.at) Quit (Read error: Connection reset by peer)
[20:13] <jtang> has anyone (ab)used the watch functionality in librados as an alternative to something like 0mq?
[20:16] * humbolt (~elias@212095007096.public.telering.at) has joined #ceph
[20:17] * rturk-away is now known as rturk
[20:17] * humbolt (~elias@212095007096.public.telering.at) Quit (Read error: Connection reset by peer)
[20:20] * jtang1 (~jtang@142.176.24.2) has joined #ceph
[20:20] * humbolt (~elias@212095007096.public.telering.at) has joined #ceph
[20:21] * haomaiwang (~haomaiwan@notes4.com) has joined #ceph
[20:22] * jtang must keep an eye on libcrush as well
[20:25] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: garota, eu vou pra California... MTV, here I come)
[20:30] * haomaiwang (~haomaiwan@notes4.com) Quit (Ping timeout: 480 seconds)
[20:33] * rturk is now known as rturk-away
[20:34] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[20:34] * ChanServ sets mode +v andreask
[20:35] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[20:36] * jtang1 (~jtang@142.176.24.2) Quit (Quit: Leaving.)
[20:50] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[20:51] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[20:53] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[20:53] * Tamil (~tamil@38.122.20.226) Quit (Quit: Leaving.)
[20:57] * jks (~jks@3e6b5724.rev.stofanet.dk) has joined #ceph
[21:02] * jtang1 (~jtang@142.176.24.2) has joined #ceph
[21:03] * Tamil (~tamil@38.122.20.226) has joined #ceph
[21:04] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:05] * danieagle (~Daniel@186.214.76.63) has joined #ceph
[21:20] * sleinen1 (~Adium@2001:620:0:25:6963:1a91:6857:1c10) Quit (Quit: Leaving.)
[21:23] * haomaiwang (~haomaiwan@117.79.232.209) has joined #ceph
[21:23] * jtang1 (~jtang@142.176.24.2) Quit (Quit: Leaving.)
[21:24] * jtang1 (~jtang@142.176.24.2) has joined #ceph
[21:26] * infinitytrapdoor (~infinityt@109.46.3.22) Quit (Ping timeout: 480 seconds)
[21:28] * drokita (~drokita@199.255.228.128) has left #ceph
[21:29] * jtang1 (~jtang@142.176.24.2) Quit ()
[21:31] * haomaiwang (~haomaiwan@117.79.232.209) Quit (Ping timeout: 480 seconds)
[21:31] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[21:36] * ScOut3R (~ScOut3R@540240D7.dsl.pool.telekom.hu) has joined #ceph
[21:47] * ScOut3R (~ScOut3R@540240D7.dsl.pool.telekom.hu) Quit (Ping timeout: 480 seconds)
[21:51] * danieagle (~Daniel@186.214.76.63) Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[21:57] * Cube (~Cube@12.248.40.138) has joined #ceph
[22:02] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:02] * gregaf (~Adium@2607:f298:a:607:90b6:a075:51bd:1b9a) Quit (Quit: Leaving.)
[22:03] * wrencsok (~wrencsok@wsip-174-79-34-244.ph.ph.cox.net) Quit (Server closed connection)
[22:04] * wrencsok (~wrencsok@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[22:05] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) Quit (Server closed connection)
[22:05] <Azrael> sagewk: around?
[22:05] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) has joined #ceph
[22:05] <sagewk> y
[22:05] * jefferai (~quassel@corkblock.jefferai.org) Quit (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
[22:05] * jtang1 (~jtang@142.176.24.2) has joined #ceph
[22:06] <Azrael> sagewk: was wondering if you need anything from us wrt: the osd crashes bug or downing and osd taking down others
[22:06] <sagewk> not yet
[22:06] <Azrael> ok
[22:07] * oddomatik (~Adium@12.248.40.138) has joined #ceph
[22:07] <Azrael> we believe we can reproduce the issue/effect, if needed for logs
[22:08] <sagewk> great. i think the logs so far will be sufficient, but we'll let you know!
[22:08] <Azrael> right, ok
[22:18] <Azrael> sagewk: thanks
[22:21] * LeaChim (~LeaChim@90.217.166.163) Quit (Ping timeout: 480 seconds)
[22:22] * madkiss (~madkiss@2001:6f8:12c3:f00f:4877:fd3e:dcda:4e57) has joined #ceph
[22:23] * gregaf (~Adium@38.122.20.226) has joined #ceph
[22:24] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[22:25] * sleinen1 (~Adium@2001:620:0:26:d110:eadc:ff70:884c) has joined #ceph
[22:26] * mikedawson (~chatzilla@206.246.156.8) has joined #ceph
[22:27] * Macmonac (~opera@194.199.107.6) Quit (Server closed connection)
[22:30] * LeaChim (~LeaChim@2.220.252.14) has joined #ceph
[22:31] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[22:32] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[22:36] * mikedawson_ (~chatzilla@206.246.156.8) has joined #ceph
[22:37] <sagewk> sjust: did you look at the final version of https://github.com/ceph/ceph/pull/410 ?
[22:37] <sjust> sagewk: oops, looking now
[22:38] <Kdecherf> Hm, what can cause a infinite "rejoin"/"replay" state of mds of a cluster after a upgrade to 0.65?
[22:39] * mikedawson (~chatzilla@206.246.156.8) Quit (Ping timeout: 480 seconds)
[22:39] * mikedawson_ is now known as mikedawson
[22:40] <sjust> sagewk: looks right
[22:42] <sagewk> thanks
[22:50] * jtang1 (~jtang@142.176.24.2) Quit (Quit: Leaving.)
[22:51] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[22:51] * ChanServ sets mode +v andreask
[22:54] <Kdecherf> I have a mds hanging on "1 mds.0.291 rejoin_joint_start" on 0.65, any idea?
[22:55] <paravoid> sjust: so, I upgraded to 0.66 today
[22:55] <paravoid> slow peer is slow :)
[22:55] * Machske (~Bram@d5152D87C.static.telenet.be) Quit (Server closed connection)
[22:55] <paravoid> (I was running i/o all weekend)
[22:55] * Machske (~Bram@d5152D87C.static.telenet.be) has joined #ceph
[22:56] <gregaf> Kdecherf: sounds like you're using multiple active MDSes?
[22:58] * fridudad_ (~oftc-webi@p4FC2C8FD.dip0.t-ipconnect.de) has joined #ceph
[22:59] <fridudad_> Help! ;-( i've some some OSDs running on 100% cpu but doing "nothing" and generating slow queries / down vms while running upstream/cuttlefish
[22:59] <Kdecherf> gregaf: 3 mds but one active at a time
[23:01] <Kdecherf> oh
[23:01] <Kdecherf> oh yeah
[23:01] <Kdecherf> I see what happened here
[23:01] * capri (~capri@212.218.127.222) Quit (Server closed connection)
[23:02] * capri (~capri@212.218.127.222) has joined #ceph
[23:03] <Kdecherf> gregaf: it seems that the mds in recovery/replay/rejoin state doesn't like to have 146 client session reconnects
[23:03] <gregaf> did it finally finish or something?
[23:03] <gregaf> that generally shouldn't be a problem, but maybe we have a gaping hole we don't know about
[23:04] <Kdecherf> I blacklisted all clients with iptables for waiting the mds to end its recovery
[23:04] <Kdecherf> the cluster is back online now
[23:04] <Kdecherf> in active state*
[23:07] * mtanski (~mtanski@69.193.178.202) Quit (Ping timeout: 480 seconds)
[23:09] * Pauline (~middelink@2001:838:3c1:1:be5f:f4ff:fe58:e04) has joined #ceph
[23:09] * mtanski (~mtanski@69.193.178.202) has joined #ceph
[23:11] * diegows (~diegows@190.190.2.126) Quit (Server closed connection)
[23:11] * diegows (~diegows@190.190.2.126) has joined #ceph
[23:12] <darkfaded> fridudad_: usual questions, what state are you pg's in
[23:12] <fridudad_> darkfaded: oh right now in chaotic state as i'm trying to solve the issue for 4 hours now.
[23:13] <fridudad_> darkfaded: 4096 pgs: 2133 active+clean, 580 active+remapped+wait_backfill, 526 active+degraded+wait_backfill, 6 active+recovery_wait, 1 peering, 1 active+remapped, 4 active+remapped+backfilling, 110 active+degraded+backfilling, 665 active+degraded+remapped+wait_backfill, 1 active+recovery_wait+remapped, 2 active+recovery_wait+degraded, 4 remapped+peering, 63 active+degraded+remapped+backfilling;
[23:13] <darkfaded> ok let that sit for a while (i won't be able to help, but i hope someone else can with that info now)
[23:14] <darkfaded> incidentially, is there net traffic between the osd's still?
[23:16] <fridudad_> darkfaded: i had this status since exactly 04:30 hours
[23:16] <fridudad_> darkfaded: or at least a status like that
[23:16] <fridudad_> but i played with the rewight then and tried to set the not responding osds to weight 0
[23:17] <darkfaded> hint: working with storage: debugging is first, change nothing. sorry to say :(
[23:17] <darkfaded> but maybe for next time
[23:17] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 22.0/20130618035212])
[23:17] <saaby> fridudad_: did you have 5 pg's in peering all the time, or is that changing?
[23:18] * sleinen1 (~Adium@2001:620:0:26:d110:eadc:ff70:884c) Quit (Quit: Leaving.)
[23:19] <fridudad_> saaby: do you mean the 1 peering + 4 remapped+peering ?
[23:19] <saaby> yep
[23:19] <Machske> gregaf: Maybe you remember my problem about the cephfs being slow and I thought it was the mds, because I thought I had 1.1mil inodes as seen with df -i but that appeared not to be the inode count ?
[23:20] * mtanski (~mtanski@69.193.178.202) Quit (Quit: mtanski)
[23:20] <fridudad_> saaby: it's changing actual status is now 4096 pgs: 75 active, 2230 active+clean, 1183 active+remapped+wait_backfill, 79 active+degraded+wait_backfill, 15 active+recovery_wait, 55 peering, 1 active+remapped, 69 active+remapped+backfilling, 7 active+degraded+backfilling, 189 active+degraded+remapped+wait_backfill, 5 active+recovery_wait+remapped, 179 remapped+peering, 9 active+degraded+remapped+backfilling;
[23:20] <Machske> Anyway today I found out that all of my osd's have a fragmentation of 25% or more using xfs as filesystem
[23:20] <saaby> ok
[23:20] <saaby> are your osd's all up?
[23:20] <Machske> I'm running xfs_fsr on all osd's, hopefully this will help fix the problem
[23:20] <saaby> or are they flapping?
[23:20] <darkfaded> Machske: i have made a nagios check script for xfs frag
[23:21] <fridudad_> saaby: now it's 4096 pgs: 18 active, 2409 active+clean, 1249 active+remapped+wait_backfill, 68 active+degraded+wait_backfill, 13 active+recovery_wait, 40 peering, 1 active+remapped, 71 active+remapped+backfilling, 9 active+degraded+backfilling, 81 active+degraded+remapped+wait_backfill, 7 active+recovery_wait+remapped, 122 remapped+peering, 7 active+degraded+remapped+backfilling, 1 active+recovering;
[23:21] <Machske> darkfaded: cool! is it available ?
[23:21] <Machske> We use nagios for everything so that will certainly come in handy
[23:21] <fridudad_> saaby: sometimes flapping (always the same OSDs) sometimes all up
[23:21] <fridudad_> saaby: right now all are up
[23:22] <saaby> ok
[23:22] <darkfaded> Machske: yes i'll gice you a link in a second
[23:22] <saaby> I think I would set "ceph osd set nodown"
[23:22] <darkfaded> just don't use it as i do - i have been running it every min
[23:22] <darkfaded> made up for 50% of disk io
[23:22] <saaby> to try to stabilize things
[23:22] <Machske> :)
[23:22] <darkfaded> so, once every 6 hours is fine i think
[23:22] <fridudad_> saaby: now flapping only 27 of 28 are up
[23:22] <darkfaded> https://bitbucket.org/darkfader/nagios/src/eb1aea815814e26afc75f913b9deaabf89320678/check_mk/local/xfsfrag.py?at=default
[23:22] <Machske> I was actually planning to write something my self and let it run every 12 hours or so
[23:23] <gregaf> is that level of fragmentation actually a problem?
[23:23] <saaby> if your osd's are actually working (i.e. not crashed) setting "nodown" could stabilize your pg's
[23:23] <darkfaded> if you don't have cmk then just change it to use os.return and a print
[23:23] <fridudad_> saaby: the osd marked down is using 100% cpu
[23:23] <saaby> right
[23:23] <saaby> that is probably from peering
[23:23] <darkfaded> gregaf: imho 25% is still quite OK, but depends on use. the fragmentation report is _weird_ - my ISO fs has 99%
[23:23] <saaby> I am not sure, but I think you are maybe experiencing something we have seen too
[23:23] <darkfaded> so it doesn't account for average file size
[23:23] <darkfaded> on a ceph fs i would actually worry about 25%
[23:24] * haomaiwang (~haomaiwan@117.79.232.209) has joined #ceph
[23:24] <fridudad_> saaby: any solution, hint, dirty hack or workaround?
[23:24] <gregaf> I thought 25% would have been 125 extents for 125 files, which sounds fine, but maybe not
[23:24] <Machske> darkfaded: nice: 0 fs_status_XFS - OK - no heavily fragmented Filesystems
[23:24] <saaby> yeah, I would start by setting "nodown"
[23:24] <darkfaded> Machske: you can adjust the level of course :)
[23:24] <saaby> that prevntsthe mons from marking the osd's down. - which could stabilize things
[23:25] <fridudad_> saaby: no this OSD has killed itself the process is away
[23:25] * mtanski (~mtanski@69.193.178.202) has joined #ceph
[23:25] <Machske> I would worry too, because docs says: a couple of % is a scary value :)
[23:25] <saaby> what do you mean?
[23:25] <saaby> ah ok
[23:25] <fridudad_> saaby: the process isn't running anymore
[23:25] <saaby> the process is missing?
[23:25] <saaby> ok
[23:25] <fridudad_> saaby: yes the ceph-osd process has gone
[23:25] <Machske> oom maybe ?
[23:26] <saaby> I would: set "nodown" - start that osd up - and make sure they all stay up (and if they crash restart)
[23:26] <Machske> to increase stability, I have a cron running every 5 mins on each cluster node with /etc/init.d/ceph start
[23:26] <Machske> I woke up one day only to find out that 30% of the osd's had killed themselves during the night
[23:26] <saaby> can you see how it crashed?
[23:26] <saaby> segfault? assert?
[23:27] <fridudad_> saaby: heartbeat suicide timeout
[23:27] <saaby> ok
[23:28] <fridudad_> saaby: but the problem is i've now restarted this process but now the next one runs at 100% and will die soon
[23:28] <saaby> oh, so the processes keep crashing?
[23:28] <darkfaded> Machske: they got emo?
[23:29] <fridudad_> saaby: no now it's another OSD
[23:29] <saaby> or the same proces keeps crashing?
[23:29] <Machske> emo ?
[23:29] <saaby> ok
[23:29] <darkfaded> Machske: depressed
[23:29] <saaby> did you set "nodown" ?
[23:29] <fridudad_> saaby: if i start this osd again another one will run at 100% puc...
[23:29] <saaby> yes
[23:29] <saaby> I think that is from peering
[23:29] <Machske> well it was mostly a problem in 0.48, it got A LOT better sinds 0.56 and so on
[23:30] <Machske> it were OOM issues, I'm guessing memory leaks
[23:30] <fridudad_> saaby: yes
[23:30] <Machske> but still, to give me a better "probably false" feeling, I'm still running that cron
[23:30] <saaby> ok
[23:31] <Machske> sense of confort I mean
[23:31] <saaby> and that still happens.. ok.
[23:32] * haomaiwang (~haomaiwan@117.79.232.209) Quit (Ping timeout: 480 seconds)
[23:33] <saaby> fridudad_: ok, if a new OSD keeps crashing everytime you restart the last crashed one (even with "nodown" set), then I'm in too deep. - And you should probably unset nodown again...
[23:33] <fridudad_> saaby: what do you mean by in too deep?
[23:34] <saaby> == I probably don't know what the root cause is.
[23:34] <fridudad_> saaby: let's see maybe it had crashed before i set it and i didn't recognized it
[23:34] <saaby> ok
[23:34] <fridudad_> saaby: right now all are up
[23:34] <saaby> aha!
[23:34] <saaby> good
[23:35] <saaby> but, rememerb that the "nodown" setting will not mark osd's down even if they crash
[23:35] <fridudad_> saaby: but still wondering why always the same osd processes using 100% cpu
[23:35] <saaby> so you have to check that all osd processes (on alle servers) are running
[23:36] <fridudad_> saaby: but it is visable via the degraded output? of ceph -s? or really to check all pids?
[23:36] <saaby> yeah.. My guess was that, that was because of it peering up it's pg's, but it's a bit conserning that it stays at 100%
[23:36] <saaby> can you see your inactive/peering pg's getting fewer?
[23:36] <Pauline> you might be lucky and "ceph osd tree" will show the actual osd status
[23:36] <saaby> Pauline: nope
[23:36] <saaby> It won't
[23:37] <markit> hi, newbie here, let's say that I've 3 nodes, with 3 ods, and set to keep 2 copies. Is ceph going to balance the traffic for max throughput (get/write data in more than one node) and max safety (not 2 copies on the same node)?
[23:37] <markit> (I'm using RBD)
[23:37] <fridudad_> saaby: actual status pgs: 2745 active+clean, 1147 active+remapped+wait_backfill, 49 active+degraded+wait_backfill, 11 active+recovery_wait, 1 active+remapped, 76 active+remapped+backfilling, 52 active+degraded+remapped+wait_backfill, 8 active+recovery_wait+remapped, 7 active+degraded+remapped+backfilling;
[23:38] <saaby> ok, that looks better
[23:38] <saaby> all active
[23:38] <saaby> still seeing slow requests?
[23:38] <Machske> If it's always the same osd, would it not be a better idea to remove it from the cluster, let it rebalance and then re-add it again as a new osd ?
[23:39] * humbolt (~elias@212095007096.public.telering.at) Quit (Quit: humbolt)
[23:39] <fridudad_> saaby: i've shutted down all VMs since 4 hours so i've no I/O. If i just strat one VM i see slow requests again as the processes running at 100% cpu do not serve any I/O at all to clients
[23:39] <saaby> Machske: yeah, I think so, but I understood that different osd's where involved in the crashes.
[23:40] <saaby> fridudad_: just to be sure - is it only one OSD running at 100% and not serving I/O?
[23:40] <saaby> and is it still doing that?
[23:40] <fridudad_> Machske: yes that was what i tried but then the next one went down via suicide
[23:40] <saaby> right, ok.
[23:41] <fridudad_> saaby: no - osd.12, osd.13 and osd.14 are the OSDs i've seen
[23:41] <saaby> ok, are you getting any more active+clean pg's?
[23:41] <Machske> do you use rbd to store the vm images or filebacked disk images on cephfs for your vm's
[23:41] <saaby> and is that one OSD still burning 100% cpu?
[23:41] <fridudad_> saaby: yes but very slow
[23:42] <fridudad_> saaby: now 2810 active+clean, 1092 active+remapped+wait_backfill, 46 active+degraded+wait_backfill, 11 active+recovery_wait, 77 active+remapped+backfilling, 1 active+degraded+backfilling, 49 active+degraded+remapped+wait_backfill, 8 active+recovery_wait+remapped, 2 active+degraded+remapped+backfilling;
[23:42] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[23:42] <fridudad_> Machske: rbd
[23:42] <fridudad_> saaby: from time to time two minutes ago all three now only one
[23:43] <saaby> fridudad_: ok, thats probably good. - I think the slowness could be because there is a lot of backfilling going on (because you rewrighted one or more OSD's?)
[23:43] <Machske> maybe lot's of iowait ?
[23:43] <saaby> fridudad_: ok, that is probably also good. I would give it a bit of time and see if it goes away.
[23:43] <fridudad_> saaby: yes i reverted right now all my changes regarding the reweight
[23:44] <saaby> oh ok - just now?
[23:44] <fridudad_> saaby: urg sorry no when we started to talk
[23:44] <saaby> ok
[23:44] * TiCPU (~jeromepou@209.52.17.78) Quit (Remote host closed the connection)
[23:44] <fridudad_> Machske: sure right now with all these backfilling and recovering i see a lot of I/O waits some machines 9% some other 40%
[23:45] <saaby> right.. that is probably normal..
[23:45] <fridudad_> Machske: but under normal workload i have between 0.5% and 1%
[23:45] <Machske> 9% till 40% aint that bad, I've got worse on my machines :s I'm always very carefull when reweighing
[23:45] <saaby> fridudad_: what happens if you try to give the cluster a bit of I/O? - still slow requests?
[23:46] <fridudad_> saaby: right now all OSDs on one host use 100% cpu and seem to stuck
[23:46] * BillK (~BillK-OFT@124-169-221-120.dyn.iinet.net.au) has joined #ceph
[23:46] <saaby> ok...
[23:46] <Machske> btw I set osd recovery max active = 2 instead of 5 to minimize the impact
[23:46] <fridudad_> Machske yes i had it set to 1 but right now...
[23:46] <Machske> recovery takes longer, but the impact is less
[23:47] <fridudad_> Machske i wanted to have a working machine and i had backfiling and recovery set to 2 and were still seeing slow requests and failing vms
[23:47] <saaby> it would probably be a good idea to set debug-osd to 20 and see what they are doing
[23:47] <saaby> if you can't see anything meaningful from the logs already
[23:47] <fridudad_> Machske: so my idea was to give it faster recover
[23:48] <Machske> how are your pools configured in terms of rep size and min_size ?
[23:48] <fridudad_> saaby: log just says: restarting backfill on
[23:48] <Machske> ceph osd dump | grep min_size
[23:48] <fridudad_> Machske: rep size 3 min size 1
[23:48] <saaby> hmm.. restarting backfill. that sounds wrong?
[23:49] <fridudad_> saaby: it's full of restarting backfill entries
[23:49] <Machske> hmm that's what I have, during rebalance I also get slow requests but just for a few moments when the rebalancing is being calculated. Once it starts, slow requests are gone
[23:50] <fridudad_> Machske nice ;-) i would like to have that too...
[23:53] <fridudad_> saaby: another idea was to reformat these osds having the 100% cpu problem i wasn't sure if this is better or worse
[23:54] <saaby> I think that sounds very scary at this point.. as long as you don't know the root cause
[23:54] <Machske> idd
[23:54] <Machske> would not destroy any osd's atm
[23:54] <saaby> but - if you want to try something like that shutting them down should give you the same result?
[23:54] * mtanski (~mtanski@69.193.178.202) Quit (Quit: mtanski)
[23:55] <Machske> set noout and shut those osd's ?
[23:55] <Kdecherf> gregaf: I have an issue on mds 0.65, I can't mount anymore my pool (mount error 5)
[23:55] <saaby> Machske: yeah.. and unset "nodown" again.
[23:55] <Kdecherf> I don't see anything in the logs of the active mds
[23:55] <fridudad_> saaby: why unset it again?
[23:55] <saaby> either that, or mark the shut-down osd's down manually.
[23:55] <gregaf> Kdecherf: I think that usually means that there isn't an MDS (not always, though)
[23:55] <Machske> I personaly do not set nodown if I temporary shutdown an osd, I just set noout which prevents auto recovery
[23:56] <saaby> sure. the nodown was to prevent the osd's to get marked down because of slowness.
[23:57] <Machske> ah ok
[23:57] <Kdecherf> gregaf: it's weird 'cause the cluster correctly report an active mds
[23:57] * mtanski (~mtanski@69.193.178.202) has joined #ceph
[23:57] <Machske> well I've they get marked down by slowness, that would scare me, I've had overloaded osd's in the past, but they never got marked as down
[23:58] <gregaf> Kdecherf: well, get logs of the client and see where the error is coming from?
[23:58] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[23:58] <fridudad_> saaby: so you would stay at nodown and keep all running?
[23:58] <Machske> doesnt that cause peering issues ?
[23:59] <Machske> osd's are down but you're basically saying they are not, so would they not try to contact the downed osd ?
[23:59] <saaby> fridudad_: well.. I still don't understand the rootcause of those osd's burning cpu. so I don't really have good bearings now.

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.