#ceph IRC Log


IRC Log for 2012-10-09

Timestamps are in GMT/BST.

[0:02] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[0:03] <gregaf> tziOm: yeah, you'll probably also want to specify data directory locations and stuff, but the OSD doesn't need to have the cluster enumerated in its conf file or anything like that
[0:06] * tryggvil (~tryggvil@163-60-19-178.xdsl.simafelagid.is) has joined #ceph
[0:08] * The_Bishop (~bishop@p4FCDF2AC.dip.t-dialin.net) has joined #ceph
[0:10] <tziOm> My thing is I want to boot all osd's off a common image
[0:10] <tziOm> So basically it does not have a config (except as told mon address)
[0:13] * dty (~derek@pool-71-178-175-208.washdc.fios.verizon.net) has joined #ceph
[0:15] <dmick> tziOm: I think all the stuff that has to be per-daemon defaults
[0:16] <dmick> you'll want to add the new osd to your master ceph.conf to handle cluster restarts
[0:16] <dmick> but as far as per-OSD config, you might be able to get by with the defaults
[0:16] <dmick> you can also set config options on the commandline
[0:17] <tziOm> hmm..
[0:19] <tziOm> So you guys rsync the ceph.conf around?
[0:19] <dmick> there are a lot of different setup methodologies around
[0:20] <tziOm> ..and the preferred one?
[0:21] <dmick> I know the effort is to make it so that there is little to no per-host configuration, since obviously this is a win
[0:21] <dmick> I *think* the Chef method does not copy the entire .conf file, but I'd have to look at the code to be sure
[0:21] <dmick> the easy way when doing small manual clusters is to copy it everywhere, just because that requires no thought
[0:21] <tziOm> sure..
[0:22] * miroslavk (~miroslavk@69-170-63-147.static-ip.telepacific.net) Quit (Ping timeout: 480 seconds)
[0:23] <gregaf> most of our automation stuff generates a ceph.conf which contains the per-daemon defaults, the list of monitors, and nothing else
[0:23] <gregaf> and then starts up the right number of daemons with the appropriate IDs
[0:24] <tziOm> IC..
[0:25] <tziOm> but then back to my original question, any way I can grab this config off the cluster itself on bootup? having just the mon addresses and keyring? .. reason is that when I boot first time and make the osd, I make a osd-config with appropriate osd.id and so on, but next time I boot, I know nothing.. just my hostname/ip..
[0:26] <tziOm> ..so would be nice to be able to ceph getconf hostname > /etc/ceph.conf
[0:26] <tziOm> ..then service ceph start
[0:26] <dmick> you should be able to grab any config options you've set by asking the monitor admin socket as above
[0:27] <dmick> ah, but that's on the monitor machine, I see what you're saying
[0:27] <tziOm> so --path can be ip address?
[0:27] <dmick> you'd like to be able to contact it from another host
[0:27] <tziOm> yes
[0:27] <tziOm> because I boot from net, and have a general "osd" image I boot into
[0:27] <dmick> yes, I'm finally following
[0:28] <dmick> sorry
[0:28] <tziOm> only small problem with this is I need my config at bootup, before starting osd's
[0:30] <dmick> --admin-daemon must refer to a local UNIX-domain socket, it seems
[0:33] <dmick> but yeah, this is standard templatable info; the cluster-global stuff can just be copied however is convenient; the per-OSD config (filesystem path, journal path, etc.) can be generated
[0:34] <dmick> so --admin-daemon is overkill; it's a way to validate the running config, which is where I started, but I don't think that's where you're going with this
[0:40] <tziOm> no..
[0:41] <tziOm> I can perhaps get what I need from ceph osd dump
[0:41] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[0:42] <tziOm> So I just need a logic that lowest number is md0 and so on, and mount at some standard location and I am good to go..
[0:42] <tziOm> dont you think?
[1:12] * dty (~derek@pool-71-178-175-208.washdc.fios.verizon.net) Quit (Quit: dty)
[1:12] * dty (~derek@pool-71-178-175-208.washdc.fios.verizon.net) has joined #ceph
[1:16] * tziOm (~bjornar@ti0099a340-dhcp0358.bb.online.no) Quit (Remote host closed the connection)
[1:16] * senner (~Wildcard@point-mod100.wctc.net) has joined #ceph
[1:18] * senner (~Wildcard@point-mod100.wctc.net) Quit ()
[1:23] * lofejndif (~lsqavnbok@04ZAAAFUO.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[1:24] * The_Bishop (~bishop@p4FCDF2AC.dip.t-dialin.net) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[1:28] * Cube1 (~Adium@ Quit (Quit: Leaving.)
[1:29] * jlogan1 (~Thunderbi@2600:c00:3010:1:573:72fb:f085:5346) Quit (Ping timeout: 480 seconds)
[1:32] * dty (~derek@pool-71-178-175-208.washdc.fios.verizon.net) Quit (Quit: dty)
[1:38] * gminks_ (~ginaminks@108-210-41-138.lightspeed.austtx.sbcglobal.net) Quit (Quit: gminks_)
[1:42] * dty (~derek@pool-71-178-175-208.washdc.fios.verizon.net) has joined #ceph
[1:52] * Meyer___ (meyer@c64.org) has joined #ceph
[1:54] * Meyer__ (meyer@c64.org) Quit (Ping timeout: 480 seconds)
[1:58] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[1:59] <joao> sagewk, still around?
[1:59] * edv (~edjv@ has left #ceph
[2:00] <sagewk> joao: yeah
[2:06] * phantomcircuit (~phantomci@173-45-240-7.static.cloud-ips.com) Quit (Quit: quit)
[2:07] * phantomcircuit (~phantomci@173-45-240-7.static.cloud-ips.com) has joined #ceph
[2:11] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[2:11] * mik (~michael@office-pat.ipsystems.com.au) has joined #ceph
[2:12] <mik> Hi, anyone here familiar with the internals of cephx?
[2:16] <gregaf> what're you after, mik?
[2:16] <gregaf> yehudasa wrote it
[2:16] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[2:16] <mik> I've been trawling through the code...
[2:16] <mik> And I was wondering why a fixed IV was used for AES-CBC encryption
[2:17] <gregaf> well, that's definitely a question for him then
[2:17] <mik> Is it like 3am for him?
[2:17] <yehudasa> no, 5pm
[2:17] <mik> Oh, awesome :)
[2:18] <mik> So I was trawling around the code, to see what security it gives
[2:19] <mik> And I noticed that CBC encryption with a fix IV is used... are the keys ephemeral, or are they fixed/well-known?
[2:20] <yehudasa> hmm.. the keys are fixed
[2:20] <mik> Oh...
[2:20] <yehudasa> let's go one step back
[2:20] <mik> Ok
[2:20] <yehudasa> what keys are you asking about?
[2:21] <mik> The AES keys
[2:21] <mik> The protocol looks a lot like krb4 to me
[2:21] <yehudasa> well, each user has its own configurable key
[2:22] <mik> Maybe we should step back further a bit
[2:22] <mik> So who are the "actors" in this?
[2:23] <mik> Do the clients to to the monitor to get a ticket, then use the ticket to talk to OSDs?
[2:23] <yehudasa> basically yes
[2:23] <mik> Do they get a ticket per OSD?
[2:23] <yehudasa> same goes with the osds/mds
[2:24] <mik> Is there any "authorisation" once you're authenticated?
[2:24] <mik> (eg. which volumes you're allowed to access)
[2:24] <yehudasa> a ticket is per service type
[2:25] <yehudasa> e.g., a single ticket per all the osds
[2:25] <yehudasa> however
[2:25] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[2:25] <yehudasa> the osds have some 'rotating' ephemeral secret
[2:25] * Meyer___ (meyer@c64.org) Quit (Read error: Connection reset by peer)
[2:26] <mik> And the only crypto primitive you're using is AES?
[2:26] <yehudasa> iirc, yes
[2:27] <mik> Hm, this is really begging for real Kerberos :/
[2:27] * sagelap (~sage@110.sub-70-197-140.myvzw.com) has joined #ceph
[2:28] <mik> I was looking into coding a couple of things in, but the auth protocol is pretty critical for what I was looking at
[2:28] <mik> I wanted to add HMAC to the protocol (to prevent session hijacking)
[2:28] <mik> and Authorisation - if it's possible
[2:28] <yehudasa> well, there is extra 'caps' data attached to the tickets
[2:28] <mik> eg. limiting which volumes/pools a client has access to
[2:28] <yehudasa> that's already there
[2:29] <mik> Oh, awesome
[2:29] <mik> So assuming I had a network that couldn't be sniffed on/MITM attacked, I could limit which volumes a client can access based on auth key?
[2:29] <yehudasa> and there's a branch with message signing waiting to be merged in
[2:30] <yehudasa> you can limit a client to specific pools, yes
[2:30] <mik> Nice
[2:30] <mik> I'm looking at getting a new SAN infrastructure for work
[2:30] <mik> So I thought I'd take another look at Ceph, Gluster, etc
[2:31] <yehudasa> well, we have a guy working on our security stuff recently, and he has made some changes
[2:31] <mik> Ceph seems to be the best fit so far (although not for NAS, just SAN replacement and object store)
[2:31] <mik> Ah, full time employee?
[2:31] <yehudasa> part time, but as good as a full time :) dedicated to security
[2:33] <mik> Ok, so back to the ticket contents
[2:33] <mik> So the ticket is encrypted with the shared secret between the mon process and the OSDs
[2:34] <mik> and sent to the client, and has caps encrypted in it
[2:34] <mik> But the client just sees it as a binary blob?
[2:35] <yehudasa> yeah, the client doesn't see the ticket's content
[2:35] <mik> Ok, that's a pretty good start :)
[2:36] * sagelap (~sage@110.sub-70-197-140.myvzw.com) Quit (Ping timeout: 480 seconds)
[2:36] <mik> But you're sending multiple messages out with the same IV and key?
[2:37] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[2:37] <yehudasa> well, I think there's a nonce in every message
[2:39] <mik> Yeah, if that was at the start, it would be ok
[2:39] <gregaf> the messages aren't identical and it's not like it's sending out the actual key; they're signed….
[2:39] <gregaf> <— not a crypto expert, but confused by this conversation
[2:40] <yehudasa> the nonce is always at the first block
[2:40] <mik> So the thing that could happen is somebody could replace the service_id in the message with somebody else's
[2:41] <yehudasa> service_id?
[2:41] <gregaf> correction: they're not signed, that's what Peter is working on right now…I always make that up for some reason *sigh*
[2:41] <mik> At the start of CephXSessionAuthInfo
[2:41] <mik> Or is it CephXAuthorizer I should be looking at?
[2:43] <yehudasa> are you searching for the ticket?
[2:43] <mik> Oh, I see
[2:43] <mik> so nonce is right at the start
[2:43] <mik> so the IV is fine
[2:44] <mik> Well that was lucky ;)
[2:45] <yehudasa> I don't think luck had much to do with it
[2:45] <mik> It's all to do with CBC mode - it basically xors the plaintext block against the previous ciphertext block before encrypting
[2:45] <mik> So the IV is the "first" ciphertext block
[2:46] <mik> Normally it would be a nonce of some sort (either passed with the message, or pre-agreed)
[2:47] * sagelap (~sage@217.sub-70-197-139.myvzw.com) has joined #ceph
[2:47] <mik> So the only thing is, that the nonce is 64bit
[2:47] <mik> And the AES block size is 128bit
[2:49] <mik> So in the description at the top of CephXProtocol.h
[2:49] <mik> A = mon
[2:49] <mik> P = client (eg. rbd.ko, qemu etc)
[2:49] <mik> S = osd
[2:50] <mik> ?
[2:50] <yehudasa> A = authenticator
[2:50] <yehudasa> P = client, osd, mon, mds
[2:50] <yehudasa> S = osd, mon, mds
[2:50] <mik> So the Authenticator is a separate process?
[2:50] <yehudasa> where mon is also the authenticator
[2:51] <mik> Ah, ok
[2:51] <mik> So it sounds like what you really want is just kerberos5 :)
[2:52] <mik> And use the negotiated session key that it produces as a hmac key
[2:53] <yehudasa> well, we would have used kerberos just that we also needed an implementation that would scale and would not serve as a single point of failure
[2:53] <mik> Well, there's PKI, but that's a bit CPU heavy
[2:54] <mik> Kerberos is pretty scalable
[2:54] <mik> But HA is kindof active/standby
[2:54] <mik> Authenticating a distributed system is always a hard problem
[2:57] * sagelap (~sage@217.sub-70-197-139.myvzw.com) Quit (Ping timeout: 480 seconds)
[2:59] <mik> So authsecret is shared with mon, osd and mds, and rotates?
[3:01] <yehudasa> I think that authsecret is static per instance, svcsecret is shared among the same service type
[3:01] <yehudasa> svcsecret rotates
[3:01] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[3:02] * joao (~JL@ Quit (Remote host closed the connection)
[3:02] <mik> Ok, that's not too bad then...
[3:03] <mik> So I'd probably just find a better way to "rotate" keys, then somehow negotiate a HMAC key then I'd be happy
[3:03] <mik> For example, make a key by encrypting a timestamp with the "static" key
[3:04] <mik> But I guess you'll want me to find a break before I start messing with your auth protocol ;)
[3:05] <yehudasa> well, you're free to mess with it as much as you want to
[3:06] <yehudasa> whether any changes make any sense though...
[3:07] <mik> So the main attack vector I'm looking at is a compromised host gaining access to resources it's not authorised for
[3:09] <mik> But before I go re-inventing the wheel, I should probably look at what your security guy is doing with message signing
[3:16] <mik> On a different note
[3:16] <mik> With the object store stuff, what software stack would you use for a map/reduce style workload?
[3:16] <mik> Or is that not what it's for?
[3:22] <mik> Bah, gotta go
[3:22] <mik> thanks yehudasa
[3:22] * mik (~michael@office-pat.ipsystems.com.au) has left #ceph
[3:23] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[3:30] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[3:34] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[3:35] * senner (~Wildcard@24-196-37-56.dhcp.stpt.wi.charter.com) has joined #ceph
[3:35] * senner (~Wildcard@24-196-37-56.dhcp.stpt.wi.charter.com) Quit ()
[3:36] * senner (~Wildcard@24-196-37-56.dhcp.stpt.wi.charter.com) has joined #ceph
[3:36] * senner (~Wildcard@24-196-37-56.dhcp.stpt.wi.charter.com) Quit ()
[3:41] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[4:04] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[4:08] * gminks_ (~ginaminks@108-210-41-138.lightspeed.austtx.sbcglobal.net) has joined #ceph
[4:16] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[4:21] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[4:29] * dec (~dec@ec2-54-251-62-253.ap-southeast-1.compute.amazonaws.com) Quit (Remote host closed the connection)
[4:34] * kblin_ (~kai@kblin.org) has joined #ceph
[4:34] * gminks_ (~ginaminks@108-210-41-138.lightspeed.austtx.sbcglobal.net) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * Cube (~cube@ Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * hijacker (~hijacker@ Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * liiwi (liiwi@idle.fi) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * MarkS (~mark@irssi.mscholten.eu) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * jmcdice_ (~root@ Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * spaceman-39642 (l@ Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * Meths (rift@ Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * coredumb (~coredumb@ns.coredumb.net) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * acaos (~zac@209-99-103-42.fwd.datafoundry.com) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * rz (~root@ns1.waib.com) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * NaioN (stefan@andor.naion.nl) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * todin (tuxadero@kudu.in-berlin.de) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * wonko_be (bernard@november.openminds.be) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * kblin (~kai@kblin.org) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * DLange (~DLange@dlange.user.oftc.net) Quit (reticulum.oftc.net kilo.oftc.net)
[4:34] * dec (~dec@ec2-54-251-62-253.ap-southeast-1.compute.amazonaws.com) has joined #ceph
[4:35] * gminks_ (~ginaminks@108-210-41-138.lightspeed.austtx.sbcglobal.net) has joined #ceph
[4:35] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[4:35] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[4:35] * Cube (~cube@ has joined #ceph
[4:35] * hijacker (~hijacker@ has joined #ceph
[4:35] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[4:35] * liiwi (liiwi@idle.fi) has joined #ceph
[4:35] * MarkS (~mark@irssi.mscholten.eu) has joined #ceph
[4:35] * jmcdice_ (~root@ has joined #ceph
[4:35] * spaceman-39642 (l@ has joined #ceph
[4:35] * Meths (rift@ has joined #ceph
[4:35] * coredumb (~coredumb@ns.coredumb.net) has joined #ceph
[4:35] * wonko_be (bernard@november.openminds.be) has joined #ceph
[4:35] * todin (tuxadero@kudu.in-berlin.de) has joined #ceph
[4:35] * NaioN (stefan@andor.naion.nl) has joined #ceph
[4:35] * rz (~root@ns1.waib.com) has joined #ceph
[4:35] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[4:35] * acaos (~zac@209-99-103-42.fwd.datafoundry.com) has joined #ceph
[4:35] * kblin (~kai@kblin.org) has joined #ceph
[4:36] * kblin (~kai@kblin.org) Quit (Ping timeout: 480 seconds)
[4:58] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[5:09] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[5:28] * maelfius (~mdrnstm@ Quit (Quit: Leaving.)
[5:29] * maelfius (~mdrnstm@ has joined #ceph
[5:30] * gminks_ (~ginaminks@108-210-41-138.lightspeed.austtx.sbcglobal.net) Quit (Quit: gminks_)
[5:30] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[5:31] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[5:44] * miroslavk (~miroslavk@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[6:06] * Meyer__ (meyer@c64.org) has joined #ceph
[6:16] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[6:20] * maelfius (~mdrnstm@ Quit (Quit: Leaving.)
[6:22] * tren (~Adium@2001:470:b:2e8:a104:a841:6fb3:155d) Quit (Quit: Leaving.)
[6:24] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[6:25] <elder> sage1 remind me what the "kernel test" is in teuthology.
[6:25] <elder> Is it a workunit?
[6:25] <sage1> there is a kernel suite that runs a bunch of kernel jobs
[6:26] <sage1> ./schedule_suite.sh kernel master wip-kernel-foo elder@inktank.com
[6:27] <elder> Looks great. Thank you.
[6:27] <sage1> oh, you should double-check that that suite does msgr failure injection
[6:27] <elder> Is that going to use anything local other than my teuthology tree?
[6:27] <sage1> ~/src/ceph-qa-suite
[6:27] <elder> OK. What would that look like?
[6:27] <sage1> you'll want to make sure it's up to date
[6:27] <elder> That's why I asked.
[6:27] * tren (~Adium@216-19-187-18.dyn.novuscom.net) has joined #ceph
[6:28] <sage1> now that i think about it, the kernel suite doesn't inject failures... we should fix that tomorrow
[6:28] <sage1> :)
[6:29] <elder> Sounds like a good idea.
[6:30] <elder> Any way we can add that as a parameter of other teuthology runs somehow? (I don't know how it's accomplished). I would really like to cover that a LOT in our regular test runs.
[6:35] <sage1> best route is to add it to the suite, i think. the rados and rbd suites do lots of injection already.
[6:36] <sage1> there are some outstanding issues with the userspace fs client that prevent us from doing it for the fs, i think... it may be that the kernel client is ok, though. lets try it tomorrow.
[6:36] <elder> OK.
[6:36] <elder> Leaving shortly.
[6:41] * tren (~Adium@216-19-187-18.dyn.novuscom.net) Quit (Quit: Leaving.)
[6:45] <sage1> elder: 'night!
[6:45] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[6:45] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[6:54] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[6:58] * dmick (~dmick@2607:f298:a:607:1a03:73ff:fedd:c856) has left #ceph
[6:59] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[6:59] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[7:17] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[7:18] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[7:40] * miroslavk (~miroslavk@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[7:46] * hijacker (~hijacker@ Quit (Ping timeout: 480 seconds)
[7:57] * hijacker (~hijacker@ has joined #ceph
[8:05] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:16] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:21] * loicd (~loic@magenta.dachary.org) Quit ()
[8:34] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) has joined #ceph
[8:54] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[8:58] * cattelan (~cattelan@2001:4978:267:0:21c:c0ff:febf:814b) Quit (Ping timeout: 480 seconds)
[9:07] * cattelan (~cattelan@2001:4978:267:0:21c:c0ff:febf:814b) has joined #ceph
[9:11] * Leseb (~Leseb@2001:980:759b:1:e908:c6d:3724:cccf) has joined #ceph
[9:14] * loicd (~loic@ has joined #ceph
[9:29] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[10:09] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has joined #ceph
[10:10] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:10] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[10:10] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[10:13] * joao (~JL@ has joined #ceph
[10:23] * BManojlovic (~steki@ has joined #ceph
[10:27] * tziOm (~bjornar@ has joined #ceph
[10:34] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[10:38] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[10:40] * tziOm (~bjornar@ has joined #ceph
[10:41] <tziOm> I have a little problem here
[10:41] <tziOm> while playing, ran ceph mon add mon02
[10:41] <tziOm> and I allready had another mon running there...
[10:41] <tziOm> so now I cant do anything.. it just hangs..
[10:49] <tziOm> Anyone
[10:51] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[10:51] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[10:52] <joao> tziOm, what does './ceph -m mon_status' say?
[10:53] <tziOm> hangs
[10:54] <tziOm> thing was, I was running mon0 on
[10:54] <tziOm> then I ran ceph mon add mon02
[10:54] <tziOm> ...big mistake, because now I cannot do shit
[10:55] <joao> I'm assuming s/6/1/
[10:55] <joao> well, I believe it should not have worked
[10:55] <tziOm> it does not, thats for sure
[10:56] <joao> iirc, there are checks for same address/same name when adding monitors to the monmap
[10:58] <joao> tziOm, did 'ceph' return anything like "added mon.X at <address>"?
[11:00] <tziOm> yeah
[11:00] <tziOm> sure
[11:01] <tziOm> this mon is not running, tho (on .. but it is a bit scary anyway, that cluster is _dead_ for this reason..
[11:01] <tziOm> no, its
[11:01] <joao> so, there is no clash on address:port?
[11:01] <tziOm> no
[11:02] <joao> and the monitor, is it running?
[11:02] <joao> the new one
[11:02] <tziOm> no
[11:02] <joao> how many monitors does your cluster have?
[11:02] <tziOm> just one running (..and the other one..)
[11:03] <joao> that's it then
[11:03] <tziOm> quorum
[11:03] <joao> yes
[11:03] <tziOm> but heck.. should be possible to correct, yes?
[11:03] <joao> bring the new monitor up
[11:04] <joao> it will join the quorum and everything should be working in no time
[11:04] <tziOm> Hmm..
[11:04] <tziOm> unable to read magic from mon data.. did you run mkcephfs?
[11:04] <tziOm> I did not mkcephfs, and I cant seem to do it when ceph is hanging either..
[11:05] <joao> ./ceph-mon --mkfs -i <monid>
[11:05] <joao> should do wonders
[11:05] <tziOm> generated monmap has no fsid; use '--fsid <uuid>'
[11:06] <joao> right
[11:06] <joao> what version of ceph are you using?
[11:06] <tziOm> 48.2
[11:07] <joao> I think argonaut already has a 'current_uuid' file under the monitor data directory, where the fsid is kept
[11:07] <joao> but I would assume that
[11:07] <joao> ./ceph -m fsid
[11:07] <joao> would return the cluster fsid
[11:07] <tziOm> point is it hangs
[11:08] <tziOm> yeah, ok, managed with --admin-daemon
[11:08] <joao> yeah, that!
[11:08] <joao> I hate mornings
[11:08] <joao> should have thought of that right away :(
[11:09] <tziOm> Not the sucker cant read keyring, but thats wrong..
[11:10] <joao> maybe it's a problem with ceph.conf?
[11:10] <tziOm> auth: error reading file: /etc/ceph/ceph.mon.mon02.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: can't open /etc/ceph/ceph.mon.mon02.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
[11:10] <joao> do you have the keyring on any of those locations?
[11:10] <tziOm> yes
[11:11] <tziOm> -rw------- 1 root root 63 Sep 30 18:10 /etc/ceph/ceph.keyring
[11:11] <joao> that's weird
[11:12] <tziOm> ah, bug?
[11:12] <tziOm> open("/etc/ceph/ceph.mon.mon02.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin", O_RDONLY) = -1 ENOENT (No such file or directory)
[11:12] <tziOm> ..that file does not exist here!
[11:13] <joao> okay, the question might sound silly, but I've made that mistake soooo many times that I'm assuming other people can occasionally incur in it
[11:13] <joao> but, are you looking for the ceph keyring in the same server where you're running the monitor? :p
[11:15] <tziOm> sure
[11:15] <tziOm> but if you look at strace, the file ceph-mon is trying to open does not exist anywhere on earth
[11:15] <joao> yes, hence my question
[11:16] <joao> you could have by mistake run the 'ls -l /etc/ceph/ceph.keyring' on some server that actually had it
[11:16] <tziOm> one cant use open that way...
[11:16] <joao> oh
[11:16] <tziOm> no - no, but look at C code
[11:16] <joao> yes
[11:16] <joao> I see what you're saying
[11:16] <joao> that's really, really weird
[11:16] <joao> and you are right
[11:16] <joao> let me run that here
[11:18] <joao> tziOm, I will make note of that
[11:18] <joao> for the time being
[11:18] <joao> run it with --keyring /etc/ceph/ceph.keyring (or wherever is the keyring file)
[11:18] <joao> it should work
[11:18] <tziOm> it is a definite bug
[11:19] <joao> it certainly appears so
[11:19] <joao> will look into it
[11:19] <tziOm> ceph-mon: created monfs at /var/lib/ceph/mon/ceph-mon02 for mon.mon02
[11:20] <joao> starting it should now work
[11:20] <tziOm> hmm.. now the mon is started
[11:20] <tziOm> but ceph still does not respond...
[11:20] <joao> let it simmer a bit
[11:20] * masterpe (~masterpe@ Quit (Quit: leaving)
[11:20] <tziOm> tcp 0 0* LISTEN 15468/ceph-mon
[11:20] <tziOm> tcp 0 0* LISTEN 10914/ceph-mon
[11:20] <joao> it will have to go through the probing phase and election
[11:20] * masterpe (~masterpe@2001:990:0:1674::1:82) has joined #ceph
[11:21] <joao> tziOm, you can check the status on the monitor with the --admin-daemon <path> mon_status
[11:21] <joao> it should be in either state probing or electing
[11:21] <tziOm> read only got 0 bytes of 4 expected for response length; invalid command?
[11:21] * guilhemfr (~guilhem@AMontsouris-652-1-103-134.w82-123.abo.wanadoo.fr) has joined #ceph
[11:22] <tziOm> hmm..
[11:22] <tziOm> but this works on mon.a
[11:22] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[11:22] <joao> what's the output?
[11:23] <tziOm> http://pastebin.com/nYkUHMFa
[11:24] <joao> (brb; going for a coffee refill)
[11:27] <joao> tziOm, my guess is that mon02 failed for some reason
[11:27] <joao> is it not so?
[11:27] <tziOm> Dunno..
[11:28] <joao> is it still running?
[11:28] <tziOm> yeah
[11:28] <tziOm> and answers on 6789
[11:29] <joao> yet './ceph --admin-daemon <path> mon_status' hangs?
[11:29] <joao> or produces an invalid response?
[11:29] <tziOm> no
[11:29] <tziOm> but a little strange is
[11:30] <tziOm> --admin-daemon /var/run/ceph/ceph-mon.a.asok help => shows mon_status, perf dump ...
[11:31] <tziOm> ceph --admin-daemon /var/run/ceph/ceph-mon.mon02.asok help => does not show quorum status or mon_status
[11:32] <joao> and running it with mon_status instead of help returns nothing?
[11:32] <tziOm> yes
[11:32] <tziOm> returns invalid command
[11:33] <joao> that's weird
[11:33] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[11:33] <joao> never seen that happening before
[11:34] <joao> tziOm, do you have a 'keyring' entry on your ceph.conf?
[11:34] <joao> this should be unrelated to the 'mon_status' command not running though
[11:34] <joao> err, not showing anything, I mean
[11:35] <joao> but *might* be the reason why the monitor is not entering the quorum
[11:35] <tziOm> joao, yes.
[11:35] <joao> we'd probably know more with logs
[11:35] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:35] <joao> I'm entering a fishing expedition here
[11:36] <joao> does ceph-mon.mon02.asok even exist?
[11:36] <tziOm> yeah yeah
[11:36] <tziOm> and it does "answer"
[11:37] <tziOm> but something is very wierd here..
[11:37] <joao> yeah, it would have returned an error of sorts if it didn't
[11:37] <joao> are all the monitors being run the same version?
[11:37] <tziOm> I think biggest problem here atm is me not beeing able (same version, yes) to remove the mon!!
[11:38] <joao> yeah, I'm not sure that's the biggest problem
[11:38] <joao> because afaik, we did everything by the book
[11:38] <tziOm> I think so...
[11:39] <joao> there's just something missing that is adding up
[11:39] <tziOm> and cluster is very down, and seems more or less impossible to fix it.
[11:39] <joao> *isn't
[11:40] <joao> any chance you can get me mon02 log?
[11:46] <tziOm> yeah..
[11:46] <tziOm> lets see, perhaps enable some debugging? howto?
[11:47] <joao> --debug-auth 20 --debug-mon 20 --debug-ms 10
[11:47] <joao> I usually prefer to do it directly on the cli
[11:47] <joao> but if you get that into the ceph.conf it will work perfectly as well
[11:48] <joao> actually, on the ceph.conf it should be under [mon]
[11:48] <joao> and as
[11:48] <joao> debug auth = 20
[11:48] <joao> debug mon = 20
[11:48] <joao> debug ms = 10
[11:52] <joao> tziOm, what does the 'keyring' entry in your ceph.conf look like?
[11:52] <joao> is it just a single path?
[11:52] * tryggvil (~tryggvil@163-60-19-178.xdsl.simafelagid.is) Quit (Quit: tryggvil)
[11:54] <tziOm> yes
[11:56] <tziOm> [mon]\nkeyring /etc/ceph/keyring
[11:57] <joao> I thought it was on /etc/ceph/ceph.keyring
[11:57] <joao> besides, I think it needs a '=' between 'keyring' and the path
[11:57] <tziOm> it has
[11:58] <tziOm> its ok, i wrote this just here..
[11:58] <joao> oh, okay :)
[11:58] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[11:59] <joao> here's the thing that I suspect to be the culprit on everything that has gone wrong:
[11:59] <joao> 2012-10-09 11:48:26.820366 7fe493bb3780 -1 mon.mon02@-1(probing) e0 unable to load initial keyring /etc/ceph/ceph.mon.mon02.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
[11:59] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Remote host closed the connection)
[11:59] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[12:00] <tziOm> ah..
[12:01] <tziOm> this is probably same bug as erlier
[12:01] <joao> yeah
[12:01] <tziOm> it tries to load it as one filename, instead of 4
[12:01] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit ()
[12:01] <tziOm> I managed to fix now seems
[12:01] <joao> the monitor should parse that option, but for some reason does not
[12:01] <tziOm> cp /etc/ceph/ceph.keyring /var/lib/ceph/mon/ceph-mon02/keyring
[12:02] <tziOm> did wokr
[12:02] <joao> cool!
[12:02] <tziOm> now cluster is working again, but what a mess!!
[12:02] <joao> hey, it was probably my fault on the --mkfs command; not sure why it didn't initialize the keyring on the mon's data dir on the first place
[12:03] <joao> another thing to look into :)
[12:04] <tziOm> hmm...
[12:04] <tziOm> I need more coffee
[12:04] <joao> that makes two of us
[12:04] <joao> and I have to make a fresh batch
[12:12] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[12:13] <tziOm> we have the machine ;)
[12:14] <tziOm> Ok, so this is for now done deal (but hope you will spread the word about this bug?)
[12:15] <joao> yes, I'll be looking into it still this morning, and if by any chance it is new, will make sure action is taken accordingly ;)
[12:17] <tziOm> huh..
[12:17] <tziOm> found another strange thing here..
[12:17] <tziOm> ceph auth get mon.
[12:17] <tziOm> => gives me key...
[12:17] <tziOm> running command again: failed to find mon. in keyring
[12:18] <tziOm> wtf?
[12:19] <joao> tziOm, try running that command but with -m <ip>:<port> for each monitor you have
[12:20] <joao> I have a feeling that one will return the key and the other will return the error
[12:20] <tziOm> seems so..
[12:20] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[12:21] <joao> and is mon02 the one returning the error by any chance? :)
[12:21] <jtang> has there been any application that use CRUSH outside of of the ceph project?
[12:21] <tziOm> joao, sure
[12:21] <joao> jtang, not that I am aware of
[12:22] <joao> tziOm, yeah, this must definitely be looked into
[12:22] <joao> there's something nasty going on with the monitor store
[12:23] <jtang> ok
[12:29] <tziOm> Hmm..
[12:29] <tziOm> seems like cluster hangs when adding new monitor (ceph mon add <name> <ip>[:<port>]) before actually starting it...
[12:31] <jtang> dont you need to give it the map as well?
[12:31] <tziOm> yes
[12:31] <jtang> i vaguely remember having the same problem before
[12:31] <tziOm> I did now exactly as http://ceph.com/docs/master/cluster-ops/add-or-rm-mons/
[12:32] <tziOm> ceph mon delete mon02
[12:32] <tziOm> unknown command delete
[12:32] <tziOm> huh
[12:32] <jtang> cool, more docs, i dont remember seeing them docs before
[12:32] <tziOm> help says delete.. actuall command that works is remove.. nice
[12:34] <tziOm> damn, this is buggy
[12:36] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[12:42] <tziOm> Have you guys ever tried removing a monitor?!
[12:48] <tziOm> So for me it seems that the ceph.conf might be the problem here...
[12:49] <tziOm> isnt it shitty if one needs to keep synced plain-text files around the cluster at all times... does it not break the very idea..?
[12:49] <Fruit> there are several solutions for that, like chef, puppet, caspar
[12:51] <tziOm> yeah..
[12:51] <tziOm> but its still doing the same..
[12:51] <tziOm> so you then need to detect monitor down and copy configs around before cluster works again!?
[12:51] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:53] <Fruit> "make install" done
[12:54] <Fruit> but that should only be necessary if the monitor is permanently removed
[12:55] <Fruit> which is not something you'd do every day anyway
[12:55] <tziOm> Seems not here..
[12:55] <tziOm> try stopping a monitor then try ceph mon status (with monitor still in ceph.conf)
[12:55] <tziOm> dont think it is just here that this fails!
[12:56] <Fruit> yeah it's slower, but it doesn't hang permanently for me
[12:56] <tziOm> nice?
[12:57] <tziOm> This cant be as designed?
[13:01] <Fruit> it's inherent in tcp
[13:01] <Fruit> hrm
[13:01] <Fruit> does it give a connection refused or a connection timeout?
[13:01] <Fruit> i.e., is the whole machine down? because then it's understandable
[13:03] <Fruit> ah I think I see
[13:03] <Fruit> when the ceph command gets a connection refused, it waits a little and then retries
[13:03] <Fruit> if you ask me it should just try the next mon instead
[13:06] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[13:06] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[13:14] <tziOm> even worse if ceph mon del monfoo
[13:14] <tziOm> then it hangs forever!
[13:15] <tziOm> for some unknown reason..
[13:16] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[13:17] <joao> Fruit, where did you see that 'ceph' would retry the same monitor when it gets a connection refused?
[13:17] <joao> been looking and haven't found it yet
[13:17] <joao> a hint would be appreciated ;)
[13:30] <Fruit> saw it with strace
[13:33] <Fruit> [pid 13021] 13:33:05 connect(3, {sa_family=AF_INET, sin_port=htons(6789), sin_addr=inet_addr("")}, 16) = -1 ECONNREFUSED (Connection refused)
[13:33] <Fruit> [pid 13021] 13:33:05 futex(0x1e7680c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 5, {1349782386, 42212000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[13:33] <Fruit> first line is the failed connection, second is the ‘sleep’
[13:34] <Fruit> there's a second connrefused following that last line, one second later
[13:46] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[13:48] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[13:54] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[14:14] <Fruit> hrm, what are the remaining issues with cephfs? can't get it to break over here
[14:22] * mtk (jf8uae0jfM@panix2.panix.com) has joined #ceph
[14:31] * gminks_ (~ginaminks@108-210-41-138.lightspeed.austtx.sbcglobal.net) has joined #ceph
[14:33] <tziOm> Because of this mon down lockup, do you usually do ip takeover on mons?
[14:46] * dty (~derek@pool-71-178-175-208.washdc.fios.verizon.net) Quit (Quit: dty)
[14:46] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[14:56] <wido> tziOm: IP takeover on mons?
[15:04] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Ping timeout: 480 seconds)
[15:08] <tziOm> yes?
[15:12] * dty (~derek@testproxy.umiacs.umd.edu) has joined #ceph
[15:12] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:17] <wido> tziOm: What would be the point of that? The clients should swap over to another mon since they have the monmap
[15:17] <wido> No need for doing something special with IP-swapping while a monitor is down
[15:18] <tziOm> if you check further up...
[15:18] <tziOm> you will see..
[15:22] <nhm> good morning #ceph
[15:23] <joao> morning nhm :)
[15:23] <wido> tziOm: Ah, I missed that one
[15:29] * loicd (~loic@magenta.dachary.org) has joined #ceph
[15:40] * tryggvil (~tryggvil@163-60-19-178.xdsl.simafelagid.is) has joined #ceph
[15:48] * gminks_ (~ginaminks@108-210-41-138.lightspeed.austtx.sbcglobal.net) Quit (Quit: gminks_)
[15:55] * tren (~Adium@216-19-187-18.dyn.novuscom.net) has joined #ceph
[15:57] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[16:00] * tren (~Adium@216-19-187-18.dyn.novuscom.net) Quit ()
[16:04] <tziOm> I dont seem to be able to write data to my osd's
[16:04] <tziOm> health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
[16:04] <tziOm> osdmap e9: 2 osds: 2 up, 2 in
[16:10] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[16:11] * deepsa (~deepsa@ has joined #ceph
[16:15] * gminks (~ginaminks@66-90-217-168.dyn.grandenetworks.net) has joined #ceph
[16:21] <tziOm> I can run rados df
[16:21] <tziOm> but a rados -p data ls just hangs..
[16:38] <tziOm> Huh
[16:38] <tziOm> not very hard to find ceph bugs
[16:38] <tziOm> for example mon is crashing now if I run ceph osd tree
[16:38] <tziOm> nce
[16:39] <joao> care to share the log on that so we can take a look? :)
[16:39] <tziOm> http://pastebin.com/8BPGTwHS
[16:40] <joao> what happened before that?
[16:56] * gregaf1 (~Adium@cpe-76-174-249-52.socal.res.rr.com) has joined #ceph
[16:56] * gregaf1 (~Adium@cpe-76-174-249-52.socal.res.rr.com) Quit ()
[16:59] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[16:59] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[17:02] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[17:03] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[17:08] * Fruit managed to crash cephfs after all :P
[17:08] <Fruit> ceph: ceph_add_cap: couldn't find snap realm 100
[17:09] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:14] * Tv_ (~tv@2607:f298:a:607:5c1e:e9a0:aa30:35e7) has joined #ceph
[17:20] * tren (~Adium@ has joined #ceph
[17:25] * gaveen (~gaveen@ has joined #ceph
[17:26] * jlogan1 (~Thunderbi@2600:c00:3010:1:9914:e2f:738e:1fb1) has joined #ceph
[17:27] * BManojlovic (~steki@ has joined #ceph
[17:27] * tren1 (~Adium@ has joined #ceph
[17:27] * tren (~Adium@ Quit (Read error: Connection reset by peer)
[17:30] * gminks (~ginaminks@66-90-217-168.dyn.grandenetworks.net) Quit (Quit: gminks)
[17:31] <joao> sagewk, whenever you have the time, please check on bug #3276 (http://tracker.newdream.net/issues/3276) and potential fix on mon-3276-fix
[17:31] <joao> tziOm, fyi ^^^
[17:32] * gminks_ (~ginaminks@66-90-217-168.dyn.grandenetworks.net) has joined #ceph
[17:32] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[17:34] * sagelap (~sage@84.sub-70-197-143.myvzw.com) has joined #ceph
[17:45] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[17:46] * tryggvil (~tryggvil@163-60-19-178.xdsl.simafelagid.is) Quit (Quit: tryggvil)
[17:48] <guilhemfr> hi all
[17:48] <guilhemfr> is there someone to discuss about chef cookbook ?
[17:49] <guilhemfr> I already do some work in this branch : http://goo.gl/TqKzJ
[17:49] * Leseb_ (~Leseb@ has joined #ceph
[17:49] <guilhemfr> I'm faced to many choices and decision and I would like to discuss it with someone from Inktank to be in keeping
[17:52] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Quit: Leaving.)
[17:52] * Leseb (~Leseb@2001:980:759b:1:e908:c6d:3724:cccf) Quit (Read error: Connection reset by peer)
[17:52] * Leseb_ is now known as Leseb
[17:56] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[17:59] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[18:02] <gregaf> guilhemfr: Tv_ is probably your best bet; he's our "argh make it devops" guy ;)
[18:02] <gregaf> or I could talk later today or tomorrow, but I'm busy this morning
[18:03] <Tv_> guilhemfr: well hello there
[18:09] <guilhemfr> Hi
[18:17] * sagelap1 (~sage@ has joined #ceph
[18:20] * sagelap (~sage@84.sub-70-197-143.myvzw.com) Quit (Ping timeout: 480 seconds)
[18:21] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[18:23] * aliguori_ (~anthony@cpe-70-123-130-163.austin.res.rr.com) Quit (Quit: Ex-Chat)
[18:24] * gminks_ (~ginaminks@66-90-217-168.dyn.grandenetworks.net) Quit (Quit: gminks_)
[18:25] * gminks_ (~ginaminks@66-90-217-168.dyn.grandenetworks.net) has joined #ceph
[18:30] <tren1> gregaf: Morning :)
[18:30] * tren1 is now known as tren
[18:30] <gregaf> morning
[18:30] <gregaf> I have no news for you, sorry :(
[18:31] <tren> gregaf: no worries…the rsync is still going. the ceph-fuse process is at 17ish GB
[18:31] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[18:39] * The_Bishop (~bishop@2001:470:50b6:0:40cc:ebb6:79e3:a011) has joined #ceph
[18:39] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:43] <sagewk> slang: merged the libcephfs gtest stuff. is there a ceph-qa-suite task and workunit in place yet?
[18:43] <nhm> Cube1: ping
[18:43] <slang> sagewk: there is one already, yes
[18:43] <slang> sagewk: I need to figure out if/when it gets run
[18:44] <sagewk> k
[18:44] <slang> ./suites/smoke/basic/tasks/libcephfs_interface_tests.yaml
[18:44] <slang> ./suites/smoke/verify/tasks/libcephfs_interface_tests.yaml
[18:44] <slang> ./suites/fs/basic/tasks/libcephfs_interface_tests.yaml
[18:44] <slang> ./suites/fs/verify/tasks/libcephfs_interface_tests.yaml
[18:46] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[18:46] <nhm> slang: nice!
[18:46] <slang> nhm: yep it is, except I didn't write those so I shouldn't get the credit :-)
[18:47] <nhm> slang: sorry, I was half responding to your chaos monkeys email. ;)
[18:49] * aliguori (~anthony@ has joined #ceph
[18:50] <sagewk> slang: those are running a workunit .sh?
[18:50] <sagewk> that now runs the new gtest binary?
[18:50] <sagewk> (feel free to move things around if it isn't organized well)
[18:50] <Tv_> i want to call messenger randomization chaos monkeys seamonkeys ;)
[18:51] <slang> sagewk: the binary didn't actually change
[18:53] * vata (~vata@2607:fad8:4:0:d15e:437d:f3bd:c559) has joined #ceph
[19:01] * dty_ (~derek@129-2-129-152.wireless.umd.edu) has joined #ceph
[19:02] * dmick (~dmick@2607:f298:a:607:2134:dd42:86b5:dc17) has joined #ceph
[19:03] * ChanServ sets mode +o dmick
[19:03] * guilhemfr (~guilhem@AMontsouris-652-1-103-134.w82-123.abo.wanadoo.fr) Quit (Quit: Quitte)
[19:04] * chutzpah (~chutz@ has joined #ceph
[19:07] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[19:08] * dty (~derek@testproxy.umiacs.umd.edu) Quit (Ping timeout: 480 seconds)
[19:08] * dty_ is now known as dty
[19:12] <Tv_> anyone here more knowledgeable about dh_install than me?-(
[19:12] <Tv_> i want excludes inside the *.install files
[19:12] <Tv_> i don't think that exists
[19:15] * dmick1 (~dmick@2607:f298:a:607:90be:2bd9:9652:87b3) has joined #ceph
[19:18] * steki-BLAH (~steki@ has joined #ceph
[19:18] * Cube1 (~Adium@ has joined #ceph
[19:19] * dmick (~dmick@2607:f298:a:607:2134:dd42:86b5:dc17) Quit (Ping timeout: 480 seconds)
[19:19] * tziOm (~bjornar@ti0099a340-dhcp0358.bb.online.no) has joined #ceph
[19:19] * loicd (~loic@ has joined #ceph
[19:22] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[19:24] <tziOm> How can i figure out why a osd is down?
[19:24] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[19:24] * Leseb (~Leseb@ Quit (Quit: Leseb)
[19:27] <joshd> tziOm: there's usually something at the end of the osd's log
[19:27] * amatter_ (~amatter@ Quit (Read error: Connection reset by peer)
[19:28] <dmick1> ceph -s gives you some status; examining the log gives more
[19:28] <dmick1> jinx
[19:28] * dmick1 is now known as dmick
[19:32] * gminks_ (~ginaminks@66-90-217-168.dyn.grandenetworks.net) Quit (Quit: gminks_)
[19:34] * miroslavk (~miroslavk@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[19:34] * nwatkins (~nwatkins@soenat3.cse.ucsc.edu) has joined #ceph
[19:34] * gminks_ (~ginaminks@66-90-217-168.dyn.grandenetworks.net) has joined #ceph
[19:35] <tziOm> Also ceph osd tree crashes here!
[19:36] <joshd> that's odd... do you have a custom crushmap, and an earlier version of ceph on the client than the server?
[19:37] <joshd> a pastebin of the crash would also be helpful
[19:39] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) Quit (Quit: Leaving.)
[19:39] * synapsr (~synapsr@72-18-242-2.static-ip.telepacific.net) has joined #ceph
[19:40] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:43] * amatter (~amatter@ has joined #ceph
[19:45] <tziOm> no
[19:46] <tziOm> it was the crushmap, yes
[19:46] <tziOm> Perhaps I changed the id's, could that be it?
[19:49] <joshd> possibly, I'm not sure what exactly depends on those
[19:53] * dty (~derek@129-2-129-152.wireless.umd.edu) Quit (Quit: dty)
[19:54] * dty (~derek@129-2-129-152.wireless.umd.edu) has joined #ceph
[19:54] * dty (~derek@129-2-129-152.wireless.umd.edu) Quit (Remote host closed the connection)
[19:55] * dty (~derek@testproxy.umiacs.umd.edu) has joined #ceph
[19:55] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[20:00] * synapsr_ (~synapsr@72-18-242-2.static-ip.telepacific.net) has joined #ceph
[20:00] * synapsr (~synapsr@72-18-242-2.static-ip.telepacific.net) Quit (Read error: Connection reset by peer)
[20:01] * synapsr (~synapsr@72-18-242-2.static-ip.telepacific.net) has joined #ceph
[20:06] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:06] <tziOm> so you want the logs?
[20:06] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:08] * synapsr_ (~synapsr@72-18-242-2.static-ip.telepacific.net) Quit (Ping timeout: 480 seconds)
[20:10] * Leseb__ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:10] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[20:10] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[20:10] * Leseb__ is now known as Leseb
[20:11] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[20:12] <joshd> for the osd tree problem, if you could attach the file generated by 'ceph osd getcrushmap -o /tmp/crushmap' to a bug that'd be great
[20:14] <sagewk> tziom: are you running latest master?
[20:15] * tziOm (~bjornar@ti0099a340-dhcp0358.bb.online.no) Quit (Ping timeout: 480 seconds)
[20:15] * tziOm (~bjornar@ti0099a340-dhcp0358.bb.online.no) has joined #ceph
[20:16] <sagewk> i just pushed a fixto the osd tree code yesterday that fixed a crash in certain cases
[20:18] * gregaf (~Adium@ has left #ceph
[20:19] * gregaf (~Adium@ has joined #ceph
[20:19] <Tv_> oh, automake
[20:19] <tziOm> this is git from ~7 hrs ago
[20:19] <Tv_> i can't get a new SUBDIRS entry to take effect.. i'm doing something simple wrong, but what
[20:19] <Tv_> it never gets a Makefile.in
[20:23] * gminks_ (~ginaminks@66-90-217-168.dyn.grandenetworks.net) Quit (Quit: gminks_)
[20:25] <tziOm> sagewk, have you looked at my logs?
[20:25] * Ryan_Lane (~Adium@ has joined #ceph
[20:26] * gminks_ (~ginaminks@66-90-217-168.dyn.grandenetworks.net) has joined #ceph
[20:27] <tziOm> this is c038c3f653389685d4f1f89527d72c15c6e62b50 and 3.6.1
[20:29] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[20:32] * buck (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[20:39] * synapsr (~synapsr@72-18-242-2.static-ip.telepacific.net) Quit (Remote host closed the connection)
[20:39] * mtk (jf8uae0jfM@panix2.panix.com) Quit (Remote host closed the connection)
[20:40] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[20:43] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:43] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[20:43] * Leseb_ is now known as Leseb
[20:50] * maelfius (~mdrnstm@ has joined #ceph
[20:57] <nhm> anyone around?
[20:58] <joao> I'm here
[20:58] <slang> *waves*
[20:58] <nhm> joao: any idea why rbd kernel client might start trhowing I/O errors?
[20:58] <joao> none
[20:59] <nhm> ok, looks like some OSDs went down.
[21:01] <tziOm> is anyone running a stable ceph cluster!?
[21:01] <tren> Define "stable"?
[21:02] <tziOm> something you dont have to babysit and that does not crash and have unexpected behaviour 50% of time
[21:02] <nhm> tziOm: Sure, I've run aging test clusters with a constant mix of nasty writes until the drives fill up, then do the whole process over again.
[21:02] <tziOm> cant be just me is what I am thinking..
[21:03] <tziOm> I have done a few tests a week or so, but seems everything I touch breaks..
[21:03] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Ping timeout: 480 seconds)
[21:04] <tren> tziOm: Have you worked with tweaking any of the timeouts between pieces? Ceph in its default state is pretty fragile I've noticed for any large (greater than 60 osd) clusters :)
[21:04] <Fruit> all I've seen crash so far is the cephfs client, which isn't officially declared stable anyway
[21:05] <tziOm> seems I cant even get the cluster up running without " health HEALTH_WARN 23 pgs degraded; 38 pgs stuck unclean; recovery 4/44 degraded (9.091%)"
[21:06] * gminks_ (~ginaminks@66-90-217-168.dyn.grandenetworks.net) Quit (Remote host closed the connection)
[21:06] <tren> tziOm: The initial startup of the cluster has replication that needs to happen to actually build the cluster (as I understand it)
[21:07] <tren> it's exacerbated if you modify the pool replica count (I usually keep 3 copies of data)
[21:08] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[21:13] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[21:19] <tziOm> this sounds like a chicken and egg problem to me..
[21:20] <dmick> tziOm: that sounds different from my experience as well
[21:21] <dmick> how's the time sync among your servers?
[21:22] <tziOm> ntp, so should be good
[21:33] <Tv_> hah!
[21:33] <Tv_> tziOm: ntp explicitly stops trying if it's too far out of sync
[21:33] <Tv_> tziOm: but that degraded/unclean stuff sounds like your osds are having issues talking to each other
[21:35] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[21:38] <tziOm> but how do I fix these problems? Seems undocumented..
[21:39] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[21:40] * aliguori (~anthony@ has joined #ceph
[21:40] <joshd> tziOm: normally degraded and unclean pgs recover by themselves
[21:41] <joshd> tziOm: but you made some crushmap changes, right? you could have accidentally created a crushmap that doesn't work the way you intended
[21:41] <joshd> what is your crushmap?
[21:42] <tziOm> I have no changes now, actually..
[21:43] <tziOm> http://pastebin.com/Ew0pGcGc
[21:44] <dmick> tziOm: the cluster should work at healing itself, but there will be problems without good timelock. You might check the state of the ntpd's just to be sure
[21:46] <joshd> to be specific, the monitors need to have their time in synch, the rest of the cluster doesn't matter
[21:47] <tziOm> 9 Oct 21:46:47 ntpdate[29093]: adjust time server offset -0.013694 sec
[21:47] <tziOm> only one monitor.
[21:47] <joshd> tziOm: does 'ceph pg dump' show any empty lists ([])?
[21:47] <tziOm> no, lots of stuff
[21:57] <tziOm> any more ideas
[21:58] <tziOm> 0 -- >> pipe(0x3708b40 sd=29 :6805 pgs=2355 cs=1 l=0).fault with nothing to send, going to standby
[21:59] <joshd> if the degraded count isn't going down, enabled debugging on the osds and restart one of them so it re-peers
[22:00] <sagewk> josh, tziom: you can also identify which pg(s) are degraded and 'ceph pg <pgid> query' to see what's going on
[22:02] <tren> tziOm: That's not an error. It's informative. I think the wording with "fault with nothing to send" is a little misleading.
[22:04] <joshd> tziOm: if you have any pgs that aren't created yet that'd be a problem too. could you pastebin ceph pg dump?
[22:05] <sagewk> joao: around?
[22:05] <joao> here
[22:06] <sagewk> on the kerying::load thing.. what caller was passing in a comma separated list? i think the problem is there.
[22:06] <sagewk> you'll notice a few functions up from_ceph_contexts() does
[22:06] <sagewk> if (ceph_resolve_file_search(conf->keyring, filename)) {
[22:06] <sagewk> ret = load(cct, filename);
[22:06] <joao> oh
[22:06] <joao> that's neat
[22:06] <sagewk> which does the search through the list for a valid file and then calls load(). load() shouldn't need to do that part
[22:06] <joao> sagewk, it was on the monitor
[22:06] <joao> Monitor::mkfs
[22:07] <sagewk> ah
[22:07] <joao> sagewk, makes sense; I'll patch it up
[22:07] <sagewk> yeah, we should just copy that file search call there.
[22:07] <sagewk> we don't want the other stuff in from_ceph_context() in the mkfs case.
[22:07] <joao> and scrap this patch
[22:07] <sagewk> cool thanks!
[22:11] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[22:11] <tziOm> what does this mean, then? http://pastebin.com/raw.php?i=SzYQBVRu
[22:12] <sagewk> 3 3 ubuntu@plana26
[22:12] <sagewk> 3 3 ubuntu@burnupi64
[22:12] <sagewk> anyone know who those are?
[22:13] <joshd> tziOm: notice the up and acting set are both a single osd [0] - that's why it's stuck degraded
[22:14] <dmick> I'm betting plana26 is samf; just a hunch
[22:15] <dmick> burnupi64: good bets are Tamil or Ken
[22:17] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:18] <tziOm> joshd, hmm.. why is this, then?
[22:19] <dmick> how many OSDs do you think you have configured?
[22:19] <tziOm> 2
[22:19] <joshd> tziOm: well, I don't see any problem with your crush map... what does 'ceph osd dump' show?
[22:19] * gaveen (~gaveen@ has joined #ceph
[22:20] <dmick> are they both running? (osd dump and/or ps, even)
[22:20] <tziOm> http://pastie.org/5026075
[22:20] <tziOm> both running, yes
[22:24] <joshd> what about ceph pg dump?
[22:27] <tziOm> http://pastie.org/5026108
[22:31] <joshd> this is a strange case, since a bunch of those are remapped
[22:31] <tziOm> probably easier to read this: http://pastie.org/pastes/5026130/text
[22:31] <joshd> do you have a copy of your old crushmap that you'd modified?
[22:32] <tziOm> did not modify this, only added osd.1
[22:32] <joshd> did you use mkcephfs to create the cluster?
[22:33] <tziOm> ceph osd crush set 1 osd.1 1.0 root=default rack=unknownrack host=
[22:33] <joshd> ok, yeah, the second paste is clearer
[22:34] <sagewk> joao: you tested that with both a straight filename, a list, and a bad list?
[22:34] <joshd> all the degraded/remapped pgs seem to have never been created in the first place
[22:34] * jlogan1 (~Thunderbi@2600:c00:3010:1:9914:e2f:738e:1fb1) Quit (Quit: jlogan1)
[22:34] <tziOm> I used mkcepfs to create the mon.a mds.a and osd.0
[22:34] <joao> sagewk, you were quicker than me; I just pushed it!
[22:35] <joao> sagewk, yes, yes and yes
[22:35] <sagewk> joao: great thanks
[22:35] <joshd> tziOm: so running 'ceph pg force_create_pg <pgid>' for each pg in 'ceph pg dump | egrep "(degraded|remapped)"' should do the trick
[22:35] <joao> tested with --keyring <existing-keyring-file>, --keyring <multiple-non-existent-files> and --keyring <multiple-non-existent-files+existing-file>
[22:35] <joao> hope that covers it
[22:37] <joshd> tziOm: I'm still not sure how your cluster avoided creating those pgs... did you originally install an older version? there was a bug that caused this to happen a while ago
[22:37] <tziOm> no, this is fresh cluster with just this version
[22:38] <tziOm> ok.. done, but I still have
[22:38] <tziOm> health HEALTH_WARN 38 pgs stuck inactive; 38 pgs stuck unclean
[22:38] <tziOm> osd.1 [ERR] mkpg 1.20 up [1] != acting [1,0]
[22:39] <tziOm> ..alot of them
[22:39] <joshd> did your osd.0 just go down?
[22:39] <gregaf> joshd: sounds like the bugs with CRUSH failing to map when the number of buckets and leaves is similar to the number of replicas
[22:39] <tziOm> joshd, no
[22:40] <joshd> gregaf: I thought that doesn't happen with the default crushmap and no down osds
[22:40] <gregaf> is it using chooseleaf, and are there nested buckets?
[22:40] <gregaf> mkcephfs tries to do that based on the number of OSDs it knows about
[22:40] <gregaf> but eg the Chef stuff doesn't; all my 2-node crowbar installs have degraded PGs (not uncreated ones, though)
[22:41] <gregaf> I wasn't tracking this whole conversation so I don't know how the cluster came into existence, but I believe I recall tziOm messing with CRUSH maps and placement
[22:41] <joshd> the crushmap in this case is using choose (the default), not chooseleaf
[22:41] <joshd> I thought that bug only happened with down osds though
[22:42] <tziOm> gregaf, I messed around with crushmaps earlier, but started from scratch again
[22:42] <joshd> gregaf: if it's any osds, regardless of up/down, we should change the default crushmap
[22:42] <tziOm> I just dont understand why you guys are saying "ceph is running stable..." while I seem to find every possible bug before I have even gotten it somehow up running...
[22:44] <Tv_> tziOm: because one person having trouble doesn't mean the world is on fire
[22:45] <Tv_> e.g. http://dreamhost.com/cloud/dreamobjects/ has been running fine for quite a while now
[22:46] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:47] <joshd> tziOm: if gregaf is correct, you can fix the issue by changing each 'choose' in your crushmap to 'chooseleaf'
[22:48] <tren> Tv_: How many OSD's are in that cluster? *curious* :)
[22:48] <tziOm> Tv_, no but I follow guides on ceph.com
[22:49] <tziOm> Tv_, and I am not completely green.. odd things happen.
[22:49] * synapsr (~synapsr@72-18-242-2.static-ip.telepacific.net) has joined #ceph
[22:49] <gregaf> joshd: mkcephfs' CRUSH map is done differently; we expect that people building it through other methods know what they're doing
[22:50] <gregaf> eg it intelligently decides the number of levels to build in based on the total number of OSDs and their distribution
[22:51] <tziOm> I set now chooseleaf..
[22:51] <tziOm> does not seem to have any effect..
[22:52] <Tv_> tren: multiple petabytes is all i dare to say right now; i forget what's public, and it's not my baby
[22:53] <tren> Tv_: Thanks :)
[22:54] <gregaf> Tv_: I think anything we know is public
[22:54] <tziOm> joshd, thanks alot for trying to help out
[22:55] <gregaf> the fact that it's 3PB raw is in one of the upcoming talk abstracts
[22:55] <dmick> tziOm: your problems are definitely unusual
[22:55] <dmick> it's unclear yet why, but this doesn't happen often
[22:56] <tziOm> but I guess you guys agree its scary
[22:56] <tziOm> especially since you guys cant just say "do that" and it starts working.. ..
[22:57] <Tv_> gregaf: i was move involved with dreamobjects in the early days... but yeah, another part is i can't remember the numbers off the top of my head
[22:57] <Tv_> 3PB with 1.5TB disks at 8 disks per server would come out to 256 servers, which sounds believable
[22:58] <joshd> tziOm: the kind of issues you've hit are rare, and still hard to diagnose sometimes
[22:58] <Anticimex> 256 10G ports? i can deliver that. :)
[22:58] <Tv_> s/move/more/
[22:59] <joshd> tziOm: if your pg dump has two osds for every pg's acting and up sets, restarting the osds will retrigger peering and should get things going again
[23:00] <joshd> tziOm: er, just 2 for the degraded pgs. the remapped ones will stay at 1 until recovery is restarted
[23:02] * The_Bishop (~bishop@2001:470:50b6:0:40cc:ebb6:79e3:a011) Quit (Ping timeout: 480 seconds)
[23:05] <tziOm> tried restarting now
[23:05] <tziOm> seems no diff
[23:07] <amatter> hi all. I'm building a new ceph cluster to host millions of small files (<1MB). I have 3x machines with a ssd in slot0 and 3x enterprise drives in other bays. each hdd is set up as its own osd running ext4. I am using cephfs. how should I adjust the system for best performance? adjust the striping for the data pool?
[23:10] * buck1 (~buck@bender.soe.ucsc.edu) has joined #ceph
[23:10] * buck (~Adium@soenat3.cse.ucsc.edu) Quit (Quit: Leaving.)
[23:10] * The_Bishop (~bishop@2001:470:50b6:0:4c8d:d50:b677:7da9) has joined #ceph
[23:12] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[23:12] * jlogan1 (~Thunderbi@2600:c00:3010:1:d8b3:92ee:c6dc:92c5) has joined #ceph
[23:12] * f4m8 is now known as f4m8_
[23:12] <rweeks> I'm not the expert, amatter, but I think you might be better off with btrfs or xfs rather than ext4.
[23:13] <gregaf> sounds like a pretty basic partition the SSD, use as journal for an OSD per HDD, set replication as desired
[23:13] <amatter> I've created individual partitions on the ssd for each osd to use as a raw device
[23:14] <amatter> should I be concerned about the default 4MB stripe and object sizes?
[23:15] <joshd> if you have many small writes, you might want to use a smaller stripe size
[23:15] <gregaf> with the 4MB stripe size then each file is going to live on a single disk
[23:15] <Tv_> amatter: using cephfs?
[23:16] <amatter> tv_: yes
[23:16] <Tv_> amatter: good luck on those metadata operations ;)
[23:16] <gregaf> that's probably what you're after, but if you for some reason needed each file to have more IOPs within it you could reduce the stripe size and spread them out
[23:17] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[23:18] <amatter> thanks
[23:18] <rweeks> but seriously, isn't the recommendation to use btrfs and not ext4?
[23:19] <Tv_> rweeks: xfs
[23:20] <Tv_> rweeks: but that's largely because we haven't tested ext4 much lately; with the current code, it might actually be viable again
[23:20] <nhm> rweeks: tough to say right now. Once I can get back to performance testing, I should have some better answers.
[23:20] <Tv_> frankly, it wouldn't surprise me if ext4 won again
[23:20] <Tv_> (err s/again//)
[23:20] <Tv_> make that s/again/after all, for now/
[23:20] <nhm> rweeks: the short of it is that btrfs does really well on a fresh filesystem, but seems to degrade quickly. Ext4 tends to be in second place, but we have no idea how it degrades. XFS tends to be the slowest, but degrades more slowly than BTRFS.
[23:21] * danieagle (~Daniel@ has joined #ceph
[23:21] <Tv_> nhm: also, btrfs still seems to eat leveldbs :(
[23:21] <nhm> Tv_: saw that. Hopefully only with compression on?
[23:21] <Tv_> nhm: maybe..
[23:22] <Tv_> nhm: or maybe compression makes it more likely..
[23:22] * synapsr (~synapsr@72-18-242-2.static-ip.telepacific.net) Quit (Remote host closed the connection)
[23:22] <Tv_> wish we could repro :(
[23:23] <amatter> still not advised to run an msd and an osd on the same machine?
[23:23] <Tv_> amatter: depends on your hardware, but most people choose to have two kinds of boxes
[23:24] <Tv_> amatter: mds likes ram but doesn't use local disk; that's a bad fit for IO-optimized server hardware
[23:24] <Tv_> amatter: so most people put in a physically smaller box with enough ram, no extra dissk
[23:25] * dok (~dok@static-50-53-68-158.bvtn.or.frontiernet.net) has joined #ceph
[23:25] * buck1 (~buck@bender.soe.ucsc.edu) Quit (Quit: Leaving.)
[23:25] <Tv_> amatter: if you insist on identical hardware, then waste some money on it not being optimal; it's not a question of anything breaking
[23:25] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[23:25] * dty (~derek@testproxy.umiacs.umd.edu) Quit (Ping timeout: 480 seconds)
[23:26] <tziOm> Ok, I will try to recreate this..
[23:30] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[23:31] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[23:36] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[23:39] <rweeks> I see, nhm, Tv_. Thanks for the clarification.
[23:41] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[23:42] <rweeks> so then since the OSDs sit on that, right now we don't have a better answer than xfs
[23:43] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Ping timeout: 480 seconds)
[23:49] <Fruit> I think I'll try zfs tomorrow :)
[23:50] <rweeks> let us know how that goes.
[23:50] <Fruit> will do
[23:51] <nhm> Fruit: a couple of other people have tried and have run into some issues.
[23:51] <nhm> Fruit: I don't remember if they were able to get around them.
[23:51] <Fruit> it's a test setup; no fun unless it goes boom
[23:51] <sjust> Fruit: zfs on linux?
[23:51] <Fruit> sjust: yeah
[23:51] <sjust> cool
[23:52] <nhm> Fruit: fuse or kernel client?
[23:54] <Fruit> for zfs? kernel
[23:54] <nhm> ok, the llnl implementation then
[23:54] <Fruit> it's not really a "client" though
[23:54] <nhm> fruit: yeah, muscle memory talking about our stuff. :)
[23:54] <Fruit> figured that :)
[23:54] <Fruit> had fun with that today
[23:56] * aliguori (~anthony@ Quit (Remote host closed the connection)
[23:58] * vata (~vata@2607:fad8:4:0:d15e:437d:f3bd:c559) Quit (Quit: Leaving.)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.