#ceph IRC Log


IRC Log for 2013-02-08

Timestamps are in GMT/BST.

[9:05] -solenoid.oftc.net- *** Looking up your hostname...
[9:05] -solenoid.oftc.net- *** Checking Ident
[9:05] -solenoid.oftc.net- *** No Ident response
[9:05] -solenoid.oftc.net- *** Found your hostname
[9:05] * CephLogBot (~PircBot@rockbox.widodh.nl) has joined #ceph
[9:05] * Topic is 'v0.56.2 has been released -- http://goo.gl/WqGvE || argonaut v0.48.3 released -- http://goo.gl/80aGP || performance tuning overview http://goo.gl/1ti5A'
[9:05] * Set by scuttlemonkey!~scuttlemo@d51A54392.access.telenet.be on Tue Feb 05 13:28:40 CET 2013
[9:10] * loicd (~loic@lvs-gateway1.teclib.net) has joined #ceph
[9:16] * ScOut3R (~ScOut3R@ has joined #ceph
[9:24] * leseb (~leseb@stoneit.xs4all.nl) has joined #ceph
[9:24] * leseb (~leseb@stoneit.xs4all.nl) Quit (Remote host closed the connection)
[9:25] * leseb (~leseb@mx00.stone-it.com) has joined #ceph
[9:29] * fghaas (~florian@91-119-222-199.dynamic.xdsl-line.inode.at) has joined #ceph
[9:39] * BManojlovic (~steki@ has joined #ceph
[9:49] * LeaChim (~LeaChim@b0faa140.bb.sky.com) has joined #ceph
[9:52] * KindOne (KindOne@h185.237.22.98.dynamic.ip.windstream.net) Quit (Remote host closed the connection)
[9:53] <absynth_> morning all
[9:55] * KindOne (KindOne@h185.237.22.98.dynamic.ip.windstream.net) has joined #ceph
[10:00] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) has joined #ceph
[10:10] <ScOut3R> morning absynth_
[10:10] <ScOut3R> do i remember right that you are using cachecade with your controllers? or was it someone else who mentioned it
[10:10] <absynth_> yes
[10:10] <topro> i feel like what I experience here is something related to http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09703.html
[10:11] <topro> linux 3.2.35 + 0.56.2 on ceph servers, linux 3.7.3 on clients using kernel fs driver
[10:11] <absynth_> ScOut3R: it was us
[10:11] <topro> osds on xfs
[10:12] <absynth_> topro: without going into detail, the OSD kernels sound quite old, did you consider a kernel update?
[10:12] <topro> symptom: files get writte successfully, after some time some ~4MB files get truncated to 2.0MB without getting write IO
[10:13] <ScOut3R> absynth_: we are planning to use it within our cluster and i couldn't find an answer for a specific scenario. here it is: what happens when all the SSDs in the cachecade array dies? will the controller just ignore that function and the "real" arrays will be available or will the controller shut down the arrays affected by the cachecade layer outage? have you had such experience?
[10:13] <absynth_> topro: are you talking about 4.0MB inside the clients or the 4.0M object files on the OSD disks (as seen from the "outside")?
[10:13] <topro> absynth_: thought servers don't rely on kernel version despite using a recent kernel if osds use btrfs, am I wrong? is there some ceph server support in the kernel?
[10:14] <absynth_> ScOut3R: we asked ourselves the exact same question
[10:14] <topro> absynth_ 4mb on client (file size)
[10:14] <absynth_> ScOut3R: in theory, if an SSD dies (i.e. due to wear), its internal controller should switch it to read-only mode, effectively disabling it for caching
[10:14] <absynth_> topro: didn't you just say your OSDs are XFS?
[10:14] <absynth_> 10:11:52 < topro> osds on xfs
[10:14] <ScOut3R> absynth_: good to know :) the LSI docs are handling just one specific case where a RAID1 SSD array used for read/write cache but there's no scenario for a complete outage regarding the SSDs
[10:15] <topro> absynth_ yes, they are on xfs, so I thought kernel version on servers shouldn't matter at all, right?
[10:15] <absynth_> we have just one SSD for caching, as far as i know
[10:15] <ScOut3R> absynth_: thanks, then i'll test it as soon as i get the hardware
[10:15] <absynth_> topro: my perception was that there are quite a few xfs-related patches in recent kernels
[10:16] <topro> ansynth_: but i had osds on btrfs before showing the exact same problem
[10:16] <topro> that was the reason to switch to xfs in the first place
[10:17] <absynth_> well, we saw objects filled with NULL bytes, but not truncated
[10:17] <topro> can I help debug this issue?
[10:17] <absynth_> which timezone are you in?
[10:17] <absynth_> CET?
[10:17] <topro> GMT +1 so CET, yes
[10:18] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) Quit (Quit: Leaving.)
[10:18] <absynth_> k, then you should probably wait till about 5pm
[10:18] <topro> then, re-report?
[10:18] <absynth_> usually, gregaf, sage et al. are awake by then
[10:18] <absynth_> i seem to remember a mailing list thread, was that you too?
[10:18] <topro> no, no mailing list report by me
[10:19] <absynth_> ok, then i remembered wrong
[10:19] <absynth_> usually, with issues like this, someone from inktank (btw, i am not an inktank employee, i am a community member) will be interested in seeing logs or even accessing the cluster
[10:21] <absynth_> personally, i'd update my OSDs to have newer kernels, though, just in case
[10:21] <absynth_> i know it's a pain, but still
[10:22] <topro> with me not even beeing a real community member, just exploring whether ceph would be the right choice for me, I don't think that access to the cluster would be possible. but I will supply you with as much information as I can
[10:23] <topro> I would love to use latest kernel but i.e. 3.7.3 I was testing didn't like my sas-controller, seems like there is a regression I couldn't sort out, yet
[10:24] <absynth_> meh
[10:25] <absynth_> either way, a good idea would be to prepare a mailing list posting either on ceph-user or the old list
[10:25] <absynth_> the core people read everything.
[10:26] <topro> better chance than getting people interested into it on irc?
[10:27] <absynth_> yeah, i think so
[10:28] <absynth_> you can still include something along the lines of "you can find me on irc"
[10:28] <absynth_> so people can ping you if they want to get into the issue
[10:29] <trond> Hi, has anyone tested format 2 images with the rbd kernelmodule in 3.8-rc6?
[10:29] <topro> ok, I'll try to get as much information into a mailing list posting and stand by... ;)
[10:32] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) has joined #ceph
[10:36] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[10:37] * brambles (lechuck@s0.barwen.ch) has joined #ceph
[10:43] * brambles_ (lechuck@ec2-54-228-50-165.eu-west-1.compute.amazonaws.com) Quit (Ping timeout: 480 seconds)
[10:44] <joao> morning all
[10:45] <joao> absynth_, how's it going with your cluster?
[10:45] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[10:46] <morpheus__> so filled another bug report, very high memory usage :/ http://tracker.ceph.com/issues/4052
[10:47] <joao> morpheus__, any chance you can post a 'ceph -w' on that ticket?
[10:48] <joao> wold be great to see what's happening
[10:48] <joao> also, I'm going to edit the ticket to make the stack trace readable ;)
[10:48] <morpheus__> yes currently some backfilling etc ( we removed to osds ), the high memory usage was there before though
[10:48] <absynth_> joao: hm, so-so i'd say. we made some headway yesterday, but we are still in "only make supervised changes" mode
[10:49] <joao> I see :\
[10:49] <joao> let me know if I can be of any help
[10:49] <joao> I'm not on your PS support though
[10:49] <absynth_> nah, it's cool
[10:50] <absynth_> i have cephs cell no. ;)
[10:50] <topro> absynth_: i'm not so used to mailing list good-habbits. would you mind having a look at http://paste.debian.net/232613/ Would that be ok or anything missing?
[10:50] <absynth_> s/ceph/sage/
[10:50] <joao> just in case, if you have a panic attack or something of the sorts, for all intents I'm the one awake :p
[10:50] <absynth_> oooh, now i remember you topro
[10:50] <absynth_> the subdir question from a couple days ago
[10:51] <topro> yes
[10:51] <morpheus__> janos: posted ceph -w output and some more infos
[10:51] <topro> got no answer
[10:51] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[10:51] <absynth_> just to be sure, you have one (1) client, right?
[10:51] <absynth_> and not numerous ones trying to mount the same cephfs
[10:51] <topro> now there are two, but it started happening when there was only one
[10:52] * hybrid512 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[10:54] <absynth_> posting looks good
[10:55] <topro> so I will give it a try. full real names prefered in mailing list posting?
[10:55] <absynth_> ifyes
[10:55] <absynth_> -if
[10:55] <joao> morpheus__, thanks
[10:55] * low (~low@ has joined #ceph
[10:55] <joao> morpheus__, 'ceph -s' as well?
[10:56] <joao> I wonder how many osds you are recovering at the same time
[10:57] * loicd1 (~loic@lvs-gateway1.teclib.net) has joined #ceph
[10:57] * loicd (~loic@lvs-gateway1.teclib.net) Quit (Read error: Connection reset by peer)
[10:57] <morpheus__> yes one moment
[10:57] <morpheus__> removed two osds via osd out
[10:57] <morpheus__> any chance to disable deep scrubing during the backfill?
[10:58] <absynth_> yeah
[10:58] <absynth_> just set the scrub intervals to impossible values
[10:58] <absynth_> osd scrub min interval = 3000000
[10:58] <absynth_> osd scrub max interval = 3000000
[10:58] <absynth_> (that's seconds)
[10:58] <absynth_> you can inject those parameters during runtime too, afair
[11:00] <morpheus__> janos: ceph -s is in
[11:00] <morpheus__> absynth_: thx i'll take a look
[11:02] <absynth_> win 12
[11:03] <topro> what a honor, I'm the first one to use ceph-user
[11:03] <topro> maybe even the only one subscriber :(
[11:04] <topro> finally decided to at least include my ceph.conf
[11:04] <morpheus__> i tried ceph osd tell \* injectargs '--osd-scrub-max-interval 10000000'
[11:04] <morpheus__> but config show shows osd_scrub_max_interval = 604800
[11:05] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[11:11] <absynth_> topro: nah, i'm on there too
[11:11] <absynth_> so we can have a romantic little conversation
[11:12] <absynth_> did you see the change in ceph -w, morpheus__?
[11:12] <absynth_> also, maybe 10000000 is too much
[11:12] <morpheus__> yep
[11:12] <absynth_> or maybe it's a boot time parameter and not really honored at runtime. dunno, sam would probably know
[11:16] * ShaunR (~ShaunR@staff.ndchost.com) Quit (Read error: Connection reset by peer)
[11:16] * ShaunR (~ShaunR@staff.ndchost.com) has joined #ceph
[11:16] <topro> absynth_: something just came to my mind, what I told you about only having had one cephfs client is not totally correct ;)
[11:17] <absynth_> then maybe what happens to the files is totally correct
[11:17] <absynth_> quid pro quo
[11:17] <absynth_> ;)
[11:17] <topro> II do also have a VM mounting the same cephfs, I had that from the beginning, so there have been two, working on the same directory
[11:18] <absynth_> had numerous of these issues the last days, including lengthy discussion why this is wrong :)
[11:18] <absynth_> so you might well have found the root cause
[11:18] * Morg (d4438402@ircip2.mibbit.com) has joined #ceph
[11:18] <absynth_> unmount cephfs on one client and see if the problem still exists
[11:19] <topro> no, it can not be correct. the vm compiles a medium-sized project on the ceph fs (different subtree) and copies the binary to the folder where truncation happens. my native machine then creates a .tar.bz2 including that binary. and that .tar.bz2 gets truncated later on without receiving additional IO
[11:22] <topro> i have been doing the exact same work for years with my home on fns, and the day I started using my test-ceph this problems started showing up
[11:22] <topro> s/fns/nfs/
[11:52] * gregaf1 (~Adium@2607:f298:a:607:d1d0:87ef:2023:ecab) has joined #ceph
[11:58] * gregaf (~Adium@ Quit (Ping timeout: 480 seconds)
[12:23] <morpheus__> cluster still seems to do deep scrubing during backfill :/
[12:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:58] * pixel (~pixel@ has joined #ceph
[13:01] * capri_on (~capri@ has joined #ceph
[13:01] <capri_on> hey, im new to ceph and i want to know how i could calculate how many nodes could crash, until ceph wont work any more
[13:02] <capri_on> i knew if im using 3 nodes, 1 could fail, but if im running 10 or 20 ceph nodes and so on
[13:03] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:08] <Gugge-47527> capri_on: with replication=2, 1 node can fail
[13:08] <Gugge-47527> And maybe more than one :)
[13:08] <Gugge-47527> But the "wrong 2" will fail both replicas of some data
[13:09] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[13:09] <capri_on> Gugge-47527 --> ist there any chance tt change den replication number so more nodes could crash?
[13:09] <Gugge-47527> sure
[13:09] <Gugge-47527> you can set the replication level to 20
[13:10] <capri_on> oh cool :-)
[13:17] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[13:36] * jamespage (~jamespage@tobermory.gromper.net) Quit (Quit: Coyote finally caught me)
[13:40] * gaveen (~gaveen@ has joined #ceph
[13:41] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:44] * fghaas (~florian@91-119-222-199.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[13:45] * fghaas (~florian@91-119-222-199.dynamic.xdsl-line.inode.at) has joined #ceph
[13:51] * capri_on (~capri@ Quit (Quit: Verlassend)
[13:56] <Morg> hello
[13:57] <Morg> anyone got problems with flapping osds on 3TB disks?
[13:59] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[14:00] * pixel (~pixel@ Quit (Remote host closed the connection)
[14:07] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:14] * SkyEye (~gaveen@ has joined #ceph
[14:17] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:18] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:21] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[14:33] * fghaas (~florian@91-119-222-199.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[14:42] <nhm> Morg: not strictly on 3TB disks, but there were some issues with flopping OSDs other folks ran into, and I think we just figured it out a day or two ago.
[14:47] <absynth_> GRMPF, YES.
[14:49] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[14:49] <absynth_> Morg: got a stacktrace?
[14:49] <noob2> wido: you around?
[14:49] <wido> noob2: Yes
[14:50] <noob2> you wrote the article on ceph over iscsi right?
[14:50] <wido> I wrote something a long time ago. The more recent one was from Florian Haas
[14:50] <wido> fghaas, but he's not online now
[14:50] <noob2> oh that's right
[14:50] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Quit: Leaving)
[14:50] <noob2> ok i was wondering if you've encountered this also
[14:50] <wido> RBD should be coupled into tgt and you could have fun
[14:51] <wido> some code has been written I saw recently
[14:51] <noob2> really?
[14:51] <wido> Yes, let me look it up
[14:51] <noob2> ok
[14:51] <noob2> i've found a bug in the lio utils that only occurs under really heavy load
[14:51] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[14:51] <wido> noob2: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11662.html
[14:51] <wido> dmick was actually the one who wrote it
[14:51] <noob2> if you tell 10-20 vmware clients to do bonnie++ tests and then bring down a couple of osd's the rbd client starts telling LIO to wait
[14:52] <noob2> lio triggers a scsi abort and then the kernel panics :(
[14:52] <wido> I haven't been running RBD with iSCSI in production, so I can't really tell
[14:52] <topro> absynth_: I unmunted cephfs on the vm, since then no more files truncated. but still this must not happen.
[14:52] <noob2> yeah i'm not in production yet either but only because of this bug
[14:52] <noob2> that's awesome about the rbd in tgt.
[14:53] <noob2> ah he's using librados. that's smart
[14:53] <absynth_> topro: and this is where discussion would kick in. general consensus yesterday was that the *multiple mounting* must not happen
[14:53] <topro> absynth_: they don't even write to the same files. could it be related that my host is using amd64 kernel and the vm is using i686?
[14:53] <absynth_> topro: this is similar to the issue yesterday
[14:53] <absynth_> lemme see IRC logs
[14:54] <wido> noob2: He's using librbd, that's what you meant?
[14:54] <absynth_> hm, no logs for yesterday and wed
[14:54] <noob2> yes
[14:54] <wido> I have to give this a try once :)
[14:54] <noob2> well i believe this is what i was looking for
[14:54] <wido> And it should go upstream them after we poke dmick to submit the patch
[14:54] <topro> absynth_: just to make sure I get you right: cephfs is not intended to get mounted by multiple clients simultanously?
[14:55] <noob2> LIO's fibre code assumes that it will be able to quickly write to disk. if it can't the developers told me it gets into buggy code and dies
[14:55] <absynth_> topro: from what i know (i am not using cephfs, only following the discussion here), that is correct
[14:55] * wido has to admit that it has been a long time since he played with iSCSI
[14:55] <noob2> :)
[14:55] <absynth_> wido: can you look at topro's statement?
[14:56] <absynth_> Morg: did the OSD(s) just die or what was the flappy behavior exactly?
[14:57] <wido> absynth_: Well, CephFS is ofcourse intented to be used on multiple clients, BUT
[14:57] <wido> it's not working that well right now, so things can break if you do so
[14:57] <topro> i don't like that BUT
[14:57] <topro> ;)
[14:57] <Morg> well, i created cluster nicely, and checking ceph -w shows a lot of "wrongly marked me as down"
[14:57] <wido> Have you tried FUSE yet topro?
[14:57] <wido> ceph-fuse, to see if that works out?
[14:57] <topro> wido: no, actually not
[14:58] <wido> I saw your post on the ml and read back the other conversation
[14:58] <topro> would that make any difference?
[14:58] <wido> the suggestion there was to try that to rule out a kernel bug
[14:58] <Morg> absynth_: i removed any additional settings in ceph.conf checked on both 0.48 and 0.56.2
[14:58] <topro> ok, so the most important question to me is: will cephfs eventually work in an environment with multple (tens) of clients mounting it simultanously?
[14:58] <Morg> absynth_: but end result is still the same
[14:59] <wido> topro: Yes, it will. CephFS will work with multiple clients and multiple active MDSes
[14:59] <wido> but it just didn't get the attention it should have gotten
[14:59] <wido> From what I've heard some great work is being done on it
[14:59] <wido> but didn't hit any repo yet
[15:00] <wido> You should be able to replace your NFS by CephFS at some point
[15:00] <topro> wido: for the moment i keep away from multiple active MDS as I read about that being an issue. shouldn't there be a similar statement on the docs to warn people from using multiple clients at the current state of implemenation?
[15:00] <wido> topro: Might be a good suggestion. There should be a big BETA warining around the whole CephFS for now
[15:01] <absynth_> more like a "considered a data sink" warning
[15:01] <topro> wido: from docs its clear that cephfs is not production ready, so thats not the problem
[15:01] <topro> absynth_ :)
[15:01] <absynth_> we had someone here just yesterday who had the same idea and much of the same problems with cephfs, i think
[15:01] <wido> So you have it mounted on multiple clients? I couldn't find that in your e-mail
[15:01] <absynth_> Morg: can you paste ceph -s and one line from ceph -w?
[15:01] * cephalobot (~ceph@ds2390.dreamservers.com) Quit (Remote host closed the connection)
[15:02] <absynth_> wido: he noticed 2 mins after sending the mail ;)
[15:02] <Morg> sure thing, just a sec.
[15:02] * cephalobot (~ceph@ds2390.dreamservers.com) has joined #ceph
[15:02] <topro> wido: i just figured out that the whole problem could be related to multiple clients after writing that by further discussion with absynth_
[15:02] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[15:02] * fghaas (~florian@91-119-222-199.dynamic.xdsl-line.inode.at) has joined #ceph
[15:03] <topro> anyway, I will forever be the first one having posted an issue on ceph-users :P
[15:03] <Morg> absynth_: http://pastebin.com/66aUVrJL
[15:04] <Teduardo> So, everything I read in here scares the piss out of me reg'd Ceph.
[15:04] <absynth_> you will forever be the first person having posted an inaccurately described, potentially self-caused issue on ceph-users :P
[15:04] <Morg> absynth_: it goes on and on, failure, wrongly marked down, boot up
[15:04] * trond (~trond@trh.betradar.com) Quit (Read error: Connection reset by peer)
[15:04] <noob2> Teduardo: what in particular?
[15:04] <absynth_> Morg: some ceph osd tree pls, too
[15:05] * Teduardo (~DW-10297@dhcp92.cmh.ee.net) Quit ()
[15:05] <noob2> weird..
[15:05] <absynth_> some men just aren't cut out for Ceph
[15:05] <topro> absynth_: talking like this, isn't every bug self-caused somehow? if I wouldn't have given cephfs a try I wouldn't have posted that message
[15:05] <noob2> lol
[15:06] <xiaoxi> nhm:hi, are you online?
[15:06] <Morg> absynth_: there
[15:06] <Morg> http://pastebin.com/qsMMHGAg
[15:06] <absynth_> topro: true. hadn't you chosen a career in IT, you wouldn't have tried Ceph
[15:06] <absynth_> topro: so, yes, IT'S ALL YOUR FAULT.
[15:06] <joao> <absynth_> some men just aren't cut out for Ceph <- we're aiming at making it as painless as os x
[15:06] <noob2> :D
[15:07] <absynth_> joao: that is about as good a statement as "we would like you to enjoy our chinese water torture as much as you can" ;)
[15:07] * Teduardo (~DW-10297@dhcp92.cmh.ee.net) has joined #ceph
[15:08] <topro> absynth_: obviously
[15:08] <joao> there's a good side in anything, you just have to be open-minded enough :p
[15:08] <Morg> absynth_: every osd is 3TB disk with btrfs
[15:08] <absynth_> Morg: ok, so basically, "wrongly marked me down" is caused by issuing "ceph osd down X" for the OSDs in question. they complain about being marked down when in fact they aren't, and reboot
[15:08] <absynth_> do you have network/latency issues on the network connecting the OSDs and mons?
[15:08] <absynth_> packet loss?
[15:08] <Morg> none
[15:09] <absynth_> is it always the same OSDs?
[15:09] * loicd1 (~loic@lvs-gateway1.teclib.net) Quit (Ping timeout: 480 seconds)
[15:09] <Morg> it goes for all osd's
[15:09] <topro> I'm gonna write a short followup on my mailing list posting...
[15:09] <xiaoxi> Morg: have you got some "slow request" warning?
[15:09] <Morg> lemme see
[15:09] <absynth_> this is 0.56.2, right?
[15:10] <Morg> absynth_: yes, and no, no slow req, only this 2013-02-08 15:01:15.618235 mon.0 [INF] osd.14 failed (3 reports from 3 peers after 2013-02-08 15:01:17.618187 >= grace 20.000000)
[15:10] <xiaoxi> morg:you should look at osd.14's log to see if there are slow requests
[15:10] <absynth_> that means "we think osd.14 is broken", issued by (not sure) the other OSDs
[15:11] <absynth_> if there were slow requests, he would see them in ceph -w
[15:11] <absynth_> and: i cannot think of any scenario where slow requests would cause OSD reboots
[15:11] <Morg> well, it starts just after cluster is built
[15:11] <Morg> no data inside
[15:12] <absynth_> which version is this?
[15:12] <Morg> 56.2
[15:12] <absynth_> hm
[15:12] <absynth_> do you feel adventurous?
[15:12] <Morg> and its the same on 0.48
[15:12] * fghaas (~florian@91-119-222-199.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[15:12] * aliguori (~anthony@cpe-70-112-157-151.austin.res.rr.com) has joined #ceph
[15:12] * fghaas (~florian@91-119-222-199.dynamic.xdsl-line.inode.at) has joined #ceph
[15:12] <Morg> absynth_: that depends on what to you have in mind ;]
[15:12] <absynth_> there is a new branch
[15:12] <Morg> *do
[15:12] <absynth_> wip-bobtail-f, i think
[15:13] <absynth_> it has a very recent fix for an issue that me, paravoid and a couple others have encountered
[15:13] <absynth_> which has to do with weird osd behavior
[15:13] <absynth_> maybe what you're seeing might be connected...?
[15:13] <Morg> i can try it on monday i guess
[15:14] <Morg> its from development-release or development testing?
[15:14] <absynth_> no idea, really, we have deb packages for it
[15:15] <absynth_> i overheard sage and paravoid today, and i think they were talking about the same issue
[15:15] <absynth_> let me scroll back
[15:15] <Morg> kk
[15:16] <absynth_> yes, paravoid has the same issue
[15:16] <absynth_> at least as far as i can tell
[15:16] <absynth_> and sage recommended he try out the new version
[15:16] <morpheus__> we also tried wip_bobtail_f after sage's advise, crashing here :/
[15:17] <Morg> i will do that then
[15:17] <absynth_> well, it's working OK here, but we're maybe a special case ;)
[15:17] <Morg> and let you know
[15:17] <absynth_> you just need to update the OSDs, not the mons
[15:17] <absynth_> if that is any consolation
[15:17] <Morg> absynth_: http://media.tumblr.com/tumblr_mdk6l9Dzlo1r6hxat.gif ;]
[15:18] <nhm> Morg: oh good, you got in contact with absynth_
[15:18] <morpheus__> still hoping for a another fix later ;)
[15:19] <absynth_> nhm: yeah, and paravoid is possibly testing the wip_bobtail_f release today, too
[15:19] <absynth_> as far as i can tell
[15:19] <absynth_> Morg: ;)
[15:20] <nhm> absynth_: oh good
[15:20] <absynth_> morpheus__: i think you shouldn't need to update if you don't have the same symptoms that were discussed here
[15:20] <absynth_> i.e. OSDs taking literally forever to boot et al.
[15:21] <absynth_> didn't you post a stacktrace on the list? i could swear i saw one earlier
[15:21] <absynth_> you are from fremaks, right?
[15:21] <morpheus__> both, yes :)
[15:21] <morpheus__> this stacktrace is from bobtail_f after sage mentioned it because of our problems
[15:21] <morpheus__> OSDs take ~8-12GB RAM on startup
[15:21] <absynth_> oh, yeah, that
[15:22] <absynth_> do you have lots of idle PGs?
[15:22] <absynth_> i.e. PGs of offline VMs that weren't touched in weeks, testing PGs, benchmarks, backup of RBD images, etc.?
[15:23] <morpheus__> yep there are some
[15:23] <morpheus__> we stoped a lot of vms some weeks ago
[15:23] <morpheus__> since then osds are growing
[15:25] <absynth_> ok
[15:25] <absynth_> if you can, delete them
[15:25] <absynth_> all of them
[15:25] <absynth_> on our cluster, we had about 300gb worth of idle PGs and that led to the immense bootup times
[15:26] <morpheus__> i'll try, cluster is currently backfilling, we had another problem last night.. :)
[15:26] <morpheus__> thx for sharing
[15:26] <absynth_> where do i send the invoice? ;)
[15:29] <morpheus__> maybe we should think about getting an inktank support contract
[15:39] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:39] * Teduardo (~DW-10297@dhcp92.cmh.ee.net) Quit ()
[15:41] <mikedawson> nhm: you mentioned flopping OSDs earlier, I've seen some, is there a link to discussion of what you guys found?
[15:42] * jskinner (~jskinner@ has joined #ceph
[15:43] <nhm> mikedawson: not sure. Sage just mentioned it to me yesterday or the day before. Basically a bunch of people have been having OSDs have various problems.
[15:44] <nhm> We knew about it before, but didn't know why. Apparently we know why now.
[15:45] * Teduardo (~DW-10297@dhcp92.cmh.ee.net) has joined #ceph
[15:45] * Morg (d4438402@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[15:47] * mattch (~mattch@pcw3047.see.ed.ac.uk) has joined #ceph
[15:47] <mikedawson> nhm: i have seen situations where the osds die and the backing device falls offline. I've blamed hardware, but the drives always test clean and when I put them back into production they run fine.
[15:48] <Teduardo> mikedawson: Dell raid controllers do that all the time if the disks don't have TLER
[15:48] <Teduardo> even in raid-0
[15:49] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[15:49] <mikedawson> Teduardo: I'm using JBOD straight to onboard SATA
[15:49] <nhm> mikedawson: I don't think this would cause that, but Sage may have input.
[15:51] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit ()
[15:52] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[15:52] <morpheus__> so another question, whats backfill_toofull: backfill reservation rejected, OSD too full
[15:53] <morpheus__> i dont see any OSD above ~85% disk usage
[15:53] <morpheus__> ah i see
[15:53] <morpheus__> The osd backfill full ratio enables an OSD to refuse a backfill request if the OSD is approaching its its full ratio (85%, by default).
[15:59] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[16:00] * PerlStalker (~PerlStalk@ has joined #ceph
[16:01] <absynth_> mikedawson: we had OSDs that were up&in, but the cluster was still degraded by those osds' pgs
[16:02] <absynth_> i think paravoid and morg had OSDs that were wrongly marked as down all the time and took forever to come back up
[16:02] <absynth_> and also, morpheus__' OSDs need a crazy amount of RAM during bootup and peering
[16:03] <absynth_> these issues might and/or might not be related, interconnected and/or fixed by last night`s update
[16:03] <mikedawson> absynth_: seen most of those issues over the course of my testing, too. This issue is different
[16:03] <absynth_> "the osds die and the backing device falls offline
[16:03] <absynth_> "
[16:03] <absynth_> yøu mean that?
[16:03] <absynth_> woah, where'd that letter come from?
[16:04] * vata (~vata@2607:fad8:4:6:91f4:d410:ab68:d13e) has joined #ceph
[16:05] <mikedawson> absynth_: yeah, most likely the backing device falls offline causing the OSD to die. So it may have nothing to do with Ceph. But it only seems to happen under Ceph re-balancing load
[16:05] <absynth_> yeah, but that points towards hardware, doesn't it?
[16:06] <absynth_> i mean, hardware dying under heavy stress isn't a very farfetched scenario
[16:06] <nhm> mikedawson: Our Dell controllers seem to randomly take down drives and mark them as foreign.
[16:06] <absynth_> might be worth playing with osd_max_recovery settings and nmon to see if your i/o subsystem on the failing machine is just overwhelmed
[16:07] <absynth_> morpheus__: one other thing, do you scrub?
[16:07] <nhm> mikedawson: Not just ceph, system drives too. ;(
[16:07] <mikedawson> absynth_: exactly. that hardware has been taken out of the dev lab and moved to the probably doesn't work pile
[16:08] <morpheus__> absynth_: usualy yes, but i disabled it temporary when i read the mailinglist post about scrubing memory leaks
[16:08] <absynth_> ok
[16:09] * leseb (~leseb@mx00.stone-it.com) Quit (Ping timeout: 480 seconds)
[16:09] <absynth_> wtf snow?!?
[16:10] <nhm> absynth_: yes, all over. Blizzard this weekend apparently.
[16:10] <nhm> absynth_: oh wait, you meant there. ;)
[16:11] <joao> lol
[16:12] <absynth_> hm, blizzard... i wanted to play diablo3...
[16:14] <nhm> still need to play that. I've been avoiding it until I can actually devote time toit.
[16:14] <absynth_> me too
[16:14] <absynth_> the good part about this is: it will probably be in the bargain bin before i have enough time to play
[16:14] <absynth_> still haven't played warcraft3, either. i am kinda backlogged...
[16:17] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[16:22] <joao> I've reached the point in which I buy the games with the intent to install windows one of these days in order to play them
[16:23] <joao> still waiting for that to happen, and I keep buying the games
[16:23] <joao> steam just makes it too easy
[16:23] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[16:24] <absynth_> steam is available for linux
[16:24] <joao> I know
[16:25] <joao> that doesn't mean the games are though
[16:25] <joao> :p
[16:25] <absynth_> right
[16:25] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[16:35] * low (~low@ Quit (Quit: Leaving)
[16:38] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[16:38] <phantomcircuit> joao, valve ported cs:s i immediately removed my last copy of win xp
[16:39] <joao> yeah, they've also ported (or promised to port) a couple of other games
[16:39] <joao> tf2 comes to mind
[16:40] <joao> not sure if they ported l4d2 though, but that would probably dissuade me of ever installing windows again
[16:40] <janos> i imagine with rumors of a steam console device that we'll see more linux games
[16:40] <joao> hope so
[16:41] <janos> (i say this from the warm, gamely comfort of my win7 machine)
[16:41] <janos> but i would love to be able to ditch windows support
[16:41] <joao> but I gotta say that the best thing that ever happened to my productivity was the beta steam client crashing my ubuntu
[16:41] <janos> haha
[16:42] <janos> i'm avoiding it, telling myself i'm on fedora so it's not first-rate supported yet
[16:42] * janos rocks back and forth
[16:46] <janos> i had to check
[16:46] <janos> "Blizzard will release a game for Linux this year"
[16:46] <janos> http://steamforlinux.com/?q=en/node/162
[16:46] <janos> grain of salt, etc
[16:46] <noob2> :D
[16:46] <noob2> i feel like i've been waiting a decade for this stuff to happen
[16:47] <noob2> instead of using wine and bitching when it dies right before the end boss
[16:47] <janos> haha
[16:48] * darktim (~andre@pcandre.nine.ch) Quit (Ping timeout: 480 seconds)
[16:51] * harybahh (~EddardSta@dsi-laptop-vh.univ-lyon1.fr) has joined #ceph
[16:51] <harybahh> hi
[16:51] <noob2> you guys have been using chef for awhile right? how do you like it
[16:56] <noob2> harybahh: hello
[17:02] * jlogan1 (~Thunderbi@2600:c00:3010:1:64a7:2a7d:3bc:c1b2) has joined #ceph
[17:07] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[17:08] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[17:10] * rturk-away is now known as rturk
[17:14] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:15] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[17:16] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[17:28] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:30] * BMDan (~BMDan@ has joined #ceph
[17:31] <mattch> nhm: I had another thought on the 'will these SSDs last long enough?' question... is it sane to assume that 'total data writes before failure / manufacturers maximum sequential write speed per sec will give you a MTBF?
[17:33] * harybahh (~EddardSta@dsi-laptop-vh.univ-lyon1.fr) Quit (Ping timeout: 480 seconds)
[17:34] <mattch> (albeit a 'shortest it should last assuming maximum continuous writes to the device)
[17:34] <mattch> interestingly a value SSD claims 7000TB of writes before failure, and a write speed of 235MB/s - which qorks out at a shade over ayear by this calculation.
[17:35] * KindOne (KindOne@h185.237.22.98.dynamic.ip.windstream.net) Quit (Read error: Connection reset by peer)
[17:39] * KindOne (KindOne@h185.237.22.98.dynamic.ip.windstream.net) has joined #ceph
[17:40] <xiaoxi> mattch: I think you are right
[17:40] <elder> Anybody else having trouble accessing host teuthology?
[17:40] <elder> (Inktank)
[17:40] * fghaas (~florian@91-119-222-199.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[17:41] * sleinen1 (~Adium@2001:620:0:26:e143:f11b:f2d8:50c2) Quit (Quit: Leaving.)
[17:41] <xiaoxi> mattch: 7000TB may be a reasonable value for the write endurance capability for enterprise ssd with HET
[17:41] * sleinen (~Adium@ has joined #ceph
[17:41] <slang1> elder: joao and I can get to it either
[17:42] <elder> can't
[17:42] <elder> OK
[17:42] <slang1> sorry - cannot
[17:42] <joao> yeah, that kinda screwed what I had planned for this morning
[17:42] <elder> Well, me too.
[17:43] <elder> I need a few nodes, because my others are busy.
[17:43] <mattch> xiaoxi: I was quite impressed that these were the 'value' SSDs that Dell offer...
[17:44] <xiaoxi> mattch: well, can you tell your ssd vendor?
[17:44] <xiaoxi> is it intel ?
[17:45] <mattch> xiaoxi: Samsung (sm825 drives)
[17:45] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[17:46] <xiaoxi> mattch: not that familiar with Samsung , but the data from Intel SSD with HET(High endurance technology) is 4PB for 400GB SSD
[17:47] <xiaoxi> since ceph's journal IO pattern is quite sequential ,you could expect more.
[17:48] <xiaoxi> What's more, some preliminary data show that if you overprovision 20% of your ssd capability, the endurance will almost double
[17:48] <noob2> dmick: are you around?
[17:49] * sleinen (~Adium@ Quit (Ping timeout: 480 seconds)
[17:49] <rturk> dmick is usually in a bit later
[17:49] <noob2> ok
[17:50] * BMDan (~BMDan@ Quit (Quit: Leaving.)
[17:51] <mattch> xiaoxi: By overprovision - do you mean allocate a bigger journal size than expected?
[17:52] <xiaoxi> mattch: either you can do as you said ,or just leaving 20% unused(unpartitioned) space there
[17:54] <mattch> xiaoxi: Ahh, I assume the SSD will use that 'unused' space to do error corrections etc?
[17:57] <xiaoxi> mattch:not really , just use them as backup blocks.Actually every ssd has overprovision some space there.
[17:57] <nhm> xiaoxi: I'm still not sure if you are better with an S3700 at 200GB, or a 520 at 400GB with 2x more cells to spread writes over.
[17:57] <nhm> xiaoxi: sorry, 480gb for the 520
[17:58] * SkyEye (~gaveen@ Quit (Remote host closed the connection)
[17:59] * ScOut3R (~scout3r@540079A1.dsl.pool.telekom.hu) has joined #ceph
[18:02] <xiaoxi> nhm: some input from SSD team told me that MLC has 38x RBER(Raw bit error rate) than that of HET.
[18:02] <mattch> xiaoxi: For info, that samsung drives with 7PB total writes is an MLC drive
[18:02] <xiaoxi> but since most of the RBER is correctable by ECC and in-ssd-raid, I am not quite sure for the overall endurance
[18:03] <xiaoxi> mattch: the endurance is highly connect with access pattern, you will get significant difference endurance if you do pure 4K random write
[18:04] <mattch> xiaoxi: Definitely - am only thinking for journalling which I believe is all sequential?
[18:05] <nhm> xiaoxi: hrm. Still hard to know which approach is better
[18:06] <xiaoxi> nhm:but s3700 is cheaper than 520,right?
[18:06] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[18:08] <nhm> xiaoxi: In $US, 200gb Intel S3700 and 480gb Intel 520 are both around $500.
[18:13] * rturk is now known as rturk-away
[18:14] <absynth_> re, everyone
[18:14] <absynth_> sjust or someone around?
[18:16] * jskinner (~jskinner@ has joined #ceph
[18:17] <nhm> absynth_: he's out sick toady
[18:17] <nhm> absynth_: says he "picked up" something in Germany. :P
[18:18] <joao> are you sure it wasn't "someone"?
[18:20] <absynth_> lol
[18:20] <absynth_> "nah, it's fine, i don't need a jacket" </quote>
[18:22] <absynth_> to be fair, it was really chilly.
[18:22] <absynth_> at least by LA standards
[18:24] * alram (~alram@ has joined #ceph
[18:24] <joao> absynth_, these guys from LA have a surprisingly high-tolerance to cold weather
[18:25] <nhm> joao: you only think that because you have a low tolerance. :P
[18:26] * ScOut3R (~scout3r@540079A1.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[18:26] <joao> can't forget that while I was wearing 2 long-sleeve shirts, gloves, and a winter jacket in Amsterdam, Sage and Greg were wearing pretty much a t-shirt and a summer jacket
[18:26] <joao> nhm, that's true
[18:26] <absynth_> i think the portuguese i know are the people with the lowest tolerance to cold weather
[18:27] <yehudasa> joao: these guys are not native LA
[18:27] <yehudasa> they pretty much wear t-shirts all year around
[18:28] <joao> absynth_, that's because our notion of a bad winter is when we get 5C in Lisbon
[18:29] <joao> and that's a really bad winter imo
[18:29] <yehudasa> joao: a few weeks ago it was colder than that in LA
[18:29] <absynth_> i remember driving to an area with lots of snow with my ex and her sister. the terrified faces when we drove back in the dark on a snow-covered highway... priceless.
[18:29] <janos> haha
[18:29] <joao> yehudasa, that's tough :\
[18:29] <joao> thought you guys had warmer temperatures all year round
[18:30] * jskinner (~jskinner@ Quit (Ping timeout: 480 seconds)
[18:30] <yehudasa> false in advertisement
[18:30] <yehudasa> we also get some rain from time to time
[18:30] <absynth_> that's harsh
[18:31] <joao> absynth_, I enjoy snow, but it terrifies me to drive on it
[18:31] <absynth_> but compared to that, Germany still pretty much is Mordor.
[18:31] <janos> snow driving is fun!
[18:31] <ircolle> You guys need to come to the Rocky Mountains of Colorado :-)
[18:31] <absynth_> joao: that's the thing - driving on snow is completely safe
[18:31] <janos> just not fun around terrified other drivers ;)
[18:31] <absynth_> driving on snow that's thawed and re-frozen is dangerous
[18:32] <janos> 4wd + tire chains!
[18:32] <absynth_> you can drive on fresh snow with slicks, pretty much
[18:33] <elder> You guys are a bunch of babies.
[18:34] <absynth_> wanna discuss that over a nice, fresh knuckle sandwich?
[18:36] <elder> I've been known to jump in the water when it's -10C
[18:36] <joao> that's just wrong
[18:36] <absynth_> that is brave
[18:36] <elder> And icy roads are just a fun challenge.
[18:37] <elder> My wife might say that "brave" is a euphemism.
[18:38] <absynth_> for "foolish"?
[18:38] <elder> or worse
[18:38] <absynth_> i mean, ice-diving is kinda semi-popular here, too, but i just don't like it. after all, i'm not a seal
[18:38] <absynth_> (i mean the animal, not the elite soldier)
[18:46] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[18:52] <morpheus__> anyone else an idea how to disable deep scrubing completly on a running cluster?
[18:52] * davidz2 (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[18:53] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[18:53] * jskinner (~jskinner@ has joined #ceph
[18:55] <absynth_> morpheus__: you cannot disable it completely, at least not for PGs that have never before been scrubbed
[18:55] <absynth_> AFAIK
[18:56] <morpheus__> okay... narv :) deep scrubing during backfill
[18:59] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[18:59] * ScOut3R (~ScOut3R@540079A1.dsl.pool.telekom.hu) has joined #ceph
[19:00] * chutzpah (~chutz@ has joined #ceph
[19:00] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) has joined #ceph
[19:01] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[19:03] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) Quit ()
[19:03] <absynth_> ok, guys, off for the weekend. l8r
[19:06] * jluis (~JL@ has joined #ceph
[19:08] * jskinner (~jskinner@ Quit (Ping timeout: 480 seconds)
[19:12] * joao (~JL@ Quit (Ping timeout: 480 seconds)
[19:12] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[19:17] <elder> Sp�ter
[19:20] <noob2> dmick: you in now?
[19:27] * benpol (~benp@garage.reed.edu) has joined #ceph
[19:29] <benpol> Yahoo! I just successfully doubled the number of PGs in a pool using the "ceph osd pool set <poolname> pg_num <numpgs> --allow-experimental-feature" command Sage mentioned in ceph-devel the other day. Seemed to go very smoothly.
[19:30] * jskinner (~jskinner@ has joined #ceph
[19:34] <dmick> noob2: I am now
[19:34] <dmick> and yay benpol
[19:36] <noob2> dmick: i was looking at your tgt rbd implementation
[19:36] <noob2> have you tried it out?
[19:36] <dmick> I have
[19:36] <noob2> i believe it solves a problem i am encountering
[19:36] <dmick> but only to do relatively-simple I/O. I think I did run an fio on top of an ext4 on it
[19:37] <noob2> when i map the drive and use that as a block backing store, LIO gets into weird buggy code when the rbd device tells it to wait
[19:37] <dmick> and I ran some random-block-I/O tester whose name I forget
[19:37] <dmick> oh?
[19:37] <noob2> bonnie++?
[19:37] <dmick> no, something block
[19:37] <noob2> oh ok
[19:37] <dmick> it seemed...relatively solid
[19:37] <noob2> yeah LIO's devs told me that if rbd doesn't respond fast enough to the scsi write it issues an abort and things get nasty. I'm able to reliably kernel panic it
[19:38] <noob2> i was thinking that if LIO understood rbd this wouldn't happen
[19:38] <noob2> seems you already beat me to it and wrote code :D
[19:38] <dmick> well I hacked up an existing system with a very small set of changes :)
[19:39] <dmick> the version I posted to the stgt ml is slightly different, and probably a better place to start if you want to start there, but it's a quick build from source
[19:39] * jskinner (~jskinner@ Quit (Read error: No route to host)
[19:39] <dmick> I can push that branch to my repo as well
[19:39] * jskinner (~jskinner@ has joined #ceph
[19:39] <noob2> so to use that you're building it against a vanilla kernel?
[19:40] <noob2> i've never done custom kernel drivers like this before
[19:40] <dmick> it's all userland
[19:40] <noob2> ok
[19:40] <noob2> so the backstore is pluggable then
[19:40] <dmick> that was kinda the point...avoid the kernel and its dependencies
[19:40] <noob2> right
[19:40] <dmick> there are built-in plugins
[19:40] <dmick> bs_XXX; now there's a bs_rbd
[19:41] <noob2> :)
[19:41] <noob2> were the lio guys responsive to this?
[19:41] <dmick> I don't think they're connected; stgt is a different project
[19:42] <dmick> I think lio is the kernel impl
[19:42] <noob2> right
[19:42] <noob2> ok
[19:42] <noob2> i'd imagine you'd need to work up some code for the targetcli also right?
[19:42] <dmick> the main reviewer on the stgt list responded, but hasn't yet reviewed or merged
[19:42] <dmick> actually it Just Worked
[19:42] <noob2> lol damn
[19:42] <noob2> when does that happen :D
[19:43] <dmick> see this commit message for usage:
[19:43] <dmick> https://github.com/dmick/tgt/commit/27e3338839b32c10ea57a35c7ad6751da2208987
[19:43] <dmick> that's on the original branch
[19:43] <noob2> you the man!
[19:43] <dmick> the latest code is on the bs_rbd-ml branch
[19:44] <dmick> so it's quick to clone that repo and build the -ml branch to play around if you want
[19:44] <noob2> yeah i really need to get better at building source
[19:45] <noob2> oh tgt.. isn't that the old scsi utils?
[19:50] <noob2> i was thinking you made a patch for this: https://github.com/torvalds/linux/tree/master/drivers/target
[19:51] <dmick> http://stgt.sourceforge.net/
[19:52] <noob2> yeah that's the old scsi stuff
[19:52] <noob2> target is the new kernel utils
[19:52] <dmick> I was specifically aiming at anti-kernel
[19:52] <noob2> gotcha
[19:53] <noob2> yeah that is one annoying thing about the new target
[19:53] <noob2> i need to update my kernel every time i want to try new changes out
[19:58] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[20:00] <noob2> dmick: when you look to modify things like this how do you figure out how they work? there's no docs in this code :)
[20:00] <dmick> oh, you know. experience, parallel structure, hints in names from configuration parameters vs source
[20:01] <dmick> I think I might have known that since there was a backend involving files, there were going to be open/read/write calls somewhere; it's pretty clear that all the bs_* files are parallel, and have ops vectors
[20:01] * benpol (~benp@garage.reed.edu) has left #ceph
[20:01] <noob2> i see
[20:01] <noob2> so you just know after awhile
[20:02] <dmick> then once I read through bs_rdwr.c it was clear that it was just a set of simple tweaks to that
[20:02] <noob2> https://github.com/torvalds/linux/blob/master/drivers/target/target_core_file.c :D
[20:02] * BMDan (~BMDan@ has joined #ceph
[20:02] <noob2> and you just wrap your rbd library calls with their stuff
[20:02] <noob2> not bad
[20:03] <noob2> i may have to fork the kernel and give this a try
[20:03] <noob2> lio already supports qlogic fibre channel so the battle is halfway finished
[20:05] <BMDan> http://ceph.com/docs/master/cephfs/fstab/ is very wrong. "noauto" means it won't mount at startup, and "nodiratime" is a strict subset of noatime.
[20:13] <dmick> I don't know if I'd say noauto is *wrong*, per se, but it's certainly questionable as a suggested usag
[20:14] <dmick> and I would think nodiratime is a subset, but now I'm not sure, and I'm not sure it's not interpreted by the filesystem itself; are you sure?
[20:15] <dmick> noob2: you can't call librbd from the kernel
[20:16] <BMDan> @dmick: One sec, will reference code for the latter question. For the former, it explicitly says, "the Ceph file system will mount automatically on startup", and then provides an option that undoes that statement.
[20:16] <cephalobot> BMDan: Error: "dmick:" is not a valid command.
[20:16] <dmick> you'd have to fuse it somehow with the existing rbd driver
[20:16] <dmick> BMDan: good point
[20:16] <dmick> kind of an odd level of doc there
[20:16] <dmick> and self-contradictory
[20:17] <BMDan> http://lxr.free-electrons.com/source/fs/inode.c#L1539
[20:17] <BMDan> w/r/t nodiratime vs. noatime
[20:18] <dmick> looks pretty conclusive :)
[20:18] <BMDan> I only knew how to find it because I'd had to find it before. ;)
[20:18] <dmick> would you like to file a doc issue at ceph.com/tracker?
[20:18] <dmick> (or submit a patch)
[20:18] <BMDan> That sounds suspiciously like clicking a link. Which involves moving my mouse.
[20:19] <BMDan> I don't have time for these sorts of shenanigans!
[20:19] <dmick> I know. I ask a lot.
[20:19] <BMDan> I just came here to complain.
[20:19] <dmick> :D
[20:24] <jms_> So ... now that I'm actually where I can access my ceph testbed... Basic setup: Ceph 0.56.1; 1 with MDS + MON, and 2 with MON + 2xOSD .... 7x Clients with the 3.5.7 Kernel (OFED requirement) mounting cephfs directly.
[20:25] <jms_> Without setting the tunables I would get errors like: 2013-02-07 11:40:59.804544 7f626a385700 0 log [WRN] : slow request 30.724700 seconds old, received at 2013-02-07 11:40:29.079789: osd_op(client.4114.1:468 10000000bbb.000001d3 [write 0~4194304] 0.1b9cbc36 RETRY snapc 1=[]) currently waiting for sub ops
[20:26] <jms_> Setting the tunables removed those errors... 6 parallel writes of a 4.1GB file from nodes were done in ~4sec ...
[20:27] <jms_> After the writes finished I did an ls on the mounted cephfs ... took 4.5min before it returned
[20:27] <BMDan> dmick: http://tracker.ceph.com/issues/4058
[20:27] <dmick> thank you sir
[20:28] <BMDan> jms_: You're being undone by block cache, IMHO.
[20:28] <BMDan> That is, you won't be able to read from the FS until the caches have been flushed.
[20:28] <BMDan> You can try an explicit "sync" command (or mount the FS with -o sync) to get around this.
[20:29] <jms_> Now, with the tunables set there's dmesg errors: libceph: mon2 feature set mismatch, my 8a < server's 4008a, missing 40000 ... and I can't mount the FS ... but from what I heard this is expected because of the kernel version I'm running
[20:29] <BMDan> You'll note that performance will decrease greatly, but what you're really doing is exposing the *real* FS performance, rather than testing your memory subsystem's performance.
[20:31] <jms_> So mount the cephfs with '-o sync' or make sure sync is in the OSD filesystem mount optios to have it at that point?
[20:32] <BMDan> You can use block cache on the OSD fs.
[20:33] <BMDan> You just can't use it when benchmarking from the client side.
[20:33] <BMDan> That is, mount -t ceph /mnt/ceph -o name=admin,secret=foobar,sync
[20:33] * jms_ nods
[20:34] * BMDan has been doing Ceph for all of a week at this point, though, so take it all with a heaping spoonful of salt.
[20:34] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[20:38] <jms_> Oh ... and why isn't space free'd until the clients unmount the FS? I would delete the junk files but ceph -s/-w still showed for the pgmap multiple gigs in use, but once I unmountd it, the space in use went down ...
[20:38] <noob2> dmick: ok so that kinda bombs my idea to roll this into lio target
[20:49] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) Quit (Quit: Leaving.)
[20:50] <dmick> if you're gonna be kernel anyway, layering on rbd *ought* to work; if there are issues in that path, maybe we could make sure there are tracker issues about them
[20:50] <dmick> but really tgt is worth a try. non-kernel may be a win here because it's less likely to have deadlock/timeout problems
[20:51] <dmick> and it's just another network service, so it fits well
[20:51] <dmick> but not only that, it could really use some real-world testing
[20:52] <BMDan> What is the current feeling on the fuse client? I asked the other day and didn't get a lot of feedback.
[20:52] <BMDan> I'm trying to push this into production, so all else being equal, I'd really rather not trigger corruption.
[20:54] <noob2> dmick: yeah i think i'm behind you on this. being in the kernel is rough because my servers crash when i have a problem
[20:55] <dmick> BMDan: you mean ceph-fuse or rbd-fuse?
[20:59] * Cube (~Cube@ has joined #ceph
[21:07] <BMDan> dmick: Does rbd-fuse actually present a FS? If so, what's the difference? Is it just hitting a different API on the backend?
[21:09] <dmick> rbd-fuse allows access to rbd images as if they were files in a flat dir
[21:09] <dmick> ceph-fuse allows access to a shared POSIX filesystem
[21:09] <dmick> you really don't want to share RBD images
[21:11] <dmick> but if what you're really after is a chunk of storage with simple semantics, there's a lot less to pay for with rbd-fuse-exported simple images, but still with the cluster redundancy benefits.
[21:13] <BMDan> Well, ultimately, the goal is to roll this up with Cinder and use it for an OpenStack deployment.
[21:13] <BMDan> At which point, I *believe*, it's all rather academic, since Cinder speaks the requisite protocols directly through librados, yes?
[21:15] <BMDan> Okay, a bit of research yields Cinder->libvirt/qemu/librbd/librados, but the point stands.
[21:16] <dmick> right. as long as you use qemu-kvm VMs, you're in with OpenStack
[21:16] <dmick> rbd-fuse might be a way to support Xen VMs there, for example.
[21:17] <dmick> until/unless we do a native driver for Xen<->rbd
[21:17] <BMDan> I believe the intention is indeed to stick with KVM; it's what we're using on our other OpenStacks.
[21:17] <BMDan> Which, BTW, is a ridiculous word.
[21:17] <BMDan> But there is it.
[21:18] <BMDan> "OpenStacks", that is.
[21:18] <dmick> technology makes strange wordfellows
[21:20] <noob2> anyone think rhel7 will have a kernel recent enough to include ceph rbd? :D
[21:20] <ShaunR> so i see that btrfs they say is not ready for production use... how true is that?
[21:20] <ShaunR> should i be using xfs instead?
[21:20] <noob2> ShaunR: still pretty true
[21:20] <noob2> yes xfs
[21:21] <ShaunR> Has there been alot of disaterous failures using btrfs?
[21:21] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:21] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:24] <janos> "openstacks" makes me hungry for pancakes
[21:24] * loicd (~loic@magenta.dachary.org) Quit ()
[21:25] <BMDan> ShaunR: Well, it's like driving an F1 car. You go very, very fast—much faster than other people. However, you may occasionally hit a brick wall at 200 mph, and nobody will feel the least bit sorry for you.
[21:25] <BMDan> If that is acceptable in your deployment environment, then go for btrfs, and enjoy the ride and let us know all about it.
[21:26] <BMDan> But if there's one thing that I avoid like the dickens, it's bleeding-edge kernel filesystems.
[21:26] <BMDan> Even (most) drivers have, as a worst-case common consequence, bricking some piece of (replaceable) hardware.
[21:27] <iggy> a few distros think it's good enough for production
[21:27] <gregaf1> the bigger problem is that your F1 car starts out at 200mph — but then on your fifth lap of the track you get passed by a Camry, and it's lapping you a little while later
[21:27] <BMDan> However, essentially every filesystem bug, even the "minor" ones, cause data corruption.
[21:29] <BMDan> I recognize that it's hardly the same experience, but going with ext4 for me was like strapping into the car, and then having a bunch of friends push me around the racetrack.
[21:29] <BMDan> XFS is the Camry option.
[21:30] <BMDan> Needless to say, I went with ext4, because I can always throw more hardware at it to get performance; there's no replacement for lost data.
[21:31] <jmlowe1> BMDan: DO NOT USE BTRFS without this patch https://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=d468abec6b9fd7132d012d33573ecb8056c7c43f
[21:31] <BMDan> That and I have enough background in ext* that I can literally poke bits on the disk if it comes down to it, whereas XFS is strange milk and btrfs is brainf*** territory.
[21:31] <BMDan> Thank you jmlowe1 for expressing exactly why I don't use btrfs. ;)
[21:32] <BMDan> I have a slide from a presentation I did a couple years ago with a screenshot of a Drupal module's page that basically said, "If you used version x.y.z, we may have irrevocably deleted all your data. Terribly sorry about that. Fixed in the new version!"
[21:32] * rturk-away is now known as rturk
[21:32] * rturk is now known as rturk-away
[21:33] <BMDan> Using alpha software in production is like using your Shih-Tzu to protect your junkyard. Strictly speaking, it fits the slot that needs filling, but you've missed the point.
[21:33] <nhm> Btw, with lots of OSDs on 1 node, I'm seeing ext4 pulling ahead of btrfs and xfs right now in some of our development branches.
[21:34] <nhm> at least for lots of large sequential writes.
[21:34] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:34] <nhm> 2.7GB/s in one node.
[21:34] <BMDan> You know what'd be a really easy piece of code to write that would be *hilarious*?
[21:35] <janos> 10 GOTO 10
[21:35] * janos tries
[21:35] <BMDan> hehe
[21:35] * loicd (~loic@magenta.dachary.org) Quit ()
[21:35] <BMDan> No, no. Enable OSDs to use librbd as a backend FS.
[21:35] * janos was reaching wayyyyy back on that one
[21:36] <dmick> BMDan: put down the pipe and go play with the kitties for a while. It's getting to you.
[21:37] <jmlowe1> osd's all the way down
[21:37] <gregaf1> you don't need to write any new code for that, just use rbd kernel mounts
[21:37] <BMDan> It's actually rather legit, if you think about it. Imagine twenty clusters of twenty-one machines apiece; one meta-OSD at the front, handling writes with a replica count of 1 into its children.
[21:38] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:38] <lightspeed> or run OSDs on rbd-backed VMs
[21:38] <gregaf1> you could construct a map where that didn't loop, but…oh god the turtles
[21:38] <gregaf1> and the latency
[21:38] <nhm> Or run ceph on virtual machines provided by a cloud provider using ceph. :P
[21:39] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[21:39] <BMDan> That's a lot of kernel we don't have to run versus implementing/calling an rbd client ourselves.
[21:42] * sleinen1 (~Adium@2001:620:0:25:7de3:1c10:3ee5:490d) has joined #ceph
[21:42] * loicd (~loic@magenta.dachary.org) Quit ()
[21:45] <jks> running 0.56.2, ceph went from status HEALTH_OK to suddenly: health HEALTH_WARN 4 pgs peering; 4 pgs stuck inactive; 16 pgs stuck unclean
[21:46] <jks> I have all osds up and in and they have been that way the whole time... how can I find out what happened?
[21:47] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[21:47] <jks> if I do a ceph pg dump, I see 4 pgs in peering state... but how would I get them rolling again?
[21:50] <jks> if I do a ceph pg query on them, I see they all have osds 6 and 12 in the "up" column... hmm
[21:51] <ShaunR> so then are alot of you using ext4 over XFS/BTRFS then?
[21:51] <ShaunR> i was about to test XFS but now i'm thinking maybe i should be testing ext4 :)
[22:00] <ShaunR> ok, so it looks like ceph really wants to be placed someplace other than /usr/local/ceph, what prefix are you guys passing configure?
[22:03] <dmick> by "you guys" you mean "in the official build"?
[22:04] <ShaunR> ya
[22:04] <dmick> things go in /usr/bin and /usr/lib (and /var/lib/ceph and /var/log/ceph)
[22:04] <ShaunR> well ./configure by default will put everything in /usr/local
[22:05] <dmick> do_autogen.sh is how I generally do it, and that sets /usr
[22:05] <ShaunR> adding --prefix=/usr/local/ceph puts everything there... but ceph still defaulting to locations like /var/lib/ceph and /etc/ceph/
[22:05] <dmick> yeah, I don't think it's fully prefix-able anymore
[22:06] <ShaunR> do_autogen.sh i dotn see, are you meaning autogen.sh?
[22:07] <ShaunR> ah, guess if i would have ready INSTALL
[22:07] <ShaunR> it does say to use autogen
[22:08] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[22:08] <dmick> in the root directory
[22:08] <dmick> do_autogen.sh
[22:09] <dmick> (the root of the source)
[22:09] <ShaunR> ya, i dont have that in my source..
[22:09] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:09] <ShaunR> in mine it's named autogen.sh
[22:10] <dmick> ? oh maybe you got the tarball and the tarball renames it or something?
[22:11] <dmick> indeed. again, I've never used the tarball for anything
[22:11] <ShaunR> your pulling from git?
[22:12] <dmick> ah. no, autogen.sh is just different
[22:12] <dmick> and yes, all git
[22:12] <dmick> I guess do_autogen.sh just isn't in the tarball.
[22:13] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:14] <ShaunR> f it, i'm gunna pull from git
[22:14] <dmick> I'm not saying it's more correct, it's just what I'm used to
[22:15] <ShaunR> can multple filesystems be used with osd's? For example could one osd run ext4 and another btrfs?
[22:15] <dmick> sure, if you want
[22:16] <ShaunR> Just wondering from a standpoint of migrating to btrfs in the future once it may become stable.
[22:22] <ShaunR> wtf... now were did it get installe
[22:26] <ShaunR> dmick: where does yours ends up?
[22:29] <dmick> I don't make install without setting DESTDIR
[22:29] <dmick> when I build/test, I just run from the source dir
[22:29] <dmick> if I want to move it to a different machine, typically I'll make debs
[22:36] * danieagle (~Daniel@ has joined #ceph
[22:41] <dmick> not sure why wget didn't complain for me when getting the new build key
[22:41] * wschulze1 (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[22:48] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[22:54] <ShaunR> dmick: so then your install still wants to use /etc/ceph and /var/lib/ceph?
[22:59] <dmick> yes. there are other config vars for that set in do_autogen.sh though; don't know if everything relies on them or not
[23:00] <dmick> git grep var/lib/ceph shows probably not
[23:00] <jks> anyone else seen pgs stuck peering? - should I consider restarting osds to get them unstuck?
[23:04] * rturk-away is now known as rturk
[23:06] <noob2> dmick: i posted my ceph rbd auto mounting code to github. https://github.com/cholcombe973/rbdmount It mounts ceph rbd devices and exports them over LIO utils
[23:08] <dmick> jks: it seems a little odd
[23:08] <jks> yeah :-|
[23:08] <dmick> noob2: ok, cool
[23:09] <noob2> :)
[23:09] <noob2> i need to edit the readme to show the config file format and how to use it
[23:10] <dmick> jks: can you pastebin ceph osd dump, ceph pg dump to start?
[23:10] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:11] <jks> dmick, you want all pgs or just the dump of the inactive ones?
[23:12] <dmick> let's start with all
[23:13] <jks> dmick, http://pastebin.com/1i9W1wQ8
[23:14] * junglebells (~bloat@CPE-72-135-215-158.wi.res.rr.com) has joined #ceph
[23:18] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[23:19] * The_Bishop (~bishop@2001:470:50b6:0:355c:a685:2329:ce22) Quit (Ping timeout: 480 seconds)
[23:20] * The_Bishop (~bishop@2001:470:50b6:0:edfe:497e:7390:6a2b) has joined #ceph
[23:22] <ShaunR> hmm, ceph health is returning...
[23:22] <ShaunR> 2013-02-08 06:28:25.432174 7f8c5feb1760 -1 unable to authenticate as client.admin
[23:22] <ShaunR> 2013-02-08 06:28:25.432658 7f8c5feb1760 -1 ceph_tool_common_init failed.
[23:22] <ShaunR> also /etc/ceph/keyring is empty
[23:24] <dmick> if that's where your client.admin looks for its keys, then that would explain the failure
[23:26] <jks> dmick, hmm, found out that the mds is logging an error: mds.0.cache.dir(1) mismatch between head items and fnode.fragstat!
[23:27] <jks> probably not related to the osds failing to get those pgs active though
[23:28] <absynth_> how innocent our cluster looks when it sleeps, err, doesn't throw errors
[23:28] <jks> absynth_, it will wake up eventually ;-)
[23:29] <dmick> jks: yeah
[23:29] <darkfader> that which eternal lies... ?
[23:29] <jks> hmm, I restarted osd.6 and now I'm back at HEALTH_OK
[23:29] <dmick> I see the 6,12 pattern too
[23:29] <dmick> is there anything suspicious in osd.6's log?
[23:29] <ShaunR> dmick: i
[23:30] <jks> dmick, basically just bunches of slow requests... and then this: 0 -- >> pipe(0x7ffe6c001a40 sd=52 :6807 s=2 pgs=372 cs=61 l=0).reader got old message 23 <= 196030 0x7ffe4c001e30 pg_notify(1.11(9),4.e(9) epoch 3327) v4, discarding
[23:30] <ShaunR> i'm trying to generate one using... ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' osd 'allow *' -o /etc/ceph/keyring
[23:30] <ShaunR> but i get the same error
[23:30] <ShaunR> ceph_tool_common_init failed. has me wondering if i'm issing somthing crucial
[23:30] <absynth_> whats the symptom, osd taking forever to come up / in?
[23:30] <absynth_> or not coming up/in at all after a flap?
[23:30] <absynth_> (@jks)
[23:31] <jks> absynth_, sorry?
[23:31] <dmick> 4 pgs in peering, not making progress
[23:31] <absynth_> oh
[23:31] <dmick> all on the same primary,secondary pair
[23:31] <jks> dmick, found the place in the logs right before it happened
[23:31] <absynth_> that sounds like the bug we were seeing yesterday, at least kind of
[23:31] <jks> 2013-02-08 16:51:53.039423 7ffebc2a7700 0 -- >> pipe(0x7ffe6c0025d0 sd=37 :6807 s=2 pgs=311 cs=9 l=0).reader got old message 18 <= 196030 0x7ffe5c10d580 pg_notify(0.12(14),2.10(9),1.11(9),4.e(9) epoch 3325) v4, discarding
[23:31] <jks> 2013-02-08 16:51:53.039517 7ffebc2a7700 0 -- >> pipe(0x7ffe6c0025d0 sd=37 :6807 s=2 pgs=311 cs=9 l=0).reader got old message 19 <= 196030 0x7ffe5c003840 osd_map(3326..3326 src has 2822..3326) v3, discarding
[23:32] <absynth_> do you have many idle PGs, i.e. pgs that haven't been touched in weeks or so?
[23:32] <jks> 2013-02-08 16:51:53.039674 7ffebc2a7700 0 -- >> pipe(0x7ffe6c0025d0 sd=37 :6807 s=2 pgs=311 cs=9 l=0).reader got old message 20 <= 196030 0x7ffe5c1d8950 pg_notify(0.12(14),2.10(9),1.11(9),4.e(9) epoch 3326) v4, discarding
[23:32] <jks> 2013-02-08 16:56:41.670401 7ffe965ec700 0 -- >> pipe(0x7ffe6c001320 sd=44 :6807 s=0 pgs=0 cs=0 l=0).accept connect_seq 72 vs existing 71 state standby
[23:32] <jks> stuff like that
[23:32] <jks> absynth_, I have no idea? :-)
[23:32] * BMDan (~BMDan@ Quit (Quit: Leaving.)
[23:32] <absynth_> let me look for the ml posting
[23:32] <jks> absynth_, it's been running an rsync to copy over data... been running for days
[23:32] <absynth_> oh, you are jens, right?
[23:32] <jks> absynth_, but I couldn't say if that hits all the pgs or something is left out
[23:33] <jks> absynth_, yep
[23:33] <absynth_> ok
[23:33] <absynth_> did you restart any OSDs after you made that posting?
[23:34] <jks> yes, just 5 minutes ago I restarted osd.6 and the system went back into HEALTH_OK
[23:34] <jks> my rsync is screwed though... timed out ;-)
[23:34] <absynth_> heh, figures
[23:34] * sleinen1 (~Adium@2001:620:0:25:7de3:1c10:3ee5:490d) Quit (Quit: Leaving.)
[23:34] <absynth_> it would have been interesting to see if a simple "ceph osd down 6" would have changed anything
[23:34] <jks> I'm trying to rsync over a copy of my data... I have been trying to achieve that for 4 months now, but keep hitting various bugs :-|
[23:35] <absynth_> so right now, you have a OK cluster?
[23:35] <jks> yep
[23:36] <dmick> 'cept for those 4 pgs
[23:36] <jks> nope, they're ok again
[23:36] <jks> pgmap v1484365: 1018 pgs: 1018 active+clean; 3603 GB data, 7407 GB used, 14091 GB / 22161 GB avail
[23:36] <dmick> oh?
[23:36] <dmick> hm
[23:37] <dmick> presumably you've looked at syslog on osd.6 and osd.12 for messages about bad disks, filesystem problems, etc.?
[23:37] <jks> yep, no such things
[23:37] <jks> and other osds on the same servers weren't affected
[23:38] <dmick> and outside-the-cluster reads on OSD data and journal seem to be OK?
[23:38] <dmick> (like ls -lR, dd of=/dev/null, that sort of thing)
[23:38] <jks> you mean just cat'ing a file?
[23:38] <jks> seems to be okay, yes
[23:38] <absynth_> dmick: this issue sounds a bit like the peering/osd boot/osd memleak complex we have been talking about the last days with sage et al.
[23:38] <dmick> how long were those pg's in that state?
[23:39] <absynth_> might still be something completely different, though
[23:39] * vata (~vata@2607:fad8:4:6:91f4:d410:ab68:d13e) Quit (Quit: Leaving.)
[23:39] <dmick> absynth_: could be, I wasn't paying strict attention there
[23:39] <jks> dmick, 27311.9 seconds to be exact ;-)
[23:39] <absynth_> he had slow requests in the 20-thousands of seconds, i.e. about 8 hours
[23:39] <ShaunR> anybody know why i cant generate a client.admin keyring using ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' osd 'allow *' -o /etc/ceph/keyring
[23:40] <dmick> ShaunR: how does it fail?
[23:40] <jks> hmmm... interesting
[23:40] <jks> from osd.6: osd.6 3308 heartbeat_check: no reply from osd.12 since 2013-02-08 15:47:42.725323 (cutoff 2013-02-08 15:47:44.830487)
[23:40] <absynth_> jks: if your issue and the issues we have read about in the last days are indeed connected, a fix for them should be available soon, from what i hear
[23:41] <jks> followed up by: -- submit_message osd_sub_op(unknown.0.0:0 0.85 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v7 remote,, failed lossy con, dropping message 0x7ffe8803f630
[23:41] <absynth_> jks: maybe a network issue?
[23:41] <jks> and then: osd.6 3322 from dead osd.12, dropping, sharing map
[23:41] <jks> absynth_, but I don't see how that could affect only 1 osd on a server, and not the rest
[23:41] <dmick> ShaunR: (and the usual way to create a keyring file is ceph-authtool, I think)
[23:41] <jks> and to my knowledge there should have been no problems with the network at all (it is all running on a dedicated switch in my test lab)
[23:41] <ShaunR> dmick: well right now i'm seeing monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
[23:41] <absynth_> maybe still check dmesg for network-related messages
[23:41] * yoshi (~yoshi@lg-corp.netflix.com) has joined #ceph
[23:42] <dmick> does the keyring file exist?
[23:42] <jks> absynth_, no dmesg messages for weeks
[23:42] <dmick> does it have a client.admin key in it?
[23:42] <ShaunR> because i removed the file, but before that it was unable to authenticate as cleint.admin
[23:42] <absynth_> weird...
[23:42] <ShaunR> the docs say to use that command above to get or create auth.
[23:42] <absynth_> well, i shall look at the insides of my eyelids for a couple hours.
[23:43] <absynth_> goog luck bug hunters
[23:43] <absynth_> -g+d
[23:43] <dmick> ShaunR: how did you originally deploy this cluster
[23:44] <jks> dmick: ah, the real problem seems to be with osd.12
[23:44] <jks> dmick, heartbeat_map is_healthy 'OSD::op_tp thread 0x7fa114ff9700' had timed out after 15
[23:44] <ShaunR> dmick: I choose to use the rpms for this test./
[23:44] <dmick> ShaunR: OK. What did you do to create the cluster
[23:44] <jks> dmick, it logs that a minute before osd.6 complains... and it keeps logging that
[23:44] <dmick> once the rpms were installed
[23:45] <dmick> jks: sounds suspicious all right
[23:45] <jks> then after a few minutes it logs: log [WRN] : map e3309 wrongly marked me down
[23:45] <jks> and: 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7fa114ff9700' had timed out after 15
[23:46] * jmlowe1 (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has left #ceph
[23:47] <dmick> ShaunR: ?
[23:47] <dmick> if you did mkcephfs, mkcephfs should have created a keyring with the client key with $BINDIR/ceph-authtool --create-keyring --gen-key -n client.admin $dir/keyring.admin
[23:47] <ShaunR> dmick: i think i got it figured out... the docs didnt place a absolute path on the keyring when i ran mkcephfs the frist round
[23:47] <dmick> and then you would have copied that keyring.admin to the machine you want to be running as your client
[23:47] <ShaunR> so it placed the keyring in my current dir... which wasnt /etc/ceph
[23:48] <dmick> You mean this one:
[23:48] <dmick> cd /etc/ceph
[23:48] <dmick> sudo mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring
[23:48] <ShaunR> lol, ya
[23:49] <ShaunR> HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery 21/42 degraded (50.000%)
[23:49] <dmick> do you have one OSD but pool size 2?
[23:49] <ShaunR> is that normal to see right off the bat? I mean i only have 1 server running ceph right now
[23:50] <ShaunR> I didnt specify a pool size anyplace
[23:51] <dmick> so it's 2
[23:51] <dmick> do you have one OSD?
[23:51] <ShaunR> ya probably
[23:51] <ShaunR> yep, only 1
[23:51] <dmick> that's why then
[23:51] <dmick> cna't have two replicas on one storage daemon
[23:51] <ShaunR> makes sense...
[23:54] * loicd looking at http://gitbuilder.sepia.ceph.com/gitbuilder-quantal-i386/log.cgi?log=38dd59ba7cebe1941b864646246f98c75bb395c5
[23:55] <dmick> yes, that gitbuilder is not yet right
[23:55] <dmick> clearly one easy reason is boost_program_options :)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.