#ceph IRC Log


IRC Log for 2013-04-30

Timestamps are in GMT/BST.

[0:05] * dwt (~dwt@128-107-239-234.cisco.com) Quit (Quit: Leaving)
[0:05] <mikedawson> sage: I'm back
[0:05] <sage> care to try the latest wip-mon-compact? it tries to compact on trim.. can you see if that lowers the growth speed?
[0:06] <mikedawson> sage: will install it now, then be off for a few hours. Will pick it back up after bedtime for my kids
[0:07] <sage> k thanks
[0:07] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[0:07] * gmason (~gmason@hpcc-fw.net.msu.edu) Quit (Ping timeout: 480 seconds)
[0:08] <mikedawson> sage: should I manually compact when I start, or is that handled in the new version?
[0:08] <sage> it'll happen automatically
[0:08] <mikedawson> sage: nice!
[0:09] <Teduardo> is there a ceph driver for cinder integrated into ubuntu 13.04?
[0:10] <mikedawson> sage: I lost mon.c from quorum it looks like. My guess is it will come back after restarting with the new build
[0:10] <sage> yeah
[0:11] <mikedawson> it was 3.1GB whereas the other two are 7.2GB
[0:14] <mikedawson> sage: mon.a compacted. mon.b and mon.c are unchanged
[0:14] <mikedawson> no quorum in this state
[0:14] <sage> b and c got the new code too
[0:14] <sage> ?
[0:15] <mikedawson> all reporting ceph version 0.60-722-g0f7d951 (0f7d951003b09973b75b99189a560c3a308fef23)
[0:16] * BillK (~BillK@58-7-104-61.dyn.iinet.net.au) has joined #ceph
[0:17] <mikedawson> sage: all three also started quick (no delay like I saw before manually compacting). Plus they no longer show a compacting message on the cli
[0:17] * diegows (~diegows@host28.190-30-144.telecom.net.ar) Quit (Ping timeout: 480 seconds)
[0:18] <sage> you should see something like 'bootstrap -- triggering compaction' in the logs
[0:18] <sage> shortly after startup
[0:19] <mikedawson> sage: 2013-04-29 22:19:06.927673 7fc56de317c0 10 mon.c@2(probing) e1 bootstrap -- triggering compaction
[0:20] <sage> is there a 'bootstrap -- finished compaction'?
[0:20] <mikedawson> sage: not on that mon
[0:20] <sage> that would explain it :) probably in D state
[0:22] <mikedawson> I manually compacted mon.b, it logged
[0:22] <mikedawson> 2013-04-29 22:17:57.818577 7f96ef86b7c0 -1 compacting monitor store ...
[0:22] <mikedawson> 2013-04-29 22:18:09.686575 7f96ef86b7c0 -1 done compacting
[0:22] <mikedawson> to the cli, but it doesn't show a 'bootstrap -- finished compaction' either
[0:23] <sage> hrm
[0:27] <mikedawson> sage: so mon.a compacted by itself, and I had to manually compact mon.b and mon.c. Right now mon.a and mon.b are in quorum. mon.c is attempting to syncronize
[0:31] <mikedawson> sage: mon.b goes down to 29M at some point http://pastebin.com/raw.php?i=KKDNiJZ2 then seems to catch up, but it is stuck looping through that process
[0:32] <sage> making progress?
[0:32] <sage> oh, it goes down to 29M again and restarts the sync?
[0:32] <sage> could try putting 'mon compact on bootstrap = false' in ceph.conf.. that might be too slow
[0:33] <mikedawson> sage: twice now mon.c has segfaulted http://pastebin.com/raw.php?i=jDZ6eN64
[0:33] <sage> is that you running the mon_status command?
[0:35] <mikedawson> no, I'm doing du watching the file size, tailing the logs, and occasionally doing ceph -s
[0:39] <mikedawson> sage: 'mon compact on bootstrap = false' in mon.c's ceph.conf worked. it is back in the quorum
[0:39] <sage> ok cool, that's a bad idea it seems :)
[0:40] <sage> any indication of whther it is growing at the same rate as before?
[0:40] <sage> actually maybe best ot wait a couple hours to see steady state
[0:40] <mikedawson> sage: I've seen it compact already
[0:41] <mikedawson> post-compact seems to be around 250MB
[0:41] <mikedawson> sage: I'll let it run and report back. how late will you be around?
[0:41] <sage> i'll check as late as 10pm
[0:42] <sage> with dinner/kids in between
[0:42] <mikedawson> I'll have results for you. Thanks!
[0:42] <sage> thanks for testing!
[0:43] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[0:44] <SpamapS> Question: is it possible to boot linux with an rbd root?
[0:44] <SpamapS> Follow up question: will rbd (kernel level) ever support clones?
[0:47] <gregaf> I think no, because you need to load the rbd support from somewhere local, right? and yes, elder's patches are in review and testing right now
[0:47] <SpamapS> so RBD is module-only ?
[0:47] <dmick> SpamapS: well, are you thinking about a PXE-booted machine?
[0:47] <SpamapS> dmick: can be initrd on a local disk too
[0:48] <SpamapS> This is for baremetal openstack nova
[0:48] <SpamapS> instead of booting a VM, we are booting real hardware
[0:48] <dmick> I *think* you can bind rbd into the kernel; I don't know if anyone does, but I think maybe we do for UML testing envs
[0:48] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[0:48] * coyo (~unf@00017955.user.oftc.net) Quit (Ping timeout: 480 seconds)
[0:49] <dmick> elder would know best
[0:49] * rustam (~rustam@ has joined #ceph
[0:49] <gregaf> oh, like with rbd as a root drive but loading linux from elsewhere first? can you do that?
[0:50] <SpamapS> gregaf: right
[0:50] <gregaf> I think that should be fine then, yes
[0:50] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:50] <gregaf> (I am obviously not up on all the ways to boot linux)
[0:51] <elder> I'm on my way out the door but... I have never tried to boot Linux with an rbd root.
[0:51] <gregaf> sage, joao: I'm still trying to figure out how #4837 happened following my election deferral commits, but once the behind leader got elected the reason it all went to hell was http://tracker.ceph.com/projects/ceph/repository/revisions/41987f380f4fc72762486ad7ddd0ab63173dc5e3, which bootstraps from handle_last() — and the pre-conditions there guarantee we are going to sync rather than start an election, unlike with the other bootstrap() c
[0:52] <gregaf> if one of you has time to look over that code and check my work again that'd be great
[0:52] <elder> Maybe if rbd and the necessary network drivers were present in initrd it would work. Sort of like a diskless NFS node. It would be cool I think.
[0:52] <gregaf> (I'll figure out the patches for not bootstrapping inappropriately)
[0:52] <dmick> elder: I meant whether you'd linked rbd into the kernel (although, is that necessary, SpamapS? can't initrd contain modules?)
[0:52] <SpamapS> dmick: right, we can load initrd from PXE too, so thats not really an issue.
[0:52] <elder> I test rbd with it linked directly into the kernel too, in a user-mode Linux (UML) environment.
[0:52] <dmick> ^bingo
[0:52] <dmick> ok
[0:53] <elder> But there too, I haven't tried that in a "normal" kernel build.\
[0:53] <SpamapS> sounds like "probably" is the answer to my 1st question
[0:53] <dmick> I'm pretty sure I've linked it into a kvm boot
[0:53] <SpamapS> (since initrd is fine)
[0:53] <elder> I'd love to hear about it if you do it.
[0:53] <SpamapS> so, 2nd question, can I then boot from clones? THe docs say no, that kernel rbd doesn't support clones.
[0:54] <elder> Well, get the first thing done first... To boot you need to map the image. I'm not sure how that will work in initrd but maybe... But yes, I expect you could boot from clones too if you got a layered image to work.
[0:55] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:55] <elder> Again, I'd love to hear about it, I think it's an interesting use case.
[0:55] <gregaf> SpamapS: patches in review, like I said — are they going to make the 4.0 window do you think, elder?
[0:55] <elder> I have to go though. I'll be back online again in a few hours.
[0:55] <SpamapS> elder: initrd will just run scripts interpretting kernel commandline to find the image, mon's, etc.
[0:55] <elder> 4.0? Is that the next release?
[0:55] <gregaf> well, I assume
[0:55] <SpamapS> gregaf: ah cool, missed that
[0:55] <gregaf> 3.9 just got pushed out yesterday or something, right?
[0:55] <elder> They should, unless Linus has a problem with these late-breaking patches.
[0:56] <elder> But 10 comes after 9.
[0:56] <SpamapS> Yeah I don't think it will be 4.0.
[0:56] <gregaf> ah, right, I forgot we counted that way
[0:56] <SpamapS> Linus just did 3 to commemorate 3 decades
[0:56] <SpamapS> (and, I suspect, to giggle at what broke)
[0:56] <lurbs> I heard he did it just to ruin all software that naively expected 2.x.x.
[0:56] <gregaf> huh? no, he got tired of the ridiculous numbers in the 2.6 series
[0:57] <gregaf> he made up a different reason each time anybody asked and wouldn't take that for an answer
[0:57] <SpamapS> Oh nice
[0:57] <elder> 2.6.39--39 was "too high a number for humans to feel comfortable with" or something to that effect.
[0:57] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) Quit (Quit: leaving)
[0:57] <SpamapS> gregaf: maybe *that* answer was also one of the answers he made up...
[0:57] <gregaf> nah, otherwise he could have waited
[0:57] <gregaf> a bunch of the core people were bummed he didn't wait until after 2.6.42 ;)
[0:58] <SpamapS> Ok well it sounds like what I want to do (boot a box using an rbd root based on a clone) will at least be feasible "soonish"
[0:59] <sage> yup!
[0:59] <dmick> you could test half of it with a non-clone image today
[0:59] <SpamapS> Right, and that would probably be fine for the solution I have
[1:00] <SpamapS> as I could just have glance store images in a different store than cinder wants volumes in, and then cinder wouldn't even bother with a snapshot :)
[1:00] <dmick> although I'm not sure how you force rbd to map an image (create a device) before the kernel wants it for root-mounting
[1:01] <sage> that will be the trick
[1:01] <dmick> but I'd expect some of the software-raid stuff to have solved that already somehow
[1:01] <joao> gregaf, I'll be happy to take a look in the morning
[1:01] <joao> I'm in no shape to do it right now though
[1:01] <SpamapS> dmick: that would be done in initrd exactly
[1:02] <SpamapS> dmick: a tiny little collection of scripts and tools that can make magic happen :)
[1:02] <dmick> where does that go? is there some singleuser-shellscript mechanism?...
[1:02] <sage> gregaf: is the solution there to make the first bits of bootstrap() send cancellations for syncs in progress?
[1:03] <SpamapS> dmick: varies by distro. In Ubuntu and Debian, /usr/share/initramfs-tools is the root FS that gets squashfs'd
[1:03] <gregaf> no, it's going into bootstrap and starting a sync, but there's no "I'm not the leader!" notification that goes out to other people
[1:03] <gregaf> sage: ^
[1:04] <joao> gregaf, sage, fyi, I've been unable to grow the store significantly with just a couple of OSDs and rados bench while thrashing; going to fire a teuthology task over a couple of miras and hopefully that will cause havoc on the store overnight
[1:04] <gregaf> basically we really want to prevent it from ever even reaching that point, thus my elector changes
[1:04] <gregaf> but apparently they failed
[1:04] <gregaf> :/
[1:04] <sage> it seems like if we rely on that the design is broken.. ceph-mon can restart and go into bootstrap() at any time
[1:04] <gregaf> sage: right, but a restart changes everything, right?
[1:04] <gregaf> or are you saying bootstrap is allowed from anybody, anywhere?
[1:04] <sage> yeah
[1:04] <gregaf> in which case yes, the design is broken, but I think in all other cases this isn't a problem
[1:05] <dmick> SpamapS: OK, thanks for the pointer
[1:05] <gregaf> (I'm thinking band-aids, then surgery after release)
[1:05] <gregaf> (or else we need to agree to push back and do some surgery, I'm not sure which is a better idea)
[1:05] <sage> can't we just ignore any sync messages that are nonsense? taht seems like an easier and safer bandaid
[1:06] <gregaf> what would your "nonsense" measurement be?
[1:06] <sage> hmm, no election epoch in the message. :(
[1:07] <joao> I thought we were already dropping sync messages whenever the right conditions weren't met
[1:07] <sage> browsing teh sync code it looks like it's push based.. is hat right?
[1:07] <joao> sage, 'providers' push chunks to 'requesters'
[1:08] <joao> but only upon an initial request
[1:08] <sage> seems like it would be simpler/safer as pull. then the leader/provider could just ignore or say 'go away' at any time
[1:08] <gregaf> joao: yeah, but the first obviously-broken thing we could check for here is when ex-but-nobody-knows-yet leader gets a sync message from itself
[1:08] <joao> and I believe it's the leader who gets to decide with whom a given monitor should sync with in the first place
[1:08] <joao> gregaf, ah!
[1:09] <joao> that ought to be simple enough to handle: drop any messages if they're target matches us
[1:09] <joao> *their
[1:09] <gregaf> sage: we can't be pure pull because we have to pause trimming, and other than that arrangement it actually basically is pull — but the wrapping could be simplified a lot more than it is
[1:09] <sage> i see
[1:10] <gregaf> joao: and then rely on timeouts to clean it all up? that might work
[1:10] <joao> gregaf, those timeouts are in place already
[1:10] <gregaf> yeah
[1:10] <gregaf> however, this actually becomes impossible if the code to prevent a too-far-behind monitor from being elected leader were working, and apparently I failed and it's not
[1:11] <joao> and in the meantime, the other monitors will eventually figure out that that monitor dropped out of quorum
[1:11] <gregaf> so we should fix that, and then we can put in place the "drop stuff from myself" band-aid as well
[1:11] <gregaf> or perhaps just assert out, which I actually added in already as a debugging measure, since I don't think we can get into this scenario any other way
[1:11] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[1:13] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[1:16] <sage> gregaf: 4858 looks good to me
[1:17] <gregaf> I need to check the test results — it doesn't look good but it also doesn't look like my fault ;)
[1:17] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit ()
[1:18] * vata (~vata@ Quit (Quit: Leaving.)
[1:19] <gregaf> okaaaayy…..
[1:19] <gregaf> 2013-04-29T15:31:09.385 DEBUG:teuthology.orchestra.run:Running []: 'sudo chmod 777 /var/log/ceph'
[1:19] <gregaf> 2013-04-29T15:31:09.393 INFO:teuthology.orchestra.run.err:chmod: cannot access `/var/log/ceph': No such file or directory
[1:19] <gregaf> that's on a bobtail backport and sounds a little familiar, but the tests are all from the ceph-qa-suite bobtail branch…do I need to do something in teuthology too?
[1:19] <gregaf> s/in/with
[1:20] <gregaf> sjust: sage: one of you probably knows about that ^
[1:21] <sage> you probably skipped the - install: task?
[1:21] <gregaf> schedule_suite.sh ;)
[1:21] <gregaf> and no, I see it running shortly before that in the log
[1:21] * BManojlovic (~steki@fo-d- Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:22] <sage> was it hte bobtail ceph-qa-suite?
[1:22] <sage> where is the full log?
[1:22] <gregaf> yeah, should have been
[1:22] <gregaf> ./a/gregf-2013-04-29_15:16:59-temp_test-wip-4858-reset-bobtail-testing-basic — any of those
[1:22] <sage> thw bobtail suite doesn't have install tasks in the jobs.. it has to runa gainst the bobtail branch of teuthology.
[1:23] <gregaf> okay; I can probably specify that somehow?
[1:23] <sage> the 6th or so arg ot schedule_suite.sh :) can run with no args for usage
[1:23] <gregaf> can
[1:23] <gregaf> *k
[1:23] <davidz> sstan: sorry, missed your question. No not a common problem.
[1:24] <gregaf> once more with feeling! and someday my non-bobtail branch will be usable…
[1:25] <gregaf> maybe it's not pulling from where I think and the .gitignore thing is busting them; I'll rebase it I suppose
[1:27] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:29] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[1:30] * rturk is now known as rturk-away
[1:31] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) has joined #ceph
[1:33] * BillK (~BillK@58-7-104-61.dyn.iinet.net.au) Quit (Quit: Leaving)
[1:33] * BillK (~BillK@58-7-104-61.dyn.iinet.net.au) has joined #ceph
[1:50] * lofejndif (~lsqavnbok@1SGAAAHIR.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[1:55] <median> can i use a S3 client to connect to rados gw?
[2:00] <phantomcircuit> im trying to shrink the journal device being used
[2:01] <phantomcircuit> i shutdown the osd
[2:01] <phantomcircuit> flushed the journal
[2:01] <phantomcircuit> resized the partition
[2:01] <phantomcircuit> and created a new journal with --mkjournal
[2:01] <phantomcircuit> but now im getting 2013-04-30 02:03:32.155644 2f171e95780 -1 filestore(/var/lib/ceph/osd/ceph-0) mount failed to open journal /dev/sda2: (22) Invalid argument
[2:01] <phantomcircuit> any ideas?
[2:02] <phantomcircuit> /dev/sda2 is a partition on an actual sata device
[2:02] <phantomcircuit> so i doubt it's the O_DIRECT issue
[2:03] <dmick> strace show errors?
[2:05] <median> has anyone gotten a s3 client to work with radosgw?
[2:07] <phantomcircuit> dmick, not sure too many threads one sec
[2:09] <dmick> median: a whole bunch of people, yes
[2:10] * bergerx_ (~bekir@ Quit (Remote host closed the connection)
[2:10] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[2:11] <phantomcircuit> dmick, all the open() calls seem to succeed
[2:11] <phantomcircuit> indeed nothing seems to have returned -1
[2:11] <dmick> so it's probably something the osd is trying to verify
[2:12] <median> dmick: i'm doing the 5 minute install, i've gotten everything working and i've created a block device, i've also set up radosgw
[2:12] <mikedawson> sage: looks good so far (all three are in quorum and under 300MB)
[2:12] <sage> mikedawson: what distro are you on?
[2:12] <median> but im having trouble with the fcgi module i think
[2:12] <mikedawson> sage: raring
[2:12] <sage> what version of libleveldb is it?
[2:13] <phantomcircuit> dmick, any idea how to figure out what it is? the generic invalid argument error isn't super helpful :(
[2:13] <sage> i wonder if this is version specific.. we haven't seen it on precise
[2:13] <median> can you give me an example of an client that works so I can try?
[2:13] <dmick> phantomcircuit: read the source, like I am? .. :)
[2:13] <phantomcircuit> i am lol
[2:13] <mikedawson> sage: 1.9.0-1
[2:13] <dmick> it's possible some debug would help more too, but I don't remember which subsystem is responsible
[2:14] <phantomcircuit> dmick, how can i just turn debugging on for everything
[2:14] <dmick> oh you don't wanna do that
[2:14] <phantomcircuit> just wont reload the config for anything else for a minute..
[2:15] <phantomcircuit> i think it's failing early enough that the torrent of log data should still be parseable
[2:15] <phantomcircuit> maybe not
[2:15] <sage> mikedawson: performance seems ok?
[2:15] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) has joined #ceph
[2:15] <dmick> I would try setting debug filestore = 20
[2:15] <sage> it is potentially doing a bunch of work every few minutes to compact (each time it trims some old data)
[2:15] <dmick> which you can do on the osd commandline if you start the osd by hand
[2:16] <dmick> median: boto works, as does libS3
[2:16] <mikedawson> 16k writes with rados bench have been significantly slower than I was seeing last week, but I haven't tracked down why
[2:16] <dmick> those seem popular
[2:16] <dmick> I've also heard people using cyberduck
[2:17] <sage> hmm, that should be affected by this at least, tho.
[2:17] <sage> *shouldn't
[2:17] <sage> tho disconcerting
[2:17] <mikedawson> sage: with 3x replication I'm seeing about 20iops/osd where I was getting more like 70 before
[2:18] <mikedawson> I'll dig into that tonight/tomorrow
[2:19] <phantomcircuit> dmick, debug journal (heh seems obvious now...)
[2:19] <phantomcircuit> 2013-04-30 02:23:26.479980 2ee71be6780 3 journal journal_replay open failed with Invalid argument
[2:22] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) Quit (Quit: Leaving.)
[2:22] <dmick> there should have been more info before that?...
[2:23] <dmick> FileJournal::_open, maybe?
[2:23] <phantomcircuit> dmick, http://pastebin.com/raw.php?i=8s51Ziqi
[2:24] <phantomcircuit> problem is in FileJournal::_open somewhere
[2:25] <phantomcircuit> er open
[2:25] <dmick> open journal size 9744416768 > current 994050048
[2:25] <dmick> is the reason for EINVAL
[2:25] <phantomcircuit> huh that's weird
[2:26] <phantomcircuit> it's a block device with journal size = 0
[2:26] <phantomcircuit> possibly mkjournal is detecting a different size?
[2:26] <dmick> but size = 0 just means 'use the whole block dev'
[2:27] <dmick> presumably the block dev changed size; how you get the OSD to STFU is another question
[2:27] <dmick> did you --flush-journal before changing it?
[2:27] <dmick> yes, you said so
[2:27] <phantomcircuit> 2013-04-30 02:31:14.189948 3b969c1d780 10 journal header: block_size 4096 alignment 4096 max_size 9744416768
[2:27] <phantomcircuit> wat
[2:27] <phantomcircuit> dmick, yeah i did
[2:28] <phantomcircuit> dmick, i assume osd journal size is in MiB ?
[2:28] <dmick> so you believe your device should be larger than 9GB?
[2:29] <phantomcircuit> no the correct size is ~950 MB
[2:29] <phantomcircuit> it was previously ~10 GB
[2:29] <dmick> oh. so that's saying it still believes it ought to be the old size
[2:30] <phantomcircuit> oh god it started with debugging on
[2:30] <phantomcircuit> -rw-r--r-- 1 root root 349M Apr 30 02:35 /var/log/ceph/ceph-osd.0.log
[2:30] <phantomcircuit> lol
[2:32] <phantomcircuit> HEALTH_WARN 12 pgs recovering; 180 pgs recovery_wait; 192 pgs stuck unclean; recovery 2691/289454 degraded (0.930%)
[2:32] <phantomcircuit> ok much better
[2:32] <dmick> wait, what'd you do to fix it?
[2:33] <phantomcircuit> dmick, wrote zeros to the journal device and then did --mkjournal
[2:33] <phantomcircuit> i dont really want to know why it worked
[2:33] <phantomcircuit> but it did
[2:36] * sagelap (~sage@ Quit (Quit: Leaving.)
[2:38] <dmick> ah, so, probably the journal file itself has marks in it
[2:38] * rustam (~rustam@ Quit (Remote host closed the connection)
[2:38] <dmick> read_header()
[2:39] <dmick> surprised --mkjournal didn't do that. Wonder if that failed, but not loudly enough
[2:41] <dmick> or it looks like maybe it won't overwrite an existing journal
[2:41] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) has joined #ceph
[2:42] <median> hi. i'm a little closer to getting the s3 gateway working
[2:42] <median> now i get a 403 forbidden when i connect to the radosgw, do i have to encode my secret key somehow for the client?
[2:43] <dmick> whatever client you're using should have instructions on how to set up keys
[2:43] <dmick> often it's env vars
[2:44] <median> i'm just wondering if the secret key has to be transformed somehow or do i use it as is?
[2:44] <dmick> you say "the secret key" like there's only one; there are many
[2:47] <median> i'm talking about the secret key for a s3 user in the rados-gw
[2:48] <dmick> boto talks about these:
[2:48] <dmick> AWS_ACCESS_KEY_ID - Your AWS Access Key ID
[2:48] <dmick> AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key
[2:48] <median> i'll set those and see
[2:49] <dmick> Ceph has its own cephx keys that control access to the cluster by client; one client is the radosgw process itself
[2:49] <dmick> Ceph's doc talk about setting up those keys so that radosgw can talk to the cluster
[2:50] <dmick> have you used radosgw-admin to create a user?
[2:50] <dmick> if that worked, the radosgw-to-cluster auth is working
[2:51] <dmick> (actually, that might not even be true; radosgw-admin may use the client.admin key)
[2:51] <median> wow i created a fresh user and its working
[2:52] <median> using s3browser.com to connect to rados-gw
[2:53] <dmick> ah, a windows client. ok
[2:54] <median> also that, have python running as well
[2:56] * portante|ltp (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[2:59] <median> works great, thanks!
[2:59] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) Quit (Quit: Leaving.)
[3:01] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) has joined #ceph
[3:08] * mrjack (mrjack@office.smart-weblications.net) has joined #ceph
[3:08] <elder> dmick, I just updated my ceph tree and rebuilt, and am now restarting ceph. It's doing something different from I'm used to.
[3:08] <elder> {2024} elder@speedy-> ./ceph health
[3:08] <elder> HEALTH_WARN
[3:08] <elder> {2024} elder@speedy->
[3:09] <dmick> not much there
[3:09] <dmick> ./ceph -s might say more
[3:09] <elder> I'm used to it saying something about percentage ready or something for a little bit, and then HEALTH_OK
[3:09] <elder> ./ceph health -s
[3:09] <elder> health HEALTH_WARN
[3:09] <elder> monmap e1: 3 mons at {a=,b=,c=}, election epoch 6, quorum 0,1,2 a,b,c
[3:09] <elder> osdmap e7: 2 osds: 2 up, 2 in
[3:09] <elder> pgmap v26: 24 pgs: 24 active+clean; 9518 bytes data, 389 GB used, 86950 MB / 474 GB avail
[3:09] <elder> mdsmap e5: 1/1/1 up {0=a=up:active}
[3:10] <dmick> hum. I'm not seeing the reason for the health warning
[3:10] <elder> Well, my naive eyes don't see anything that seems amiss either.
[3:11] <elder> Running client tests doesn't seem to show anything wrong, but then again it ought to be robust, right?
[3:12] <dmick> istr that maybe ./ceph health --format=json might show more?...
[3:13] <elder> "health_detail": "low disk space!"},
[3:13] <elder> Maybe that's it.
[3:13] <dmick> ah, there you go
[3:13] <dmick> probably a warn in the logs
[3:13] <elder> I'm just running in rbd, so I don't designate a lot of space for the devices.
[3:13] <elder> But how can it be "low" when it's empty?
[3:13] <elder> I think the definition of low should scale with the storage available, perhaps.
[3:14] <dmick> disk space in the filesystems holding the osd data
[3:14] <elder> 83%?
[3:14] <elder> (full)
[3:14] <elder> Ohhh.
[3:14] <dmick> } else if (stats.latest_avail_percent <= g_conf->mon_data_avail_warn) {
[3:14] <elder> Is it because the space available with respect to the size of my rbd image?
[3:14] <elder> OK, where do I get that global config value
[3:15] * Tamil (~tamil@ Quit (Quit: Leaving.)
[3:16] <dmick> OPTION(mon_data_avail_warn, OPT_INT, 30)
[3:16] <elder> Where did that come from?
[3:16] <dmick> common/config_opt.h
[3:16] <dmick> opts
[3:16] <elder> OK I see that now.
[3:16] <elder> Thanks.
[3:17] <dmick> and, yeah, I assume committed space in the rbd image counts against latest_avail_percent, but I dunno
[3:18] <dmick> ceph health detail probably shows it
[3:18] <elder> Yes it does.
[3:18] <elder> 2013-04-29 20:18:18.223746 mon.1 -> 'HEALTH_WARN' (0)
[3:18] <elder> mon.a addr has 17% avail disk space -- low disk space!
[3:18] <elder> mon.b addr has 17% avail disk space -- low disk space!
[3:18] <elder> mon.c addr has 17% avail disk space -- low disk space!
[3:18] <dmick> mon logs, maybe?...
[3:18] <elder> So you're right. And it doesn't seem to have to do with the space "reserved" for an rbd device.
[3:19] <elder> No, it's my main disk, it's probably filling up with stuff and I need to go prune.
[3:19] * rustam (~rustam@ has joined #ceph
[3:19] <dmick> right, but I think mon logs are a good pruning candidtate
[3:19] <elder> under teuthology archive?
[3:19] <elder> Oh.
[3:19] <elder> No.
[3:20] <elder> Yes I have over a GB of logs
[3:20] * rustam (~rustam@ Quit (Remote host closed the connection)
[3:20] <elder> of various kinds
[3:21] <elder> OK, is there a good way to prune things without bothering whatever I have running?
[3:22] <elder> I can stop it, it's no big deal, but I'd like to know anyway.
[3:22] <dmick> the usual logrotate method, which I can never remember
[3:22] <dmick> I *think* it's "rename the log file, kill -HUP the daemons, which will reopen the log file, which will reopen a new one"
[3:22] <dmick> but I can never remember for sure
[3:22] <elder> That's OK.
[3:24] <elder> So are the log files just the things under ceph src/out/*.log?
[3:24] <elder> If so that's not what's consuming my space.
[3:24] <dmick> yes
[3:24] <dmick> you may enjoy baobab
[3:25] <dmick> at least that's what I think it's called
[3:26] <dmick> http://www.makeuseof.com/tag/visually-contents-folder-hard-drive-baobab-linux/
[3:26] <elder> I think I have that, or have tried it, and I didn't know how to interpret what it showed me intuitivelyl.
[3:27] <elder> Oh, maybe it was a different tool. I'm scanning now.
[3:28] <elder> OK, now I get it, and it's nice.
[3:29] * jimyeh (~Adium@60-250-129-63.HINET-IP.hinet.net) has joined #ceph
[3:30] <elder> About 40GB worth of built kernels, etc.
[3:30] <elder> I guess I don't clean those up often enough...
[3:31] * coyo (~unf@pool-71-164-242-68.dllstx.fios.verizon.net) has joined #ceph
[3:31] <dmick> it's one fo the nicer tools I've used for figuring this out
[3:32] <elder> Many many years ago I had a script "dudiff" that would compare yesterday's comprehensive "du | sort -n" with today's, and distill it down to just a delta summary.
[3:32] <elder> I'd run it nightly.
[3:33] <elder> OK, now I've doubled my free space to 35% free. That'll do for now.
[3:34] <dmick> and yeah, fwiw, I think that is the log rotation dance: rename, hup, wait for a bit, dispose of renamed files
[3:34] <dmick> if you don't care about the old info I *think* you can just hup
[3:35] <elder> OK, well I've made a mental note of that, but I think I'm OK for now.
[3:35] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[3:40] <elder> ./ceph health
[3:40] <elder> HEALTH_OK
[3:43] <dmick> w00t
[4:21] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[4:30] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[4:38] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[4:42] * wschulze1 (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[4:43] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) Quit (Remote host closed the connection)
[4:46] * dosaboy (~dosaboy@host86-161-203-141.range86-161.btcentralplus.com) has joined #ceph
[4:47] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[4:50] * dosaboy_ (~dosaboy@host86-161-164-218.range86-161.btcentralplus.com) Quit (Read error: Operation timed out)
[5:01] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[5:05] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Quit: killed (ChanServ (Quit Message Spam is off topic.)))
[5:07] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[5:15] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[5:17] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[5:29] <nigwil> we're looking at deploying a 30-node Openstack system with a similarly sized Ceph storage cluster, and it was suggested why not run the two together, that is for each server assign cores to run OSDs and use the remaining cores for Openstack, anyone done this? is it a good idea given the network requirements of Ceph?
[5:39] * rustam (~rustam@ has joined #ceph
[5:40] * rustam (~rustam@ Quit (Remote host closed the connection)
[5:44] <SpamapS> nigwil: what if you need a lot more compute than storage?
[5:44] * dwt (~dwt@128-107-239-233.cisco.com) has joined #ceph
[5:45] <nigwil> we could add compute nodes, beyond the initial deployment. I suppose the idea is that Ceph could occupy a subset of the compute nodes
[5:45] <nigwil> The attraction with the idea is that the split between compute and storage could be moved depending on need (not moved quickly but over time)
[5:46] <nigwil> the obvious drawback might be during recovery in that some VMs would be resource starved (somewhat) if they are co-habiting with OSDs receiving replicas
[5:49] * kfox1111 (~kfox@96-41-208-2.dhcp.elbg.wa.charter.com) has joined #ceph
[5:52] <SpamapS> nigwil: the failure profiles of ceph nodes and nova compute nodes are very different
[5:53] <SpamapS> nigwil: seems like a ceph cluster would just need to be RAM/IO/NET powered, where as compute also needs lots and lots of CPU
[5:54] <nigwil> ok, but if a node has both, then could they co-exist?
[5:55] <sage> there are lots of people who deploy that way, but probably not the majority
[5:56] <mikedawson> sage: its still working for me. thanks!
[5:56] <sage> excellent!
[5:58] * wschulze1 (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:01] <nigwil> anything we should do for an "ideal" configuration using this scenario? we planning on 4 x 10GbE (2 frontend 2 for cluster) for example
[6:02] <sage> things are a bit simpler if you combine it all into a flat network (instead of explicit front/back). depends on whether you really want/need the isolation or not.
[6:03] <nigwil> ok, we thought the official wisdom was that split was desirable, mainly to isolate replication activity from the client-side
[6:03] <janos> i thought there would be speed benefits to isolating the OSD traffic to a cluster network
[6:04] <nigwil> maybe my networking naivite overlooks some complexity in the split arrangement? it seems splitting is relatively easy to do too
[6:04] <janos> it is easy to do
[6:04] <nigwil> naivete (sp)
[6:04] <janos> heck i do it at home
[6:05] <nigwil> me too :-) (or I am about to now that my box of NICs has arrived at home)
[6:05] <janos> hehe
[6:07] * Cube1 (~Cube@ has joined #ceph
[6:10] * Cube1 (~Cube@ Quit ()
[6:13] * dwt (~dwt@128-107-239-233.cisco.com) Quit (Quit: Leaving)
[6:14] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[6:19] <nigwil> on a separate topic, is Paxos the perfect answer for consensus? I am wondering if Ceph could support a pluggable consensus implementation so something like Raft could be swapped in. https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raftDraftJan13.pdf
[6:21] <nigwil> I ask because of the the two criticisms in that paper: "Unfortunately, Paxos has two significant drawbacks.
[6:21] <nigwil> The first drawback is that Paxos is exceptionally diffi-
[6:21] <nigwil> opaque;"
[6:21] <nigwil> cult to understand. The full explanation [9] is notoriously
[6:22] <lurbs> Are there even an odd number of people who'd get votes as to which algorithm to use?
[6:22] <nigwil> :-) we could have the engines make the decision...
[6:26] * illuminatis (~illuminat@0001adba.user.oftc.net) Quit (Ping timeout: 480 seconds)
[6:29] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[6:32] <pioto> hi; i'm trying out the current chef stuff, and it seems to hang here for me on blank new server: [Tue, 30 Apr 2013 00:31:03 -0400] INFO: Processing ruby_block[get osd-bootstrap keyring] action create (ceph::mon line 79)
[6:32] <pioto> 2013-04-30 00:31:03.487852 7faf7306f700 0 -- :/3844 >> pipe(0x2eb2040 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
[6:33] <pioto> there are a fair number of ceph-mon threads, and a ceph-create-key process
[6:33] <pioto> the create-key process seems to be busy writing INFO messages to stderr
[6:33] <pioto> i wonder if, say, the ubuntu upstart jobs are starting something up too soon?
[6:35] <dmick> there was a problem until recently that create-key would keep trying to authenticate with the mons when it couldn't, because a key was missing
[6:35] <pioto> yeah. it seems like maybe that.
[6:35] <dmick> recently enough that unless you're on the nightly packages you'd probably see that
[6:35] <pioto> lemme nuke and try the 'dev' branch
[6:36] <pioto> instead of 'testing'
[6:36] <dmick> but it arises from a confusion of the install methods
[6:36] <dmick> you trying this on a machine/set that didn't have ceph installed?
[6:38] <pioto> yes
[6:38] <pioto> though now it does
[6:38] <pioto> as part of this process
[6:39] <pioto> i'm nuking /etc/ceph/*, /var/lib/ceph/*/*, and trying again with the dev branch...
[6:39] <pioto> let's see how it goes
[6:40] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[6:40] <dmick> if you get to that place again, the key was the client.admin.keyring in ceph-create-keys (it's a script)
[6:41] <dmick> bad use of 'key' there. "the issue was.."
[6:41] <pioto> ok
[6:41] <pioto> also, nuked /etc/apt/sources.list.d/ceph*
[6:41] <pioto> it seems odd that it will add 3 different files there if you switch branches
[6:41] <pioto> instead of just calling them all ceph.list
[6:41] <mikedawson> dmick, pioto: you need the right caps in /var/lib/ceph/mon/ceph-a/keyring
[6:42] <dmick> pioto: I think that's wget, probably
[6:43] <mikedawson> if 'caps mon = "allow *"' is omitted, ceph-create-keys is angry
[6:43] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[6:44] <pioto> mikedawson: well, the chef recpie should be doing all that for me, no?
[6:44] <dmick> it's that cap, I don't think it goes in that file tho
[6:44] <pioto> the processes i see when things are stuck are:
[6:44] * kfox1111 (~kfox@96-41-208-2.dhcp.elbg.wa.charter.com) Quit (Ping timeout: 480 seconds)
[6:44] <pioto> root 6250 0.4 0.7 43700 7384 ? Ss 00:41 0:00 /usr/bin/python /usr/sbin/ceph-create-keys --cluster=ceph -i ceph-test-libvirt
[6:44] <pioto> root 6792 0.0 0.5 129500 5336 ? Ssl 00:42 0:00 /usr/bin/ceph-mon --cluster=ceph -i ceph-test-libvirt -f
[6:44] <dmick> ah. yes, it's that keyring, but the key is named 'mon.', and may not be the only key
[6:45] <mikedawson> yep - it is under 'mon.'. those processes can also get stuck waiting for quorum
[6:45] <pioto> yeah. you generate a key for that already elsewhere, and store it in the chef environment you use here
[6:45] <dmick> pioto: in your ceph-create-keys file, in the get_key function, do you have if returncode == errno.EPERM or returncode == errno.EACCES:
[6:45] <pioto> along with the fsid
[6:46] <pioto> dmick: lemme see
[6:46] <pioto> ENOENT
[6:47] <dmick> uh, I'm assuming that means that string doesn't appear in the file?
[6:47] <pioto> but, i think that chef didn't upgrade my packages when i switched branches
[6:47] <pioto> yes, that string isn't there
[6:47] <pioto> no EPERM
[6:47] <dmick> ok. so that went in Friday the 19th
[6:47] <pioto> k, let's see
[6:47] <pioto> i'm still on ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
[6:47] <pioto> which i guess is the last testing release
[6:47] <dmick> which would have been v0.60-591-g1a8b30e
[6:47] <dmick> (or later)
[6:48] <mikedawson> pioto: http://tracker.ceph.com/issues/4752
[6:48] <dmick> that's the bug that v0.60-591-g1a8b30e fixed
[6:48] <dmick> but mikedawson, do you know if the chef installation needs ceph-create-keys to work?
[6:49] <mikedawson> pioto: and 0.60 has some serious monitor issues. if you want to test something new that may actually work, try "wip-mon-compact" it is the best thing since 0.58
[6:50] <pioto> well. i assume all this will be fixed in 0.61 in a few days?
[6:50] <pioto> i just wanted to get a better feel for using chef with ceph
[6:50] <pioto> not doing a production deployment this week
[6:50] <mikedawson> dmick: not sure, but before those fixes, I had success commenting ceph-create-keys out of the init scripts
[6:50] <dmick> pioto: you're using https://github.com/ceph/ceph-cookbooks ?
[6:51] <dmick> mikedawson: were you installing with chef?
[6:51] <mikedawson> pioto: yes, that will land in 0.61
[6:51] <mikedawson> dmick: i was not using chef
[6:51] <dmick> ISTR you had installed with mkcephfs; that will definitely trigger this bug
[6:51] <dmick> or rather
[6:51] <dmick> an older mkcephfs
[6:51] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[6:52] <mikedawson> dmick: yes, right
[6:52] <pioto> dmick: yes, using that
[6:52] <pioto> cookbook
[6:52] <dmick> k. so that *shouldn't* be pioto's problem, necessarily
[6:53] <pioto> k, twiddling my thumbs while things reinstall...
[6:53] <pioto> but i hope that the 'gitbuilder' branch of things will fix this.
[6:54] <dmick> chef recipes/mon.rb contains this: ceph-authtool "$KR" --create-keyring --name=mon. --add-key='#{node["ceph"]["monitor-secret"]}' --cap mon 'allow *'
[6:54] <dmick> which ought to have done that
[6:55] <pioto> yeah. unless something made the mon try to start first, maybe?
[6:55] <pioto> like, the apt install process?
[6:55] <pioto> *shrug*
[6:55] <dmick> anything's possible, but it shouldn't start without warning
[6:55] <pioto> right now, i have these ceph processes, from this most recent run:
[6:55] <pioto> root 8099 0.1 0.6 112628 6928 ? Ssl 00:54 0:00 /usr/bin/ceph-mon --cluster=ceph -i ceph-test-libvirt -f
[6:55] <dmick> at least not until it's got a filestore set up
[6:55] <pioto> root 8100 0.5 0.7 43692 7380 ? Ss 00:54 0:00 /usr/bin/python /usr/sbin/ceph-create-keys --cluster=ceph -i ceph-test-libvirt
[6:55] <pioto> root 8119 0.0 0.5 234036 5696 pts/1 Sl+ 00:54 0:00 ceph auth get-key client.bootstrap-osd
[6:57] <dmick> ok. what's in /var/lib/ceph/mon/ceph-*/keyring
[6:57] <pioto> last few chef-client lines: http://pastebin.com/sjTYVYSD
[6:57] <pioto> let's seee what's in keyring files
[6:57] * illuminatis (~illuminat@0001adba.user.oftc.net) has joined #ceph
[6:58] <pioto> $ ls -l /var/lib/ceph/*/*/*keyring*
[6:58] <pioto> -rw------- 1 root root 77 Apr 30 00:54 /var/lib/ceph/mon/ceph-ceph-test-libvirt/keyring
[6:58] <pioto> $ sudo cat /var/lib/ceph/mon/ceph-ceph-test-libvirt/keyring
[6:58] <pioto> [mon.] key = AQAsQH9RmBqkKxAAs8tdOJBFlqJBplL5weBckw== caps mon = "allow *"
[6:58] <pioto> well, that's on 3 lines
[6:58] <pioto> irssi is being "helpful"
[6:58] <dmick> that ought to be enough to let ceph-create-keys work
[6:58] <dmick> (and yes)
[6:59] <dmick> what is ceph-create-keys logging?
[6:59] <pioto> let's see
[6:59] <dmick> (it'll be in the client.admin.log)
[6:59] <dmick> (I think)
[7:00] <pioto> i don't see such a log at all in /var/log/ceph
[7:00] <pioto> but, in the mon log, i see: 2013-04-30 01:00:00.514938 7fba0eecb700 1 mon.ceph-test-libvirt@-1(probing) e0 discarding message auth(proto 0 30 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[7:00] <pioto> a lot of lines like that
[7:01] <pioto> this is with... $ ceph -v
[7:01] <pioto> ceph version 0.56.4-56-ga8e7e9d (a8e7e9df61a7229d9e2b4b4dedc68b5c1bf15c38)
[7:01] <pioto> wait. huh?
[7:01] <pioto> why aren't in on 0.60....
[7:01] <dmick> hm
[7:01] <pioto> hrm. "branch": "dev" got me:
[7:01] <pioto> $ cat /etc/apt/sources.list.d/ceph-gitbuilder.list
[7:01] <pioto> deb http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/bobtail precise main
[7:01] <pioto> that seems wrong
[7:02] <dmick> indeed
[7:02] <pioto> lemme see... maybe i need to set 'version: next" or something?
[7:02] <dmick> is this in the chef?
[7:03] <pioto> yes
[7:03] <dmick> oh attributes/repo.rb
[7:03] <dmick> (I really don't know the chef layout)
[7:03] <pioto> i don't either
[7:03] <pioto> lemme see
[7:03] <dmick> I would try branch dev
[7:04] <pioto> ah, yhep
[7:04] <dmick> and version master
[7:04] <pioto> k
[7:04] * jimyeh1 (~Adium@42-72-90-192.dynamic-ip.hinet.net) has joined #ceph
[7:09] * jimyeh (~Adium@60-250-129-63.HINET-IP.hinet.net) Quit (Ping timeout: 480 seconds)
[7:12] <pioto> ok, well, now i have the right version: 0.60-815-g7d4c0dc
[7:12] <pioto> and still, only logging is for mon: 2013-04-30 01:12:30.414579 7f1b881d5700 1 mon.ceph-test-libvirt@-1(probing) e0 discarding message auth(proto 0 30 bytes epoch 0) v1 and sending client elsewhere
[7:13] <pioto> that seems to log, like, twice every 5 seconds
[7:15] <dmick> ceph-create-keys is looping
[7:15] <pioto> strace on the ceph-create-keys process says it's printing: write(2, "INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'\n", 60) = 60
[7:15] <pioto> yeah
[7:15] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[7:15] <dmick> ah. so you have a different issue
[7:16] <dmick> just one mon?
[7:17] <dmick> find his asok file (in /var/run/ceph......asok)
[7:17] <pioto> yes, just 1 mon, and 1 mds and 2 osds
[7:18] <pioto> on the same host (small scale test)
[7:18] <pioto> let's see
[7:18] <pioto> $ ls -ld /var/run/ceph/ceph-mon.ceph-test-libvirt.asok
[7:18] <pioto> srwxr-xr-x 1 root root 0 Apr 30 01:11 /var/run/ceph/ceph-mon.ceph-test-libvirt.asok
[7:19] <pioto> i guess i need to tell it something with ceph-mon --admin-socket /var/run/... --tell ?
[7:19] <pioto> or, something like that...
[7:20] <dmick> ceph --admin-daemon <that-path> mon_status
[7:22] <pioto> http://pastebin.com/2qtVJXbv
[7:22] <dmick> is that the local machine's IP addr?
[7:23] <pioto> yes
[7:23] <pioto> its public one
[7:23] <pioto> it has a cluster address too
[7:23] <pioto> but that won't mnatter for mon i guess
[7:23] <dmick> it shouldn't
[7:23] <dmick> but it's like it can't talk to itself
[7:23] <pioto> yeah
[7:24] <pioto> i haven't set up any firewall...
[7:24] <dmick> can you maybe strace it and see what it's trying to send where?
[7:24] <pioto> let's see
[7:24] <pioto> the part is odd
[7:24] <dmick> how did you configure its address etc.
[7:24] <pioto> in the mons []
[7:24] <pioto> let's see
[7:24] <dmick> (that may be because it hasn't answered itself)
[7:24] <pioto> the generated ceph.conf
[7:24] <pioto> http://pastebin.com/LtcMWtYK
[7:25] <pioto> lemme get the chef "environment"
[7:25] <pioto> http://pastebin.com/1PawpY8N
[7:25] <dmick> does ceph-test-libvirt.home.pioto.org work for ping or dig or stuff?
[7:25] <pioto> it ought to... confirming
[7:26] <pioto> yep
[7:26] <dmick> what about dig -x on the ipaddr?
[7:26] <pioto> yep
[7:26] <dmick> (or whtaever....the reverse lookup, I mean)
[7:26] <pioto> dns seems good
[7:27] <dmick> running out of ideas. maybe stop the ceph-mon proc, crank up mon debug, and start it agian
[7:27] <pioto> no firewall at all. all ALLOW policies, no other rules
[7:27] <dmick> debug mon = 20 or something
[7:27] <dmick> I don't know why it's in that state
[7:28] <pioto> let's see. still probing...
[7:28] <pioto> lemme paste some debug logs
[7:29] <pioto> http://pastebin.com/8tTAbtzU
[7:29] <pioto> that's just a screenful
[7:30] <dmick> yeah, afraid I don't know, and I'm out of time
[7:31] <dmick> if you feel up to it, summarize some of this to ceph-devel and maybe someone can help; if I see that tomorrow I'll bring it up with one of our monitor guys
[7:31] <pioto> ok, well, thanks anyways
[7:31] <dmick> but one of them is in Lisbon and will be up way earlier than me
[7:31] <pioto> it's also way too late for me to be up, but i'll try to get this tomorrow
[7:31] <elder> Me too.
[7:31] <pioto> it seems like it could be an edge case with a small deploy like this
[7:31] <pioto> and wouldn't show in a "real" cluster
[7:32] <dmick> perhaps, but I've run single-mon clusters a lot and never hit this. seems odd.
[7:33] <dmick> seems odd that mon host is in [global]
[7:33] <dmick> I would have expected to see a [mon.<name>] section
[7:33] <dmick> with a mon addr =
[7:33] <pioto> yeah.
[7:33] <dmick> maybe the chef does something "smart"
[7:34] <dmick> but since you can add another mon at any time, it seems not smart
[7:34] <pioto> well, maybe it nukes that line later when it builds the OSDs into the config?
[7:35] <pioto> anyways, it's late. good night, thanks for the help so far
[7:35] <dmick> in fact
[7:35] <pioto> if i get a chance to come up with a summary for the list, i'll do that tomorrow
[7:35] <dmick> why isn't there a [mon] section; the template seems to call for one
[7:35] <dmick> that could be key
[7:35] <dmick> <% if (! node['ceph']['config']['mon'].nil?) -%>
[7:35] <dmick> [mon]
[7:35] <dmick> <% node['ceph']['config']['mon'].each do |k, v| %>
[7:35] <dmick> <%= k %> = <%= v %>
[7:35] <dmick> <% end %>
[7:35] <dmick> <% end -%>
[7:35] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has joined #ceph
[7:35] <pioto> hrmmmm
[7:35] <dmick> I'd dig there first. (maybe mon host is not relevant)
[7:36] <pioto> well, no
[7:36] <pioto> i think that's just for optional mon settings
[7:36] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has left #ceph
[7:36] <dmick> could be
[7:36] <dmick> there's also 'ceph''config-sections'
[7:37] <dmick> hm
[7:37] <dmick> * Build an initial bootstrap monmap from the config. This will
[7:37] <dmick> * try, in this order:
[7:37] <dmick> *
[7:37] <dmick> * 1 monmap -- an explicitly provided monmap
[7:37] <dmick> * 2 mon_host -- list of monitors
[7:37] <dmick> * 3 config [mon.*] sections, and 'mon addr' fields in those sections
[7:37] * bergerx_ (~bekir@ has joined #ceph
[7:37] <dmick> would seem to indicate that mon_host should be doing the job. OK, back to I dunno
[7:47] * pconnelly (~pconnelly@71-93-233-229.dhcp.mdfd.or.charter.com) Quit (Quit: pconnelly)
[8:15] * capri (~capri@ has joined #ceph
[8:25] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[8:32] * Vjarjadian (~IceChat77@ Quit (Quit: OUCH!!!)
[8:33] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[8:44] * sileht (~sileht@2a01:6600:8081:d6ff::feed:cafe) has joined #ceph
[8:59] * jimyeh1 (~Adium@42-72-90-192.dynamic-ip.hinet.net) Quit (Read error: Connection reset by peer)
[8:59] * jimyeh (~Adium@60-250-129-63.HINET-IP.hinet.net) has joined #ceph
[9:17] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:20] * eschnou (~eschnou@ has joined #ceph
[9:22] * tnt (~tnt@ Quit (Ping timeout: 480 seconds)
[9:27] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:30] * ScOut3R (~ScOut3R@ has joined #ceph
[9:31] * leseb (~Adium@ has joined #ceph
[9:38] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[9:40] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[9:46] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:55] * zoi (~oftc-webi@d24-141-198-231.home.cgocable.net) has joined #ceph
[9:55] * zoi (~oftc-webi@d24-141-198-231.home.cgocable.net) has left #ceph
[9:57] * l0nk (~alex@ has joined #ceph
[10:01] * loicd (~loic@magenta.dachary.org) has joined #ceph
[10:07] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[10:11] * LeaChim (~LeaChim@ has joined #ceph
[10:13] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:18] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[10:19] * BManojlovic (~steki@ has joined #ceph
[10:30] * dxd828 (~dxd828@ Quit (Quit: Leaving)
[10:31] * dxd828 (~dxd828@ has joined #ceph
[10:46] * Havre (~Havre@2a01:e35:8a2c:b230:80ab:a678:e1d9:96ac) has joined #ceph
[10:53] * v0id (~v0@91-115-228-70.adsl.highway.telekom.at) has joined #ceph
[10:55] * loicd (~loic@ has joined #ceph
[11:00] * vo1d (~v0@91-115-229-155.adsl.highway.telekom.at) Quit (Ping timeout: 480 seconds)
[11:14] * jimyeh (~Adium@60-250-129-63.HINET-IP.hinet.net) Quit (Quit: Leaving.)
[11:21] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[11:24] <benner> is it normal that in HEALTH_WARN status (recovery after new osd added) i'm getting error http://p.defau.lt/?v7NdYREzDnyN8GnVLusbVg
[11:24] <benner> ?
[11:24] <benner> ceph version 0.56.4
[11:25] * rustam (~rustam@ has joined #ceph
[11:25] <tnt> benner: nope
[11:29] * rustam (~rustam@ Quit (Remote host closed the connection)
[11:35] <benner> it seems that problem not in recovering process. i waited until HEALTH_OK state but problem persists. after io error i see (ceph -w) this error: http://p.defau.lt/?xA5suOwJ0xKNFmyT96l4pQ
[11:37] <tnt> what's your kernel version ?
[11:38] <benner> 3.2.0-40-generic on client
[11:38] <benner> 3.5.0-27-generic on osds
[11:39] <tnt> you need something more recent than 3.2 for rbd kernel client.
[11:40] <benner> actualy it's strage. i was sure that kernels are the same becouse all setup are using 12.04.2 ubuntu (and all is up to date). doing investigation
[12:01] * goodbytes (~kennetho@2a00:9080:f000::58) has joined #ceph
[12:02] <goodbytes> Hi, can anybody tell me how to start e.g. an OSD manually through Upstart on Ubuntu?
[12:05] <tnt> service start ceph osd.1
[12:05] <goodbytes> I am running Ceph 0.56.4
[12:06] * goodbytes (~kennetho@2a00:9080:f000::58) has left #ceph
[12:06] * goodbytes (~kennetho@2a00:9080:f000::58) has joined #ceph
[12:07] <goodbytes> tnt, you mean "service start ceph osd.1" ? I tried that but it returns no output and it doesn't start the osd
[12:07] <goodbytes> *correction "service ceph start osd.1"
[12:18] * capri_on (~capri@ has joined #ceph
[12:20] * jefferai (~quassel@corkblock.jefferai.org) has joined #ceph
[12:23] * illuminatis (~illuminat@0001adba.user.oftc.net) Quit (Ping timeout: 480 seconds)
[12:23] * capri (~capri@ Quit (Ping timeout: 480 seconds)
[12:24] * ferai (~quassel@corkblock.jefferai.org) Quit (Ping timeout: 480 seconds)
[12:25] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[12:25] * diegows (~diegows@ has joined #ceph
[12:25] * loicd (~loic@ has joined #ceph
[12:31] * nolan (~nolan@2001:470:1:41:20c:29ff:fe9a:60be) Quit (Ping timeout: 480 seconds)
[12:35] * illuminatis (~illuminat@0001adba.user.oftc.net) has joined #ceph
[12:36] * nolan (~nolan@2001:470:1:41:20c:29ff:fe9a:60be) has joined #ceph
[12:50] <tnt> goodbytes: check your ceph.conf
[12:50] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:50] <tnt> goodbytes: if [osd.1] is defined and has a host= entry equal to the hostname of the machine, it should start it.
[12:54] * john_barbee_ (~jbarbee@c-98-226-73-253.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[13:03] * Sargun_ (~sargun@208-106-98-2.static.sonic.net) has joined #ceph
[13:05] * Sargun (~sargun@208-106-98-2.static.sonic.net) Quit (Ping timeout: 480 seconds)
[13:23] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Quit: Leaving)
[13:23] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[13:32] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:41] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[13:57] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:31] <goodbytes> tnt, thanks, it has, however the hostname in ceph.conf is written as the fqdn of the host. I will try to change it to the short representation instead
[14:41] <fghaas> goodbytes: compare with "uname -n"
[14:41] <fghaas> depends on your distro whether that is an fqdn, or an unqualified hostname
[14:43] * atb (~oftc-webi@d24-141-198-231.home.cgocable.net) has joined #ceph
[14:43] <tnt> mmm, using my rbd_bench utility I get 200Mo/s (R) and 100 Mo/s (W) , but inside the VM, I'm still only at 71M / 35M so 3x slower, still not great.
[14:46] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:49] * atb (~oftc-webi@d24-141-198-231.home.cgocable.net) Quit (Quit: Page closed)
[14:54] * portante|ltp (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Ping timeout: 480 seconds)
[14:55] * atb (~chatzilla@d24-141-198-231.home.cgocable.net) has joined #ceph
[14:57] <goodbytes> fqhaas: ok thanks. Uname -n (net nodename) returns the unqualified hostname. After changing hostnames in ceph.conf to unqualified names (having the right dns-search domain anyaways) resolved my problem. Thank you all :)
[14:59] <scuttlemonkey> atb: what seems to be the confusion?
[15:02] <atb> just with the interaction between rbd and osd's and /
[15:03] <atb> do you initialize every disk as an osd and rbd then distributes across? or are you forced to have one for os?
[15:05] <scuttlemonkey> we recommend 1 osd per disk, yes
[15:05] <scuttlemonkey> but ultimately an osd is just a process that sits on top of a filesystem
[15:06] <scuttlemonkey> in the case of the 5-minute quickstart it just uses a directory structure to run a whole cluster on one machine
[15:07] <atb> and the PG's, are they like partitions?
[15:07] <scuttlemonkey> ah, here is my image...took me a min to find it
[15:07] <scuttlemonkey> http://db.tt/blR9UUsA
[15:08] <scuttlemonkey> PGs are small, logical chunks that your data is divided into for easy distribution across the cluster
[15:09] <scuttlemonkey> http://ceph.com/docs/master/rados/operations/placement-groups/
[15:11] <scuttlemonkey> or using the same slides: http://db.tt/NI2K9cB8
[15:12] <goodbytes> does anybody know of anyone who uses ceph rbd with openstack in production clouds? e.g. for public clouds?
[15:13] <scuttlemonkey> goodbytes: not sure about a "public cloud"...but there were a few people who were talking about ceph in their production openstack at the last ODS
[15:13] <scuttlemonkey> Bloomberg is the one that leaps to mind, but that is a private cloud
[15:13] <atb> so rbd is like lvm?
[15:14] * Vjarjadian (~IceChat77@ has joined #ceph
[15:14] <tnt> kind of. rbd creates block devices (i.e. disks) over a rados cluster.
[15:15] * BillK (~BillK@58-7-104-61.dyn.iinet.net.au) Quit (Quit: Leaving)
[15:16] <tnt> wrt to rbd. I'm wondering if anyone ever though of striping data over two objects. Currently you have 4M chunk mapped to objects. But imaginge you mapped 4M chunks to 2 different objects in a striped fashion where 0->64k,128k->192k,... is obj0 and 64k->128k, 192k->256k ... is obj1.
[15:16] <goodbytes> scuttlemonkey: ok, I believe ceph rbd could be a viable alternative to iscsi storage as back-end storage for virtual machines. However my main concern is price/performance
[15:16] * tnt replaced an iscsi NAS with RBD last year.
[15:17] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[15:17] <goodbytes> I like the flexibility that ceph and storage clusters in general provides. But I'm not yet sure if the extra flexibility is worth the cost of having double the nodes, versus local storage for virtual machines.
[15:18] <scuttlemonkey> tnt: that would be a good question for joshd when he gets in
[15:18] <tnt> goodbytes: well, the big plus vs local storage is migration.
[15:18] <goodbytes> tnt, agree. Live migration is a huge plus, that may just be worth the added investment in hardware
[15:18] <fghaas> goodbytes: if you don't give a flip about availability, then your reasoning has merit. but I think. somehow, that you do give a flip :)
[15:18] * jtangwk (~Adium@2001:770:10:500:49cf:4204:1351:99c6) Quit (Quit: Leaving.)
[15:19] <atb> does cinder initialize ceph devices as fake iscsi?
[15:19] * jtangwk (~Adium@2001:770:10:500:39f1:f2be:6be0:794) has joined #ceph
[15:19] <fghaas> atb: nope
[15:19] <fghaas> cinder talks to nova, which instructs libvirt/kvm to talk to rbd volumes directly
[15:19] <fghaas> using the qemu-kvm rbd storage driver
[15:20] <fghaas> which, of course, requires that your libvirt and qemu-kvm all come with rbd support
[15:20] <fghaas> (which can be a pain on rhel/centos)
[15:21] <mikedawson> goodbytes: I'm in beta as a public cloud using ceph
[15:22] <atb> what about windows guests?
[15:22] <goodbytes> mikedawson, interresting :) How does your cluster currently look hardware wise?
[15:22] <goodbytes> and do you find the performance to be reasonable?
[15:24] <goodbytes> fghaas, availability wise, I would consider the cluster as a single point of failure as well. Do you know if it is possible to use several clusters? in case a whole cluster would fail
[15:24] <atb> they wouldn't be aware that they're backend was persistent would they?
[15:24] <fghaas> atb: same thing, qemu-kvm just presents windows with a block device
[15:25] <fghaas> and of course they would, that's what distinguishes cinder volumes from ephemeral disk storage
[15:26] <fghaas> goodbytes: fair enough, run several
[15:26] <fghaas> if you must :)
[15:27] <goodbytes> fghaas, just considering the possibilities ;) do you know if there would be any administrative headaches with running several ceph clusters in one openstack "cloud"?
[15:28] <mikedawson> goodbytes: my cluster is ~200 TB right now will grow significantly as we on-ramp workload. Performance is the tough part... ceph can be quite performant for the right workloads, but you have to know a bit about architecture and tuning to get it right.
[15:28] <mikedawson> goodbytes: or hire fghass!
[15:28] <tnt> goodbytes: you can do live migrate with an iSCSI NAS as well, but then you need at least two of them for redudancy and that becomes complicated as well.
[15:28] <fghaas> goodbytes: are you looking at rbd as backend storage for cinder, glance, or both?
[15:28] <mikedawson> fghaas, i mean
[15:28] <tnt> goodbytes: what virtualization solution do you use btw ?
[15:29] <goodbytes> fghaas, both
[15:29] <goodbytes> tnt, openstack kvm
[15:30] <fghaas> goodbytes: yeah, if you have several cinder-volume and/or glance-api instances, there's nothing that keeps you from pointing them at different ceph clusters
[15:30] * gmason (~gmason@hpcc-fw.net.msu.edu) has joined #ceph
[15:30] <fghaas> however, of course, cinder volumes served from a certain ceph cluster are only available to nova nodes configured to talk to that same cluster
[15:31] <goodbytes> mikedawson, nice. My current setup is still in the testing phase. I will be using 2U, dual socket servers with 12x3.5" disk bays for my final setup (my intention so far). My setup currently runs on 4 x Fujitsu BX920 blade servers. One disk per OSD and OSD journal in tmpfs (!)
[15:31] <fghaas> same limitations apply for snapshot-to-image copies between cinder and glance
[15:31] <goodbytes> fghaas, ah I see. Makes good sense. Thanks
[15:33] <goodbytes> mikedawson, network wise I am currently running separate 10GbE networks for data access and replication
[15:33] * goodbytes (~kennetho@2a00:9080:f000::58) has left #ceph
[15:33] * goodbytes (~kennetho@2a00:9080:f000::58) has joined #ceph
[15:37] <mikedawson> goodbytes: not so sure about tmpfs. Read the Dreamhost case studies for how they run their cluster. They have used 12 drive hosts, and they put their journals on a partition on each drive
[15:37] * bergerx_ (~bekir@ Quit (Remote host closed the connection)
[15:38] * bergerx_ (~bekir@ has joined #ceph
[15:38] <goodbytes> mikedawson, i'm not too sure either. It means I have to be really really careful power-wise. However I do have access to 4 individual power sources, with individual generators on each.
[15:38] <mikedawson> goodbytes: and fghaas has some opinions about journals as well http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals
[15:40] * Vjarjadian (~IceChat77@ Quit (Quit: A fine is a tax for doing wrong. A tax is a fine for doing well)
[15:41] <goodbytes> mikedawson, great, will read. However a big concern again is performance. To take advantage of my 10 GbE network I need approx.12 disks (osd's) per host but still I want the advantage of fast writes that either tmpfs(ram) or SSDs provide. It will however be problematic and *costly* having 12 HDDs and 12 SSDs in one machine.
[15:43] <fghaas> no, you want 2 fast (high-bandwidth) SSDs plus 10 spinners in that chassis
[15:43] <fghaas> put 5 journals on each SSD
[15:43] <fghaas> filestores on the spinners
[15:43] <mikedawson> goodbytes: if you are going to use ssd, just divide throughput of the ssds by throughput of your spinners to get a ratio
[15:44] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[15:45] <nhm> goodbytes: I have a SC847a chassis that I do 24 spinning disks and 8 SSDs in. One solution for 2U servers might be to use 12 spinning disks in the front bays and 2 fast/reliable SSDs in the internal bays partition for OS and journals. It's a bit of compromise though since you have 6 journals per SSD and OS too.
[15:46] <Kdecherf> fghaas: and making 10 osd in the same host?
[15:46] <fghaas> Kdecherf: yeah, kinda pushing it but doable
[15:47] <goodbytes> mikedawson, of course. I initially dropped that idea in my testing due to poor performing ssds acting as bottlenecks (4 disks outperforming one ssd)
[15:47] * rzerres (~ralf@bermuda.daywalker-studios.de) has joined #ceph
[15:47] * rustam (~rustam@ has joined #ceph
[15:47] <Kdecherf> fghaas: interesting
[15:48] <rzerres> i try to get rid of unfound objects in my cluster.
[15:49] * rustam (~rustam@ Quit (Remote host closed the connection)
[15:49] <rzerres> since this occoured from a stupid reconfiguration of my osd's the data are already gone.
[15:50] <goodbytes> mikedawson, so I take it you are already using your cluster in production? Do you perform ongoing performance measurements?
[15:51] <rzerres> the metadata still exists, and the documented command "ceph pg {pg-id} mark_unfound_lost revert" can't go through
[15:52] <rzerres> stdout: pg has 3857 objects but we haven't probed all sources, not marking lost
[15:52] <mikedawson> goodbytes: we're running non-critical workload as a beta. Yes, we monitor performance, but we're building out a more complete solution
[15:52] <rzerres> has anybody any clue to force the deletion of the oid's?
[15:52] <tnt> rzerres: what if you say the whole osd is lost ?
[15:53] <rzerres> i have moved 2 osd's from physical node 1 to physical node2. Replicate was set to a value of 2
[15:54] <rzerres> i did change the underlying fs of the osd's in charge. So data were overwitten.
[15:55] <rzerres> I'm not interested in that data. I just want to clean up the cluster .... so
[15:55] <tnt> if you damaged the data of an osd, you should probably just remove that osd all together.
[15:57] <jmlowe> goodbytes: I run a production cluster backing 78 vm's, I just do libvirt and manage things by hand
[15:57] <rzerres> tnt: i did remove the osd's and reintrduced them afterwards.
[15:58] <goodbytes> jmlowe, how large is your cluster? hosts, osds?
[15:58] * jskinner (~jskinner@ has joined #ceph
[15:59] <jmlowe> 8 vm hosts, 4 osd hosts, 18 osd's 82874 GB avail
[16:01] <jmlowe> http://software.xsede.org http:/spxx.org are two of the vm's I run
[16:03] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[16:04] <goodbytes> jmlowe, ok, many disks do you run per osd?
[16:09] * kfox1111 (~kfox@96-41-208-2.dhcp.elbg.wa.charter.com) has joined #ceph
[16:10] <jmlowe> I recycled some older hp msa 60 enclosures and bought some new d2600 enclosures, 4x1TB or 3TB drives in hardware raid5 with xfs
[16:12] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Ping timeout: 480 seconds)
[16:12] <LeaChim> Hmm. All my MDSs have begun crashing on startup, when they try to initiate recovery. This is not good.. Anyone have any ideas on how I might recover from this?
[16:17] <goodbytes> nhm, how is your SC847a configured network-wise?
[16:18] <nhm> goodbytes: on-board 1G for management and a bonded 10G to a single client for testing.
[16:19] <goodbytes> Do you see network bandwidth as a bottleneck in your testing?
[16:19] <goodbytes> at least replication between osd's don't need to leave the server :)
[16:19] <nhm> goodbytes: I can saturate it depending on the situation at around 2GB/s, but only with no replication.
[16:20] <goodbytes> nhm, sweet jebus, thats fast. Any considerations of what you will do when/if adding more hosts, network wise
[16:21] * aliguori (~anthony@ has joined #ceph
[16:22] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[16:22] <goodbytes> the SC847a chassis seems pretty sweet. What kind of HBA's or Controllers do you use?
[16:22] <nhm> goodbytes: That machine is sitting in my basement and unfortunately probably won't ever be part of a bigger cluster. One of our customers though is looking at using the system design as the basis for a cluster they are building though, so it may live on in some form.
[16:22] <nhm> goodbytes: Currently I've got 4 LSI SAS9207-8i controllers in it.
[16:23] <nhm> goodbytes: 6 spinnners and 2 SSDs attached to each.
[16:23] * dragonfly (dragonfly@ has joined #ceph
[16:24] * dragonfly (dragonfly@ Quit ()
[16:25] <goodbytes> oh are you the mark nelson who wrote the ceph blog post about performance tuning?
[16:25] <darkfaded> so did you find out due to the number controllers/disks or via the basement comment?
[16:25] <goodbytes> I like that the set-up is very thought through, and with no bottlenecks in terms of SAS.
[16:26] <nhm> goodbytes: yes :)
[16:28] <nhm> goodbytes: I'm pretty anti-expander at this point. I've seen a couple of systems with expanders that do well, but a lot of the systems I've tested with them seem to have strange performance limitations that ceph seems to bring out.
[16:29] <kylehutson> I posted this to the ceph-users list yesterday and still no responses. Any help here? http://www.mail-archive.com/ceph-users@lists.ceph.com/msg01028.html
[16:29] * MK_FG (~MK_FG@00018720.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:30] <goodbytes> nhm, nice to "meet" you. I am trying to look at possible bottlenecks for configuring my first cluster. Currently have one cluster made of four blade servers, before I settle on a more suitable 2U or 4U chassis
[16:31] * MK_FG (~MK_FG@00018720.user.oftc.net) has joined #ceph
[16:31] * siXy (~siXy@ has joined #ceph
[16:32] <siXy> hi
[16:32] <nhm> goodbytes: nice to meet you too. :) I like the 2U and 4U supermicro boxes. Sanmina makes a really interesting 4U chassis with 2 nodes and 60 drives which a couple of system integrators are using now too.
[16:32] <nhm> The sanmina one is kind of hard to get enough CPU in for that many disks right now, but it's a beast and actually performs pretty well despite using expanders.
[16:32] * BillK (~BillK@58-7-104-61.dyn.iinet.net.au) has joined #ceph
[16:32] <jmlowe> 60 drives in a 4U? wow
[16:33] <siXy> is there a limit to how many files I should expect to store on a single OSD?
[16:33] <nhm> jmlowe: yeah, and 2 nodes too. 30 drives per node.
[16:34] <jmlowe> if that ever finds its way to one of the tier 1 vendors I'll have to get a couple of them
[16:34] <nhm> HP's SL4540 seems to be a relatively good chassis too, especially when configured with multiple nodes. The Dell C8000 also looks pretty interesting in certain configurations.
[16:34] <sstan> what are the bottlenecks on your respective systems/
[16:35] <goodbytes> nhm, how are the drives divided? physically or by SAS zoning?
[16:35] <nhm> jmlowe: Yeah, I saw that AEON is selling them now and so is Warp Mechanics, but I'm not sure if any of the tier 1 vendors are.
[16:36] * capri_on (~capri@ Quit (Quit: Verlassend)
[16:36] <nhm> goodbytes: not sure. I only had a brief amount of time to do testing on one for a customer that was in a single-node 60-disk configuration.
[16:36] <nhm> never even saw the hardware, was all remote.
[16:39] <goodbytes> ok, I have a Supermicro SBB (Storage Bridge Bay) that works by expanders. Both nodes have access to all disks. I believe it was built with intention for iSCSI failover solution, e.g. like Sun Unified Storage
[16:39] <goodbytes> or Nexenta,
[16:40] <goodbytes> I ended up pulling the one node out of the box, storing it on the shelf as a spare part :(
[16:41] <nhm> goodbytes: Never played with one of those, sounds complicated. :)
[16:42] <nhm> goodbytes: one of the reasons I came to Inktank to work on ceph was because of my dislike of complicated hardware solutions.
[16:43] <goodbytes> the whole idea of using simple commodity hardware for enterprise solutions are very tempting and seems to be more widespread, with ceph and virtualization
[16:43] <nhm> So now I work on complicated software. ;) But at least that can be more easily diagnosed and fixed.
[16:43] <goodbytes> haha
[16:43] <nhm> easily is probably a relative term...
[16:44] <goodbytes> in general i try to keep away from supermicro. I have several supermicro servers, including several of their blade chassis'.
[16:44] * atb (~chatzilla@d24-141-198-231.home.cgocable.net) Quit (Max SendQ exceeded)
[16:44] <goodbytes> was basically a "sun" shop, before oracle bought them.
[16:44] <nhm> goodbytes: we had a bunch of sun gear at my last job. :(
[16:45] * kfox1111 (~kfox@96-41-208-2.dhcp.elbg.wa.charter.com) Quit (Ping timeout: 480 seconds)
[16:45] <nhm> goodbytes: Until Oracle changed the support contract from under us.
[16:45] <janos> wait what, oracle did that? no way
[16:45] <janos> </s>
[16:46] <nhm> goodbytes: btw, for the next round of testing I'm going to try and get ZFS on linux results in.
[16:47] <goodbytes> sweet :) any ideas on the practicality of using different filesystems among different OSDs? e.g. some XFS, BTRFS and ZFS
[16:48] <nhm> goodbytes: some people were proposing doing it to avoid failure due to FS bugs.
[16:49] <janos> i do it at home
[16:49] <janos> had one machine that didn't like to scan btrfs for some reason
[16:49] <jmlowe> whatever you do don't use btrfs with a linux kernel < 3.8, you will lose data especially with rbd
[16:49] <janos> so i have half and half btrfs/xfs
[16:49] <nhm> goodbytes: It works fine, we have people that transition to different filesystems by rotating OSDs in and out of the cluster slowly over time and reformat them with whatever FS they want.
[16:50] <nhm> jmlowe: Did something get fixed in 3.8?
[16:51] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:52] * vata (~vata@2607:fad8:4:6:221:5aff:fe2a:d1dd) has joined #ceph
[16:52] <jmlowe> nhm: the bug where if you write to the end of a sparse file then quickly write with a lower offset the file will be truncated to the end of the second write
[16:53] <jmlowe> nhm: checksums won't catch it they are calculated for the truncated file size
[16:53] <nhm> jmlowe: good to know!
[16:53] <jmlowe> nhm: only happens some times under load
[16:54] <jmlowe> nhm: as far as I can tell btrfs has always had this bug
[16:55] <goodbytes> jmlowe, thank you for the heads up
[16:59] <rzerres> any plans to build raring images? Like to test 3.8 with ceph
[16:59] * kfox1111 (~kfox@96-41-208-2.dhcp.elbg.wa.charter.com) has joined #ceph
[17:00] <jmlowe> oh, fyi, if anybody has an emulex card that takes the be2net driver, the be2net driver is broken in 3.8 and won't post the card
[17:01] * jtangwk (~Adium@2001:770:10:500:39f1:f2be:6be0:794) Quit (Quit: Leaving.)
[17:02] * jtangwk (~Adium@2001:770:10:500:39f1:f2be:6be0:794) has joined #ceph
[17:04] <goodbytes> typically, all the machines I have for testing ceph, run emulex nics (be2net)
[17:06] <jmlowe> it may not be all the cards, but the hp550sfp nic's I have are broken with the raring kernel
[17:07] <jmlowe> http://lkml.indiana.edu/hypermail/linux/kernel/1303.1/00140.html
[17:08] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[17:09] * atb (~chatzilla@d24-141-198-231.home.cgocable.net) has joined #ceph
[17:10] * bergerx_ (~bekir@ has left #ceph
[17:10] * tkensiski (~tkensiski@2600:1010:b020:1029:ad7f:ed44:dd4c:75fe) has joined #ceph
[17:11] * tkensiski (~tkensiski@2600:1010:b020:1029:ad7f:ed44:dd4c:75fe) has left #ceph
[17:11] * kfox1111 (~kfox@96-41-208-2.dhcp.elbg.wa.charter.com) Quit (Ping timeout: 480 seconds)
[17:15] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[17:16] * eschnou (~eschnou@ has joined #ceph
[17:19] * yehudasa (~yehudasa@2607:f298:a:607:e918:deb4:5e7:63ec) Quit (Ping timeout: 480 seconds)
[17:28] * yehudasa (~yehudasa@2607:f298:a:607:d499:a5eb:5688:9814) has joined #ceph
[17:30] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[17:30] * sagelap (~sage@2600:1012:b028:f194:5477:975b:508f:fafa) has joined #ceph
[17:31] <sagelap> joao: how does wip-mon-compact look to you?
[17:33] <joao> sorry, didn't noticed you added new commits
[17:33] <joao> let me take a look
[17:33] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[17:36] <mikedawson> sage, joao: it is still working for me, but it seems like a clear work around to the real leveldb problem. I'm seeing things like a monitors jumping from 296M -> 502M -> 285M in the span of a couple seconds (as fast as I can read the output, then hit up and enter)
[17:37] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[17:37] <joao> sagelap, looks great
[17:37] <mikedawson> sage, joao: http://pastebin.com/raw.php?i=WqsvKSAm
[17:37] <joao> sagelap, did you have a chance to look over the email I sent to dev?
[17:38] <joao> mikedawson, agree
[17:38] <sagelap> mikedawson: that fluctuation is normal (and will probably be a bit more erratic once leveldb is fixed)
[17:38] <sagelap> but yeah, this is definitely a workaround until the cor eissue is fixed
[17:38] <joao> mikedawson, I've seen that too, but on a far smaller scale
[17:38] <sagelap> joao: looking
[17:39] <joao> fwiw, the ##leveldb folks said that we probably are hitting this because we're using a default block size of 4MB, when sst's are 2MB at most
[17:39] <sagelap> joao: oh yeah
[17:39] <mikedawson> sage: when benchmarking 4K or 16K writes, I frustratingly see much more bandwidth consumed between monitors than I can push between clients and osds
[17:39] <sagelap> joao: oh right... yeah let's turn that down
[17:40] <joao> besides, found a dead stupid bug on the _pick_random_mon() function. testing it so I can proceed with testing
[17:40] <joao> err
[17:40] <joao> that was redundant
[17:40] <joao> testing the fix so I can proceed with leveldb testing :p
[17:40] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[17:41] <sagelap> joao: note that there was a small fix in that function committed to master
[17:41] <sagelap> tho harmless because the bool arg was never true
[17:41] <joao> iirc, that's the one about the assert?
[17:42] <joao> sagelap, I just noticed that we were comparing the monitor's name with the monmap name, and that's why we would eventually pick ourselves
[17:42] <joao> gregaf, ^
[17:42] <joao> 'a' != '0'
[17:42] <joao> err, actually the bug was on 'sync_timeout'
[17:43] <joao> changed it to pick based on rank instead of name
[17:54] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) Quit (Read error: Connection reset by peer)
[17:55] * sagelap1 (~sage@2607:f298:a:607:5141:c068:6fc6:8936) has joined #ceph
[17:56] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[17:57] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) has joined #ceph
[17:57] * sagelap (~sage@2600:1012:b028:f194:5477:975b:508f:fafa) Quit (Ping timeout: 480 seconds)
[18:02] * aliguori (~anthony@ Quit (Ping timeout: 480 seconds)
[18:03] * scooby2 (~scooby2@host81-133-229-76.in-addr.btopenworld.com) has joined #ceph
[18:03] * aliguori (~anthony@ has joined #ceph
[18:05] * scooby2 (~scooby2@host81-133-229-76.in-addr.btopenworld.com) Quit ()
[18:07] * leseb (~Adium@ Quit (Quit: Leaving.)
[18:08] * BillK (~BillK@58-7-104-61.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:14] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[18:15] * atb (~chatzilla@d24-141-198-231.home.cgocable.net) Quit (Max SendQ exceeded)
[18:17] * pconnelly (~pconnelly@71-93-233-229.dhcp.mdfd.or.charter.com) has joined #ceph
[18:23] * eschnou (~eschnou@252.94-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:23] * rzerres (~ralf@bermuda.daywalker-studios.de) has left #ceph
[18:23] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[18:24] * tnt (~tnt@ has joined #ceph
[18:32] * l0nk (~alex@ Quit (Quit: Leaving.)
[18:34] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[18:38] * atb (~chatzilla@d24-141-198-231.home.cgocable.net) has joined #ceph
[18:39] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:41] * goodbytes (~kennetho@2a00:9080:f000::58) has left #ceph
[18:46] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) Quit (Read error: Connection reset by peer)
[18:46] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) has joined #ceph
[18:47] * eschnou (~eschnou@252.94-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[18:47] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) has joined #ceph
[18:49] * jmlowe (~Adium@173-15-112-198-Illinois.hfc.comcastbusiness.net) has joined #ceph
[18:50] * dwt (~dwt@128-107-239-234.cisco.com) has joined #ceph
[18:52] * aliguori (~anthony@ Quit (Ping timeout: 480 seconds)
[18:58] * siXy (~siXy@ Quit ()
[19:00] * madkiss (~madkiss@ has joined #ceph
[19:01] * BillK (~BillK@58-7-104-61.dyn.iinet.net.au) has joined #ceph
[19:03] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: my troubles seem so far away, now yours are too...)
[19:03] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[19:03] * ChanServ sets mode +o scuttlemonkey
[19:03] * madkiss1 (~madkiss@089144192030.atnat0001.highway.a1.net) has joined #ceph
[19:08] * madkiss (~madkiss@ Quit (Ping timeout: 480 seconds)
[19:11] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[19:15] * sjustlaptop (~sam@ has joined #ceph
[19:31] <cjh_> probably a silly question but would it be possible to create an ssh pipe with python librbd to ssh objects into rados?
[19:32] <cjh_> trying to think of easy ways to get data into the cluster without kernel recompiling or all kinds of craziness
[19:32] <dmick> not objects, but images. But you could use python librados to do something like that, conceivably. However, note that both rbd and rados CLIs exist, and can create images and/or objects from stdin
[19:34] * JohansGlock (~quassel@kantoor.transip.nl) Quit (Read error: Connection reset by peer)
[19:37] * rtek (~sjaak@rxj.nl) Quit (Read error: Connection reset by peer)
[19:38] * rtek (~sjaak@rxj.nl) has joined #ceph
[19:38] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[19:38] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[19:38] * JohansGlock (~quassel@kantoor.transip.nl) has joined #ceph
[19:39] <cjh_> dmick: awesome. that's what i was looking for
[19:39] <cjh_> so i could pipe data right into it from stdin
[19:39] <dmick> yeah. rbd import - or rados put -
[19:40] <cjh_> i'm finding that the mds server is just way too slow unfortunately :(. So maybe some other way works
[19:41] <dmick> not sure ssh'ing to a CLI is really the speed path (and, note, of course you don't need to ssh to a particular host to put things into a networked filesystem; the client can run on anything connected to the cluster)
[19:41] <cjh_> true, i guess i'm thinking about this backwards
[19:41] <cjh_> i don't see the rbd put command. i see create however
[19:41] <dmick> read that line again
[19:42] * jmlowe (~Adium@173-15-112-198-Illinois.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[19:42] <cjh_> oh sorry, yes i see import
[19:42] <cjh_> it's early haha
[19:43] <cjh_> i think this would work fine. i can use rbd import to just fire stuff into ceph as the backup server
[19:43] <dmick> what kind of stuff do you have?
[19:43] <cjh_> can i have multiple servers import to the same rbd dest-image?
[19:43] <dmick> what's the nature of the data?
[19:43] <cjh_> large gz objects
[19:44] <cjh_> sometimes millions of tiny 1MB gz objects
[19:44] <dmick> and, no, that doesn't make sense; think of an rbd image as a disk device
[19:44] <cjh_> depends on what system i'm backing up
[19:44] <cjh_> ok
[19:44] <dmick> seems like using plain old objects makes more sense then
[19:44] <cjh_> how would i do that?
[19:44] <dmick> note that it's a flat namespace, except for pool, so...
[19:44] <dmick> rados
[19:45] * sagelap1 (~sage@2607:f298:a:607:5141:c068:6fc6:8936) Quit (Quit: Leaving.)
[19:45] <cjh_> the gateway?
[19:45] <dmick> you should experiment a little with the CLIs if you haven't
[19:45] <dmick> /usr/bin/raods
[19:45] <dmick> *rados
[19:45] <cjh_> yeah i haven't done much of anything with them
[19:46] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[19:46] <dmick> rados mkpool mypool
[19:46] <dmick> rados -p mypool ls
[19:47] <dmick> echo "foobarbaz" | rados -p mypool put - myobj
[19:47] <dmick> rados -p mypool ls
[19:47] <dmick> myobj
[19:47] <dmick> rados -p mypool get myobj -
[19:47] <dmick> foobarbaz
[19:47] <dmick> er, sorry, switched the args on the put
[19:47] <dmick> echo "foobarbaz" | rados -p mypool put myobj -
[19:48] <cjh_> interesting
[19:48] <cjh_> i'll take a look at that
[19:48] * LeaChim (~LeaChim@ has joined #ceph
[19:53] <cjh_> dmick: so for the mds i'm a little confused why inktank didn't follow the gluster route and make the fuse client aware of everything and drop the need for a dedicated server
[19:55] <dmick> mds provides the POSIX abstraction. That doesn't exist if you don't run mds, as many don't need POSIX semantics at all
[19:56] * madkiss1 (~madkiss@089144192030.atnat0001.highway.a1.net) Quit (Quit: Leaving.)
[19:57] <cjh_> i see
[19:57] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[19:57] <cjh_> couldn't the fuse client emulate that though?
[19:59] <dmick> well, you can access cephfs-the-filesystem through FUSE or kernel modules, and from multiple clients. There needs to be a centrally-controlled datastore for filesystem semantics. mds isn't a single server, it's a collection of redundant servers, but that collection owns the data for the filesystem.
[19:59] <dmick> it's really a different architecture, and as such has tradeoffs
[19:59] <cjh_> ok
[20:00] <cjh_> just curious
[20:00] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[20:06] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[20:09] * gmason_ (~gmason@hpcc-fw.net.msu.edu) has joined #ceph
[20:11] * gmason (~gmason@hpcc-fw.net.msu.edu) Quit (Ping timeout: 480 seconds)
[20:16] * Cube (~Cube@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[20:18] * aliguori (~anthony@ has joined #ceph
[20:20] <kylehutson> I've posted here a couple of times and on the ceph-users mailing list, and haven't gotten any responses. Can somebody *please* help? I've got a class of grumpy grad students that can't complete a project now and I don't like being the object of their grumpiness. :-)
[20:20] <kylehutson> Here was my question to the list: http://www.mail-archive.com/ceph-users@lists.ceph.com/msg01028.html
[20:20] <janos> i liek to tell grad students to go get a job to make them even happier
[20:21] * BillK (~BillK@58-7-104-61.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[20:22] <janos> how did you actually remove leviathan
[20:23] <janos> did you follow the procedure for removing OSD's, or was it basically a plug-pull
[20:23] <kylehutson> I followed the procedure.
[20:23] <jmlowe> why are your osd numbers not consecutive?
[20:24] * stxShadow (~Jens@ip-88-152-161-249.unitymediagroup.de) has joined #ceph
[20:24] <janos> when you removed one host, did you get a "stable" 50% degraded before adding the new host?
[20:24] <kylehutson> When I installed, it let me determine my own OSD numbers. I put them in an order I thought logical.
[20:24] <kylehutson> i.e., 100 series on one server, 200 series on the other.
[20:24] <janos> kylehutson: i tried that with osd numbers before. it's not recommended
[20:24] <jmlowe> that's not good from what I recall
[20:24] <janos> it's not. i never had a fully happy cluster doing that
[20:25] <janos> no red-flag errors, but defintiely funkiness and unpleasantness
[20:25] <kylehutson> I added the new host first (before removing the old one), and it said 'stable' then
[20:25] <jmlowe> one of the dev's can chime in, but I think it messes with some of the assumptions crush makes
[20:26] <dmick> the only bad effect I'm aware of is using more memory for sparse tables, but it's definitely not recommended
[20:26] <kylehutson> Well, then, is there a way to migrate those servers to consecutive numbers before going any further, or will that break crush also?
[20:26] <janos> sounds like aside from the osd numbers, you did some pretty standard procedure. i'd be wary of those osd numbers. i smashed my head against walls from oddball unstableness when i did that
[20:26] <kylehutson> The new server made me start with osd.0
[20:26] * madkiss (~madkiss@ has joined #ceph
[20:26] <jmlowe> listen to dmick, not me
[20:27] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[20:27] <jmlowe> did you omit the host line for aergia?
[20:27] * dwt (~dwt@128-107-239-234.cisco.com) Quit (Read error: Connection reset by peer)
[20:28] <jmlowe> -2 0 host leviathan
[20:28] <jmlowe> -4 8 host minotaur
[20:28] <jmlowe> ????
[20:29] <kylehutson> When I started the others, just having the 'host' line in the ceph.conf was enough. I was surprised it wasn't there when I started them on aergia. But yes, I did omit that.
[20:30] <mikedawson> kylehutson: it looks like the osds on the new host (aergia) were added, but never put in the right spot on the ceph osd tree (and like others have said, don't number osds yourself)
[20:30] <jmlowe> ok
[20:30] <kylehutson> *shrug* - that's what the output of the command says. I never used the 'osd tree' command before having problems here and trying to fix it
[20:31] <jmlowe> just looking for differences between your output and mine
[20:33] <kylehutson> mikedawson: lesson learned on numbering myself.
[20:34] <mikedawson> kylehutson: do "ceph osd getcrushmap -o crushmap && crushtool -d crushmap -o crushmap.txt" and paste crushmap.txt somewhere
[20:34] <kylehutson> Is there any way to find out if I have any live pages on aergia now? If not, I can take it back out of the tree and reinsert properly
[20:34] <janos> i always name my crush output files "clush" in honor of Conan
[20:35] <janos> like "clush your enemies"
[20:35] * Cube (~Cube@ has joined #ceph
[20:36] <kylehutson> http://pastebin.com/5vN0PdmW
[20:36] <jmlowe> janos: lol
[20:37] <kylehutson> The tunables I implemented recently after finding that sometimes replications didn't happen if they weren't installed
[20:37] <mikedawson> kylehutson: pretty sure there should be a host aergia section in there
[20:37] <jmlowe> mikedawson: yeah, I think his clushmap is suspect
[20:37] <janos> haha
[20:37] * eschnou (~eschnou@252.94-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:38] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) has joined #ceph
[20:38] <mikedawson> kylehutson: take a look at all the bogus devices at the top ... symptom of your osd numbering scheme, i think
[20:38] <jmlowe> too soon to make a joke about the lamentations of the users?
[20:38] <kylehutson> Never too soon to joke, IMHO
[20:38] * madkiss (~madkiss@ Quit (Ping timeout: 480 seconds)
[20:39] <janos> hahah
[20:39] <mikedawson> kylehutson: the OSDs 0-8 are listed under unknownrack, that is wrong
[20:39] <kylehutson> OK - so how to fix?
[20:39] <jmlowe> so mikedawson is right on, aergia isn't getting any objects because it isn't in the tree
[20:40] <kylehutson> aergia isn't living up to his name (the spirit of laziness).
[20:41] <mikedawson> kylehutson: you could try something like "ceph osd crush move aergia root=default rack=unknownrack", but I think you may need to fix the host aegia part first
[20:41] <jmlowe> your map goes default->unknown rack -> leviathan or menotaur, leviathan is gone, oh shit, I can't meet the requirements of the rules
[20:41] <kylehutson> jmlowe: gotcha
[20:42] <jmlowe> might be as easy as adjusting the leviathan section to be aergia
[20:43] <jmlowe> dmick showed me the crushmap simulation tool a while ago, you should fix up your map, compile it, run it through simulation to see if it does what you think it should
[20:43] <jmlowe> then if all is well inject
[20:44] <kylehutson> OK, how do I do that?
[20:44] <mikedawson> kylehutson: jmlowe may be right, but make sure you know what you are doing. It's more involved than just s/lefiathan/aergia/g
[20:44] <jmlowe> well you dumped and decoded your map already, edit the text then -c for compile -o newcrushmap
[20:45] <jmlowe> I think crushtool has the simulator
[20:45] <kylehutson> mikedawson: I think I've already proved that I don't know what I'm doing. :-)
[20:46] <jmlowe> yeah, I didn't explicitly say adjust the osd numbers to match what's actually on aergia as well as change the name and id number
[20:46] <jmlowe> changing the id number may be crucial, I'm just a layman
[20:47] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[20:47] <kylehutson> I just ran 'ceph osd out osd.0' through .7 (the new ones on aergia)
[20:47] <kylehutson> Since there was nothing there anyway.
[20:48] <jmlowe> you'll want them in
[20:48] * stxShadow (~Jens@ip-88-152-161-249.unitymediagroup.de) Quit (Read error: Connection reset by peer)
[20:49] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[20:50] <pioto> hi, is there documentation somewhere of all the options you cah pass to an rbd drive with qemu? the things that libvirt seems to pass (auth_supported, id, key, ...) don't seem to be documented here, at least: http://ceph.com/docs/master/rbd/qemu-rbd/
[20:50] <kylehutson> OK, they're back in
[20:51] <kylehutson> …but still not showing the host in the osd tree
[20:52] <mikedawson> kylehutson: I really don't know that this will work.... you'll likely lose all the data you haven't already... other disclaimers.... http://pastebin.com/raw.php?i=b7MN8zYr
[20:52] <kylehutson> I can deal with that.
[20:52] <jmlowe> mikedawson: looks about right to me
[20:53] <mikedawson> kylehutson: and when I say I don't know it will work, I really mean that. Thanks for the review jmlowe
[20:53] <joshd> pioto: it's any ceph option (of course only client side options will actually matter). the only special case is that name can't be used, it has to be set via id
[20:53] * eschnou (~eschnou@252.94-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[20:54] * CROS__ (~hworld@ has joined #ceph
[20:54] <jmlowe> crushtool -c thatfilemikedawsongaveyou -o mynewtestcrushmap && crushtool --test --output-csv -i mynewtestcrushmap
[20:55] <jmlowe> should dump out a ton of .csv files that you can review to get an idea of where things will be placed
[20:55] <kylehutson> jmlowe: the first part of the command gives me "tunables are NOT FULLY IMPLEMENTED; enable with --enable-unsafe-tunables to enable this feature" - three times
[20:55] <pioto> joshd: hm. ok, and 'some option' becomes some_option, i guess?
[20:56] <mikedawson> kylehutson: that's consistent with what I've seen
[20:56] <kylehutson> So do I put that on the crushtool command line?
[20:56] <jmlowe> um, I know very little about tunables other than I can't use them with my old clients
[20:56] * m0zes (~oftc-webi@dhcp251-10.cis.ksu.edu) has joined #ceph
[20:56] <kylehutson> I updated kernels just so I could use them - now I'm doubting the wisdom of that decision.
[20:57] <joshd> pioto: yeah, space, -, and _ are treated equivalently, but typically used in config file, ceph command line, and other places, respectively
[20:57] <kylehutson> …or do I take the tunables section out? Will that break things?
[20:58] <kylehutson> …or do I just ignore it?
[20:58] <mikedawson> kylehutson: I left them in there, but you shouldn't need them on a typical install. Don't believe it will let you ignore the error
[20:59] <pioto> joshd: k, thanks
[20:59] <kylehutson> mikedawson: I was surprised but the new crushmap did get created.
[21:01] <mikedawson> kylehutson: you haven't done the part where you set the crushmap. That would likely be the critical part.
[21:01] <kylehutson> I took the tunables out, ran the command jmlowe gave, and now have lots of CSVs
[21:02] * eschnou (~eschnou@252.94-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:03] * dwt (~dwt@wsip-70-166-104-226.ph.ph.cox.net) has joined #ceph
[21:03] <mikedawson> kylehutson: if you like the map, and it passes jmlowe's tests, you set it "ceph osd setcrushmap -i mynewtestcrushmap". That's the risky part, I suppose
[21:03] <kylehutson> mikedawson: I've got that, but how does it "pass jmlowe's tests"?
[21:04] * dontalton2 (~dwt@128-107-239-234.cisco.com) has joined #ceph
[21:04] <mikedawson> examine the output, sanity check, etc
[21:05] <jmlowe> take a look at the generated csv's they should tell you how a bunch of random objects will be distributed between osd's
[21:08] <kylehutson> just set the new crush map - osd tree now looks like I would have expected it.
[21:09] <kylehutson> Lots of data being moved now.
[21:09] <dmick> ceph -w will let you observe pg's becoming clean, if they are
[21:09] <jmlowe> you should see some things moving onto aergia
[21:10] <kylehutson> dmick: that's what I was doing to see data being moved
[21:11] <jmlowe> ceph pg dump | less and look for the osd numbers that belong to aergia in brackets, means they either are on there are will be eventually
[21:11] * dwt (~dwt@wsip-70-166-104-226.ph.ph.cox.net) Quit (Ping timeout: 480 seconds)
[21:13] <kylehutson> jmlowe: Yay! several there, including many active+clean
[21:14] <jmlowe> mikedawson did the hard work
[21:16] <jmlowe> which university are you at?
[21:16] <mikedawson> jmlowe: not sure he's out of the woods just yet, but thanks
[21:20] <kylehutson> jmlowe: Kansas State
[21:21] <kylehutson> We've got about 4 TB of data to move, so this won't happen fast.
[21:24] <jmlowe> <- IU
[21:26] * madkiss (~madkiss@ has joined #ceph
[21:27] <kylehutson> jmlowe: Nice - what area? I'm a sysadmin for the HPC cluster.
[21:28] <jmlowe> UITS - Research Technologies - High Performance Systems group
[21:28] <jmlowe> BigRed, BigRed II, Quarry, Mason
[21:29] <kylehutson> Going to SC13 in Denver? We'll have booth there.
[21:29] <jmlowe> http://racinfo.indiana.edu/hps
[21:29] <jmlowe> I'm considering it, skipped sc 12, new baby
[21:29] <jmlowe> we always have a booth
[21:30] <kylehutson> Yeah, we're actually stealing your old booth.
[21:31] <jmlowe> I think you may have one of our alumnus there, can't remember his name, possibly a director of hpc?
[21:32] <kylehutson> Rick McMullen at Arkansas had a connection at IU and knew your old booth was being retired, so "Great Plains Network" is taking it.
[21:32] <jmlowe> that's him
[21:32] <kylehutson> HPC director here came from UCSB.
[21:33] <kylehutson> He was at KU, and is now at Arkansas. Good guy.
[21:33] <jmlowe> yeah, was sorry to see him go
[21:33] <m0zes> he's much happier it seems ar Arkansas
[21:33] <nhm> kylehutson: I used to be at the Minnesota Supercomputing Institute
[21:33] <m0zes> can't say I blame him. I wouldn't want to be at KU
[21:33] <kylehutson> m0zes is sitting about 5 feet from me.
[21:34] * m0zes is the other admin for our cluster
[21:34] * kfox1111 (bob@ has joined #ceph
[21:34] <kylehutson> Yay! Lots of HPC folks!
[21:34] <jmlowe> just can't get away from us
[21:35] <kfox1111> api question. I would like to reuse a connection in a long running process. If the connection dies for some reason though, I need to know how to recover from it. Is there a connection dropped error code or something that I can use for this?
[21:38] * atb (~chatzilla@d24-141-198-231.home.cgocable.net) Quit (Max SendQ exceeded)
[21:42] <dmick> kfox1111: when you say "a connection", you mean....
[21:48] * madkiss (~madkiss@ Quit (Quit: Leaving.)
[21:52] <kfox1111> err = rados_connect(cluster);
[21:53] <kfox1111> so, if I shut things down, that line gives me -ETIMEDOUT.
[21:53] <kfox1111> I'm just wondering if I shut down things in the middle of already having a connection, will it -ETIMEOUT or do other things?
[21:57] <joshd> kfox1111: rados_connect() isn't setting up an individual connection, but an ongoing session with the monitors
[21:58] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[21:59] <joshd> it'll attempt to reconnect internally if it can't talk to the monitors or osds, other operations won't see the connection loss, they'll just block (unless they're aio of course)
[22:00] <joshd> everything that is in progress will be resent as needed, etc.
[22:01] * pconnelly (~pconnelly@71-93-233-229.dhcp.mdfd.or.charter.com) has left #ceph
[22:01] <kfox1111> hmm.... ok. Thanks.
[22:01] * dontalton2 (~dwt@128-107-239-234.cisco.com) Quit (Read error: Connection reset by peer)
[22:05] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[22:06] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[22:07] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) Quit (Read error: Connection reset by peer)
[22:08] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) has joined #ceph
[22:12] * mega_au (~chatzilla@ Quit (Ping timeout: 480 seconds)
[22:13] * Vjarjadian (~IceChat77@ has joined #ceph
[22:15] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:17] * atb (~chatzilla@d24-141-198-231.home.cgocable.net) has joined #ceph
[22:20] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[22:32] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has left #ceph
[22:39] * john_barbee_ (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[22:39] * john_barbee_ (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) Quit ()
[22:41] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:50] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) Quit (Read error: Connection reset by peer)
[22:51] * doubleg (~doubleg@ Quit (Quit: Lost terminal)
[22:52] * median (~0x00@85-220-109-89.dsl.dynamic.simnet.is) has joined #ceph
[22:54] <gregaf> sagewk: wip-4837-election-syncing, less trouble than I thought but it's a little bit fiddly about reverting because we've exposed the message format externally
[22:54] <gregaf> I'm not aware of any reliable reproducers so I think I'm just going to run it through the monitor suites a couple times and check for any regressions
[22:54] <sagewk> k
[22:54] <gregaf> do you want me to just cherry-pick Joao's commit out of his compact-dbg branch?
[22:55] <sagewk> sure
[22:55] <sagewk> whichever ones are appropriate
[22:55] <sagewk> i think that rank stuff can wait for master
[22:55] <gregaf> you pointed out two to me, dropping the paxos_max_join_drift and fixing pick_random_mon
[22:56] <gregaf> but I don't remember the pick_random_mon history or if it's appropriate to grab or not
[22:56] <gregaf> I'll leave it out since we aren't using it
[22:57] * danieagle (~Daniel@ has joined #ceph
[22:58] <sagewk> yeah
[23:08] * JohansGlock_ (~quassel@kantoor.transip.nl) has joined #ceph
[23:08] <kfox1111> sagewk: I'm setting up a vm image that I want to distribute to people. In it, I setup a stand alone ceph instance. During the build of the image, it formats things. I would like to have the script wait until the scrub is finished so that when I distribute the image, it is nice and happy, instead of coming up and then scrubbing right away. Know a way to do that?
[23:09] * CROS__ (~hworld@ Quit (Ping timeout: 480 seconds)
[23:09] <sagewk> kfox1111: you can look at ceph pg dump [--format=json] and wait until the scrub stamps are all non-blank...
[23:09] <kfox1111> ah. thanks. :)
[23:10] * rturk-away is now known as rturk
[23:11] <sagewk> np
[23:15] * JohansGlock (~quassel@kantoor.transip.nl) Quit (Ping timeout: 480 seconds)
[23:19] <sagewk> elder: ping
[23:19] <elder> I'm here.
[23:23] <nhm> ooh, zfs on linux is doing pretty good for not having SSD ZILs here.
[23:27] <elder> sagewk, you pang?
[23:27] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[23:27] <lurbs> nhm: Speaking of which... Kernel 3.9 has dm-cache support, to do something vaguely similar.
[23:27] <lurbs> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/device-mapper/cache.txt
[23:29] <dmick> lol@pang
[23:30] <sagewk> i had pung?
[23:30] <sagewk> elder: there were 2 xfstest 139 hangs in a row
[23:30] <elder> It's first grade Spongebob.
[23:30] <elder> Last night?
[23:31] <sagewk> last 2 nights
[23:31] <sagewk> the paths are on the bug.. care to take a look?
[23:31] <elder> Interesting. Yes.
[23:31] <elder> Bug number?
[23:31] <sagewk> i forget.. it's in rbd project prio high
[23:31] <elder> OK
[23:32] <sagewk> cool
[23:32] <elder> http://tracker.ceph.com/issues/4661
[23:32] <nhm> lurbs: fun! :D
[23:35] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[23:37] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[23:39] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:39] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:44] <sagewk> elder: looks like the machines got cleaned up :(
[23:44] <elder> If it repeats like that I may be able to reproduce it anyway.
[23:44] <sagewk> we'll see what happens tonight
[23:44] * rturk is now known as rturk-away
[23:45] <tnt> joshd: scuttlemonkey_ suggested I ask you about this : "wrt to rbd. I'm wondering if anyone ever though of striping data over two objects. Currently you have 4M chunk mapped to objects. But imaginge you mapped 4M chunks to 2 different objects in a striped fashion where 0->64k,128k->192k,... is obj0 and 64k->128k, 192k->256k ... is obj1."
[23:46] * LeaChim (~LeaChim@ has joined #ceph
[23:46] <joshd> tnt: that's what stripingv2 with format 2 images lets you do
[23:47] <joshd> tnt: not supported by the kernel yet, but librbd understands stripe count and stripe unit to set up that kind of striping
[23:48] <sstan> kernel 3.9 has been released ... doesn't rbd support that now?
[23:48] * portante (~user@ has joined #ceph
[23:48] <tnt> joshd: Oh ok, nice. Was there are preliminary bench or pro/cons tests ?
[23:48] <joshd> sstan: 3.9 doesn't support format 2; 3.10 should, but won't support stripingv2 yet, just layering
[23:48] <sstan> ah
[23:49] <tnt> I'm starting to move away from the kernel client in favor of an userspace one anyway.
[23:49] <cjh_> tnt: agreed
[23:49] <sstan> joshd : if I set tunables to be optimal, will rbd still work under kernel 3.9?
[23:50] <joshd> tnt: some basic tests show what you'd expect: it's good for sequential I/O, but doesn't affect random
[23:50] <joshd> tnt: haven't tried to determine the best parameters or anything yet though
[23:51] <tnt> joshd: ok. I need to add a bunch of statistics/histograms to my userspace driver to get a better understanding of the type of access done by my workload, see if it would help.
[23:52] <joshd> sstan: http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunables#which-client-versions-support-crush-tunables2 says yes
[23:53] <sstan> joshd: cool thanks
[23:53] * b1tbkt (~Peekaboo@68-184-193-142.dhcp.stls.mo.charter.com) has joined #ceph
[23:59] * eschnou (~eschnou@252.94-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.