#ceph IRC Log

Index

IRC Log for 2012-08-23

Timestamps are in GMT/BST.

[0:05] * nhm (~nhm@65-128-137-98.mpls.qwest.net) has joined #ceph
[0:06] <nhm> mikeryan: so these intel 520s are doing ~500MB/s on a 2208 when configured in jbod mode. :D
[0:06] <nhm> At least writing out pressumably highly compressable fio data.
[0:06] <Tv_> that sounds about what i can push my intel ssds to, without extra effort
[0:08] <nhm> Tv_: much better than the Dell branded samsungs.
[0:09] <gregaf> fghaas: dmick: that's correct ??? once upon a time the monitors distributed the class .so's to the OSDs directly
[0:09] <nhm> Tv_: Goal here is to have 3 journals per SSD.
[0:10] <Tv_> nhm: yeah duh.. intel ssds have a history of being the best ones (apart from some very high-end competition)
[0:10] <gregaf> but doing that correctly across architectures, OSes, etc is difficult and so we just decided to offload the task to management software that should already have the capability
[0:10] <gregaf> and since the monitors aren't in that path now they can't really provide information about it
[0:11] <nhm> Tv_: yeah, though there are plenty of drives that are at least close to the Intels these days.
[0:11] <mikeryan> nhm sweet throughput, but it can't compare to my 100 GB of ramdisks *grin*
[0:11] <nhm> Tv_: The Dell drives are like 250MB/s.
[0:11] <nhm> mikeryan: speaking of which, what kind of throughput are you getting to that?
[0:12] <mikeryan> haven't benchmarked the disks directly, but smalliobenchfs is getting 800 mb/sec without brekaing a sweat
[0:12] <nhm> wait, the Dell drives were more like 200MB/s.
[0:12] <fghaas> gregaf, dmick: thanks
[0:13] <mikeryan> sam and i think we've traced our throughput problem down to the messaging layer
[0:13] <mikeryan> http://lacklustre.net/typical_osd_op.png
[0:13] <mikeryan> in case you ever wondered where a typical op spends most of its time
[0:14] <mikeryan> y axis is time, seconds
[0:15] <nhm> mikeryan: I'm certainly willing to believe that at least one bottleneck is in the messenger. ;)
[0:15] <gregaf> ummm???..what?
[0:15] <gregaf> *stares at picture and tries to blame messenger*
[0:15] <elder> What is "throttled"?
[0:15] <mikeryan> this op took about 0.3 sec from the OSD's perspective
[0:15] <mikeryan> and about 0.8 from the client's perspective
[0:15] <Tv_> don't throttle the messenger!
[0:15] <nhm> lol
[0:16] <Tv_> *ba-dum tisch!*
[0:16] <elder> Nice sound effects.
[0:16] <gregaf> that message was throttled on its own for almost a tenth of a second (after the OSD read the header)
[0:16] <Tv_> http://www.youtube.com/watch?v=8eXj97stbG8
[0:16] <gregaf> if you want me to believe the messenger is the problem you'd better produce graphs where there's no throttling section
[0:17] <mikeryan> gregaf: care to explain where the extra 0.5 sec came from, if not the messenger?
[0:17] <mikeryan> this is wallclock time, btw
[0:17] <gregaf> because my response right now is "the OSD messenger stopped reading client data because it was at its limit, and so the tcp buffers filled up"
[0:17] <nhm> gregaf: we've certainly seen that happen.
[0:18] <gregaf> mikeryan: so the extra .5 seconds is effectively more throttling before the header even got read
[0:18] <gregaf> would be my guess
[0:18] <mikeryan> either that, or tacked on the other end
[0:18] <mikeryan> i haven't correlated client and OSD timestmaps yet
[0:18] <gregaf> you can disable throttling entirely if you want to see it without that
[0:18] <nhm> mikeryan: 800MB to ramdisk actually kind of worries me.
[0:19] <mikeryan> nhm: got an io benchmaker i can use on the raw disk?
[0:19] <mikeryan> that's 800 MB through the FileStore
[0:20] <nhm> mikeryan: fio will give you lots of control over how you want to do writes.
[0:21] <gregaf> mikeryan: set osd_client_message_size_cap to 0
[0:22] <gregaf> and maybe even ms_dispatch_throttle_bytes (which controls how much data the messenger will shove into the daemon's dispatch loop at once)
[0:22] <gregaf> run your tests again
[0:22] <gregaf> :)
[0:26] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:26] <mikeryan> gregaf: which section does ms_dispatch_throttle_bytes live in?
[0:27] <nhm> gregaf: btw, there has been a fair amount of discussion on the SimpleMessenger locking investigation that Andreas raised on the mailing list.
[0:27] * loicd1 (~loic@brln-d9bad56f.pool.mediaWays.net) Quit (Quit: Leaving.)
[0:27] * danieagle (~Daniel@177.43.213.15) Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[0:29] <gregaf> mikeryan: whichever daemon you want it to apply to
[0:29] <gregaf> ms_dispatch_throttle_bytes is in the SimpleMessenger constructor
[0:30] <mikeryan> gregaf: that shaved off ~0.05 sec
[0:30] <mikeryan> on each request
[0:31] <gregaf> both options or just one of them?
[0:31] <mikeryan> both
[0:32] <gregaf> okay ??? Sam thinks maybe this picture is actually labeled wrong, hang on
[0:34] <mikeryan> sorry, that shaved off ~0.005 sec
[0:34] <gregaf> wait, what are the units on this picture?
[0:34] <mikeryan> order of magnitude
[0:34] <mikeryan> units are seconds
[0:34] <gregaf> okay
[0:34] <mikeryan> next graph will be labelled, sorry
[0:34] <gregaf> so it shaved off like 1/6 of the pictured time
[0:34] <gregaf> but the client's perspective of the time was still .08 seconds?
[0:34] <nhm> hrm, do we not have 3.5 kernel debs in gitbuilder?
[0:35] <mikeryan> shaved off 0.005 seconds from both the client and OSD perspective
[0:35] <gregaf> so now the client's at .075 seconds
[0:35] <mikeryan> yes
[0:35] <gregaf> okay
[0:35] <gregaf> and are the server and client actually on the same machine?
[0:35] <mikeryan> yep
[0:35] <mikeryan> all running off pure ramdisks
[0:36] <mikeryan> (not tmpfs)
[0:36] <mikeryan> all logging is to another ramdisk
[0:36] * didders (~didders@rrcs-71-43-128-65.se.biz.rr.com) Quit (Read error: Operation timed out)
[0:36] <gregaf> and Sam says that the green section is actually the time between when it got unthrottled and when it reached the next step (which is "all read" and the whole message is in memory)
[0:36] <gregaf> now I'm more interested in seeing what's going on inside the messenger and the network
[0:37] <mikeryan> i think that's where we should focus since the gross operation time from client's perspective is more than double what we're seeing in my graph
[0:39] <nhm> mikeryan: this was a single OSD with replication of 1?
[0:39] <mikeryan> yes
[0:39] <mikeryan> the logs and osd are actually on tmpfs
[0:39] <mikeryan> filestore is xfs on a ramdisk
[0:39] <mikeryan> log is a raw ramdisk
[0:39] <nhm> mikeryan: both were on the same switch?
[0:39] <mikeryan> 40gb/10gb there
[0:39] <mikeryan> same damn machine
[0:40] <nhm> was it going out over the network or just on loopback?
[0:41] <mikeryan> loopback
[0:41] <mikeryan> everything controllable was controlled..
[0:42] <mikeryan> brb
[0:42] <nhm> I doubt that, but I do believe you controlled the low hanging fruit. ;)
[0:43] <gregaf> mikeryan: wait, Sam is pointing out that according to this graph, using memory we're taking almost 10 milliseconds to read 4MB of loopback data
[0:43] <gregaf> which is actually about the bandwidth we can get out of the SimpleMessenger on this box
[0:45] <nhm> gregaf: 10ms per 4MB of data with or without the simple messenger?
[0:45] <nhm> gregaf: I'm confused what you mean...
[0:47] <gregaf> the green part in your picture
[0:48] <gregaf> is the time between "throttle_stamp" (which is set when the messenger gets the throttler allocation it needs to read the message into memeory) and the time between ???(just a sec)
[0:48] <gregaf> it's the time between "throttle_stamp" and "recv_complete_stamp", which is the time at which the entirety of the message was read into memory
[0:49] <gregaf> ie, that is controlled largely by the network stack and interface underneath the SimpleMessenger
[0:49] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[0:50] <gregaf> and the bandwidth implied by that length of time is not very far off of what Sam is observing in SimpleMessenger throughput
[0:50] <nhm> Ok
[0:52] <nhm> So I guess the plus side of all of this is that we should theoretically be able to get 800MB/s to one node.
[0:56] <nhm> Once the stupid PCIE x8 riser comes in that I need, I should be able to test 2x bonded 10G.
[0:56] <elder> shitty riser
[0:57] <nhm> elder: I'll reserve my ire until I know that it's shitty. How about shitty ordering process?
[0:57] <elder> It's shitty.
[0:59] <nhm> indeed.
[1:00] <nhm> on the plus side, my wireless is working better since someone tripped outlet it's plugged into, the motherboard raid controller supports jbod mode, and the intel SSDs are fast
[1:01] <nhm> and the fans are marginally more tolerable since I was able to get into the web interface and switch them to "optimized" mode.
[1:04] <Tv_> nhm: i wonder who would make "liftoff" be the default mode...
[1:10] * Leseb_ (~Leseb@79.142.65.52) has joined #ceph
[1:11] * fghaas (~florian@91-119-204-193.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[1:14] * pentabular (~sean@adsl-71-141-232-146.dsl.snfc21.pacbell.net) has joined #ceph
[1:15] * pentabular (~sean@adsl-71-141-232-146.dsl.snfc21.pacbell.net) has left #ceph
[1:17] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[1:17] * Leseb_ is now known as Leseb
[1:19] <nhm> Tv_: apparently it wasn't. It was set to "standard" which meant only spin the drives at 6k rpm instead of 7.5k rpm. optimized is keeping them hovering around 4.5K PRM.
[1:20] <nhm> er, s/drives/fans
[1:20] <nhm> they are still insanely loud and high pitched.
[1:20] <mikeryan> friend of mine just picked up some old sun hardware at unix surplus
[1:20] <mikeryan> a 1u server
[1:20] <mikeryan> plugs it in, turns it on
[1:20] <mikeryan> afterburner fires up..
[1:20] <Tv_> bleh, the right default is to follow temperatures and just about any hardware(/bios with all the invisible "management mode" code it runs) knows how to do that these days
[1:20] <mikeryan> and 8 year old dust fills the room
[1:21] <Tv_> mikeryan: i've stood behind a rack full of ibm servers all powering on at the same time.. my eyes were tearing for a couple of minutes from the wind
[1:21] <nhm> grrr, drives reordered after kernel upgrade.
[1:21] <mikeryan> OSHA visited the CS dept server room at UCLA once
[1:21] <Tv_> nhm: uuids dude
[1:22] <mikeryan> and mandated hearing protection be worn
[1:22] <mikeryan> nhm: i won't dispute that we haven't exhaustively controlled all possible noise sources, but we've done a damn find job reducing them as much as possible
[1:22] <nhm> Tv_: oh, it's using UUIDs for boot so that wasn't an issue, it's just annoying.
[1:23] <Tv_> nhm: /dev/disk/by-uuid/
[1:23] <Tv_> nhm: train fingers away from things that will eventually make you hurt your boot drive ;)
[1:23] <Tv_> (says the person who almost eject /dev/sda instead of an SD Card, last night)
[1:23] <nhm> Tv_: yes, I just don't have a good mental mapping of what uuid corresponds to an SSD vs spinning disk.
[1:24] <Tv_> nhm: by-id/ ?
[1:26] <nhm> Tv_: interestingly the SSDs are missing from /dev/disk/by-id, but that's enough to let me know which ones they are.
[1:27] <nhm> alright, time to eat.
[1:28] * Leseb (~Leseb@79.142.65.52) Quit (Quit: Leseb)
[1:37] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[1:40] * adjohn (~adjohn@ip-64-139-43-230.dsl.sjc.megapath.net) has joined #ceph
[1:42] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[1:44] * Cube (~Adium@12.248.40.138) Quit (Ping timeout: 480 seconds)
[1:47] * adjohn (~adjohn@ip-64-139-43-230.dsl.sjc.megapath.net) Quit (Quit: adjohn)
[1:59] * adjohn (~adjohn@ip-64-139-43-230.dsl.sjc.megapath.net) has joined #ceph
[2:02] * adjohn (~adjohn@ip-64-139-43-230.dsl.sjc.megapath.net) Quit ()
[2:02] * BManojlovic (~steki@212.200.243.134) Quit (Remote host closed the connection)
[2:05] <nhm> nice, concurrent fios to the block devices yields about 460MB/s to each SSD and 110MB/s to each 7200rpm drive.
[2:06] <nhm> so aggregate throughput on this controller is ~1.6GB/s for 8 drives.
[2:06] <nhm> and this chassis supports 36 hotswap drives. :D
[2:08] <maelfius> nhm: nice.
[2:09] <darkfader> nhm: what controller? lsi 9620?
[2:09] <darkfader> mine tops out at 1.6GB/s so it smells like that
[2:09] <darkfader> (and yes, singular, i only have one of those lol)
[2:10] <darkfader> it seems for non-raided the lsi 9207 is the utmost perfect, but i didn't get one, the money will go towards a tori amos concert instead
[2:10] <nhm> darkfader: this is actually an integrated 2208 on a supermicro motherboard.
[2:10] <nhm> darkfader: I think it's basically the same thing that's in the 9265 and 9285.
[2:10] <darkfader> ah ok, so one cycle newer than 9260
[2:10] <darkfader> although i am still lost in their model numbers
[2:10] <nhm> darkfader: I've got 4 9207-8i cards sitting here too. :)
[2:11] * Tv_ (~tv@2607:f298:a:607:38b3:897f:20fd:72b9) Quit (Read error: Operation timed out)
[2:11] <darkfader> weee.
[2:11] <darkfader> but ok, on your background thats all small-scale hehe
[2:11] <darkfader> but keep us on here posted if you test-drive the 9207
[2:11] <nhm> also a highpoint marvell based card that can supposedly be configured to do jbod, and an Areca 1680 raid controller for good measure.
[2:11] <darkfader> eeek.
[2:11] <darkfader> as for highpoint
[2:12] <darkfader> i have a 1680 too, had to patch the firmware for 3TB drives and it'll not to SATA3 to the samsung ssds
[2:12] <nhm> darkfader: yeah, I grabbed it just because I figured someone is going to use one at some point so we might as well test it before that happens.
[2:12] <darkfader> yes!
[2:12] <darkfader> off the record, aren't all those >8 port controllers with their SAS expanders just a hoax?
[2:13] <darkfader> they can never beat the equivalent in 8-port controllers, can they?
[2:13] <nhm> darkfader: I have a feeling the 9207s are going to be the way to go, but the 2208 actually might not be a bad (though expensive option) if the jbod mode works out.
[2:14] <darkfader> and, no i hope noone ever uses a highpoint for life data. i used to know a guu who lost his company due to trusting those
[2:14] <darkfader> *guy
[2:14] <darkfader> sorry for the spelling.
[2:14] <nhm> darkfader: this chassis doesn't use expanders. It has 9 connectors between the two backplanes!
[2:14] <darkfader> i'm not sure, but i don't think you can work around this grade of hardware from the software side
[2:15] <darkfader> nhm: nice
[2:15] <nhm> darkfader: so to drive everything you need like 5 controllers.
[2:15] <darkfader> (i have a areca 1680-ix16, thats where the expander lives)
[2:15] <nhm> darkfader: I think LSI makes a card that has 16 real channels but it's a full height card.
[2:15] <darkfader> did you yet have a chance to look at my storage box layout?
[2:15] <nhm> ah, I think the one I have is just the x8.
[2:16] <nhm> darkfader: no, I'm sorry. I've been crazy swamped. Remind me where to look?
[2:16] <darkfader> np i'll dig for it, if you have a moment i'll take the chance
[2:16] <darkfader> http://www.deranfangvomen.de/floh/infiniband.png
[2:17] <darkfader> the upper box is the storage swamp thing
[2:17] <darkfader> i intend to spread osds out equally by storage capacity, with many replicas in the big box and some outside of it
[2:17] <nhm> oh wow
[2:17] <darkfader> which was interesting for you since it was un-equally distributed
[2:18] <darkfader> the storage box has the areca for spinning disks and 9260 for ssd. 9260 if i stick to raid5 "cache" or 9207 if i go single-ssd-passed-on-as-jbod
[2:19] <nhm> If you do a 9265 you might be able to do either raid5 or jbod if it's anyhting like the 2208 on this supermicro board.
[2:19] <darkfader> and i tested and am defintely NOT in favor of zfs
[2:19] <nhm> oh really? How did that go?
[2:20] <darkfader> nhm: the 9260 can't do jbod but it churns nicely on raid5
[2:20] <darkfader> not convincing
[2:20] <darkfader> 300MB/s sustained
[2:20] <darkfader> really, it seems not up to the job on linux
[2:21] <nhm> darkfader: yeah, we have a bunch of H700 cards that are basically 9260s that also can't do JBOD. That's the 2108 though. The 2208 appears to be able to do jbod.
[2:21] * The_Bishop (~bishop@e179002234.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[2:21] <nhm> darkfader: that was with or without ceph?
[2:21] <darkfader> nhm: i missed the chance for a 2208 box 2 weeks ago
[2:21] <darkfader> plain zfs
[2:21] <darkfader> no ceph
[2:22] <darkfader> i gave it first 1 two-ssd zil, then 2 two-ssd zils, and 4/2 as l2arc
[2:22] <nhm> darkfader: hrm, I wonder what the llnl guys are getting with it since they are sinking so much $$$ into their deployment.
[2:22] <darkfader> i think if you stuff in >100GB ram it'll do fine
[2:22] <darkfader> i only had 16GB
[2:22] <darkfader> so it caps off the l2arc somewhere
[2:22] <nhm> ah, I have heard big ZFS deployments are memory hungry.
[2:23] <darkfader> still with two ZIL pairs capable of 400MB/s i'd expect to be above 400MB/s
[2:23] <darkfader> the areca would keep up for that too
[2:23] <darkfader> it was completely unbalanced
[2:23] <nhm> yeah
[2:24] <nhm> Those arecas are pretty fast, even if they do die after like a year or tow.
[2:24] <darkfader> hard to track since much of the zfs io passes normal io layers. and i really wont run blktrace unless someone pays :)
[2:24] <darkfader> hehe
[2:24] <darkfader> thats why it only sits in the "mirror box"
[2:25] <darkfader> what really suprised me was how much less cpu% you use up writing to the hw raid controllers opposed to sw raid
[2:25] <darkfader> even for raid0
[2:25] <darkfader> i mean, it feels right, but with all the software raid buzz
[2:26] <nhm> darkfader: interesting. I haven't actually tested it. I wonder if those guys doing software raid on cuda ever made things stable.
[2:26] <darkfader> maybe, that could be interesting
[2:26] <darkfader> didn't know and <3
[2:27] <darkfader> i tried on my desktop, the lsi uses 1/2 the cpu power per data to move
[2:27] <nhm> http://unbeknownst.net/?p=408
[2:27] <darkfader> sw raid topped out at 1.2GB/s, with lsi + some more onboard SSDs i hit 2.4GB/s
[2:27] <darkfader> (but the cursor didn't really move then)
[2:28] <darkfader> hehe, geforce 8800
[2:28] <darkfader> now, thats an easy performance goal
[2:28] <darkfader> nhm: but no source download/devmapper target?
[2:29] <darkfader> mmh. but i like the idea, really.
[2:29] <darkfader> still, i think thats bullshit. you need to move the data on and off the gpu
[2:29] <darkfader> so a lot more bus bandwidth
[2:30] <nhm> darkfader: I think cuda 4.0 can do some kind of fancy DMA.
[2:30] <darkfader> ok, then maybe
[2:30] <darkfader> with some *bling* :)
[2:30] <nhm> darkfader: hehe
[2:30] <nhm> it'd still be crazy
[2:31] <darkfader> err, coming back to picture for a second. how stupid do you think is limiting to a 20Gb/s IB switch, and how much of a concern is locality of data in ceph right now
[2:31] <nhm> they should just stick minisas connectors on your graphics card. ;)
[2:31] <darkfader> yep!
[2:31] <darkfader> sata-over-hdmi? thunderbolt!
[2:32] <nhm> hrm, is the switch non-blocking?
[2:32] <darkfader> yes
[2:32] <darkfader> it's the blazingly ultimate ib switch cisco ever made
[2:32] <nhm> I think you'll be fine. ipoib isn't going to give you much better than DDR speeds anyway.
[2:32] <darkfader> the last of its kind and the only one ever made to make the open market
[2:33] <darkfader> ($50 on ebay)
[2:33] <nhm> that's scary, I thought you'd have voltaire or mellenox or something.
[2:33] <darkfader> no, i really have the worst beast they ever made
[2:33] <nhm> damn, that's cheap
[2:33] <darkfader> if it dies, i have a $5000 hole to fill.
[2:33] <nhm> that's much scarier than the fact that it's 20Gb/s. :)
[2:33] <darkfader> and that won't have 6 gigE bridge ports
[2:34] <nhm> Well, good luck... Better you than me. ;)
[2:34] <darkfader> yeah, thats why i'll file the feature request for rdma (who said ipoib) to ethernet failover some day
[2:34] <nhm> as for data locality, I think if your switch is non-blocking you should be fine?
[2:35] <nhm> darkfader: hopefully we'll have some of RDMA support some day.
[2:35] <nhm> darkfader: gotta prove that we can get the kinds of per-node speeds that would justify it first though.
[2:35] <darkfader> the thing is if i should stuff 15K sas disks in all nodes (as fast as spinning goes) or have a low-latency linked bunch of SSDs in the other box and local "sata" as another replica
[2:35] <nhm> that's where some of this test hardware comes in. I'm going to be doing bonded 10G to start out.
[2:35] <darkfader> i think your #1 think should be rdma mds-mon-mds chatter
[2:35] <darkfader> yeah :)
[2:36] <darkfader> you can run all the rdma on ethernet these days i think
[2:36] <darkfader> it's fake but the some workings
[2:36] <darkfader> s/think/thing/ (sorry again)
[2:37] <nhm> yeah, I've heard the rdma over ethernet stuff is actually pretty good. Not IB good, but much closer than ip.
[2:37] <darkfader> my route was fun "oh, infiniband switch. oh, gluster. oh, gluster sucks. oh, ceph."
[2:37] <nhm> hopefully that won't be ceph sucks. ;)
[2:38] <darkfader> i tried to wrap my head around how iWARP does reliable transmission over ethernet. i still didn't understand, but i accept it
[2:38] <darkfader> lol... ceph is beyond doubt.
[2:38] <darkfader> dont you worry :)
[2:39] <darkfader> i'd not stand in front of 6 people in the storage class and say "oh well, it choked now, but it's still better than the REST of them"
[2:39] <darkfader> if i had any reason to worry ceph sucked
[2:40] <nhm> excellent. Ok, I gotta go put the kids to bed. Will try to be back later
[2:40] <darkfader> even if theres no rdma support that just means i don't need to worry about ib->ip failoevber at the momeny
[2:41] <darkfader> and others promise that but are so non-determistic that noone knows how long the failover will take :)
[2:41] <darkfader> oki, i'll drop off to bed. i still got the message that if there's no latency issue on the switch i'll be fine with some ssd-ish pool
[2:42] <darkfader> nite.
[3:08] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) Quit (Remote host closed the connection)
[3:38] * renzhi (~renzhi@180.169.73.90) has joined #ceph
[3:52] * pentabular (~sean@adsl-71-141-232-146.dsl.snfc21.pacbell.net) has joined #ceph
[3:53] * pentabular (~sean@adsl-71-141-232-146.dsl.snfc21.pacbell.net) has left #ceph
[3:53] * rz_ (~root@ns1.waib.com) has joined #ceph
[3:53] * rz (~root@ns1.waib.com) Quit (Read error: Connection reset by peer)
[4:27] * tightwork (~tightwork@142.196.239.240) has joined #ceph
[4:27] <tightwork> Hi
[4:30] <dmick> hello tightwork
[4:40] * chutzpah (~chutz@100.42.98.5) Quit (Quit: Leaving)
[4:42] * cattelan (~cattelan@2001:4978:267:0:21c:c0ff:febf:814b) Quit (Ping timeout: 480 seconds)
[4:46] <elder> dmick you still around?
[4:52] * cattelan (~cattelan@2001:4978:267:0:21c:c0ff:febf:814b) has joined #ceph
[4:57] * nhm (~nhm@65-128-137-98.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[5:14] <dmick> I am
[5:15] <dmick> sup elder?
[5:18] <elder> I'm getting this error when attempting to create a snapshot using rbd user space command: "rbd snap create --snap i1s1 image1"
[5:18] <elder> error opening image image1: (2) No such file or directory
[5:18] <elder> 2012-08-23 03:17:56.043319 40029de0 -1 librbd: Error reading mutable metadata: (2) No such file or directory
[5:19] <elder> I was assuming it had to do with running old code.
[5:19] <dmick> of course image1 actually exists?...
[5:20] <elder> My UML environment is using debian sid. So I had my apt sources.list point to ubuntu precise. But then I run into other troubles, like library versions not matching.
[5:20] <elder> I think I'm back where I started now though.
[5:20] <elder> Yes it exists.
[5:20] <elder> I'd just like to know how I can make sure I'm working with a good, current version of user space code.
[5:21] <elder> I can use apt to do it, if you know how. Or I can do it more manually. I just want to know I'm working with good goods.
[5:21] <dmick> do we even build sid anymore?
[5:21] <elder> I don't konw.
[5:21] <elder> By the way, I believe this is the error in the OSD log:
[5:21] <elder> 2012-08-22 22:17:56.047104 7f7538856700 1 -- 192.168.122.1:6803/19669 --> 192.168.122.101:0/1002625 -- osd_op_reply(5 rbd_header.104061d1949e [call rbd.get_size,call rbd.get_features,call rbd.get_snapcontext,call rbd.list_locks,call rbd.get_parent] = -2 (No such file or directory)) v4 -- ?+0 0x3721400 con 0x3bf2280
[5:21] <elder> But I'm not sure how to determine which of the several commands led to the -2 errno
[5:22] <dmick> perhaps rados -p rbd ls will be of use?
[5:22] <dmick> sounds like the header is missing, but the image is still noted in the directory
[5:22] <elder> cat /sys/bus/rbd/devices/1/image_id
[5:22] <elder> 104061d1949e
[5:23] <dmick> if you suspect past mismatches you can blow it away with rados rmpool rbd; rados mkpool rbd, too
[5:23] <elder> rados ls -p rbd
[5:23] <elder> ...
[5:23] <elder> rbd_header.104061d1949e
[5:23] <elder> ...
[5:23] <elder> The header is there.
[5:23] <elder> I can try that I suppose.
[5:23] <elder> But I think it's been happening with a brand new ceph instance.
[5:23] <dmick> hm
[5:24] <elder> Just a minute, let me clean up.
[5:24] <dmick> does rbd with no parms show "snap protect" and "snap unprotect" and "flatten" as commands?
[5:24] <elder> root@uml-1:~# rbd rm image1
[5:24] <elder> 2012-08-23 03:24:34.866319 40029de0 -1 librbd: Error reading mutable metadata: (2) No such file or directory
[5:24] <elder> Removing image: 100% complete...done.
[5:25] <elder> yes
[5:25] * Qten (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[5:25] <elder> My rados pool does have crap in it.
[5:26] <elder> Let me blow it away and start fresh. That's what the commands above do, right?
[5:26] <dmick> just the rbd pool,, yes
[5:26] <dmick> you might want to unmap first; I've never played with the kernel mappings
[5:26] <elder> root@uml-1:~# rados ls -p rbd
[5:26] <elder> root@uml-1:~#
[5:27] <elder> I thought there was at least something in there when I started.
[5:27] <dmick> not on a fresh cluster
[5:28] <elder> OK, created a new image, and I have:
[5:28] <elder> root@uml-1:~# rbd ls
[5:28] <elder> image1
[5:28] <elder> root@uml-1:~# rados ls -p rbd
[5:28] <elder> image1.rbd
[5:28] <elder> rbd_directory
[5:28] <elder> root@uml-1:~#
[5:28] <dmick> yeah, that's an old-style image
[5:29] <elder> Whoops.
[5:29] <elder> Let me do a new one, that's what I intended to do.
[5:29] <Qten> Hi all, anyone know if boot from RBD for openstack is working via dashboard?
[5:29] <elder> root@uml-1:~# rbd rm image1
[5:29] <elder> Removing image: 100% complete...done.
[5:29] <elder> root@uml-1:~# rbd create --new-format image1 --size=1024
[5:29] <elder> root@uml-1:~# rbd ls
[5:29] <elder> image1
[5:29] <elder> root@uml-1:~# rados ls -p rbd
[5:29] <elder> rbd_directory
[5:29] <elder> rbd_header.105f1e75cb
[5:29] <elder> rbd_id.image1
[5:30] <elder> root@uml-1:~#
[5:30] <elder> All good.
[5:30] <elder> Now, creating a snapshot:
[5:30] <elder> rbd snap create --snap i1s1 image1
[5:30] <elder> (Right? about to to that)
[5:30] <dmick> ought to work
[5:30] <dmick> I usually use image1@i1s1, but either should work
[5:30] <elder> root@uml-1:~# rbd snap create --snap i1s1 image1
[5:30] <elder> error opening image image1: (2) No such file or directory
[5:30] <elder> 2012-08-23 03:30:37.596890 40029de0 -1 librbd: Error reading mutable metadata: (2) No such file or directory
[5:30] <elder> root@uml-1:~#
[5:30] <dmick> uncool
[5:31] <elder> That's what I've been thinking all day.
[5:31] <dmick> $ rbd create --new-format image1 --size=1024
[5:31] <dmick> $ rbd snap create --snap i1s1 image1
[5:31] <dmick> $
[5:31] <dmick> so, yeah, should have worked
[5:32] <dmick> you built ceph?
[5:32] <elder> Right, of course Sage also had no problems. So there's something amiss with my config.
[5:32] <elder> I built it, yes.
[5:32] <dmick> you restarted the daemons after building it?
[5:32] <elder> But the user space is an unknown
[5:32] <elder> Yes.
[5:32] <dmick> hmm
[5:32] <elder> That's why I'm focusing in on the user space--how to make sure I have new stuff.
[5:32] <dmick> turning on class debug might give some hints
[5:32] <elder> I'm running in a UML environment so..
[5:33] <elder> It's on.
[5:33] <elder> What do you want to know?
[5:33] <dmick> anything useful with <cls> in it from the osd log?
[5:33] <elder> Whoops, no it's not on. How do I turn it on?
[5:34] <dmick> debug objclass = 20
[5:34] <dmick> in [osd]
[5:34] <elder> anywhere thereunder?
[5:34] <dmick> when you say unknown userland
[5:34] <elder> I have debug ms = 1 only
[5:34] <dmick> you mean "outside of the ceph CLI/libs", right?
[5:35] <elder> I mean I get it by "apt-get install ceph"
[5:35] <dmick> I guess I don't know what you mean by "I built it" then
[5:35] <elder> And I have my sources.list.d/ceph.list set up to point at gitbuilder... oneiric (or precise)
[5:37] <dmick> what exactly did you build then?
[5:38] <elder> I built ceph and am running ceph on my local machine.
[5:38] <elder> I then boot a UML kernel, which is basically a virtual machine on my machine.
[5:38] <elder> That kernel acts as a client for the ceph instance working on my local host.
[5:38] <elder> In that UML kernel, I have user space utilities (including the rbd user space command)
[5:38] <dmick> oh
[5:39] <elder> although I did build that stuff, it's not the stuff I built that I'm running.
[5:39] <elder> For user space in the uml kernel.
[5:40] <dmick> well so anyway, about the objclass debug: I have all sorts of thigns in my [osd] section, but that should be enough for objclass logging
[5:40] <dmick> and of course that's "on the cluster side"
[5:40] <elder> I'm going to restart ceph now that I've updated the config file.
[5:40] <elder> Give me a minute, I think I'm going to rebuild a fresh uml root file system to make sure I'm working with something I understand.
[5:44] <dmick> so the other useful command for understanding this stuff is rados listomapvals
[5:45] <dmick> that'll show you the key/value pairs on objects
[5:45] <dmick> like rbd_directory
[5:45] <dmick> I never keep straight what goes where
[5:46] <dmick> and of course rados get
[5:46] <elder> OK, I'll try that, maybe then I'll know what you're talking about...
[5:48] <dmick> I think we mostly implemented what Josh wrote in the design doc
[5:48] <elder> I just updated my ceph tree and am rebuilding that too. I'm using ceph/next branch
[5:49] <dmick> I don't think next has the stuff I worked on
[5:49] <dmick> I could be high
[5:50] <elder> real 1m48.370s to build the ceph tree.
[5:51] <dmick> yeah, I don't think next is good enough
[5:51] <elder> Really?
[5:51] <elder> What should I be using then?
[5:51] <dmick> you want bfd046e003ba5729b3efd0997b364cab98d8cab5 or the equivalent
[5:52] <dmick> which is in master, but not next, AFAICT
[5:52] * ChanServ sets mode +o sage
[5:52] <elder> I guess I don't know how those branches are managed then.
[5:52] * ChanServ sets mode +o sagewk
[5:53] <elder> I would have thought "next" meant, well, something that master hopes to be someday.
[5:53] <dmick> I guess my impression is that next is largely frozen to become 0.51
[5:53] <dmick> master is the bleeding edge, I think
[5:53] * Tobarja1 (~athompson@cpe-071-075-064-255.carolina.res.rr.com) Quit (Quit: Leaving.)
[5:53] <elder> Oh.
[5:53] <dmick> next will start tracking master more closely once 0.51 goes out
[5:53] <elder> OK, let me reset to master and build again.
[5:53] <dmick> at least that's my theory
[5:54] <sage> yep
[5:54] * ChanServ sets mode +o elder
[5:54] * ChanServ sets mode +o gregaf
[5:54] * ChanServ sets mode +o dmick
[5:54] * ChanServ sets mode +o mikeryan
[5:54] <dmick> muah ha ha
[5:57] <elder> Yay!!!
[5:58] <elder> Now we can avoid Merdin if he/she/it ever returns.
[6:01] <elder> Do we build oneiric?
[6:04] <dmick> used to
[6:05] <dmick> I'm leaving in 15 min
[6:05] <elder> OK.
[6:05] <elder> Almost there.
[6:06] <dmick> onieric last build at 6:27PM today
[6:06] <dmick> so yeah
[6:06] <dmick> sorry, 21:03 todya, even better
[6:06] <elder> OK. Just thinking it might be more likely to match libraries, etc. for debian sid.
[6:06] <dmick> (was looking at upper dir modtime)
[6:07] * deepsa (~deepsa@115.241.163.39) has joined #ceph
[6:10] <elder> Crap.
[6:10] <elder> After all that I'm getting dependency problems again.
[6:11] <elder> I just want to install the ceph user space.
[6:11] <sage> dmick: looks like they fail pretty consistently.. so far 5 failed no passed
[6:11] <sage> elder: make sure teh distro is correct in the uml's apt sources.list?
[6:11] <elder> What should it say?
[6:12] <sage> what does 'lsb_release -sc' say?
[6:12] <elder> I had I don't have lsb_release
[6:12] <elder> But I believe it was originally debian squeeze
[6:12] <sage> if you use the script it's squeeze
[6:12] <sage> yeah
[6:13] <elder> But then I changed it to sid, and then did "apt-get update; apt-get upgrade"
[6:13] <sage> oh.. that's asking for trouble, sid is a mess
[6:13] <elder> OK,
[6:13] <elder> I'll re-do it and won't update to sid.
[6:13] <elder> What about the ceph.list file?
[6:13] <sage> while you're at it, change the dist to wheezy instead of squeeze so it's more recent
[6:14] <elder> OK, that should be easy I think.
[6:14] <elder> But I'm not sure I need more variables right now...
[6:14] <sage> ideally it'd match the host distro, so you can run /host/home/elder/... binaries and the libraries will match
[6:14] <elder> I was thinking about that earlier today.
[6:14] <elder> But I am not sure where the binaries are after they're built.
[6:14] <sage> maybe precise is the way to go in that case
[6:14] <elder> Nor whether I need special libraries in a LD_LIBRARY_PATH or whetver.
[6:15] <sage> oh, for the rbd command etc. yeah, doesn't matter then.
[6:15] <sage> but i'd do precise probably, that has a deb gitbuilder for it. or wheezy. but not squeeze.
[6:15] <elder> I believe all the troubles I've had today have to do with the user space rbd command mismatching the ceph and kernel I'm building.
[6:15] <elder> OK.
[6:16] <elder> Let me back up and build a squeeze image
[6:16] <elder> I mean wheezy
[6:16] <dmick> sage: some nonsense happened with plana36; swapped drives so it had a boot drive, but no sdd, and I put it back into the pool, but then realized the tests need all 3 data drives. Might have poisoned the experiment
[6:17] <sage> elder: you're running the rbd command inside uml?
[6:17] <elder> Yes
[6:17] <sage> ah, didn't realize that. that makes sense.
[6:17] <sage> just run it on the host.
[6:17] <elder> Huh?
[6:17] <dmick> well that's true
[6:17] <elder> Ohhh.
[6:17] <sage> i only install the stuff inside uml so i can use 'rbd map'.. altho usually i don't even do that
[6:18] <elder> It won't matter where I run it, will it?
[6:18] <dmick> just becuase the kernel client is in uml doesn't mean the userland has to be
[6:18] <sage> right.
[6:18] <elder> The kernel client will get notified of changes and should be kept up-to-date.
[6:18] <sage> right
[6:18] <elder> Cool
[6:18] <sage> problem solved. ignored the sid/package brokenness.. doesn't matter.
[6:18] <elder> Again though, anything special I need to set up?
[6:18] <elder> Like LD_LIBRARY_PATH?
[6:19] <sage> for what?
[6:19] <sage> you don't need to run anything inside uml except to map the image
[6:19] <elder> I don't know, for rbd from my ceph build tree.
[6:19] <sage> automake handles that... ./rbd is actually a bash script with a zillion lines of crap to set LD_LIBRARY_PATH=.libs
[6:20] <sage> fwiw this si what i do:
[6:20] <sage> root@uml:~# cat r.sh
[6:20] <sage> #!/bin/sh
[6:20] <sage> ceph-authtool -p /host/home/sage/src/ceph/src/keyring > /tmp/p
[6:20] <sage> rbd map -m 10.3.64.22 foo --secret /tmp/p
[6:20] <elder> Wow.
[6:20] <elder> Now I see.
[6:20] <sage> i have a vaguely recent ceph package installed (<1 yr old should work) just for ceph-authtool and 'rbd map'.
[6:21] <elder> Wait a minute, I can cause a map on a client from a remote machine?
[6:23] <sage> that happens in uml.. but it can be anything vaguely recent
[6:23] <dmick> gotta run; gl elder, ask me earlier tomorrow :)
[6:23] * dmick is now known as dmick_away
[6:23] <elder> OK.
[6:23] <sage> the sysfs api is stable :)
[6:24] <elder> But I mean, mapping is setting up /dev, right?
[6:24] <elder> /dev/rbd1 maps to image1 for example
[6:24] <sage> yeah...
[6:24] <elder> So I can be running on the server, and it will cause that mapping to occur on a client?
[6:24] <sage> no
[6:24] <sage> rbd map runs inside uml
[6:24] <sage> but nothing else needs to
[6:24] <elder> OK,
[6:24] <elder> Got that.
[6:24] <elder> I've been using the sysfs interface directly for that.
[6:25] <elder> Whew
[6:25] <sage> that works too
[6:25] <elder> EXcept...
[6:25] <sage> the only reason i installed the package is because i used ceph-authtool to get the secret out of /host/hoem/sage/src/ceph/src/keyring, and my libs didn't match between uml and the host (so i couldn't run /host/home/sage/src/ceph/src/ceph-authtool)
[6:26] <elder> I have to rebuild my uml root image again... I can';t get an rbd because of dependency mismatches.
[6:26] <elder> I just used awk.
[6:26] <sage> try just setting the ceph src to squeeze again and see if it'll go
[6:26] <elder> Do'nt need no stinking ceph-authtool to parse the secrets file
[6:26] <elder> I will.
[6:26] <sage> :)
[6:27] <elder> Doesn't look like it. It only takes a few minutes to build a new one. I'll try wheezy though.
[6:31] * tightwork (~tightwork@142.196.239.240) Quit (Ping timeout: 480 seconds)
[6:33] <elder> Annoying. Wheezy does this at boot:
[6:33] <elder> [warn] CONFIG_SYSFS_DEPRECATED must not be selected ... (warning).
[6:33] <elder> [warn] Booting will continue in 30 seconds but many things will be broken ... (warning).
[6:33] <elder> Do you know why CONFIG_SYSFS_DEPRECATED is necessary in the config file?
[6:38] <sage> meh use squeeze then
[6:38] <sage> dunno
[6:40] <elder> OK, here's what seems to work. Build wheezy (or maybe building squeeze is better). Then add oneiric to the ceph.list file. Then apt-key add the key for the gitbuilder repository. Then apt-get update, and apt-get install ceph.
[6:40] <elder> (So far anyway)
[6:48] <elder> OK,
[6:48] <elder> so I have created and mapped my version 2 image.
[6:48] <elder> Now I want to run rbd user space over on my server machine.
[6:48] <elder> cd ceph/src
[6:48] <elder> ./rbd .... (shows me lots of info)
[6:49] <elder> ./rbd ls
[6:49] <elder> Oh, it shows it...
[6:49] <elder> PROGRESS!!!
[6:49] <elder> (it crashed)
[6:49] <elder> But that's PROGRESS
[6:50] <elder> Wait, not a crash, a WARN_ON()
[6:50] <elder> Even BETTER!
[6:51] <elder> Ahh, and I even know why. Very nice. I think I can go to bed now.
[6:51] <elder> Thanks all. Good night.
[6:55] * deepsa (~deepsa@115.241.163.39) Quit (Ping timeout: 480 seconds)
[6:57] * deepsa (~deepsa@101.63.3.209) has joined #ceph
[7:14] * Qten (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (Read error: Connection reset by peer)
[7:16] * Qten (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[7:16] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[7:27] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[7:44] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[7:48] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[8:58] * verwilst (~verwilst@d5152D6B9.static.telenet.be) has joined #ceph
[9:10] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:19] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:23] * stass (stas@ssh.deglitch.com) Quit (Ping timeout: 480 seconds)
[9:24] * Leseb (~Leseb@2001:980:759b:1:e940:d632:4cce:c26d) has joined #ceph
[9:35] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[9:46] * fghaas (~florian@91-119-204-193.dynamic.xdsl-line.inode.at) has joined #ceph
[9:52] * EmilienM (~EmilienM@vau75-1-81-57-77-50.fbx.proxad.net) has joined #ceph
[9:54] * loicd (~loic@brln-d9bad56f.pool.mediaWays.net) has joined #ceph
[9:56] * loicd (~loic@brln-d9bad56f.pool.mediaWays.net) Quit ()
[9:56] * loicd1 (~loic@brln-d9bad56f.pool.mediaWays.net) has joined #ceph
[10:04] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[10:04] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:05] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[10:11] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[10:13] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[10:21] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[10:24] * loicd1 (~loic@brln-d9bad56f.pool.mediaWays.net) Quit (Ping timeout: 480 seconds)
[10:32] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[10:40] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Read error: Operation timed out)
[10:40] * exec (~defiler@109.232.144.194) Quit (Quit: WeeChat 0.3.8)
[10:41] * exec (~defiler@109.232.144.194) has joined #ceph
[10:43] * verwilst (~verwilst@d5152D6B9.static.telenet.be) Quit (Read error: Connection reset by peer)
[10:44] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[10:45] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[10:55] * loicd (~loic@p5B2C523C.dip.t-dialin.net) has joined #ceph
[11:10] * deepsa (~deepsa@101.63.3.209) Quit ()
[11:38] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:42] * loicd (~loic@p5B2C523C.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[11:47] * loicd (~loic@p5B2C523C.dip.t-dialin.net) has joined #ceph
[12:00] * fghaas (~florian@91-119-204-193.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[12:30] * loicd1 (~loic@p5B2C523C.dip.t-dialin.net) has joined #ceph
[12:30] * loicd (~loic@p5B2C523C.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[12:32] * loicd (~loic@p5B2C523C.dip.t-dialin.net) has joined #ceph
[12:38] * deepsa (~deepsa@101.63.84.169) has joined #ceph
[12:38] * loicd1 (~loic@p5B2C523C.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[12:51] * loicd (~loic@p5B2C523C.dip.t-dialin.net) Quit (Quit: Leaving.)
[13:30] * tightwork (~tightwork@142.196.239.240) has joined #ceph
[13:31] * fghaas (~florian@91-119-204-193.dynamic.xdsl-line.inode.at) has joined #ceph
[13:33] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[13:50] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:05] * tightwork (~tightwork@142.196.239.240) Quit (Ping timeout: 480 seconds)
[14:21] * Qu310 (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[14:21] * Qten (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (Read error: Connection reset by peer)
[14:23] * fghaas (~florian@91-119-204-193.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[14:41] * Leseb_ (~Leseb@193.172.124.196) has joined #ceph
[14:45] * nhm (~nhm@174-20-15-49.mpls.qwest.net) has joined #ceph
[14:46] * loicd (~loic@brln-d9bad56f.pool.mediaWays.net) has joined #ceph
[14:48] * Leseb (~Leseb@2001:980:759b:1:e940:d632:4cce:c26d) Quit (Ping timeout: 480 seconds)
[14:48] * Leseb_ is now known as Leseb
[15:01] <kblin> hi folks
[15:04] <kblin> I'm a bit confused by an error I'm getting from mkcephfs..
[15:05] <kblin> http://cpaste.org/1209/ is the error I'm getting
[15:05] <kblin> /vol is an xfs mount, and in ceph.conf, "osd data = /vol/osd.$id" is set
[15:07] <kblin> the thing that confuses me is that it's the same error message I'd get if /vol/osd.0 really didn't exist, but it's there all right
[15:07] <kblin> however, if the directory doesn't exist, I don't get the error message from line 2
[15:09] <kblin> any idea what the problem could be?
[15:11] <kblin> oh, apparently I was missing my /var/lib/ceph/osd/ceph-0
[15:11] <kblin> dir
[15:19] <kblin> I was a bit confused because ceph health showed my pgs to be degraded, but I've only got one osd at the moment, so that's actually expected
[15:22] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) has joined #ceph
[15:40] <NaioN> kblin: you have to mount the xfs fs under /vol/osd.0
[15:40] <NaioN> is that from mkcephfs?
[15:41] <NaioN> I've spoken to you about this befor kblin
[15:41] <kblin> yeah, but as I said, apparently I was just missing the default directories
[15:42] <NaioN> why don't you mount the xfs volume under /vol/osd.0?
[15:42] <kblin> well, from what I got to work yesterday, I didn't need to give the osd the full partition
[15:42] <NaioN> that's true
[15:42] <kblin> but sure, I can mount that way
[15:42] <NaioN> but a. it's not recommended and b. I don't know if mkcephfs can handle it
[15:43] <kblin> as I said, I was mainly irritated about the degraded status until I realized that of course I'd only have one copy when running just one osd
[15:44] <NaioN> kblin: indeed
[15:44] <NaioN> you could set the replication level to 1
[15:44] <kblin> nah, I've got another host that I can set up, as soon as I've rescued half a TB of data off that one partition and reformatted to xfs
[15:45] <NaioN> well you can change it back to 2 again
[15:53] <kblin> right, there's that
[15:54] <kblin> set this for the three pools, and now I'm all good again
[15:54] * ao (~ao@85.183.4.97) has joined #ceph
[15:56] <kblin> so, I take if I don't want to allow root logins via ssh, I can't use mkcephfs to deploy on multiple machines?
[16:03] <darkfader> if it works via root ssh and you turn root ssh off
[16:03] <darkfader> what exactly do you expect?
[16:03] <darkfader> elves? :)
[16:04] <darkfader> probably it's a good thing though if you manually need to enable something for the mkcephfs run
[16:12] <kblin> settings->language&input
[16:12] <kblin> er, wrong window, sorry
[16:13] <kblin> I need a focus-follows-brain
[16:20] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[16:28] <NaioN> kblin: keep in mind that mkcephfs is only a tool to start a new cluster
[16:28] <NaioN> after running it once you don't use it anymore
[16:28] <NaioN> you can't expnad your cluster with mkcephfs
[16:29] <NaioN> they build mkcephfs to quickly deploy simple test clusters
[16:29] <NaioN> they've added chef for larger, more complex deployments
[16:30] <NaioN> or you can do it manual
[16:38] <kblin> NaioN: well, it'd be a new cluster, and three machines isn't what I'd call a large deployment
[16:39] <NaioN> nope true mkcephfs is good for this
[16:39] <kblin> it's just that all the machines are reachable from the net (yay for university IT), and I'm not too happy about allowing root logins for them
[16:39] <kblin> arguably, it'd still be key-based
[16:39] <kblin> so I guess I'm just being paranoid
[16:40] <NaioN> well you can't expect mkcephfs to function without root mounting and formatting
[16:40] <NaioN> but you don't need mkcephfs
[16:40] <NaioN> if you don't want to you can prepare all the osds by yourself
[16:52] * nhm (~nhm@174-20-15-49.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[16:53] <kblin> I think I'll be lazy and just allow root logins for the setup period
[16:54] * nhm (~nhm@174-20-15-49.mpls.qwest.net) has joined #ceph
[16:58] * tightwork (~tightwork@rrcs-71-43-128-65.se.biz.rr.com) has joined #ceph
[17:06] * nhm (~nhm@174-20-15-49.mpls.qwest.net) Quit (Read error: Operation timed out)
[17:27] * loicd1 (~loic@brln-d9bad602.pool.mediaWays.net) has joined #ceph
[17:27] * nhm (~nhm@174-20-15-49.mpls.qwest.net) has joined #ceph
[17:31] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:32] * loicd (~loic@brln-d9bad56f.pool.mediaWays.net) Quit (Ping timeout: 480 seconds)
[17:33] * yehuda_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[17:43] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) Quit (Remote host closed the connection)
[17:47] * nhm_ (~nhm@174-20-15-49.mpls.qwest.net) has joined #ceph
[17:47] * nhm (~nhm@174-20-15-49.mpls.qwest.net) Quit (Read error: Connection reset by peer)
[17:47] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:52] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) has joined #ceph
[17:55] * Tv_ (~tv@2607:f298:a:607:24:9854:b7ba:106e) has joined #ceph
[17:56] * ao (~ao@85.183.4.97) Quit (Quit: Leaving)
[17:59] <kblin> spiffy, now there's a reasonable ceph cluster running
[18:02] * maelfius (~mdrnstm@pool-71-160-33-115.lsanca.fios.verizon.net) Quit (Quit: Leaving.)
[18:06] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:13] * chutzpah (~chutz@100.42.98.5) has joined #ceph
[18:17] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[18:18] <sagewk> joao: there?
[18:31] * deepsa (~deepsa@101.63.84.169) Quit (Remote host closed the connection)
[18:36] * deepsa (~deepsa@115.241.132.79) has joined #ceph
[18:38] * aliguori (~anthony@cpe-70-123-140-180.austin.res.rr.com) Quit (Quit: Ex-Chat)
[18:38] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:47] <joao> sagewk, am now
[18:48] <sagewk> i have a couple mon patches that need review.. can you take a look?
[18:48] <sagewk> wip-mon-mkfs and wip-ph
[18:48] <joao> sure thing
[18:49] <sagewk> i think the mkfs thing is moving in the same direction the single-paxos branch does.. want to make sure tho
[18:49] <joao> how far back should I look into?
[18:50] <sagewk> for mkfs it's just 3 patches
[18:51] <joao> okay
[18:52] * bchrisman (~Adium@108.60.121.114) has joined #ceph
[18:54] <nhm_> WOOT
[18:55] <nhm_> 6 spinning disks, 2 intel 520ssds, rados bench to localhost with 1x replication. Was able to do 568MB/s. No ceph tuning at all.
[18:56] <nhm_> This was with the onboard lsi sas2208 with disks configured in JBOD mode.
[18:56] <nhm_> 300s test
[18:57] <gregaf> SSDs for journal?
[18:57] <nhm_> gregaf: yep, each intel 520 has 3 journals on it.
[18:57] <gregaf> cool
[18:59] <nhm_> That's by far the best performance I've seen using spinning disks.
[18:59] <nhm_> and only 6 of them!
[19:01] <joao> sagewk, the wip-mon-mkfs looks good
[19:01] <joao> the sync function, though, should go away when the new mon changes kick in
[19:04] <sagewk> joao: it should be changed to call the leveldb sync, right?
[19:04] <sagewk> er... well, i don't know enough about the sync model for leveldb there.
[19:04] <joao> hmm... I don't know
[19:04] <joao> I would assume that applying a transaction to the store would take care of that
[19:05] <joao> I will check that though
[19:05] <joao> btw, don't seem to be able to find a wip-ph branch
[19:05] * dmick_away is now known as dmick
[19:05] <sagewk> i think transaction gives atomicity, but here we need a barrier to ensure it's durable
[19:06] <sagewk> oh sorry, wip-mon-report i think
[19:06] <joao> oh yeah, that one exists ;)
[19:07] <dmick> sagewk: looks like 8/10
[19:07] <sagewk> k
[19:07] <sagewk> wip-sig
[19:07] <dmick> ok
[19:07] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[19:08] <sagewk> and set 'fatal signal handlers = false' in the config
[19:08] <dmick> reviewing code now
[19:08] <dmick> pretty straightforward :)
[19:09] <dmick> when I ask for -n 10, it wants to run them all at the same time; think I'll try for 5 just to not be so grabby?
[19:09] * Leseb (~Leseb@193.172.124.196) Quit (Quit: Leseb)
[19:09] <sagewk> 5 is enough yeah
[19:09] <dmick> k
[19:10] * aliguori (~anthony@32.97.110.59) has joined #ceph
[19:10] <dmick> lol, and I was trying to make that 'k' be a vi "last command" in a shell window, and didn't have focus, but it ... just worked out anyway :)
[19:11] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:12] <joao> sagewk, wip-mon-report also looks good
[19:12] <joao> aside from an indentation problem on Monitor.cc
[19:13] <joao> as it is, it should not clash with anything I did so far :)
[19:15] <sagewk> great thanks
[19:16] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:20] * tightwork (~tightwork@rrcs-71-43-128-65.se.biz.rr.com) Quit (Ping timeout: 480 seconds)
[19:25] * maelfius (~mdrnstm@66.209.104.107) has joined #ceph
[19:29] <elder> I'm going to be offline for a bit. I have to bring my son to school for a bit, then will be taking a look at some printed code. I'll check back periodically though.
[19:33] * andret (~andre@pcandre.nine.ch) Quit (Ping timeout: 480 seconds)
[19:42] * Tobarja (~athompson@cpe-071-075-064-255.carolina.res.rr.com) has joined #ceph
[19:44] <dmick> grr. 29 machines free and still it won't schedule me
[19:50] <dmick> ok, something must be wrong
[20:06] <dmick> me. I was wrong. tnx sage
[20:07] * Ryan_Lane (~Adium@216.38.130.164) has joined #ceph
[20:08] * nhm_ (~nhm@174-20-15-49.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[20:08] * pentabular (~sean@adsl-71-141-232-146.dsl.snfc21.pacbell.net) has joined #ceph
[20:27] <dmick> could someone do a quick review on wip-rbd-children pls
[20:34] <dmick> also: wip-2948
[20:36] * pentabular (~sean@adsl-71-141-232-146.dsl.snfc21.pacbell.net) Quit (Remote host closed the connection)
[20:56] <sagewk> dmick: lets kill the usage_exit() macro, otherwise looks good
[20:58] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[21:16] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[21:20] * EmilienM (~EmilienM@vau75-1-81-57-77-50.fbx.proxad.net) Quit (Remote host closed the connection)
[21:32] <dmick> sagewk: ok, fixed, squashed, pushed branch. what branches would you like this fix in?
[21:33] <sagewk> start with next
[21:33] <sagewk> then stable
[21:39] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) has joined #ceph
[21:49] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[21:51] * BManojlovic (~steki@212.200.243.134) has joined #ceph
[22:02] <sagewk> joao: still around?
[22:02] <joao> yeah
[22:02] <sagewk> joao: wip-id?
[22:02] <joao> looking
[22:02] <sagewk> gregaf: ^
[22:05] <joao> the two top commits seem right to me
[22:06] <sagewk> k thanks
[22:06] <sagewk> gregaf: https://github.com/ceph/ceph/commit/f7b30225b89684f5632276b6ad46f0621cf5c189
[22:06] <sagewk> when you have a few minutes
[22:08] <gregaf> I saw one version of that commit when going through email on my way in; the explanation made sense
[22:09] <gregaf> and the mechanics look fine here
[22:10] <gregaf> although the "reuse" nomenclature is a bit weird
[22:10] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) Quit (Quit: Leaving)
[22:11] <gregaf> sagewk: yeah, I think that's okay
[22:11] <gregaf> you've been running tests on that, right?
[22:11] <sagewk> msgr? yeah.
[22:12] <sagewk> thanks
[22:12] <gregaf> cool
[22:12] <sagewk> peek at wip-id too, since i think you opened the bug?
[22:12] <gregaf> yeah, it looked fine to me
[22:12] <gregaf> thanks!
[22:15] <dmick> anyone for wip-rbd-children? you know you want to
[22:18] * stass (stas@ssh.deglitch.com) has joined #ceph
[22:27] * tjpatter (~tjpatter@69.167.130.11) Quit (Quit: tjpatter)
[22:36] * nearbeer (~nik@mobile-198-228-198-114.mycingular.net) has joined #ceph
[22:47] <nearbeer> I've setup a basic rbd store w/ repl = 2 , that my VMs use and I've got 10ge links between the v
[22:48] <nearbeer> vm hosts and san. But this would require , I think , fat pipes as the number of VMs increase.
[22:49] <iggy> 2 choices.... scaling up or out... ceph should work either way
[22:50] <nearbeer> Is there a better way where one could somehow place the rbd store and VMs on the same box in two node sets ... Or three node sets. That way one would still have the rbd store across multiple machines?
[22:51] <nearbeer> Or , I think what I'm trying to ask is ... Is there a way to incorporate rbd without explicitly having to buy a 'San' per se.
[22:53] <nearbeer> no compute and storage nodes, just a node that has a physical charastic somewhere n between ?
[22:53] <iggy> nearbeer: depends how you are using rbd... the kernel clients aren't safe to use on OSDs
[22:53] <iggy> otherwise it should be safe
[22:53] <nearbeer> I've learned that the hard way
[22:53] <nearbeer> But qemu-KVM works
[22:53] <iggy> with the built-in librbd support, yeah
[22:54] <kblin> iggy: the rbd kernel module isn't safe in what respect?
[22:55] <iggy> isn't safe to use on the same machine as an OSD
[22:56] <kblin> ah, that's interesting information
[22:57] <iggy> neither is the kernel cephfs client
[22:58] <kblin> seeing how I didn't get that to work at all, I'm not too worried
[22:59] <kblin> I take the fuse version also uses librdb and should be fine, too?
[22:59] <iggy> it doesn't use librbd, but yeah
[22:59] <kblin> fair enough, as long as I can use it ;)
[23:00] <gregaf> nearbeer: if you're willing to pay the configuration cost, you can set up different pools and CRUSH trees to try and keep data more local
[23:01] <kblin> I guess I'll just keep my vms on a different machine until I get them ported to libvirt, then
[23:01] <nearbeer> Hmmm, that sounds interesting. and better than trying to keep up with san/vm interconnect speeds as things scale
[23:03] <gregaf> at whatever scale you're interested in, you'd need to arrange your CRUSH hierarchy so that (for example) all of your OSDs are in a single host bucket
[23:04] <gregaf> and then have a different CRUSH rule for each pool that takes one replica from whatever host that pool is storing images for, and a second replica from the neighboring host
[23:04] <gregaf> and set the appropriate CRUSH rule for each of those pools, and then set up each KVM instance to be in the appropriate pool
[23:05] <gregaf> of course by doing this you'll lose your nice auto-balancing and stuff
[23:05] <gregaf> but you can adjust the granularity (eg, do it based on a rack and its neighbor, or a row and its neighbor) in relation to how much interconnect you have and how much balancing you want
[23:05] <Tv_> nearbeer: you need to give up locality to one box anyway the moment you want replication...
[23:07] <nearbeer> I'm okay with giving up locality. If I can run a pool across rows or racks Nd then as more pools as needed that would seem to work based on what's been said here.
[23:08] <nearbeer> ... Add more pools as needed. That would seem to work.
[23:08] <Tv_> nearbeer: if you have <1 rack of servers, i don't see why you desire to configure crush&pools manually
[23:09] <Tv_> nearbeer: and even at <1 row of racks, you'd probably do pretty good with just one crush ruleset
[23:09] <SpamapS> https://launchpad.net/ubuntu/+source/ceph/0.48-1ubuntu4/+build/3737010 .. first builds of 0.48 with radosgw are hitting quantal. :)
[23:09] <SpamapS> https://launchpad.net/ubuntu/+source/ceph/0.48-1ubuntu4/+build/3737012 .. armhf build.. might take a while :)
[23:18] <gregaf> Tv_: he just doesn't want to have to buy bigger routers as his system grows, and if you keep the pools to a limited number of nodes you can actually do that
[23:22] <liiwi> well, I've been networks maintern..
[23:24] <liiwi> overselling is the way networks are working these days
[23:26] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[23:27] <liiwi> the question is just, how many times, and whose traffic.
[23:37] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[23:48] * nearbeer (~nik@mobile-198-228-198-114.mycingular.net) Quit (Quit: Colloquy for iPad - http://colloquy.mobi)
[23:54] <Tv_> gregaf: sure but that really only starts to get relevant at >5 racks or so
[23:54] <Tv_> and beyond that, you pretty much have to go for a "Clos network" ~ aka "fat tree", with redundant links etc

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.