#ceph IRC Log


IRC Log for 2011-02-05

Timestamps are in GMT/BST.

[0:04] * uwe (~uwe@ip-94-79-145-210.unitymediagroup.de) has joined #ceph
[0:06] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:07] <prometheanfire> is it perfered to run one storage node per server or to use sans as backends?
[0:13] <johnl> gregaf: I'm back. you around?
[0:13] <gregaf> johnl: yep!
[0:13] <johnl> oh doh, just saw that sage answered me minutes after you left, but I'd gone!
[0:13] <johnl> doh!
[0:14] <johnl> he asked me to try a new version of the client
[0:14] <johnl> but I've not done anything yet
[0:14] <johnl> except restarting the cluster at the time of course :/
[0:14] <gregaf> johnl: yeah, the problem you're seeing isn't a result of what he thought it might be
[0:14] <sagewk> i was wrong though, different bug
[0:14] <johnl> ah ok
[0:14] <gregaf> wrong error message although it sounds similar when you described it :)
[0:15] <sjust> prometheanfire: are you asking whether it is better to have the osds store data locally?
[0:15] <gregaf> johnl: so you don't still have those tid timeout messages?
[0:16] <prometheanfire> sjust: yes
[0:16] <johnl> the client box is still stuck. processes in D state
[0:16] <johnl> wanna poke it?
[0:17] <prometheanfire> what I'm looking to do is have the osds be on VM servers (kvm with rbd)
[0:17] <sjust> prometheanfire: the osd daemon should generally use local storage
[0:17] <prometheanfire> ok
[0:17] <gregaf> hmm, maybe not if you're running on a distributed block device like rbd though!
[0:18] <sjust> prometheanfire: I may have misunderstood, did you mean that you want to run the osds on vm guests?
[0:18] <prometheanfire> no
[0:18] <sjust> ok
[0:18] <prometheanfire> osds on vmhosts
[0:18] <sjust> running it on the host could work
[0:18] <gregaf> johnl: sage says if it's in D state it's probably trying to write out dirty pages
[0:18] <gregaf> so that's probably because of the unfinished tids we were seeing earlier
[0:19] <johnl> k
[0:19] <gregaf> can you turn on OSD debugging via inject-args and see if you can find those tids in the logs?
[0:19] <prometheanfire> I'm thinking about a raid + cache ssd
[0:19] <prometheanfire> how cpu/ram intensive is ceph on osds?
[0:20] <cmccabe> prometheanfire: cosd (the OSD component of ceph) can always use additional resources
[0:20] <prometheanfire> the more nodes the better?
[0:21] <cmccabe> prometheanfire: even once we've optimized it more than it is now, it will always be good to have more memory on a cosd machine, since that will allow more page cache
[0:21] <prometheanfire> that's what I got from the wiki
[0:21] <johnl> gregaf: lemme give that a go.
[0:21] <gregaf> OSDs aren't too bad on cpu; I think it maxes out at around 1 core of a modern CPU
[0:21] <gregaf> usually much less
[0:21] <cmccabe> the monitors on the other hand don't tend to generate a lot of load
[0:21] <prometheanfire> I can't find on the wiki how ceph fails on network interupt
[0:22] <gregaf> johnl: hopefully we can figure out why those tids are hanging; the root cause might be why the MDS requests were hanging too if the MDS can't commit to disk
[0:22] <gregaf> prometheanfire: how it fails on network interrupt?
[0:22] <gregaf> you mean if part of the network gets disconnected?
[0:22] <prometheanfire> yes
[0:23] <cmccabe> gregaf: yeah, I would expect memory and disk bandwidth to be more of a bottleneck on OSDs than CPU. But we haven't benchmarked it very much.
[0:23] <gregaf> so failed OSDs get marked as down, and a network interruption counts as a failure
[0:24] <gregaf> hopefully the data on that OSD is replicated on a still-available OSD
[0:24] <gregaf> and so the system continues to service all IO after a brief timeout period to declare it dead and rearrange the data placement map
[0:25] <prometheanfire> ok, so, the node is disconnected, then the data on that node cannot be recovered via cache?
[0:25] <gregaf> cache?
[0:25] <gregaf> like client cache you mean?
[0:25] <gregaf> no
[0:25] <prometheanfire> ya
[0:25] <gregaf> not from the client cache, no
[0:26] <prometheanfire> ceph+nbd+cache=none would be safest I think
[0:26] <gregaf> there wouldn't be any way to guarantee data permanence if you did that
[0:26] <gregaf> (allowed recovery from client cache, I mean)
[0:26] <prometheanfire> ok
[0:27] <prometheanfire> so the VMs on that host might just be unrecoverable given the nature of distributed file systems?
[0:27] <prometheanfire> (meaning that this is like a power cut)
[0:28] <gregaf> well you'd have to manage to lose all the OSDs holding their data first, and not be able to bring them back up
[0:28] <prometheanfire> and if an osd is always local it can pick up the slack?
[0:29] <gregaf> if you lose a portion of the OSDs replicating data, but at least one of them is up, you can continue doing IO on that data
[0:29] <gregaf> there's a brief (tunable!) period of unavailability while the system determines that the OSD is down
[0:29] <prometheanfire> ok, so an osd on each vmhost seems best for data security
[0:29] <gregaf> and updates the osd map to indicate that the remaining replicas are the ones to talk to
[0:29] <prometheanfire> a ttl basically
[0:30] <gregaf> depends on how many vmhosts you have and how much data you're storing ;)
[0:30] <prometheanfire> in the beginning 6ish hosts
[0:30] <prometheanfire> and just scaling from there
[0:30] <gregaf> well with 6 replicas your writes are going to take a while
[0:31] <prometheanfire> that how it works, just raid1?
[0:32] <gregaf> you have a primary osd that handles reads and writes, and once it receives a write from the client it replicates that data to all the replicas
[0:32] <gregaf> the client gets back an "unsafe" reply once the primary has committed to disk
[0:32] <gregaf> and a "safe" reply once all the replicas have committed to disk
[0:33] <gregaf> RBD, at least right now, ignores the unsafe and only returns once it's gotten a "safe" response
[0:33] <prometheanfire> so I don't understand what makes ceph scale so well if it is replicating writes like that
[0:34] <gregaf> that's a constant latency, and 6 replicas would be pretty extreme...
[0:34] <Tv|work> prometheanfire: because the other operations went to other osds
[0:34] <gregaf> the latency doesn't go up as you add more OSDs
[0:34] <johnl> gregaf: right, debugging turned on for osds and mds. I've since seen the "timed out" messages on the client but the tids do not appear in the logs on any of the osds, or the mds
[0:34] <gregaf> johnl: what debugging did you turn on?
[0:35] <johnl> excellent question!
[0:35] <prometheanfire> what I mean is, to get a petabyte, you would many hosts not fully replicating right?
[0:35] <gregaf> prometheanfire: in rados you set how many times you want data to be replicated
[0:35] <gregaf> you can add OSDs without changing the replication factor
[0:35] <prometheanfire> I can do the 'network raid1' with drbd
[0:36] <prometheanfire> ah, so you can set the amount of redundancy in the system, that's what I wanted :D
[0:36] <gregaf> so if you have 10 nodes with 3x replication, then every piece of data exists on 3 nodes
[0:36] <johnl> gah, I fscked it up. will redo.
[0:36] <gregaf> but none of the nodes have the same set of data
[0:36] <prometheanfire> gregaf: that is what I was looking for
[0:37] <gregaf> sorry if I misled you — you seemed to want to put your data on every node
[0:38] <prometheanfire> nah, I just want every node to have access to the data + backup (the local osd)
[0:38] <prometheanfire> I didn't go with drbd a while ago because it didn't scale, but ceph does :D
[0:45] <prometheanfire> also, one last thing for now, the osds can be of difering sizes correct?
[0:46] <sjust> prometheanfire: you can assign weights to the osds to allow for different sizes
[0:46] <johnl> gregaf: bingo. found the tids in the osd logs (and then found some other lines for the same client): http://pastebin.com/NHEnCe0Q
[0:46] <prometheanfire> ok
[0:47] <gregaf> prometheanfire: just to be clear, if your network partitions somehow then having a local copy of the data will only let you keep going if you can still talk to the monitors
[0:48] <prometheanfire> ah, so if network dies then VMs loose the disk like a power outage
[0:48] <gregaf> well it won't be much help for your VMs to remain up if they can't talk with anybody anyway ;)
[0:49] <prometheanfire> right, I thought they could use the local osd as a sort of cache while the network was down
[0:49] <gregaf> to continue going they need an available copy of their data and network access to the monitors
[0:49] <gregaf> if you don't require access to the monitors then you could get divergent sets of data, unfortunately
[0:49] <prometheanfire> since there data might not be local, that means no network is bad
[0:50] <prometheanfire> ok, that's fine I guess, power outages might kill just about anything I guess
[0:51] <gregaf> johnl: did you turn up debug_osd or just the messenger logging?
[0:51] <gregaf> it's good to see the tids exist but I'd like to look at why they aren't completeing
[0:51] <gregaf> which will require access to the internal osd logs ;)
[0:53] <gregaf> prometheanfire: I'm having trouble imagining a network setup and crush map that would let your VMs keep talking to anybody at all that wouldn't let them talk to their data with 2x or 3x replication
[0:54] <prometheanfire> ya, I had a fantasy
[0:54] <gregaf> I think you may be overestimating the impact of network partitioning :)
[0:54] <gregaf> it's just such a sexy problem to think about
[0:54] <bchrisman> hmm.. wonder if it makes sense to 'ionice' a cosd...?
[0:54] <prometheanfire> it truly is
[1:00] <gregaf> johnl: be back in 20!
[1:01] <cmccabe> bchrisman: my understanding of ionice is that it's mostly for prioritizing certain processes above others with regard to IO activity
[1:01] <bchrisman> yeah… I guess if there's no contention on your system.
[1:01] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[1:01] <cmccabe> bchrisman: the recommended setup for ceph is that OSDs are separate from monitors and MDS nodes, so I don't know if it would help
[1:01] <bchrisman> yeah
[1:02] <cmccabe> bchrisman: I guess in theory there could be kernel threads that you are contending with for I/O
[1:02] <bchrisman> I'm running with 4 disks with a mirrored root across slices off each disk..
[1:02] <bchrisman> yeah.. but kernel threads would probably be important enough that you don't want to put them off I guess?
[1:02] <cmccabe> bchrisman: but from a practical point of view, ionice seems unlikely to help
[1:02] <bchrisman> yeah
[1:08] <johnl> ugh sorry, got distracted. late here, am tired.
[1:10] <johnl> gregaf: I did "ceph osd injectargs 0 '--debug_ms 1 --debug_osd 10'"
[1:17] * greglap (~Adium@ has joined #ceph
[1:19] <greglap> johnl: np!
[1:20] <greglap> can you package up those logs and put them somewhere accessible? I think debug_osd 10 ought to be enough to find the problem
[1:20] * bcherian (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[1:20] <greglap> but I'll need more than just the messages that contain the client name :)
[1:22] * uwe (~uwe@ip-94-79-145-210.unitymediagroup.de) Quit (Quit: sleep)
[1:29] <johnl> yep, I'll package up all the logs then
[1:30] <johnl> worth me doing now? (i.e: will you likely do something on them today?) pretty tired, would be an effort to do it now :)
[1:30] <greglap> I think I'll have some time to look at them before Monday, but don't let me keep you from your sleep; I can keep myself occupied ;)
[1:32] <johnl> k. I'll do em tomorrow then. ta!
[1:32] <johnl> nnight.
[1:33] <greglap> night1
[2:04] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[2:24] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:44] * WesleyS (~WesleyS@ has joined #ceph
[2:50] * verwilst (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[2:57] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:59] <prometheanfire> I think what I was refering to earlier as a cache was the journal
[3:03] * cmccabe (~cmccabe@ has left #ceph
[3:18] <prometheanfire> does this make sense as a structure to start with? http://i.imgur.com/2JJj1.jpg
[3:25] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[3:32] <prometheanfire> are most people in the channel using ceph on racks (I'm guessing so)
[3:33] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:38] * WesleyS (~WesleyS@ Quit (Quit: WesleyS)
[4:11] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[5:10] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[8:17] * baldben (~bencheria@cpe-76-173-232-163.socal.res.rr.com) has joined #ceph
[10:06] * greglap (~Adium@ has joined #ceph
[10:50] * uwe (~uwe@ip-94-79-145-210.unitymediagroup.de) has joined #ceph
[11:03] * uwe (~uwe@ip-94-79-145-210.unitymediagroup.de) Quit (Quit: sleep)
[11:52] * uwe (~uwe@ip-109-84-0-9.web.vodafone.de) has joined #ceph
[12:14] * uwe (~uwe@ip-109-84-0-9.web.vodafone.de) Quit (Quit: sleep)
[12:33] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[12:37] * uwe (~uwe@ip-109-84-0-9.web.vodafone.de) has joined #ceph
[14:31] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:36] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[14:49] * uwe (~uwe@ip-109-84-0-9.web.vodafone.de) Quit (Remote host closed the connection)
[14:50] * uwe (~uwe@ip-109-84-0-9.web.vodafone.de) has joined #ceph
[15:04] * uwe (~uwe@ip-109-84-0-9.web.vodafone.de) Quit (Quit: sleep)
[16:01] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[16:50] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[17:08] * uwe (~uwe@ip-94-79-145-210.unitymediagroup.de) has joined #ceph
[19:32] * uwe (~uwe@ip-94-79-145-210.unitymediagroup.de) Quit (Quit: quit)
[19:35] * uwe (~uwe@mb.uwe.gd) has joined #ceph
[19:49] <johnl> gregaf: filed a bug with the logs from last night
[20:04] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[20:39] * uwe (~uwe@mb.uwe.gd) Quit (Quit: sleep)
[20:45] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[20:52] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[22:04] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[22:37] * greglap (~Adium@ has joined #ceph
[22:39] * uwe (~uwe@mb.uwe.gd) has joined #ceph
[23:06] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[23:45] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[23:45] * greglap (~Adium@ has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.