#ceph IRC Log


IRC Log for 2010-11-19

Timestamps are in GMT/BST.

[0:10] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:12] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[2:00] * Jiaju (~jjzhang@ Quit (Read error: Connection timed out)
[2:01] * Jiaju (~jjzhang@ has joined #ceph
[2:44] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) Quit (Quit: Leaving.)
[2:48] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[2:49] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[2:53] * Jiaju (~jjzhang@ Quit (Ping timeout: 480 seconds)
[2:57] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[2:57] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[2:58] * Jiaju (~jjzhang@ has joined #ceph
[3:13] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[3:13] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[3:22] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[3:23] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[3:27] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:38] * sjust (~sam@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[3:40] * Jiaju (~jjzhang@ Quit (Ping timeout: 480 seconds)
[3:47] * Jiaju (~jjzhang@ has joined #ceph
[3:50] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:55] * Jiaju (~jjzhang@ Quit (Ping timeout: 480 seconds)
[4:03] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[4:07] * Jiaju (~jjzhang@ has joined #ceph
[4:40] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[4:45] * Jiaju (~jjzhang@ Quit (Ping timeout: 480 seconds)
[4:57] * Jiaju (~jjzhang@ has joined #ceph
[4:59] * greglap (~Adium@ has joined #ceph
[5:38] * greglap1 (~Adium@ has joined #ceph
[5:45] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[6:02] * greglap1 (~Adium@ Quit (Read error: Connection reset by peer)
[6:16] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[6:55] * DLange (~DLange@dlange.user.oftc.net) Quit (Quit: lunch)
[7:07] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[7:12] * lidongyang (~lidongyan@ has joined #ceph
[7:46] <jantje> Hey
[7:46] <jantje> 2010-11-18 13:10:54.424589 7f431c0e6710 -- send_message dropped message osd_op_reply(3150116 10000014c6a.00000001 [write 229376~4096 [1@-1]] ondisk = 0) v1because of no pipe
[8:47] * allsystemsarego (~allsystem@ has joined #ceph
[8:59] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[9:41] * Devin (~ddiez@ has joined #ceph
[9:41] <Devin> hi
[9:47] <Devin> any knows in what crash scenearios you wont be able to recover @ 0.24? I see that some features are planned for 0.24 and others for 1.0
[9:48] <Devin> we want to test 0.23/0.24 with non critical services, and are starting to "paint" our setup
[10:30] * Yoric (~David@ has joined #ceph
[12:44] <jantje> 2010-11-19 12:44:13.579107 pg v69: 1584 pgs: 1584 active+clean; 272 KB data, 1100 MB used, 1367 GB / 1380 GB avail
[12:45] <jantje> thats a lot of disk space used to host 272 KB !
[13:12] <Devin> :O
[13:44] <failboat> lol
[13:46] <jantje> sagewk / yehudasa : if you ever get the chance to run iozone -t 10 -s100M -i 0 -i 2 -i 8 on a cluster, take a look at the random workload, while running that test iostat on the OSD machines show no load at all
[13:47] <jantje> Children see throughput for 10 random readers = 1997.05 KB/sec with 6 OSD's, which comes down to 500KB/sec random read for each disk
[13:47] <jantje> Children see throughput for 10 random writers = 6493.89 KB/sec
[13:48] <jantje> writing is much faster, and that is with default replication
[13:48] <jantje> (well, unless it's faster because of this: 2010-11-19 13:46:04.217081 7fb4d2a47710 -- send_message dropped message osd_op_reply(1077196 1000000003b.00000014 [write 290816~4096 [1@-1]] ondisk = 0) v1because of no pipe
[13:53] <jantje> well, actually I don't know if thats slow, but it looks just weird that writing performance is better than reading
[13:55] <jantje> [15825.760593] libceph: tid 1076940 timed out on osd1, will reset osd
[13:55] <jantje> thats probably the cause of the pipe issue
[14:32] <Devin> I was going to test ceph, but math does not helps me: if 272 kb == 1367 Gb, 3 tb = too much needed space for me ;)
[14:34] <jantje> no, it's just initially
[14:34] <jantje> probably of some empty groups or something like that
[14:34] <jantje> nothing to worry about
[14:35] <jantje> ceph does replication by default
[14:35] <jantje> so when you have a client writing 1gb, it actually uses 2gb of disk space
[14:35] <jantje> you can customize all that
[14:36] <jantje> (osd maps or something like that, don't ask me specifics :)
[14:37] <jantje> sagewk: I found some method to reproduce an issue with journals getting full
[14:37] <jantje> I mailed you with some logging
[14:38] <jantje> Life sucks when you live in europe :P Communication isn't that easy .. hehe
[15:10] <Devin> I love Europe ;)
[15:26] <failboat> I love life
[15:27] <failboat> but it sucks regardless
[15:31] <Devin> sometimes timezones sux :)
[15:31] <Devin> in a few hours i'm moving to -5 from my actual timezone
[15:32] <Devin> i need a calculator just to know when the plane will be landing :P
[16:15] * f4m8 is now known as f4m8_
[17:12] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[17:12] * Yoric (~David@ has joined #ceph
[17:27] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[17:29] * Yoric (~David@ has joined #ceph
[17:35] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[17:52] * greglap (~Adium@ has joined #ceph
[17:56] <greglap> jantje: if you've got other data on the disks besides the Ceph stuff, that's included in the space used :)
[17:59] <greglap> I don't think I've seen anybody actually hit that dropped message because of no pipe warning before
[18:00] <greglap> do you have an OSD down, or the cluster flapping somehow?
[18:03] <greglap> Devin: we aren't going to have any more crash recovery tools for v0.24, if that's what you're asking
[18:04] <Devin> so nothing until 1.0?
[18:05] <greglap> well, there will probably be some intermediary versions that might get new tools
[18:05] <Devin> that sounds like changing 1.0 date again , noooo :P
[18:05] <greglap> actual data loss on release versions is pretty rare from what I can remember, though
[18:06] <Devin> we want to test a basic setup, 3 nodes, 3 replicas
[18:06] <Devin> our nfs server is getting "too much load"
[18:07] <greglap> what with replication and everything most bugs end up causing assert failures before there's a chance to lose data
[18:07] <Devin> and we want to start exploring alternatives
[18:07] <Devin> so the basic features for us would be HA and a bit of LB
[18:08] <greglap> what kind of load balancing are you after?
[18:08] <greglap> 1.0 isn't going to have any active balancing, just the stuff that comes from striping files
[18:09] <Devin> round robbin should be ok
[18:09] <Devin> just not serving the same file always from the same server
[18:09] <greglap> that's not quite applicable to how Ceph works
[18:10] <greglap> it chunks up files across objects which are pseudo-randomly distributed across the servers
[18:10] <greglap> files of any size are read/written to more than one server at the same time
[18:11] <Devin> well that's ok :D
[18:11] <Devin> we have more spins for the same file
[18:11] <greglap> load on the servers ought to be about the same across all your workloads, though :)
[18:11] <Devin> and more cpus :P
[18:11] <Devin> so, do you think 0.24 could be a good starting point?
[18:12] <Devin> it will have a few nginx in front of it
[18:12] <greglap> I guess that depends what you want it for
[18:12] <greglap> for a 3-node setup it will probably be about as stable/recoverable as v0.23 was
[18:13] <Devin> ok, then we will start testing it on monday with 0.23 :D
[18:13] <greglap> maybe a little more since there were some OSD changes for v0.23 that we've had the chance to work some wrinkles out of
[18:14] <greglap> but it's not something I'd put in place for production use without backups
[18:14] <Devin> we will have nfs + ceph
[18:14] <Devin> live
[18:14] <Devin> we will write to our nfs server, sync to ceph
[18:14] <Devin> and server from ceph
[18:14] <greglap> with some kind of failover if it breaks?
[18:15] <Devin> if we find any probs, then go back to nfs
[18:15] <Devin> that's it
[18:15] <greglap> well if you've got safeguard, then sure, go for it
[18:15] <greglap> file serving consists of reasonably well-tested code AFAIK
[18:15] <greglap> sagewk, you have any thoughts?
[18:18] <greglap> hmm, I guess his status is just a lie
[18:19] <Devin> hehe
[18:19] <Devin> idle time = 23 hours
[18:19] <Devin> 23 hours in a row coding ;)
[18:19] <Devin> polishing ceph :P
[18:20] <sagewk> something like that :)
[18:20] <greglap> I think he just got in to the office, give him a minute :)
[18:20] <sagewk> i would only try ceph in production if you have a very clear path to fall back to something else. if you're comfortable with that, i'd love to hear how it goes :)
[18:21] <Devin> the idea is to have a cron checking everything
[18:21] <Devin> if something goes wrong
[18:21] <Devin> umount /thatfuckingceph/whyididit; mount -t nfs /myoldbutlovely/nfs
[18:21] <Devin> just for "temp" content, like snapshots, etc...
[18:22] <Devin> someone has to be the first one :D
[18:22] <sagewk> or umount -l in this case...
[18:22] <Devin> yup, i got very used to -l lately
[18:22] <Devin> moving from nfs-001 to nfs-002, etc...
[18:23] <Devin> [2285223.247672] nfs: server nfs-001 not responding, still trying
[18:23] <Devin> [2285226.702340] nfs: server nfs-001 OK
[18:24] <Devin> that's getting our "day to day"
[18:24] <sagewk> yep
[18:25] <Devin> we haven't suffered a "oh-no-no" failure with those servers, but during peaks those 2-3 seconds can kill us
[18:25] <Devin> lot's of php cgi process waiting for the nfs
[18:25] <Devin> or nginx starts queuing a lot of requets
[18:25] <Devin> in 20-30 secs our graphs go up like crazy
[18:26] <sagewk> yeah
[18:28] <Devin> and as i know that you are nice ppl, i'm sure you are going to help me :D
[18:28] <sagewk> we'll try :)
[18:31] <failboat> sagewk:
[18:35] <wido> Devin: those NFS messages, are you running UDP or TCP?
[18:35] <wido> And are you sure you have enough nfsd daemons running for your workload?
[18:36] <greglap> wido: shhh, stop trying to take away our brave new tester! ;)
[18:37] <wido> greglap: ok, i'll sshh ;)
[18:38] <Devin> hehehe
[18:39] <Devin> wido, that's what I have been said, that we have enogh, etc... :P
[18:39] * greglap (~Adium@ has left #ceph
[18:40] <wido> Devin: NFS can be a real pain in the ass, takes a lot of tuning to get it running well, Ceph is much easier ;)
[18:41] <sagewk> failboat: hi
[18:41] <Devin> as i'm sure you have no hidden interrests, i will trust you ;)
[18:41] <sagewk> failboat: is this with the hard link heavy workload you mentioned the other day?
[19:00] * nolan (~nolan@ Quit (Remote host closed the connection)
[19:03] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:06] * nolan (~nolan@phong.sigbus.net) has joined #ceph
[19:08] <Devin> enough for today
[19:08] <Devin> c'u
[19:08] <Devin> :)
[19:08] <Devin> nice weekend!
[19:08] * Devin (~ddiez@ Quit (Quit: Devin)
[19:11] * sjust (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:16] * Yoric (~David@ Quit (Quit: Yoric)
[19:17] * sjust (~sam@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[19:25] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[19:46] <cmccabe> wido: are you there?
[19:47] <wido> cmccabe: yes
[19:47] <cmccabe> wido: I'm going to take a look at 585 now
[19:48] <cmccabe> so first of all, how reproducible is it?
[19:48] <wido> well, just start my OSD's
[19:48] <wido> they will crash instantly
[19:48] <wido> I'm not sure how my cluster got in that state, it just happend
[19:50] <cmccabe> k
[19:50] <cmccabe> wido: don't change anything! :)
[19:51] <cmccabe> wido: hopefully we'll be able to get a fix and verify that it works on your cluster
[19:52] <cmccabe> wido: what's the configuration for this cluster
[19:52] <cmccabe> never mind, found it
[19:59] <wido> cmccabe: I won't touch a thing. I curious how my VM will respond to this, it's without OSD's now for almost a week, hehe
[20:11] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[20:17] <failboat> sagewk: yeah. But currently there's no workload as it fails on ls -lR for example
[20:17] <failboat> (sorry, I'm occasionally on and off here)
[20:21] <cmccabe> wido: I'd like to try to run gdb on your cluster... or perhaps restart with more logging
[20:21] <wido> cmccabe: Sure, no problem!
[20:21] <cmccabe> wido: where should I login
[20:21] <wido> node01 / osd0 is already running with logging on 20 for osd and filestore
[20:21] <wido> from the logger machine: ssh root@node01
[20:22] <wido> from the logger machine you can access every node in the cluster
[20:22] <wido> gdb is present on all nodes
[20:23] <cmccabe> wido: how do you normally restart all the nodes
[20:24] <cmccabe> wido: I've seen a few different methods I guess
[20:24] <wido> oh, i use dsh
[20:24] <wido> dsg -g osd service ceph start/stop/restart
[20:25] <cmccabe> thanks
[20:25] <wido> node07 and node09 are dead
[20:25] <cmccabe> ic
[20:26] <cmccabe> basically, I'm going to change ceph.conf to have more OSD logging, and restart
[20:26] <wido> cmccabe: No problem, feel free to do what you want
[20:26] <cmccabe> now, how do you distributed your ceph.conf out to the cluster nodes
[20:27] <wido> I simply put it only somewhere and use wget
[20:27] <wido> dsg -g osd "wget http://...../ceph.conf"
[20:27] <cmccabe> k
[20:27] <wido> wget -O /etc/ceph/ceph.conf
[20:30] <wido> cmccabe: I'm going to be afk, it's Friday night here ;)
[20:31] <cmccabe> wido: ok...
[20:31] <wido> I might read the channel later on this evening, but I'll repeat myself, it's a test cluster, so feel free to test/change things
[20:31] <cmccabe> thanks
[20:31] <cmccabe> I really think if I get this logging, it will probably be enough to solve it
[20:33] <sagewk> cmccabe: dont think you need the logs; it looks like recover_primary() simply isn't checking unfound status before calling pull()
[20:33] <cmccabe> sagewk: I see some code that checks for unfound there
[20:33] <cmccabe> bool unfound = missing_loc.count(soid);
[20:33] <cmccabe> ...
[20:33] <cmccabe> } else if (unfound) {
[20:33] <cmccabe> ++skipped;
[20:33] <cmccabe> } else {
[20:33] <cmccabe> ....
[20:34] <sagewk> oh i see
[20:34] <cmccabe> sagewk: yeah, I think that's what we added the last time this came up.
[20:34] <cmccabe> sagewk: wait a second...
[20:35] <cmccabe> sagewk: err.... the test is reversed :(
[20:35] <sagewk> ah :)
[20:35] <cmccabe> heh
[20:36] <cmccabe> wido: if you're still there... what machine do you usually compile on?
[21:14] <cmccabe> wido: 585 should be fixed now
[21:17] <wido> cmccabe: on node01, i've got "build_ceph_packages" there in /usr/local/sbin
[21:17] <wido> source is in /usr/src/ceph
[21:18] <wido> with "sync_ceph_packages" I distribute them to the other machines
[21:44] <wido> cmccabe: Indeed, fixed now, I've got 10 OSD's up again
[21:55] <sagewk> aon guys: add any package name to the list in flak:/root/packages and dist.sh will install it
[21:59] <joshd> cool
[22:12] <wido> sagewk: If my replication level is set to three and 2 OSD's are down, the cluster should recover, shouldn't it?
[22:12] <sagewk> yeah
[22:14] <wido> hmm, ok. After #585 was fixed by cmccabe the cluster stays at "120256/1410393 degraded (8.526%)" with 10 of the 12 OSD's up
[22:14] <wido> has been there for 30 min now
[22:14] <sagewk> all active, but not recovering?
[22:15] <sagewk> hmm, looking
[22:16] <wido> for example "rados -p rbd ls" blocks
[22:16] <sagewk> a few pgs are stuck peering. do you mind turning on osd debugging across the cluster?
[22:17] <wido> sagewk: Not at all, do that in a few minutes
[22:18] <sagewk> thanks
[22:31] <wido> sagewk: debug osd or filestore?
[22:31] <sagewk> just osd
[22:31] <sagewk> and debug ms =1
[22:38] <wido> sagewk: Ok, done, but I think it might be some btrfs issues
[22:38] <wido> I'm seeing a lot of btrfs hangs lately, btrfs-transaction going into status D on various nodes
[22:39] <sagewk> any currently?
[22:41] <wido> no, had one on osd10, but when the OSD was killed by the init script, it became a zombie for a few minutes
[22:41] <wido> and then exited, while the btrfs processes were blocking
[22:42] <wido> but the cluster is back again at "120256/1410393 degraded (8.526%)", with the OSD's at the logging you wanted
[22:42] <sagewk> k
[22:43] <wido> and about the btrfs warning, they are still coming back
[22:45] <wido> I'm going afk for today, thanks again!
[22:46] <sagewk> ok!
[22:56] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.