#ceph IRC Log


IRC Log for 2011-01-26

Timestamps are in GMT/BST.

[0:07] <Tv|work> sagewk, rest of newdreamers: i have a libvirt-using "cloud imitation" helper in flak:~tv/repos/cheesy2.git, if you have a libvirt setup you might want to play with it, it gives you 100% automated installs & customizations of vms (ubuntu 10.10 only for now)
[0:08] <Tv|work> sagewk: ohhh and the "license" in there might be a big lie; i need a decision on what we want to do with that thing, it's completely generic and could happily live on e.g. ceph.newdream gitweb or github or something
[0:18] <cmccabe> tv: cool
[0:18] <cmccabe> tv: I think I picture myself mostly just starting tests with a "clean" image every time
[0:19] <cmccabe> tv: then when the test is over, just discard all changes (unless I need to manually inspect a failure)
[0:22] <cmccabe> tv: when I used ec2, we rarely ever ran the ubuntu installer. We just had sort of a base image with stuff pre-installed that we brought up machines with.
[0:24] <Tv|work> cmccabe: yup, re-using images is mostly a speedup
[0:25] <Tv|work> heh, that reply applies to both of your two last lines
[0:25] <Tv|work> both running next round of tests on the same image, and using a "golden master" image to clone
[0:25] <cmccabe> tv: I am really looking forward to being able to run tests while doing other work
[0:25] <Tv|work> autotest will definitely be taught to re-clone from a known good image
[0:26] <cmccabe> tv: the current vstart system is effectively one-per machine
[0:26] <Tv|work> run vstart in a vm?
[0:26] <cmccabe> tv: no effective way to install the code there
[0:26] <Tv|work> cmccabe: rsync..
[0:27] <cmccabe> tv: I thought autotest would handle that kind of setup work
[0:27] <Tv|work> sure, but it doesn't exist quite yet
[0:27] <cmccabe> :)
[0:28] <cmccabe> multiple tests at once isn't so important that I'm willing to do throwaway work to make it happen
[0:28] <cmccabe> but I'm just excited about seeing it happen, along with all the other good stuff
[0:31] <cmccabe> this test_unfound thing is really annoying... my bisect got confused because of an unrelated problem
[0:55] <Tv|work> yay found libvirt bug
[0:55] <Tv|work> don't clone the same vm to two destinations in parallel
[1:10] <Tv|work> http://ceph.newdream.net/git/?p=cheesy2.git;a=summary
[1:11] <Tv|work> clone url changed also
[1:11] <Tv|work> no more flak
[1:48] <Tv|work> cephbooter /images is 100% full
[1:49] <Tv|work> i would guess this has caused some trouble for the clusters involved
[1:51] <Tv|work> ceph & cosd both have 4.5GB of ceph logs, out of 28GB total for the /images partition
[1:51] <Tv|work> 22G /images/ceph-peon64
[1:51] * ajnelson (~Adium@dhcp-63-189.cse.ucsc.edu) Quit (Ping timeout: 480 seconds)
[1:52] <Tv|work> 11G /images/ceph-peon64/cosd/ceph4.2756
[1:53] <Tv|work> gregaf, sagewk, someone?
[1:53] <Tv|work> 11G /images/ceph-peon64/cosd/ceph4.2756/ffsb.sh/tmp
[1:53] <Tv|work> timestamps from dec 13..
[1:53] <Tv|work> ok that's gonna go unless somebody screams bloody murder within two minutes..
[1:54] <gregaf> I don't know how that system is set up at all
[1:54] <gregaf> (cephbooter et al, I mean)
[1:54] <gregaf> but I'm pretty sure the ffsb stuff is safe to delete, yes
[1:54] <Tv|work> there's a *lot* of gunk there
[1:55] <Tv|work> this is why i'm so stubborn about the "create fresh machines from scratch" thing ;)
[1:55] <cmccabe> tv: we should consider writing logs to a filesystem with transparent compression
[1:55] <gregaf> I don't think anybody's arguing with you about it
[1:55] <gregaf> just lazy ;)
[1:56] <cmccabe> tv: I think there was like a FUSE gzipfs developed that would be extremely effective on ASCII text
[1:56] <Tv|work> gregaf: hey i'm even doing the work! ideal for lazy
[1:56] <Tv|work> cmccabe: i'd hate to complicate something like that
[1:56] <Tv|work> cmccabe: copy logs off the machine, compress then..
[1:56] <cmccabe> tv: it's just a drop-in component. You do mount.gzipfs /foo /bar
[1:57] <Tv|work> cmccabe: autotest already has a "log-grabbing hook"..
[1:57] <cmccabe> tv: well, it's up to you. I'm just suggesting a way to save space.
[1:57] <Tv|work> oh btw the kvm cluster i'm hijacking some sepia nodes for is now called pseudomorph
[1:58] <cmccabe> tv: since logs are decompressed kind of infrequently
[1:58] <Tv|work> as in, the ink cloud cephalopods can squirt kinda looks like the animal itself, and confuses prey
[1:58] <Tv|work> totally applicable to virtual machines..
[1:58] <Tv|work> s/prey/predators/
[1:58] <cmccabe> tv: nice
[2:00] <cmccabe> tv: also you'll probably want to initialize /proc/sys/kernel/core_pattern to something sensible
[2:00] <cmccabe> tv: it starts off as just "core" which I think means if you get multiple processes that crash, the latest core file overwrites all the others
[2:01] <cmccabe> I have it set to /home/core/core.%s.%e.%p
[2:16] * Meths_ (rift@ has joined #ceph
[2:22] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[2:26] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Read error: Operation timed out)
[2:40] * Meths_ is now known as Meths
[2:55] * cmccabe (~cmccabe@ has left #ceph
[3:07] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:19] * pting (~pting@adsl-99-32-245-41.dsl.lsan03.sbcglobal.net) has joined #ceph
[3:20] * pting (~pting@adsl-99-32-245-41.dsl.lsan03.sbcglobal.net) has left #ceph
[3:20] * pting (~pting@adsl-99-32-245-41.dsl.lsan03.sbcglobal.net) has joined #ceph
[4:12] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[5:48] * pting (~pting@adsl-99-32-245-41.dsl.lsan03.sbcglobal.net) Quit (Quit: Leaving)
[6:08] * NoahWatkins (~NoahWatki@c-98-234-57-117.hsd1.ca.comcast.net) has joined #ceph
[6:59] * DeHackEd (~dehacked@dhe.execulink.com) Quit (Ping timeout: 480 seconds)
[7:35] * NoahWatkins (~NoahWatki@c-98-234-57-117.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[8:00] * Guest798 (quasselcor@ Quit (Remote host closed the connection)
[8:02] * bbigras (quasselcor@ has joined #ceph
[8:03] * bbigras is now known as Guest1671
[8:13] * allsystemsarego (~allsystem@ has joined #ceph
[8:16] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[9:20] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[10:10] * Yoric (~David@ has joined #ceph
[10:44] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[12:03] * jantje_ (~jan@paranoid.nl) has joined #ceph
[12:08] * jantje__ (~jan@paranoid.nl) has joined #ceph
[12:09] * jantje (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[12:11] * jantje_ (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[15:27] * Yoric (~David@ Quit (Quit: Yoric)
[15:32] * Yoric (~David@ has joined #ceph
[15:57] <gregorg> Is Ceph stable enough for ~20GB and 2 servers ?
[17:11] * ajnelson (~Adium@dhcp-225-235.cruznetsecure.ucsc.edu) has joined #ceph
[17:26] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[17:38] * ajnelson (~Adium@dhcp-225-235.cruznetsecure.ucsc.edu) Quit (Quit: Leaving.)
[17:42] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:54] * ajnelson (~Adium@dhcp-225-235.cruznetsecure.ucsc.edu) has joined #ceph
[18:01] * joshd (~jdurgin@adsl-75-28-69-238.dsl.irvnca.sbcglobal.net) has joined #ceph
[18:03] * greglap (~Adium@ has joined #ceph
[18:10] <greglap> gregorg: depends what you mean by stable
[18:10] <greglap> Ceph runs very well on 2 servers and we have several people with 20GB of data in their installs (rather more than 20GB, actually)
[18:10] <greglap> but I wouldn't put it on a production system without backups
[18:15] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:20] <jantje__> hey guys
[18:21] * ajnelson (~Adium@dhcp-225-235.cruznetsecure.ucsc.edu) Quit (Quit: Leaving.)
[18:21] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[18:25] * sjusthm (~sam@ has joined #ceph
[18:28] <Tv|work> good morning
[18:28] <Tv|work> and fyi anyone using the sepia cluster, i'm repurposing sepia20 to pseudomorph01 Real Soon Now
[18:28] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[18:29] <greglap> sjust is probably the one that will want to know that
[18:29] <sjusthm> yup
[18:29] <Tv|work> sjust: ^
[18:29] <Tv|work> "hm"?
[18:29] <sagewk> tv: the fast kvm machines should be ready today too
[18:29] <sjusthm> but I'm using cosd at the moment :)
[18:30] <Tv|work> sagewk: in that case i might actually wait..
[18:30] <Tv|work> sagewk: i still have prep work i can do locally
[18:30] <sagewk> k
[18:30] <Tv|work> it's been good learning cephboot setup anyway ;)
[18:34] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:40] * greglap (~Adium@ Quit (Quit: Leaving.)
[18:46] * Yoric (~David@ Quit (Quit: Yoric)
[18:47] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) has joined #ceph
[18:54] * ajnelson (~Adium@dhcp-63-189.cse.ucsc.edu) has joined #ceph
[18:54] <sagewk> morning everyone! let's do 10:30 for standup meeting
[18:55] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[18:58] <Tv|work> yay a preannounced time so nobody needs to come poke me ;)
[18:59] <gregaf> I try and get people rounded up at 11 every day...
[18:59] <gregaf> but some of them are resistant
[19:00] <gregaf> </grumble>
[19:03] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[19:11] <wido> sagewk: To stick with the btrfs bugs, do you think this is Ceph related? http://pastebin.com/t9VAsNk5
[19:11] <sagewk> wido: don't think so
[19:11] <sagewk> but, it's hard to say :)
[19:12] <wido> sagewk: Ok, ofcourse. I just saw it coming alone. But i'll then head over to #btrfs
[19:12] <jantje__> hi guys
[19:13] <wido> hi jantje__
[19:14] <yehudasa> jantje__: did you get to test ino32?
[19:14] <jantje__> wido: i upgraded my cluster to all diskless nodes, 5x 2xOSDs + 1 MDS/OSD
[19:14] <wido> Ah, how is that working out for you?
[19:14] <jantje__> yehudasa: ow, i didn't knew you already implemented that
[19:15] <jantje__> yehudasa: i'm currently on vacation, and i lose my wifi connection all the time, but i'll try to get my vpn going
[19:15] <jantje__> wido: great! I just hope I don't get any delays because of the slow NFS root
[19:15] <yehudasa> jantje__: oh, it can really wait till you get back
[19:15] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has left #ceph
[19:16] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:16] <jantje__> yehudasa: i'll see what I can do, It would be really great to get it working, so I finally can kick of a distributed build -)
[19:16] <jantje__> is it in testing?
[19:17] <yehudasa> oh, actually I put that change in unstable
[19:17] <jantje__> no problem
[19:18] <jantje__> sagewk: typo: monclinet: fix locking
[19:18] * Meths_ (rift@ has joined #ceph
[19:23] * Meths (rift@ Quit (Read error: Operation timed out)
[19:24] <wido> yehudasa: About the cephx issues, I'm still seeing those
[19:27] <wido> Somehow, and I don't know why, my 4 OSD's get "split" up, I have 2 OSD's which talk to eachother, but not to the rest
[19:33] <yehudasa> wido: which version are you using?
[19:35] * Meths_ is now known as Meths
[19:37] <jantje__> yehudasa: i cant test today, but I was looking for the commit in unstable, which one is it? :)
[19:37] <wido> yehudasa: the latest unstable
[19:38] <wido> what I noticed, the OSD's get split up, cluster stalls. "rbd ls" works, but "rados -p rbd ls" spits errors like "failed to authorize client"
[19:38] <wido> restarting the OSD's fails, since one or more goes into Zombie state
[19:39] <wido> right now I'm downling 1.1TB of data to a Qemu-RBD VM, that kept failing. Upgraded to the latest unstable today, see how that holds out
[19:40] <jantje__> wido: are you also using diskless nodes?
[19:42] <wido> jantje__: no, right now I'm back to one machine at the office
[19:42] <wido> I had to move my servers from DC to DC and 70% died in the process
[19:43] <wido> I'm in the process of getting budget for buying a good test cluster
[19:43] <wido> based on Atom CPU's for OSD's, see how that works out
[19:44] <jantje__> cool
[19:44] <jantje__> i currently have all quad core 2.6ghz with 4gb ram
[19:45] <yehudasa> wido: does it happen now, can I log in and see that?
[19:45] <jantje__> default corporate 1u servers, wehave thousands of those :-)
[19:46] <wido> yehudasa: not yet, the rsync is running again. I'll keep an eye out and see if it happends again
[19:46] <wido> jantje__: Hehe, I'm lacking disks :( But I'm looking at: http://www.supermicro.com/products/motherboard/ATOM/ICH9/X7SPE-HF-D525.cfm
[19:46] <wido> Should be enough for 4 disks
[19:47] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[19:47] <wido> And you could store 44 in one 19" rack and not even use 16A at 230V
[19:48] <jantje__> i'm wondering how good performance is compared to other cpu's
[19:51] <wido> jantje__: Me too, but I think it will be enough
[19:51] <wido> I'll order a few first to see how it works
[19:53] <jantje__> too bad supermicro doesn't have 4x LAN boards for atom cpus
[19:53] <wido> Yes, but it's the same with all there boards
[19:53] <wido> 2x LAN is just not enough
[19:54] <wido> and 10G is to expensive at the moment
[19:55] <jantje__> we have those LN4 boards for the core 2
[19:56] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[19:57] <wido> yes, they have some boards, but not all
[19:58] <wido> yehudasa: Yes, it is stuck again. But there are no cephx messages (yet)
[19:59] <wido> if you want, access logger.ceph.widodh.nl and then "ssh noisy"
[19:59] <wido> I see the btrfs bug again in the dmesg. And I think that is what's freezing the OSD's
[20:02] <yehudasa> yeah, the osd hangs due to btrfs bug
[20:02] <yehudasa> not really related to that old cephx problem
[20:02] <wido> yehudasa: True, but after a while I'm starting to get the cephx problems
[20:03] <wido> could it be of keys which do not rotate anymore?
[20:03] <yehudasa> yeah, might be
[20:03] <wido> Ok, I'll leave it for now than and post this at the btrfs ml
[20:03] <yehudasa> probably the osd hangs in a point where it took some lock that prevents it to rotate
[20:13] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[20:20] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[20:29] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[20:31] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[20:55] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[20:56] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[20:57] <yehudasa> wido: we started implementing librbd
[20:59] <wido> yehudasa: that is great news!
[21:28] * fzylogic (~fzylogic@ has joined #ceph
[21:30] * dallas (~dallas@ has joined #ceph
[21:30] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[21:31] * dallas (~dallas@ has left #ceph
[21:31] * dallas (~dallas@ has joined #ceph
[21:34] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[21:46] * dallas (~dallas@ Quit (Read error: No route to host)
[21:51] * fzylogic_ (~fzylogic@70-0-245-189.pools.spcsdns.net) has joined #ceph
[21:53] <wido> yehudasa: If you have anything alpha for librbd, let me know, I'll start testing it and see if I can wrap something around it. A PHP extension and I nice Webinterface to manage your RBD images would be great
[21:56] * dallas (~dallas@ has joined #ceph
[21:58] * fzylogic__ (~fzylogic@ has joined #ceph
[21:58] * fzylogic (~fzylogic@ Quit (Ping timeout: 480 seconds)
[21:58] * fzylogic__ is now known as fzylogic
[22:05] * fzylogic_ (~fzylogic@70-0-245-189.pools.spcsdns.net) Quit (Ping timeout: 480 seconds)
[22:08] <yehudasa> wido: soon, but not yet
[22:16] <wido> yehudasa: Yes, no hurry. Just let me know when you got something
[23:33] * DJLee (82d8d198@ircip2.mibbit.com) has joined #ceph
[23:33] <DJLee> erm..
[23:34] <DJLee> referring to the current mailinglist atm, would the unstable (dated 20-jan) be more recent than 0.24.2?
[23:34] <DJLee> how'd I checkout the 0.24.2? Im unsure to use the tagging function in the git..
[23:35] <DJLee> the current master log still shows 0.24, maybe just not updated..?
[23:36] <yehudasa> the unstable branch is always more recent than our releases
[23:37] <DJLee> cool~
[23:37] <DJLee> :)
[23:37] <sagewk> and the master branch is currently not updated.. fixing
[23:37] <yehudasa> to checkout: you can do something like 'git checkout v0.24.2'
[23:37] <sagewk> i also just updated master, so you can 'git pull'
[23:37] <yehudasa> git fetch
[23:37] <yehudasa> git checkout master
[23:38] <yehudasa> git pull
[23:38] <DJLee> argh, thats how i use the tag :)
[23:38] <DJLee> yep, i also agree with what Greg and Jim suggested,
[23:38] <DJLee> I've had any good expierence with the master version somehow.. :(
[23:39] <DJLee> I think current unstable dated 20-jan is working so far great, for performance.
[23:40] <DJLee> and about the btrfs hanging in recent post, maybe that's the one I also noticed back when i posted the long 0.23.2 consolidated question, heh
[23:41] <DJLee> atm i've got some (another) largish benchmark (based on unstable 20-jan), and i'll post that up soon.
[23:43] <DJLee> about the current object chunk being 4mb, so this means that any file size less than 4mb, is guarantee to have its own obj block, correct?
[23:49] <yehudasa> DLee: right
[23:56] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:58] <bchrisman> curious about this error message. I'm certain it's a misconfiguration: OSD id 8 != my id 0
[23:58] <bchrisman> failed: ' /usr/bin/cosd -i 0 -c /etc/ceph/ceph.conf '
[23:59] <bchrisman> that's what it says while trying to startup cosd for osd0… not sure how it's getting '8'

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.