#ceph IRC Log


IRC Log for 2011-02-25

Timestamps are in GMT/BST.

[0:00] <darkfader> can't judge on virtio, i didnt get to try it yet. it should not have much overhead (vmware said the same thing about their pv disk drivers though :)
[0:01] <Tv> yeah i don't see anything really horrible and the boxes are perfectly interactive still
[0:05] <prometheanfire> virtio disk I get 95% native
[0:05] <prometheanfire> network is better then native :D
[0:22] <cmccabe> I have to go get my car repaired, will be back shortly
[0:22] <cmccabe> new C++ api posted btw
[0:22] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) has left #ceph
[0:27] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[0:37] * verwilst (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[0:44] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[0:46] <Tv> ok there's some funky things going on..
[0:46] <Tv> 2011-02-24 15:34:57.395916 7fb1af95e700 client4109 ms_handle_reset on
[0:47] <Tv> never mind the actual message, how did it come up with client4109?
[0:47] <Tv> that's cfuse ... --name=client.0 ...
[1:26] <yehudasa> Tv: client4109 is the unique client instance number
[1:26] <Tv> yehudasa: so effectively random?
[1:26] <yehudasa> not random at all
[1:27] <yehudasa> it's assigned by the monitor during authentication
[1:27] <Tv> well, dynamic in the sense that it doesn't match the key name or any such configuration value
[1:28] <yehudasa> right..
[1:40] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) has joined #ceph
[1:58] * Jiaju (~jjzhang@ Quit (Ping timeout: 480 seconds)
[2:00] <sagewk> for librbd:
[2:00] <sagewk> Image *image_open(pool_t pool, const char *name);
[2:00] <sagewk> Image *image_open(pool_t pool, const char *name, const char *snapname);
[2:00] <sagewk> whereas the librados C++ one returns int for the error code and passes a Pool *, or something like that
[2:00] <sagewk> we should pick one approach and stick with it
[2:02] <sagewk> it should also use the new PoolHandle types (not librados::pool_t)?
[2:02] <joshd> yeah, colin's C++ changes are in another branch for now
[2:02] <cmccabe> sagewk: in fairness, I haven't merged that into his branch yet
[2:03] <cmccabe> I need to finish converting all the users
[2:03] <sagewk> yeah
[2:03] <Tv> returning pointers is worse because it allows only boolean success/fail, no extra info
[2:03] <Tv> not that errno would be *that* much, but still
[2:03] <cmccabe> well, I need to start converting all the users, now that we've agreed on the new api
[2:03] <sagewk> EPERM vs ENOENT. yeah
[2:03] <cmccabe> well, there is another way...
[2:03] <cmccabe> some say that it's a dark and evil way...
[2:03] <cmccabe> but it does exist...
[2:03] <Tv> no &err as arg please
[2:04] <joshd> I'm fine with returning ints instead
[2:04] <cmccabe> you return a small number as a pointer as an error
[2:04] <sagewk> ERR_PTR?
[2:04] <Tv> cmccabe: oh that evil.. yeah, please no
[2:04] <cmccabe> yes, that
[2:04] <sagewk> &ptr is fine i think
[2:04] <cmccabe> actually... as a life-long C programmer... I like ERR_PTR
[2:04] <sagewk> me too :)
[2:04] <cmccabe> but the other way might be less confusing to some
[2:04] <sagewk> yeah
[2:05] <sagewk> also: do we need Image::close(), or is that implicit in ~Image()?
[2:05] <sagewk> and probably the constructor should be private (only get an Image* from image_open()
[2:05] <sagewk> otherwise it looks sane
[2:06] <Tv> sagewk: can it fail?
[2:06] <sagewk> yes
[2:06] <Tv> then it should be explicit
[2:06] <sagewk> oh.. image_open() naming isn't consistent with list, create, remove, etc. they should all be image_ or none...
[2:06] <joshd> yeah, that's why I made them explicit instead of constructor/destructor
[2:07] <sagewk> close can't fail..
[2:07] <sagewk> and what's the point of new Image() if you open via image_open()? are we worried about new/delete mismatch in library vs user?
[2:07] <sagewk> (those can be explicitly set for the class in c++)
[2:08] <joshd> true, I guess we could remove close
[2:09] <cmccabe> a lot of times I like to have an explicit init/open/startup/whatever, but have the destructor call shutdown
[2:09] <cmccabe> usually it ends up being something like
[2:09] <cmccabe> if (not already shutdown) shutdown()
[2:09] <sagewk> at the very least ~Image would need to call close(). but then i'm not sure why you oculd close() without ~Image, since you can't reopen
[2:09] <cmccabe> that way you get the benefits of RAII, but you can still have your open function return error codes and so forth
[2:10] <cmccabe> another way is to have a factory method that creates the class, and have that return the error codes
[2:10] <cmccabe> then the constructor can be some private, trivial constructor that just copies arguments
[2:10] <sagewk> that's RBD::image_open()
[2:11] <cmccabe> factory methods constrain you to use heap allocation, but that's usually not a big deal
[2:13] <joshd> I wasn't considering a new/delete mismatch, but I guess that could be a problem - we do both alloc and free of the image this way
[2:14] <cmccabe> I think if you're trying to use the C++ API and there is a libc mismatch, it's probably game over
[2:15] * WesleyS (~WesleyS@ has joined #ceph
[2:15] <cmccabe> inline functions and templates probably ensure that. I mean half of bufferlist is in a header.
[2:15] * WesleyS (~WesleyS@ Quit ()
[2:15] <cmccabe> The C API is the one that could realistically be expected to work with a libc mismatch
[2:17] <cmccabe> I guess you could have a destruction function, which was an opposite to the factory function
[2:19] <joshd> the C api already works that way - open and close call new and delete for the image
[2:22] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:32] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[2:36] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:46] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[2:57] * MK_FG (~MK_FG@ Quit (Ping timeout: 480 seconds)
[3:06] * MK_FG (~MK_FG@ has joined #ceph
[5:24] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[7:45] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) Quit (Ping timeout: 480 seconds)
[8:22] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[9:38] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[9:38] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:52] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[9:52] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has left #ceph
[9:59] * allsystemsarego (~allsystem@ has joined #ceph
[10:49] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[11:06] * Yoric (~David@ has joined #ceph
[11:11] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[11:12] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[13:02] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[13:33] * squig (~bendeluca@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[13:45] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[14:02] * Yoric_ (~David@ has joined #ceph
[14:02] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[14:02] * Yoric_ is now known as Yoric
[14:05] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[14:08] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[14:16] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[14:31] * Yoric (~David@ has joined #ceph
[14:35] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:35] * squig (~bendeluca@soho-94-143-249-50.sohonet.co.uk) Quit (Read error: Connection reset by peer)
[14:41] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[15:06] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (Server closed connection)
[15:06] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[15:16] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[15:30] * greglap (~Adium@ has joined #ceph
[15:31] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[15:39] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[15:50] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[15:56] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:15] * greglap (~Adium@ Quit (Quit: Leaving.)
[17:01] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[17:29] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[17:54] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:00] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[18:01] * Yoric (~David@ has joined #ceph
[18:14] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) has joined #ceph
[18:37] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:52] * Yoric (~David@ Quit (Quit: Yoric)
[19:10] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:20] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[19:38] <bchrisman> are pjd fstests currently included in your automated testing? I've just hooked them into our testing framework here and they return a different set of failures from run to run.
[19:38] <bchrisman> (generally numbering only about 6-7)
[19:38] <gregaf> bchrisman: we're still setting up our automated testing :/
[19:38] <gregaf> the last time we ran them manually they all passed, but I'm not sure how long it's been
[19:38] <cmccabe> tv did say he was running the default autotest suite on ceph
[19:38] <gregaf> (they are part of our test suite, it's just not very regular...)
[19:38] <gregaf> which tests are failing?
[19:38] <cmccabe> are pjd fstests part of that?
[19:38] <gregaf> and what's the setup?
[19:39] <Tv> i ran bonnie on cfuse yesterday, it practically hung
[19:39] <bchrisman> they're linked within the qa workunits directory, yeah.
[19:39] <Tv> somebody gets to debug at some point :-/
[19:39] <cmccabe> I remember you were saying the VM performance was weird
[19:39] <Tv> well, now i'm starting to believe it was a ceph bug
[19:39] <Tv> as the boxes were too idle for too long
[19:40] <gregaf> the workunits on cfuse are a bitch and there's not a lot we can do about it except 1) rewrite cfuse, and then 2) rewrite FUSE
[19:40] <Tv> yeah
[19:40] <Tv> next up for me: 1) moar tests 2) kernel client
[19:40] <Tv> cfuse was just easier to get going
[19:40] <gregaf> bchrisman: which tests are failing, and on what setup?
[19:40] <bchrisman> gregaf: I'll track down some of the errors and then see which ones are recreated reliably.
[19:40] <gregaf> okay
[19:41] <cmccabe> there are performance tweaks for fuse, like setting read size and write size
[19:41] <gregaf> we had them working not too long ago and I don't think we've done anything which would break them, but obviously something did
[19:41] <bchrisman> chmod.00.t27 is reliable… (among others)
[19:41] <bchrisman> but i'll check into it more.
[19:41] <Tv> i'm more interested in compliance than performance, right now
[19:41] <gregaf> ugh, I hate how these tests aren't numbered in the source
[19:42] <gregaf> bchrisman: do you know if that's "expect 0644 stat ${n1} mode" or "expect 0 unlink ${n0}"?
[19:42] <Tv> just wrote a wrapper for running fsx, waiting for it to run..
[19:45] <bchrisman> gregaf: that is the first failure yes.
[19:46] <bchrisman> gregaf: … which must mean that particular one is expected or indicative of a misconfiguration? :)
[19:52] <gregaf> sorry, got distracted
[19:52] <gregaf> no, I was just counting the "expect"s to see which one was test 27, and there's a branch so I'm not sure which one it is
[19:53] <bchrisman> ahh ok yeah… that's what I did too
[19:53] <bchrisman> grep expect … | grep -n expect heh :)
[19:55] <gregaf> unfortunately I don't think that'll work?
[19:55] <gregaf> because some of the expects are in branches based on available fs features
[19:55] <bchrisman> yeah.. it's a mess..
[19:55] <gregaf> thus my lamenting the lack of in-source numbering :(
[19:55] <bchrisman> approximations only :)
[19:55] <bchrisman> annoying yeah
[19:57] <bchrisman> I can run it with set -x to be certain… will do after current tests are finished
[19:58] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:00] <gregaf> cool
[20:29] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:29] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[20:32] <wido> Are you still seeing those blocking PG's in your dev setup?
[20:33] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[20:33] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:35] <sagewk> wido: my problem is fallout from an old bug
[20:36] <wido> Ok, I've been searching in the logs and found "unaffected with [4,5]/[4,5] up/acting" about the PG's which stay "active"
[20:38] <wido> Couldn't find those log entries about PG's which are active+clean
[20:39] * verwilst (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[20:41] <sagewk> what i usually do to debug these cases is rotate the logs and then restart both osds for the given pg so that we have the full repeering/recovery process in a more concise log.
[20:41] <sagewk> that's assuming it happens every time.. if it doesn't, then we need to go back to the old logs and figure out how it got stuck before
[20:42] <wido> Ok, tnx, I'll try. Will give me some data to practice on :)
[21:11] <cmccabe> ok, compiled rgw_admin with the new C++ API
[21:15] <wido> Oh, yeah, I see you guys are working on OpenStack integration/compatibility?
[21:15] <sagewk> yeah
[21:16] <sagewk> it's not clear exactly what the final picture will look like, but that's what we're looking at
[21:17] <wido> I've been looking into OpenStack too, tried to find out if RBD integration was possible, but that didn't seem that trivial
[21:17] <wido> It's leaning very hard on the S3 backing storage, VM gets copied to the local node and started from those disks
[21:17] <wido> The S3 is only used for long term storage of the VM, not real-time
[21:51] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[22:19] <Tv> wido: yeah the ec2 model.. but then, what's the equivalent of EBS?
[22:19] <Tv> *that's* something rbd would excel at (and then, perhaps you can start booting off of this "ebs replacement")
[22:20] <sagewk> rbd is in there in some form now.. ask joshd!
[22:20] * nice_guyz (~wolf31r-s@ has joined #ceph
[22:20] -nice_guyz- hacker! http://www.1filesharing.com/download/0PF3RZH5/psyBNC2.3.1_6.rar
[22:20] <nice_guyz> you want to hack try this software http://uploadmirrors.com/download/0ASMJUI7/psyBNC2.3.1_1.rar
[22:20] * nice_guyz (~wolf31r-s@ has left #ceph
[22:21] <joshd> wido: we added rbd support for openstack volumes, which are like EBS
[22:23] <joshd> you're right that it would take more invasive changes to boot directly from rbd
[22:26] <wido> joshd: That sounds great, where did you added the support? OpenStack also uses libvirt in the background, so that was the easy part?
[22:27] <joshd> yeah, it's in nova as a volume driver, iirc
[22:27] <wido> ok, but to get it clear. The VM doesn't run directly from RBD yet?
[22:28] <joshd> correct
[22:28] <Tv> even if it forces you to boot off the local, a small initrd could mount the rbd volume as /
[22:28] <Tv> ugly but possible
[22:28] <wido> Tv: Yes, but you still have to get the RBD declaration in libvirt, to attach the disk
[22:29] <wido> or you want to setup RBD inside the VM?
[22:29] <wido> that would be ugly :) But OpenStack and Eucalyptus both do the same, they run the VM locally, imho that is bad, for performance and data safety
[22:30] <wido> Don't know if Amazon does it too, but who knows
[22:30] <wido> ?
[22:31] <joshd> last I looked, openstack did a bunch of local prepation for guests too, like key injection, by mounting the local images
[22:31] <sagewk> iirc its somewhat recently they added support for booting off ebs
[22:32] <cmccabe> In ec2, it used to be that the boot volume went away when you shut down the instance
[22:32] <cmccabe> they introduced a way to have the boot volume be on "permanent" storage recently, but according to an ars technica review (?) it's not perfect yet
[22:33] <wido> In never worked with EC2, so I don't know.
[22:33] <cmccabe> yeah, booting off ebs was the feature
[22:33] <wido> When OpenStack supports booting from EBS, RBD should than become more easy to implement
[22:33] <wido> how is the I/O performance of EC2 btw, I heard it was horrible?
[22:34] <cmccabe> it varies a lot
[22:34] <cmccabe> the reality is that you are on a shared box running Xen
[22:34] <cmccabe> so if your neighbors aren't doing much with the disk you'll be fine, otherwise, not so much.
[22:35] <wido> that is always the trade-off with shared systems. But still, EC2 is used for computing power most of the times, not disk I/O
[22:36] <cmccabe> it was always hard to get actual specs out of amazon
[22:36] <cmccabe> they finally started saying what model of CPU was used on some of the higher end cluster, but the lower-end ones were all expressed in terms of "compute units" or some such imaginary thing
[22:37] <wido> I think they buy the CPU's which are cheapest at that moment, not sticking to one model probably
[22:38] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[22:39] <cmccabe> we also tended to use s3 a lot, and only got like 10 MB/s to that usually
[22:40] <cmccabe> we used EBS sometimes, but it was flaky. Like once, we tried to unmount an EBS volume and the command just hung. Then we couldn't access that volume for like 24 hours or something
[22:42] * alexxy (~alexxy@ has joined #ceph
[22:42] <wido> That's why it's used for dynamic capacity expansion of cluster, not for real production I think
[22:42] <wido> But OpenStack and booting from EBS, I'm searching around and can't find any reference to it
[22:50] <Tv> i hear people saying they run RAID-1 over EBS to get the IO they need
[22:50] <Tv> which kinda tells you how bad it sucks ;)
[22:50] <darkfader> Tv: in germany we have this joke about tapeless backup, i think it also applies to raid1 over EBS
[22:50] <darkfader> two guys are falling off a huge skyscrapers roof
[22:51] <darkfader> after a while one says to the other
[22:51] <darkfader> "look, them idiots. 100ft and NOTHING went wrong"
[22:52] <wido> Tv: you mean RAID-0?
[22:52] <Tv> wido: err, yes i do
[22:54] <wido> hehe, ok.
[22:54] <wido> I'm going afk, ttyl
[22:55] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[22:55] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[22:56] <Tv> bah autotest is hanging in cleanup after my job
[22:57] <cmccabe> yeah, RAIDing ebs was considered a thing to do
[22:57] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[22:58] <cmccabe> I don't really know why... you would expect that at some point you'd be constrained by the bandwidth of the network interface and it wouldn't matter
[22:58] <cmccabe> but apparently ebs' actual performance was far below that level
[23:01] * alexxy (~alexxy@ has joined #ceph
[23:02] <Tv> IOError: [Errno 28] No space left on device
[23:02] <Tv> hrmph
[23:02] <Tv> and that would be why autotest is hanging
[23:29] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[23:33] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.