#ceph IRC Log


IRC Log for 2010-07-29

Timestamps are in GMT/BST.

[5:36] <MarkN1> patch for ceph.spec.in to rename rbdtool to rbd, also packages libcls_sync.so* http://pastebin.org/427013
[5:38] * eternaleye_ (~quassel@184-76-75-253.war.clearwire-wmx.net) has joined #ceph
[5:39] <sage> markn1: what's libcls_sync?
[5:40] <MarkN1> honestly not sure, but rpmbuild was not building because it was present but not in the files section
[5:41] <MarkN1> I am having a look now to see where it comes from
[5:41] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) Quit (Ping timeout: 480 seconds)
[5:51] <sage> oh.. cls_sync is new :)
[5:53] <MarkN1> is that the code in src/cls_sync.cc ?
[5:53] <sage> yeah
[5:54] <MarkN1> yeah I just confirmed by trying to rebuild and it fails if the libcls_sync stuff is not in place
[5:55] <MarkN1> btw this is for unstable as of about 2 hours ago
[5:57] <sage> ok cool.
[5:57] <sage> the rbd bit will go into 0.21, libcls_sync for 0.22
[5:59] <MarkN1> is .21 going to drop the same time as 2.6.35 ?
[6:00] <sage> tonight
[6:00] <sage> if i can get these lintian issues fixed :)
[6:01] <MarkN1> excellent, will rebuild over the weekend and play around with it
[6:22] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[6:24] * akhurana (~ak2@c-98-232-30-233.hsd1.wa.comcast.net) has joined #ceph
[6:43] * eternaleye_ (~quassel@184-76-75-253.war.clearwire-wmx.net) Quit (Ping timeout: 480 seconds)
[6:47] * allsystemsarego (~allsystem@ has joined #ceph
[7:12] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) has joined #ceph
[7:19] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[7:19] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[8:13] * mtg (~mtg@vollkornmail.dbk-nb.de) has joined #ceph
[9:06] <wido> sage: you still here? i see osd18 is running again, but can't find a commit in the unstable to fix this
[9:14] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) Quit (Ping timeout: 480 seconds)
[9:22] * eternaleye (~quassel@ has joined #ceph
[10:00] * eternaleye (~quassel@ Quit (Ping timeout: 480 seconds)
[10:30] * akhurana (~ak2@c-98-232-30-233.hsd1.wa.comcast.net) Quit (Quit: akhurana)
[12:39] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) has joined #ceph
[14:02] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[15:52] * f4m8 is now known as f4m8_
[16:35] * mtg (~mtg@vollkornmail.dbk-nb.de) Quit (Quit: Verlassend)
[16:58] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) has joined #ceph
[19:08] * fzylogic (~fzylogic@dsl081-243-128.sfo1.dsl.speakeasy.net) Quit (Quit: fzylogic)
[19:41] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has left #ceph
[19:41] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:39] <sagewk> wido: yeah, it was fine when i restarted it.
[21:02] <wido> sagewk: ok, cool. Any idea what caused it? I saw the crash again on a other node, restart fixed it indeed, but did you get the root cause?
[21:05] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) has joined #ceph
[21:32] <sagewk> i'll probably need the full osd log on the crashed node _and_ the primary osd for whichever pg it was.. which basically means turning up all the logs on the osd cluster, i think.
[21:36] <wido> i'm not such a hero with C (yet!), but i'm seeing a issue with the rados gateway. Now i'm trying to find what the content of a attribute is, but it's still giving me garbage: http://www.pastebin.org/428468
[21:36] <wido> (i'm used to PHP and Java)
[22:05] <wido> sagewk: i've found one issue with IPv6. Where a IPv4 address comes available wether the link is up or not, IPv6 does not, might be something with the duplicate address detection. This causes the IPv6 address not to be available on boot, so a mon, mds or osd can't bind to IPv6 and then fails. Tried changing the start-up priority, didn't help either. Checking out my kernels logs i see it takes about 10 seconds before the IPv6 address is there
[22:05] <wido> might a short loop where you try to bind be a idea? Bind? I fail, sleep 3 seconds, try again (3 times)
[22:06] <sagewk> wido: that's pretty strange
[22:08] <wido> http://www.pastebin.org/428614 the syslog. It only takes 2 seconds for the link to come up and IPv6 to become available
[22:08] <wido> but just before that (21:58:59) the mon tried to start and bind, which failed
[22:09] <wido> "10.07.29_21:58:59.449246 7f6c75e0d720 -- :/0 accepter.bind unable to bind to [2001:16f8:10:2::c3c3:2e5c]:6789: Cannot assign requested address"
[22:10] <sagewk> that sounds like something that should be handled by the startup system (don't start services until network is fully configured)...
[22:11] <sagewk> e.g., the init script has # Required-Start: $remote_fs $named $network $time
[22:11] <sagewk> but the archaic sysv style initsystem debian is using probably doesn't wait for ipv6 links to come up?
[22:13] <wido> no, it simply starts. upstart might be usefull here, but as far as i know, the link handling is done by the NIC driver
[22:14] <wido> Debian is switching to upstart too, but this will be in Squeeze
[22:16] <sagewk> this seems like something any service binding to an ipv6 address would have.
[22:16] <sagewk> well, i guess most things bind to :: and not a specific ip.. presumably that is ok before the link comes up?
[22:16] <wido> i'll give Apache a try, see what that does. But for now it's not that of a problem
[22:16] <wido> yes, indeed, because the link-local address is already online
[23:02] <wido> where should i report a bug with qemu-kvm (rbd)? there is no project in the tracker
[23:06] <yehudasa> wido: report it on the rbd project
[23:07] <wido> ok, i think it's small, it also effects "rbd"
[23:08] <yehudasa> wido: about that garbage
[23:09] <yehudasa> there are a few things wrong in your C cod
[23:09] <yehudasa> C code
[23:10] <wido> yes, i expected...
[23:10] <yehudasa> sizeof(buf) will give you the architecture pointer size, so it's either 4 or 8, but not what you expected
[23:10] <yehudasa> so you should probably #define BUF_LEN 128 or something like that
[23:10] <yehudasa> then allocate, buf = malloc(BUF_LEN);
[23:11] <yehudasa> also, you allocate xbuf, but never assign any value into it
[23:11] <yehudasa> so you'd get garbage there anyway
[23:12] <wido> ah, that's pretty new when you come from Java or PHP, there is all that stuff "done for you"
[23:12] <wido> gave librados.hpp a try, was much easier, but also, C++ does more for you
[23:13] <wido> problem i see with the radosgw is that you can't modify the ACL of a bucket, but you can from a object, looking into what it is. So i wanted to dump the value of the acl attribute to see if it gets modified
[23:14] <yehudasa> you can find out the actual object on the osd storage
[23:14] <wido> my log saw there is a new policy (xml dump shows the new grant's), but it seems it isn't written to the object
[23:14] <yehudasa> oh, nevermind.. actually you'd get garbage, so forget it
[23:16] <yehudasa> it was working for me last week, so it should probably work
[23:16] <yehudasa> if it doesn't it's a new bug
[23:20] * paunchy (~jreitz@dsl253-098-218.sfo1.dsl.speakeasy.net) Quit (Quit: leaving)
[23:22] <yehudasa> wido: regarding #322, what do you get when running 'ceph class list'?
[23:24] <wido> yehudasa: http://www.pastebin.org/428986
[23:26] <yehudasa> wido: have you reinstalled your osds lately?
[23:26] <yehudasa> might be that you have a noexec mount option on /tmp?
[23:27] <wido> no, no re-install or noexec on /tmp, but could i track down on which OSD it fails?
[23:28] <yehudasa> are you installing from the debian package?
[23:29] <wido> no, building my own with dpkg-buildpackage from the unstable source
[23:29] <wido> not all the OSD's are running the exact same binary, could that be it?
[23:29] <yehudasa> you can add "debug ms = 1" in your ceph.conf on the kvm machine
[23:30] <yehudasa> I don't think it should really matter in this case
[23:30] <wido> ok, weird, somehow it just started to work?
[23:30] <wido> didn't touch anything
[23:31] <yehudasa> it sometimes takes time to propagate
[23:31] <yehudasa> or maybe you used a different object name?
[23:31] <wido> yes, but there was a long time between the times i tried
[23:31] <wido> no, same object name, even tried different pools
[23:32] <yehudasa> do you have /var/lib/ceph/tmp on the osd machines, btw?
[23:34] <wido> yes, on all the OSD's
[23:39] * anthony (~anthony@dhcp-v021-174.mobile.uci.edu) has joined #ceph
[23:40] <anthony> hello, i need some help with making a custom crushmap, can someone take a look at my configuration?
[23:41] <anthony> om
[23:41] <wido> anthony: do you have it only somewhere?
[23:42] <anthony> somewhere?
[23:43] <wido> uh, online
[23:43] <wido> pastebin or something
[23:43] <anthony> oh okay
[23:43] <anthony> one sec
[23:44] <anthony> http://pastebin.com/ZKNhGR1N
[23:45] <anthony> i'm getting a parse error with '# hosts"
[23:46] <wido> did you compile it? crushtool -c crushmap.txt -o crushmap
[23:46] <wido> you can't load the plain-text crushmap
[23:47] <wido> oh wait
[23:47] <wido> host host0 [
[23:47] <wido> that should be: host host0 {
[23:48] <anthony> ahhh
[23:48] <anthony> i see
[23:48] <anthony> -.-
[23:48] <anthony> dumb mistake argh
[23:48] <anthony> thanks alot wido
[23:48] <wido> np
[23:48] <wido> i'm going afk, ttyl!
[23:49] <anthony> bye!

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.