#ceph IRC Log


IRC Log for 2012-06-22

Timestamps are in GMT/BST.

[0:00] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[0:06] <yehudasa_> joao: whoa, semifinals
[0:06] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[0:20] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:27] * brambles (brambles@ Quit (Quit: leaving)
[0:27] * brambles (brambles@ has joined #ceph
[0:27] * brambles_ (brambles@ has joined #ceph
[0:34] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[0:44] * lofejndif (~lsqavnbok@659AAA4AD.tor-irc.dnsbl.oftc.net) has joined #ceph
[0:48] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[0:50] <mgalkiewicz> hi my mon server has new ip address and when I try to start mon I got WARNING: 'mon addr' config option x.x.x.x:6789/0 does not match monmap file
[0:50] <mgalkiewicz> it tries to use old address and fails to start
[0:54] <mgalkiewicz> how to update such address in monmap?
[0:54] <tv_> mgalkiewicz: i think you need to handle that as removal+add, monitors aren't supposed to change addresses
[0:55] * The_Bishop (~bishop@2a01:198:2ee:0:1442:c2de:444c:a36a) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[0:58] <mgalkiewicz> tv_: ok I will try thx
[0:59] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) Quit (Remote host closed the connection)
[1:01] * cattelan_away_away_away is now known as cattelan
[1:02] <elder> So cattelan is apparently really really really here now.
[1:05] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[1:06] * brambles (brambles@ Quit (Quit: leaving)
[1:07] * gregaf (~Adium@aon.hq.newdream.net) Quit (Read error: Connection reset by peer)
[1:10] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[1:10] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Read error: Operation timed out)
[1:14] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[1:20] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[1:30] * The_Bishop (~bishop@e179010000.adsl.alicedsl.de) has joined #ceph
[1:40] <dmick> I dunno, it's not cattelan_here_here_here
[1:54] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:55] * loicd (~loic@magenta.dachary.org) has joined #ceph
[2:05] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[2:05] * loicd (~loic@magenta.dachary.org) has joined #ceph
[2:11] * tv_ (~tv@aon.hq.newdream.net) Quit (Quit: tv_)
[2:21] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[2:21] * loicd (~loic@magenta.dachary.org) has joined #ceph
[2:30] * joao (~JL@ Quit (Quit: Leaving)
[2:38] * yehudasa_ (~yehudasa@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[3:08] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[4:21] * Ryan_Lane1 (~Adium@dslb-092-078-141-030.pools.arcor-ip.net) has joined #ceph
[4:24] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:27] * Ryan_Lane (~Adium@dslb-094-223-088-003.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[4:30] * mosu001 (~mosu001@en-439-0331-001.esc.auckland.ac.nz) has left #ceph
[4:33] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:36] * OutBackDingo (~quassel@rrcs-71-43-84-222.se.biz.rr.com) Quit (Remote host closed the connection)
[4:37] * OutBackDingo (~quassel@rrcs-71-43-84-222.se.biz.rr.com) has joined #ceph
[4:38] * OutBackDingo_ (~quassel@rrcs-71-43-84-222.se.biz.rr.com) has joined #ceph
[4:38] * OutBackDingo (~quassel@rrcs-71-43-84-222.se.biz.rr.com) Quit (Remote host closed the connection)
[4:38] * OutBackDingo_ (~quassel@rrcs-71-43-84-222.se.biz.rr.com) Quit (Remote host closed the connection)
[4:40] * OutBackDingo (~quassel@rrcs-71-43-84-222.se.biz.rr.com) has joined #ceph
[4:46] * lofejndif (~lsqavnbok@659AAA4AD.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[4:58] * elder_ (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[4:59] * elder_ (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit ()
[4:59] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[5:00] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[5:03] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit ()
[5:03] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[5:03] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit ()
[5:04] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[5:04] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit ()
[5:05] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[5:15] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[5:16] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[5:17] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit ()
[5:18] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[5:19] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit ()
[5:19] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[5:21] <elder> OK, I think I now have arranged to identify myself automatically using SSL to OFTC.
[5:24] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[5:24] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[5:30] * widodh_ (~widodh@minotaur.apache.org) Quit (Read error: Connection reset by peer)
[5:30] * widodh (~widodh@minotaur.apache.org) has joined #ceph
[5:35] <dmick> humph. that sounds painful
[5:39] <elder> It actually wasn't. They offered a mild sedative.
[5:40] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[5:41] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[5:41] <dmick> hum. well I connected with SSL too
[5:41] <dmick> so there
[5:42] <elder> Did you take the blue pill?
[5:42] <dmick> I don't know, but you look very cool in those shades
[5:43] <dmick> making my own certificate scares me tho
[5:43] <dmick> and I don't see Pidgin there in the instructions, so maybe it's a "won't do this" anyway
[5:43] <elder> You notice my shades have no temples?
[5:44] <elder> I don't even know what the certificate is doing I guess, other than saying I am the one that made the certificate.
[5:44] <elder> It doesn't really authenticate me at all.
[5:46] <dmick> yeah
[6:05] * dmick is now known as dmick_away
[7:53] * cattelan is now known as cattelan_away
[8:15] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:20] * Elotero (~Elotero@ has joined #ceph
[8:31] * s[X]_ (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[8:31] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) Quit (Read error: Connection reset by peer)
[8:38] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[8:47] * Elotero (~Elotero@ Quit (autokilled: This host violated network policy. Mail support@oftc.net if you have any questions (2012-06-22 06:47:20))
[8:49] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[8:51] * loicd (~loic@ has joined #ceph
[8:54] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[9:22] * BManojlovic (~steki@ has joined #ceph
[9:30] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:48] * s[X]_ (~sX]@eth589.qld.adsl.internode.on.net) Quit (Ping timeout: 480 seconds)
[9:48] * loicd1 (~loic@ has joined #ceph
[9:48] * loicd (~loic@ Quit (Read error: No route to host)
[10:08] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit (Quit: adjohn)
[10:09] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:17] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:17] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[10:21] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:21] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[10:45] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:45] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[10:57] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:57] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[11:09] * loicd1 is now known as loicd
[11:09] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[11:56] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[11:59] * s[X]_ (~sX]@ has joined #ceph
[12:01] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:04] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[12:04] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[12:05] * s[X]_ (~sX]@ Quit (Read error: No route to host)
[12:05] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[12:15] * mgalkiewicz (~mgalkiewi@staticline58611.toya.net.pl) has joined #ceph
[12:16] <mgalkiewicz> is anybody home?
[12:21] * joao (~JL@89-181-150-156.net.novis.pt) has joined #ceph
[12:29] * mgalkiewicz (~mgalkiewi@staticline58611.toya.net.pl) Quit (Quit: Ex-Chat)
[12:31] * Ryan_Lane1 (~Adium@dslb-092-078-141-030.pools.arcor-ip.net) Quit (Quit: Leaving.)
[12:38] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:38] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[12:55] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[12:55] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[12:56] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[13:00] * s[X]____ (~sX]@ has joined #ceph
[13:02] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Operation timed out)
[13:04] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[13:15] * Ryan_Lane (~Adium@ has joined #ceph
[13:21] * s[X]____ (~sX]@ Quit (Remote host closed the connection)
[13:44] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[13:46] * The_Bishop_ (~bishop@e179004236.adsl.alicedsl.de) has joined #ceph
[13:53] * The_Bishop (~bishop@e179010000.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[14:32] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[14:32] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[14:53] * coredumb (~coredumb@ns.coredumb.net) has joined #ceph
[14:54] <coredumb> Hello
[14:56] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[15:00] <coredumb> i'm looking to a distributed/replicated filesystem able to handle live openVZ containers and/or KVM live VMs, is ceph able to handle that with great performances?
[15:29] <ninkotech> coredumb: rbd with kvm is imho good, but i didnt test it yet much
[15:29] <ninkotech> kvm (qemu) has good support for rbd
[15:30] <ninkotech> ceph (filesystem) is not yet very stable to be used for production, imho
[15:43] <coredumb> ninkotech: i didn't really get how this works?
[15:44] <coredumb> ceph block and ceph fs
[15:49] <ninkotech> coredumb: ceph is made of layers
[15:49] <ninkotech> filesystem -> block device -> objects
[15:49] <ninkotech> while block device (rados) is ~ ok, filesystem is not yet very stable
[15:50] <ninkotech> and kvm can use block device in rados as DISK
[15:50] <ninkotech> coredumb: i did oversimplify it a bit, but read something web :)
[15:50] <coredumb> ninkotech: ok
[15:51] <coredumb> and i guess i can also use block device on a host and store files on it ?
[15:51] <ninkotech> coredumb: searching keywords: rados block device kvm
[15:51] <ninkotech> coredumb: you have way too many choices :)
[15:51] <ninkotech> you can store filesystem on it
[15:51] <ninkotech> rather then files ;)
[15:52] <ninkotech> than*
[15:52] <coredumb> oh sounds like... well too awesome
[15:52] <ninkotech> coredumb: you do not store your files on disk
[15:52] <ninkotech> you store them on filesystem
[15:52] <ninkotech> usually :)
[15:53] <ninkotech> i am 1331, i store files on disk also
[15:53] <ninkotech> :)
[15:53] <coredumb> yeah that's what i meant actually ;)
[15:53] <ninkotech> its important to say what you really mean
[15:53] <ninkotech> :D
[15:54] <coredumb> eah but indeed you usually store your file on a FS :D
[15:55] <coredumb> to tell you the truth i was trying to use glusterfs as a backend for my openVZ containers
[15:55] <coredumb> it has proved to be an utter failure
[15:59] <coredumb> ninkotech: btw if it's a block device, it can't be accessible by multiple hosts at the same time right?
[16:00] <ninkotech> coredumb: i would not do that :)
[16:00] <coredumb> :/ that's what i need
[16:00] <coredumb> so ceph FS, it's really not stable enough?
[16:00] <ninkotech> coredumb: i cant really say, i do not use it
[16:01] <ninkotech> coredumb: are you crazy? multiple hosts sharing writeable space? :)
[16:02] <ninkotech> at least if they are virtual machines - having that as root -- and RW --> that will be fail :))
[16:02] <coredumb> crazy as in... NFS ?
[16:02] <coredumb> :D
[16:05] * lofejndif (~lsqavnbok@9YYAAHF88.tor-irc.dnsbl.oftc.net) has joined #ceph
[16:05] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[16:21] * ninkotech (~duplo@ Quit (Remote host closed the connection)
[16:28] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[16:29] * ninkotech (~duplo@ has joined #ceph
[16:29] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Remote host closed the connection)
[16:30] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[16:35] * s[X]__ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[16:37] <ninkotech> coredumb: booting all machines from one r/w nfs sure sounds crazy to me
[16:37] <ninkotech> :)
[16:38] <ninkotech> having homes is different -- but even than, you often expect a user to work only from one place :)
[16:45] <coredumb> we should count the number of ESX installation which have only NFS as datastore
[16:45] <coredumb> ;)
[16:50] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[16:51] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[17:04] * loicd (~loic@ has joined #ceph
[17:23] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[17:26] * lofejndif (~lsqavnbok@9YYAAHF88.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[17:28] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:29] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[17:33] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit ()
[17:40] * The_Bishop_ (~bishop@e179004236.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[17:40] * Tv_ (~tv@aon.hq.newdream.net) has joined #ceph
[17:40] * Tv_ is now known as tv_
[17:44] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:47] <elder> I'm getting teuthology run failures on workunits/suites/pjd.sh. The reason is that it needs http://ceph.com/qa/pjd.tgz which does not exist.
[17:48] <elder> Seems like a bad idea to rely on something like that. That pjd.tgz file ought to be under control of the ceph-qa git reposotry.
[18:06] * Ryan_Lane (~Adium@dslb-092-078-141-030.pools.arcor-ip.net) has joined #ceph
[18:14] * nolan (~nolan@2001:470:1:41:20c:29ff:fe9a:60be) Quit (Remote host closed the connection)
[18:17] * nolan (~nolan@2001:470:1:41:20c:29ff:fe9a:60be) has joined #ceph
[18:18] <nhm> elder: there are other examples of stuff like that (including some things I've written). I'd be in favor of moving all dependencies like that to some central location (like git).
[18:20] <joao> sagewk, around?
[18:21] * The_Bishop (~bishop@2a01:198:2ee:0:c43c:58a5:b240:aa03) has joined #ceph
[18:21] <gregaf> sagewk is off today
[18:21] <joao> kay
[18:32] * yehudasa_ (~yehudasa@aon.hq.newdream.net) has joined #ceph
[18:35] * bchrisman (~Adium@ has joined #ceph
[19:05] * chutzpah (~chutz@ has joined #ceph
[19:07] * loicd (~loic@ Quit (Quit: Leaving.)
[19:07] <tv_> elder: yeah that indirectness is a legacy inherited from times before teuthology even existed
[19:07] <tv_> elder: though we've had issues with even github.com crapping out on the tests, etc
[19:07] <elder> Well I'm with nhm, we should work toward fixing things like that.
[19:08] <elder> Well at least github is a single place to collect failures.
[19:08] <elder> But that's another issue really.
[19:08] <tv_> i agree but haven't had time to devote to it
[19:08] <tv_> plus it's a race against sagewk adding more of those things -- you really need to convince him first ;)
[19:19] <elder> Are we having a meeting?
[19:19] <tv_> i see Mark at the vidyo screen, perhaps technical difficulties
[19:19] <elder> OK.
[19:19] <elder> I'm on too.
[19:22] <elder> I say we abandon the meeting if we don't hear from Aon in the next 3 minutes...
[19:42] * lofejndif (~lsqavnbok@9YYAAHGLR.tor-irc.dnsbl.oftc.net) has joined #ceph
[20:00] * sdx23 (~sdx23@with-eyes.net) Quit (Remote host closed the connection)
[20:01] * jluis (~JL@89-181-155-203.net.novis.pt) has joined #ceph
[20:06] * joao (~JL@89-181-150-156.net.novis.pt) Quit (Ping timeout: 480 seconds)
[20:08] * jluis is now known as joao
[20:22] * cattelan_away is now known as cattelan
[20:26] * BManojlovic (~steki@ has joined #ceph
[20:28] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[20:30] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:31] * sagelap (~sage@142.sub-166-250-44.myvzw.com) has joined #ceph
[20:39] * sagelap (~sage@142.sub-166-250-44.myvzw.com) Quit (Ping timeout: 480 seconds)
[20:55] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit (Quit: adjohn)
[20:59] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[21:02] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit ()
[21:08] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[21:11] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit ()
[21:19] <nhm> elder: ping
[21:19] <elder> Here
[21:21] <nhm> elder: so I've been thinking a little bit recently about combining multiple small requests into a single transaction over the network. You know that whole side of the code way better than I do. Any thoughts?
[21:21] <elder> Only generally. Sounds like a good idea. What is the context? RADOS requests?
[21:22] <nhm> elder: it's one of the things that would be necessary I think to get our small request performance up for basically everything imho.
[21:23] <nhm> elder: even on SSDs our small request performance is really bad. I have a feeling it's probably because we are doing a single network transaction per request.
[21:25] <elder> Is there any way you can try to characterize what the distribution of elapsed time is between I/O submit on the client (application) and the completion?
[21:25] <elder> Time to kernel, time over the wire to server, time to get a response, etc.
[21:25] <elder> ?
[21:26] <elder> I've dealt with this sort of thing before--we always want to stick a tag on something so you can trace it all the way through the system. (And there's never a good way to do that.)
[21:28] <nhm> elder: with debug cranked up you get some timings, but it's not ideal. I think there may be more information (or plans to make information available) through the socket interface.
[21:28] <elder> This sort of breakout can help characterize what you can improve, and by how much, by changes (such as aggregating requests).
[21:29] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[21:29] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:29] <nhm> elder: I figure that so long as we are on tcpip, we'll probably eventually need to think about it regardless.
[21:30] <elder> This is RADOS requests though, right? You want to aggregate multiple things headed to a particular target host/socket?
[21:32] <elder> Aggregating things can add latency too--not the work of putting them together so much as the delay you might want to incur in order to actually have multiple requests to send at once.
[21:32] <nhm> yeah, I figure that would need to be tunable.
[21:32] <elder> It also could be quite complicated if we can't narrow it down to some sort of bottleneck/interface.
[21:33] <elder> I'm not that sure right now though. It would probably need to be a layer above the messenger.
[21:33] <elder> On the client side anyway.
[21:36] <nhm> elder: This is probably a down-the-road thing anyway. Lots of other things to worry about first.
[21:37] <elder> But good to be thinking through it nevertheless.
[21:38] <nhm> elder: I want to have good numbers to report for supercomputing. I think the primary goal will be to just have really good scaling numbers. Secondary is probably to show good small IO performance.
[21:39] <elder> Are you schedule to go? It means you have about 4 months to get things humming.
[21:39] <elder> Where is it this time?
[21:40] * Ryan_Lane (~Adium@dslb-092-078-141-030.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[21:42] <nhm> elder: heh, it's in salt lake city this time. Going to be lots of disappointed alcoholics. ;)
[21:44] <elder> BYOB
[21:45] <nhm> And yes, I think 4 months is a good deadline. I need to dig into our scaling performance more, and once we've got it worked out, I figure we can try a big run on the majority of the burnupi nodes and a bunch of plana nodes.
[21:48] * sagelap (~sage@196.sub-166-250-35.myvzw.com) has joined #ceph
[21:48] <elder> Take some time off, sagelap
[21:54] * dmick_away is now known as dmick
[21:57] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[21:57] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:59] * tv_ is now known as Tv_
[21:59] <dmick> nhm: do you have knowledge of just what happens on teuthology (the host) to pull things off the work queue and execute them?
[22:00] <nhm> dmick: do you mean the suite?
[22:00] <dmick> well, I think I mean "the thing that handles the queue", which I think is beanstalkd?...
[22:00] * mkampe (~markk@aon.hq.newdream.net) Quit (Remote host closed the connection)
[22:01] <Tv_> dmick: i'd be surprised if he did
[22:01] <Tv_> no offense
[22:01] <dmick> yeah, I know, but he's poked at it
[22:01] <Tv_> just.. too recent a hire
[22:01] <dmick> and talked to joshd and stuff
[22:01] <nhm> dmick: yeah, I don't know anything about that part of it. I cheated and run suites locally.
[22:01] <dmick> ok. no harm in asking
[22:01] <Tv_> dmick: there's a start_worker.sh that just seems to run N workers, and i think people just start the workers manually
[22:02] <Tv_> dmick: i don't see any tie into upstart/sysvinit/runit etc
[22:02] <Tv_> dmick: and i poked at their env and don't see anything special about ssh, still no clue on that..
[22:02] * sagelap (~sage@196.sub-166-250-35.myvzw.com) Quit (Ping timeout: 480 seconds)
[22:02] <dmick> -schedule definitely talks to beanstalkd
[22:02] <dmick> maybe that then farms things to -worker
[22:02] <dmick> haven't found that link yet
[22:02] <Tv_> dmick: yes
[22:03] <Tv_> dmick: but i don't see where -worker would be getting its ssh key from
[22:04] <Tv_> i guess i can grep the whole damn vm for the file ;)
[22:04] <Tv_> dmick: uhhh the key that's actually put on plana is called "teuthologyworker@metropolis"
[22:05] <Tv_> dmick: that's not what the teuthology vm is called!
[22:06] * jluis (~JL@ has joined #ceph
[22:06] * jluis (~JL@ Quit ()
[22:06] <Tv_> but i do see workers running on the teuthology vm
[22:06] <Tv_> this confuses me
[22:07] <dmick> yeah, worker is different from locker is different from "teuthology"
[22:11] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[22:14] * sagelap (~sage@2600:1010:b002:85d0:1ff:7ea1:67a6:7c63) has joined #ceph
[22:17] <Tv_> dmick: and i switched to teuthworker and tried ssh manually, and i couldn't log in to plana
[22:17] <Tv_> dmick: and there are no id_rsa id_dsa files on that vm
[22:21] * sagelap1 (~sage@ace.ops.newdream.net) has joined #ceph
[22:21] <sagelap1> grr fucking chef
[22:22] <sagelap1> dmick: today's let's-break-every-qa-job plana node is 87
[22:22] <Tv_> dmick: i grepped the whole vm for the teuthologyworker@metropolis and qa@metropolis keys, they are not there
[22:22] * sagelap (~sage@2600:1010:b002:85d0:1ff:7ea1:67a6:7c63) Quit (Ping timeout: 480 seconds)
[22:23] <sagelap1> i cleared out the queue, so every scheduled_* locked node needs to be mopped up with cleanup-user.sh and new runs scheduled
[22:24] <dmick> somehow elder's jobs seem to be running
[22:24] <elder> Really?
[22:24] <sagelap1> all fo hte runs in progress have so many solo failures they're basically useless
[22:24] <dmick> sagelap1: we're just trying to puzzle out how the queue can or cannot be running
[22:24] <dmick> and, awesome about chef solo, that's great
[22:24] * sagelap1 is now known as sagelap
[22:24] <dmick> sagelap: where does t-w get its ssh key from?
[22:25] <sagelap> if this happens one more time i say we stick the apt-get kludge in there. we haven't ogtten a single useful qa run in a week
[22:25] <sagelap> t-w?
[22:25] <dmick> teuth-worker\
[22:25] <dmick> tamil cannot submit a teuthology job from teuthology-the-vm
[22:25] <dmick> there is no key in its home, nor in teuthology-worker's
[22:26] <dmick> we are scratching our heads at how it gets access to planae
[22:26] <elder> Oooh, the plurality!
[22:26] <dmick> (tamil cannot submit a teuthology job from teuthology-the-vm as user teuthology)
[22:26] <Tv_> plural**2
[22:26] <Tv_> plana is already plural of planum, iirc
[22:26] <dmick> sorry, user teuthworker)
[22:26] <dmick> oh. sorry. then why aren't the machines called planum01 etc., eh?
[22:27] <elder> What's the plural of planae then?
[22:27] <sagelap> not sure about that part
[22:27] <Tv_> because the sepies is "sepia plana"
[22:27] <Tv_> *species
[22:27] <elder> Fingers too used to typing "sepia"?
[22:27] <nhm> elder: clearly you have a knack for scripting languages. From now on you are lead teuthology/chef/ruby/python czar. ;)
[22:28] <elder> How is that? Because my jobs are working?
[22:28] <nhm> elder: indeed. you have a gift!
[22:28] <elder> You guys just don't' know how hard to hit the return key.
[22:28] <elder> You have to really mean it.
[22:28] <Tv_> for the record, i've put an axe through a commodore-64 once
[22:29] <elder> PC loadletter?
[22:29] <dmick> anyway: sagelap: do you know how the key is supposed to be obtained for teuthology-worker?
[22:29] <nhm> elder: that's what I was thinking. :D
[22:29] * mkampe (~markk@aon.hq.newdream.net) has joined #ceph
[22:31] <sagelap> never looked. it's not just ~/.ssh?
[22:31] <Tv_> dmick: one theory i have is pebkac deleted the key at some point... question: are new jobs working or not
[22:32] <dmick> yes, looking into if -schedule works
[22:32] <dmick> teuthology still does, but of course that's with my key
[22:32] <sagelap> they ar enot
[22:32] <sagelap> something broke :)
[22:32] <dmick> apparently "some things" got deleted yesterday, and josh restored them
[22:32] <dmick> some of them may have been the key
[22:32] <sagelap> raise SSHException('No authentication methods available')
[22:33] <dmick> yeah, that's what tamil is seeing
[22:33] <dmick> I think ~teuthworker/.ssh is missing id_*sa*
[22:33] <dmick> and I know where the pubkey is but I don't know where the priv might be; not finding it on metropolis
[22:34] <sagelap> just create a new one
[22:34] <sagelap> ?
[22:34] <dmick> well I suppose I could, but I thought I must be missing something
[22:35] <sagelap> dmick: that was probably the ony copy
[22:35] <Tv_> i hope it was the only copy; otherwise we're doing it even more wrong ;)
[22:35] <dmick> ok. a new one will let us continue anyway
[22:35] <dmick> so I'll get on that
[22:39] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:39] <sagelap> is someone looking into stefan's issue on the list?
[22:40] <dmick> yehuda just came to ask me about it
[22:40] <dmick> I will after this firedrill
[22:45] <dmick> new key created and (I think) distributed to all plana except 78 80 81 82
[22:48] <nhm> which key did you create?
[22:51] <sagelap> ok, working now.
[22:51] <dmick> nhm: It's called "teuthworker_dmick_<date>"
[22:51] <sagelap> scheduled a reegression run
[22:52] <nhm> dmick: ah, cool
[22:56] <dmick> sagelap: is sched_nightly.sh intended to be run as user teuthology?
[22:56] <sagelap> grr, more machines with broken apt
[22:56] <dmick> (I see it's currently commented out of user teuthology's crontab)
[22:56] <dmick> if so, it seems like /home/teuthology/.ssh is also missing keys
[22:56] <sagelap> obsolete, they now use hte regular schedule script
[22:56] <sagelap> but yes, user teuthology
[22:56] <sagelap> it doesn't need keys
[22:57] <sagelap> it just schedules stuff
[22:57] <dmick> oh because it schedules and t-w does it
[22:57] <dmick> ok
[22:57] <nhm> sagelap: same thing that was breaking mencoder?
[22:57] <dmick> it needed permission to files owned by user teuthology
[22:57] <dmick> got it.
[22:57] <sagelap> dmick: ok, can you add the apt thing to the chef script? this is a huge waste of time
[22:57] <sagelap> yes
[22:57] <Tv_> can i please see one full instance of the mencoder problem?
[22:58] <sagelap> ubuntu@teuthology:/a/sage-2012-06-22_13:51:40-regression-next-testing-basic$ grep role 1086/teuthology.log
[22:58] <sagelap> 2012-06-22T13:51:43.024 INFO:teuthology.task.internal:roles: ubuntu@plana42.front.sepia.ceph.com - ['mon.a', 'mon.c', 'osd.0', 'osd.1', 'osd.2']
[22:58] <sagelap> 2012-06-22T13:51:43.024 INFO:teuthology.task.internal:roles: ubuntu@plana39.front.sepia.ceph.com - ['mon.b', 'mds.a', 'osd.3', 'osd.4', 'osd.5']
[22:58] <sagelap> 2012-06-22T13:51:43.024 INFO:teuthology.task.internal:roles: ubuntu@plana47.front.sepia.ceph.com - ['client.0']
[22:58] <sagelap> ubuntu@teuthology:/a/sage-2012-06-22_13:51:40-regression-next-testing-basic$ grep role 1089/teuthology.log
[22:58] <sagelap> 2012-06-22T13:51:43.717 INFO:teuthology.task.internal:roles: ubuntu@plana57.front.sepia.ceph.com - ['mon.a', 'mon.c', 'osd.0', 'osd.1', 'osd.2']
[22:58] <sagelap> 2012-06-22T13:51:43.717 INFO:teuthology.task.internal:roles: ubuntu@plana58.front.sepia.ceph.com - ['mon.b', 'mds.a', 'osd.3', 'osd.4', 'osd.5']
[22:58] <sagelap> 2012-06-22T13:51:43.718 INFO:teuthology.task.internal:roles: ubuntu@plana61.front.sepia.ceph.com - ['client.0']
[22:58] <Tv_> all i've heard is "apt-get failed, let's get a hammer"
[22:58] <sagelap> tv_ one in each of those sets is broken
[22:58] <dmick> Tv_: the problem is 'something breaks the dpkg state, we don't know what', and then later apt-get fails
[22:58] <sagelap> tv_: another set is 67 68 69
[22:59] <Tv_> can you point to something specific?
[23:00] <Tv_> "one in this huge pile" when i don't know the symptoms is kinda not helpful
[23:00] <dmick> I could, but I ssh'ed to one of the machines and then it went away
[23:00] <sagelap> try 'apt-get install mencoder' on those machines and see the failure
[23:01] <Tv_> sagelap: i see lots of already installed mencoders..
[23:01] <dmick> 58 has apparently been healed
[23:01] <Tv_> i see aborted dpkg runs
[23:01] <dmick> oh we don't know which from each of those sets
[23:01] <sagelap> dmick: right
[23:02] <Tv_> Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/s/samba/libwbclient0_3.6.3-2ubuntu2.2_amd64.deb 404 Not Found [IP: 80]
[23:02] <Tv_> hrrmph
[23:02] <nhm> sagelap: I did that yesterday and it failed because apt-get update needed to be run.
[23:02] <Tv_> apt-get update and retrying on 47
[23:02] <nhm> Tv_: that's exactly what I did yesterday, it will work.
[23:02] <dmick> out of 67 68 69, none of them report any half-installed pkgs from dpkg -l
[23:02] <Tv_> dmick: dpkg -C is the official check
[23:03] <sagelap> tv_: the archive for that set is ubuntu@teuthology:/a/sage-2012-06-22_13:51:40-regression-next-testing-basic/1086
[23:03] <dmick> dpkg -C quiet
[23:03] <dmick> 69 offers to install it
[23:04] <dmick> but reports
[23:04] <dmick> Err http://us.archive.ubuntu.com/ubuntu/ precise-updates/main libwbclient0 amd64 2:3.6.3-2ubuntu2.2
[23:04] <dmick> 404 Not Found [IP: 80]
[23:04] <dmick> Err http://us.archive.ubuntu.com/ubuntu/ precise-updates/main libsmbclient amd64 2:3.6.3-2ubuntu2.2
[23:04] <dmick> 404 Not Found [IP: 80]
[23:04] <dmick> Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/s/samba/libwbclient0_3.6.3-2ubuntu2.2_amd64.deb 404 Not Found [IP: 80]
[23:04] <sagelap> dmick: for that set see ubuntu@teuthology:/a/sage-2012-06-22_13:51:40-regression-next-testing-basic/teuthology.log
[23:04] <dmick> Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/s/samba/libsmbclient_3.6.3-2ubuntu2.2_amd64.deb 404 Not Found [IP: 80]
[23:04] <dmick> E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
[23:04] <Tv_> the one valid error i see is about the blktrace deb
[23:05] <dmick> sage: that file is gone
[23:05] <sjust> blktrace requires mencoder?
[23:05] <dmick> the movies do
[23:05] <sjust> yeah
[23:05] <Tv_> and that's immediately followed by a wget error, which makes me think the network crapped out
[23:05] <Tv_> unfortunately chef-solo puts debug output inside the temp dir, that is removed
[23:06] <dmick> ah, */teuthology.log
[23:06] <dmick> so yeah, the chef just reports apt-get returned 100, doesn't say why
[23:07] <dmick> I'm assuming that if I apt-get update on 69, the Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/s/samba/libwbclient0_3.6.3-2ubuntu2.2_amd64.deb 404 Not Found [IP: 80] will go away
[23:07] <Tv_> well it puts detail in a file, that we promptly remove
[23:07] <nhm> sjust: seekwatcher does for the movie generation. I'm trying to distribute the generation across all of the nodes to make it finish faster.
[23:07] <dmick> yes. I mean in the teuthology log
[23:07] <Tv_> dmick: that's not from aptg-et update
[23:07] <sjust> ah
[23:07] <dmick> right. I'm saying a-g-u will fix it
[23:07] <Tv_> dmick: yeah that just means you're using stale package indices
[23:07] <nhm> dmick: that's what happened yesterday. apt-get update fixed the problem.
[23:08] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[23:08] <Tv_> 2012-06-22T13:52:19.640 INFO:teuthology.orchestra.run.out:[Fri, 22 Jun 2012 13:52:19 -0700] INFO: Processing execute[apt-get update] action run (ceph-qa::radosgw line 27)
[23:08] <dmick> so assuming chef really does apt-get update before it processes packages, that shouldn't be the error
[23:08] <nhm> dmick: this is why I was saying yesterday that I thought the update wasn't happening (or I suppose happening properly).
[23:08] <dmick> so we have an explicit update in radosgw.rb because that manipulates sources.list files
[23:08] <dmick> but we do not have an explicit one in default.rb, because chef supposedly does that for us
[23:09] * sagelap1 (~sage@2600:1010:b00c:afd9:1ff:7ea1:67a6:7c63) has joined #ceph
[23:09] <dmick> it is possible that supposition is in error
[23:09] * sagelap (~sage@ace.ops.newdream.net) Quit (Read error: Connection reset by peer)
[23:09] <Tv_> ahh that's from the radosgw recipe, which is triggered after the default recipe, which already tried to apt-get install
[23:09] <dmick> it's like there's an echo in here :)
[23:09] <Tv_> i don't see us triggering an apt-get update
[23:10] <Tv_> now, here's what bothers me
[23:10] <dmick> right. I was told "package" resources in Chef cause that to happen automagically
[23:10] <dmick> this may not be true
[23:10] <Tv_> 404 on a deb shouldn't get into a state where dpkg -C says anything
[23:10] <Tv_> apt won't run dpkg until it has a sufficient pile of debs
[23:10] <dmick> dpkg -C isn't saying anything; apt-get install said that
[23:10] <Tv_> we're not seeing what actually happened
[23:10] <Tv_> dmick: when run after the issue
[23:10] <dmick> I think apt-get install from chef's Package resource failed
[23:11] <dmick> in one case, it failed because of stale repo data
[23:11] <Tv_> ubuntu@plana47:~$ sudo dpkg -C
[23:11] <Tv_> The following packages have been unpacked but not yet configured.
[23:11] <Tv_> They must be configured using dpkg --configure or the configure
[23:11] <Tv_> menu option in dselect for them to work:
[23:11] <Tv_> blktrace utilities for block layer IO tracing
[23:11] <Tv_> ... etc
[23:11] <Tv_> dmick: blah, walking over
[23:11] <dmick> right. I saw something like that with "mencoder" yesterday.
[23:20] <Tv_> dmick: unattended-upgrades - automatic installation of security upgrades
[23:20] <Tv_> dmick: i do believe that has a mode where it doesn't actually install, just keeps indices up to date
[23:21] <Tv_> cron-apt - automatic update of packages using apt-get
[23:21] <Tv_> hmm
[23:21] <dmick> I believe that you are correct that Chef does not update
[23:21] <sagelap1> it's possible the mopping up itself is breaking other nodes... cleanup-user.sh does a nuke + reboot. and when a run goes bad i kill the teuthology procs which could leave the chef-client runs orphaned or aborted (not sure what they do when the ssh connection drops)
[23:22] <Tv_> dmick: former is canonical work, latter is in universe
[23:22] * sagelap1 is now known as sagelap
[23:22] <Tv_> oh god nuke
[23:22] <Tv_> there's a reason i insisted on that name
[23:22] <Tv_> nuke would definitely be able to leave dpkg in a bad state, and require manual recovery
[23:23] * dmick searches for a tranquilizer dart to fire in Tv_'s direction
[23:23] <Tv_> this is why the original tests did NOT use dpkg
[23:23] <Tv_> because this is one of the things i broke when i tried to use debs
[23:24] <dmick> we could break the chef runs into "a thing that happens on each chef task from teuthology" and "a thing that happens on install" without too too much effort. That might let us minimize things like this
[23:25] <Tv_> what part of ceph-qa-chef is valuable that does not run apt-get?
[23:25] <sagelap> or not run chef automatically on each run at all.. the problem there is we'll inevitably miss stuff when we unlock old machines
[23:25] <dmick> erm, now I have to look
[23:26] <Tv_> sagelap: which is why we need to improve the provisioning system as a whole
[23:26] <sagelap> tv_: agreed. but in the meantime, we need to actually test next prior to 0.48 release
[23:26] <dmick> ntp,grub, ssh keys, ttyS1 login
[23:26] <dmick> and we probably need to keep the conditional rgw apt stuff
[23:27] <dmick> I dunno. maybe it'll just leave us with other inconsistencies
[23:27] <dmick> and yes, keep working on the better provisioning, clearly
[23:28] <sagelap> dmick: can we just do a full sweep of chef-client on plana now so we can start scheduling jobs?
[23:29] <Tv_> dmick: make that dpkg --configure --pending && apt-get update
[23:29] <sagelap> this category of breakage is only going to come up when we adjust the chef cookbook to install new packages
[23:29] <dmick> yeah, just adding "apt-get update" to it
[23:29] <dmick> ok
[23:29] <Tv_> dmick: i mean, not automated, but for the sweep
[23:29] <Tv_> dmick: anything where "dpkg -C" outputs anything will NOT come back to health with just apt
[23:29] <dmick> yes
[23:29] <Tv_> (except perhaps apt-get -f install, which is like a rubber hammer..)
[23:30] <sagelap> cool
[23:31] <dmick> if 78 80 81 82 are sitting waiting for F1 again I am going to scream
[23:31] <dmick> dpkg -C is clean everywhere besides there
[23:31] <sagelap> cover your ears, everyone
[23:32] <nhm> heh
[23:32] <dmick> updating everyone
[23:32] <Tv_> not really for the loudness, but for the vocabulary
[23:32] <sagelap> heh
[23:32] <nhm> 78,80,81,82 are all part of the aging cluster anyway. You can just leave those alone.
[23:33] <dmick> as long as they're recheffed before replacing in the pool
[23:33] <Tv_> make that reinstalled
[23:34] <nhm> yeah, they need to be reinstalled before we release them.
[23:34] <nhm> they've been modified suffiently that it would be best to start out with a clean slate.
[23:35] <dmick> and rewaxed and buffed while we're at it
[23:35] <sagelap> gotta go guys. once those are updated, cleanup-user.sh scheduled_sage@metropolis to nuke+unlock all those planas, and then schedule something new... regression next testing probably
[23:35] <dmick> k bai
[23:35] <nhm> sagelap: have a good weekend
[23:35] <sagelap> thanks! have a good weekend everyone.
[23:36] * sagelap (~sage@2600:1010:b00c:afd9:1ff:7ea1:67a6:7c63) Quit (Quit: Leaving.)
[23:54] <dmick> maybe he really meant cleanup-and-unlock.sh
[23:54] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.