#ceph IRC Log

Index

IRC Log for 2011-10-25

Timestamps are in GMT/BST.

[0:08] * jmlowe (~Adium@mobile-198-228-227-233.mycingular.net) has joined #ceph
[0:08] * jmlowe (~Adium@mobile-198-228-227-233.mycingular.net) has left #ceph
[0:56] * adjohn (~adjohn@50.0.103.34) Quit (Quit: adjohn)
[1:50] * sjust (~sam@aon.hq.newdream.net) has left #ceph
[1:50] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[2:04] * jojy (~jojyvargh@108.60.121.114) Quit (Quit: jojy)
[2:32] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[2:37] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:41] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:06] * gohko (~gohko@natter.interq.or.jp) Quit (Quit: Leaving...)
[3:15] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[3:31] * bchrisman (~Adium@108.60.121.114) Quit (Quit: Leaving.)
[3:32] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:58] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[4:01] * interiorcrocodile (~bob@70.52.147.141) has joined #ceph
[4:44] * interiorcrocodile (~bob@70.52.147.141) Quit (Remote host closed the connection)
[4:44] * interiorcrocodile (~bob@70.52.147.141) has joined #ceph
[5:19] * Pjack (~IceChat77@60-251-132-28.HINET-IP.hinet.net) has joined #ceph
[5:29] * Pjack (~IceChat77@60-251-132-28.HINET-IP.hinet.net) Quit (Ping timeout: 480 seconds)
[6:19] * interiorcrocodile (~bob@70.52.147.141) Quit (Remote host closed the connection)
[6:31] * sandeen_ (~sandeen@sandeen.net) Quit (Quit: This computer has gone to sleep)
[6:49] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[7:21] * SpamapS (clint@xenclint.srihosting.com) Quit (Quit: Lost terminal)
[9:35] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[9:48] <chaos__> i'm wondering if it safe to set other users permissions for writing to ceph administrative sockets, sagewk it's possible to do something *wrong* through this sockets?
[10:56] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:25] * alexxy (~alexxy@79.173.81.171) Quit (Remote host closed the connection)
[12:29] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[13:44] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[13:44] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[13:58] <stingray> ифр
[13:59] <stingray> bah
[13:59] <stingray> now I have disk_tp timeoust
[13:59] <stingray> timeouts
[14:02] <stingray> spinning here:
[14:02] <stingray> Thread 131 (Thread 0x7ffd9a49a700 (LWP 11570)):
[14:02] <stingray> #0 find_entry (this=0x32912c8, v=<optimized out>) at osd/PG.cc:5053
[14:02] <stingray> #1 PG::build_inc_scrub_map (this=0x3291000, map=..., v=...) at osd/PG.cc:2904
[14:02] <stingray> #2 0x0000000000640a8b in PG::build_scrub_map (this=0x3291000, map=...) at osd/PG.cc:2879
[14:02] <stingray> #3 0x000000000064ceec in PG::replica_scrub (this=0x3291000, msg=0x3907d20) at osd/PG.cc:2989
[14:03] <stingray> #4 0x0000000000595272 in OSD::RepScrubWQ::_process(MOSDRepScrub*) ()
[14:09] <josef> man you guys really do a lot of things that are way slow on btrfs
[14:11] <josef> why is ceph-osd fsyncing a block dev?
[14:12] <stingray> this is on ext4 btw.
[16:11] * mark (~mark@penguin.msi.umn.edu) has joined #ceph
[16:25] <mark> anyone here tried using ceph as a backend for openstack's nova-volume with the libvirt driver?
[16:27] * mark is now known as nhm
[16:47] * ssedov (stas@ssh.deglitch.com) has joined #ceph
[16:53] * stass (stas@ssh.deglitch.com) Quit (Ping timeout: 480 seconds)
[16:56] * sandeen_ (~sandeen@sandeen.net) has joined #ceph
[17:00] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[17:27] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Ping timeout: 480 seconds)
[18:24] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[18:26] * ognatortcele (~ognatortc@66.246.173.34) has joined #ceph
[18:29] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[18:40] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:45] <joshd> nhm: yeah, what's up?
[18:50] * The_Bishop (~bishop@port-92-206-76-12.dynamic.qsc.de) has joined #ceph
[18:50] <nhm> joshd: how's stability?
[18:52] * cp (~cp@md85636d0.tmodns.net) has joined #ceph
[18:53] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Quit: Leaving.)
[18:55] * aliguori (~anthony@32.97.110.59) has joined #ceph
[18:55] <joshd> nhm: rbd itself is fairly stable, but you may run into issues with the underlying fs on your osds (see thread on ceph-devel titled 'ceph on non-btrfs file systems'
[18:55] <joshd> I have to go now - be back in a few hours
[18:55] <nhm> joshd: ok, cool. I'm doing testing now for an eventual production deployment in the spring/summer.
[18:56] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[19:07] * cp (~cp@md85636d0.tmodns.net) Quit (Quit: cp)
[19:08] * adjohn is now known as Guest14716
[19:08] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[19:08] * Guest14716 (~adjohn@50.0.103.34) Quit (Read error: Connection reset by peer)
[19:17] * adjohn is now known as Guest14718
[19:17] <stingray> Who is the local PG.c expert?
[19:17] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[19:17] * Guest14718 (~adjohn@50.0.103.34) Quit (Read error: Connection reset by peer)
[19:20] <ognatortcele> Hi, i am testing Ceph for our company… whats better, investing in high quality ssd or more OSD with cheaper disks.?
[19:25] * adjohn (~adjohn@50.0.103.34) Quit (Ping timeout: 480 seconds)
[19:26] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[19:33] * adjohn is now known as Guest14721
[19:33] * Guest14721 (~adjohn@50.0.103.34) Quit (Read error: Connection reset by peer)
[19:33] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[19:34] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:35] <NaioN> sagewk: we still have troubles with the osds where they kill themselves
[19:35] <NaioN> but before the osd kills itselve I see a lot of these messages:
[19:35] <NaioN> 2011-10-25 15:49:35.330295 7fa5abf21700 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fa59ee06700' had timed out after 30
[19:36] <NaioN> 2011-10-25 15:49:35.330318 7fa5abf21700 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fa5a4712700' had timed out after 60
[19:37] <NaioN> and some of these:
[19:37] <NaioN> 2011-10-25 15:50:03.570892 7fa59bad6700 -- 10.20.0.225:6800/6678 >> 10.20.0.231:0/731394354 pipe(0x1114000 sd=22 pgs=0 cs=0 l=0).accept peer addr is really 10.20.0.231:0/731394354 (socket is 10.20.0.231:50560/0)
[19:37] <stingray> list<Entry>::iterator find_entry(eversion_t v) {
[19:37] <NaioN> 2011-10-25 15:50:03.570950 7fa59bad6700 -- 10.20.0.225:6800/6678 >> 10.20.0.231:0/731394354 pipe(0x1114000 sd=22 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1)
[19:37] <stingray> NaioN: I am debugging the same problem
[19:37] <NaioN> aha
[19:37] <stingray> apparently this find_entry goes into infinite loop at some point
[19:37] <stingray> #0 find_entry (this=0x32912c8, v=<optimized out>) at osd/PG.cc:5053
[19:37] <stingray> #1 PG::build_inc_scrub_map (this=0x3291000, map=..., v=...) at osd/PG.cc:2904
[19:37] <stingray> #2 0x0000000000640a8b in PG::build_scrub_map (this=0x3291000, map=...) at osd/PG.cc:2879
[19:38] <stingray> #3 0x000000000064ceec in PG::replica_scrub (this=0x3291000, msg=0x3907d20) at osd/PG.cc:2989
[19:38] <stingray> #4 0x0000000000595272 in OSD::RepScrubWQ::_process(MOSDRepScrub*) ()
[19:38] <stingray> something like this
[19:38] <NaioN> ok you saw that in the osd log?
[19:38] <NaioN> because I have the log level low at the moment
[19:38] <NaioN> but I have the logs of a run with the loglevels high
[19:38] <NaioN> so I could search for that string
[19:39] <stingray> NaioN: the only log message you will see will be something like:
[19:40] <stingray> 2011-10-25 21:12:32.510581 7f109dbc8700 osd.2 6625 pg[0.159( v 2297'5163 (2297'5158,2297'5163] n=1548 ec=2 les/c 6624/6624 6617/6617
[19:40] <stingray> and then there will be no dout(10) << " done. pg log is " << map.logbl.length() << " bytes" << dendl;
[19:40] <NaioN> ok
[19:40] <stingray> (that's debug osd 10)
[19:40] <NaioN> I had 20
[19:40] <NaioN> but it seems I lost it...
[19:40] <stingray> I just gdb'd it and found the spinning thread
[19:40] <NaioN> could do a new run
[19:40] <stingray> now I'm trying hard to understand the logic behind all this
[19:41] <NaioN> hmmmm sagewk could help you
[19:41] <NaioN> If you found the spinning thread I think you found more than what we already knew
[19:41] <stingray> he is obviously not here. Either I'm fixing it myself or I guess I have provided enough info
[19:41] <NaioN> yes I think you provided a lot of information with it
[19:42] * adjohn (~adjohn@50.0.103.34) Quit (Read error: Connection reset by peer)
[19:42] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[19:46] <stingray> I think it branched into the first branch and then just keeps doing p--
[19:47] <NaioN> Well I'm not familiar with the code, so I'm no help there :)
[19:47] <stingray> #0 operator> (l=<optimized out>, r=<optimized out>) at osd/osd_types.h:387
[19:48] <stingray> while (p->version > v)
[19:48] <stingray> p--;
[19:48] <stingray> yes it's here
[19:48] <stingray> :(
[19:49] <stingray> if only somebody will tell me if, for example, it is safe to add aome kind of bounds checking to this thing
[19:49] <stingray> ...
[19:49] <stingray> (I would still like to have access to my data at some point)
[19:50] <NaioN> you could also mail to the mailinglist
[19:50] <stingray> yeah I could
[19:54] * adjohn is now known as Guest14723
[19:54] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[19:54] * Guest14723 (~adjohn@50.0.103.34) Quit (Read error: Connection reset by peer)
[20:00] * cp (~cp@75.103.61.58) has joined #ceph
[20:00] <psomas> :Q::Q
[20:00] <psomas> sorry :/
[20:02] * adjohn (~adjohn@50.0.103.34) Quit (Ping timeout: 480 seconds)
[20:02] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[20:07] * fronlius (~Adium@f054098210.adsl.alicedsl.de) has joined #ceph
[20:09] * adjohn (~adjohn@50.0.103.34) Quit (Read error: Connection reset by peer)
[20:09] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[20:22] * adjohn (~adjohn@50.0.103.34) Quit (Read error: Connection reset by peer)
[20:23] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[20:29] * adjohn is now known as Guest14726
[20:29] * Guest14726 (~adjohn@50.0.103.34) Quit (Read error: Connection reset by peer)
[20:29] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[20:36] <psomas> stingray: i'm not familiar with the code, but it seems sane to add some bound checking, ie, p--; if (p == log.begin()) break; or sth like that
[20:39] * cp (~cp@75.103.61.58) Quit (Quit: cp)
[20:49] * adjohn (~adjohn@50.0.103.34) Quit (Quit: adjohn)
[20:54] <stingray> psomas:
[20:54] <stingray> yeah, that's what I'm trying to do
[20:54] <stingray> well, actually I'm swimming home
[20:55] <stingray> one of my computers is trying to compile what I just did
[21:24] * fronlius (~Adium@f054098210.adsl.alicedsl.de) Quit (Quit: Leaving.)
[21:27] * fronlius (~Adium@f054098210.adsl.alicedsl.de) has joined #ceph
[21:35] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[22:03] * cp (~cp@75.103.61.58) has joined #ceph
[23:04] * cp (~cp@75.103.61.58) Quit (Quit: cp)
[23:10] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[23:13] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:35] * ognatortcele (~ognatortc@66.246.173.34) Quit (Quit: ognatortcele)
[23:58] * aliguori (~anthony@32.97.110.59) Quit (Remote host closed the connection)
[23:59] * cp (~cp@75.103.61.58) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.