#ceph IRC Log


IRC Log for 2012-03-01

Timestamps are in GMT/BST.

[0:07] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[0:16] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[0:31] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[0:34] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[0:39] * The_Bishop (~bishop@cable-89-16-138-109.cust.telecolumbus.net) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[0:40] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[0:44] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[0:45] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[0:46] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[0:50] * adjohn is now known as Guest4581
[0:50] * _adjohn (~adjohn@ has joined #ceph
[0:50] * _adjohn is now known as adjohn
[0:52] * lofejndif (~lsqavnbok@83TAADQ4B.tor-irc.dnsbl.oftc.net) has joined #ceph
[0:54] * Guest4581 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Ping timeout: 480 seconds)
[1:00] * joao (~joao@ has joined #ceph
[1:02] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[1:10] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[1:36] <sagewk> tv|work, sjust: pushed json_spirit makefile fix to next
[1:44] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:45] * Enoria (~Enoria@albaldah.dreamhost.com) Quit (Remote host closed the connection)
[1:45] <sjust> yup
[1:47] * lofejndif (~lsqavnbok@83TAADQ4B.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[1:49] * raso (~raso@debian-multimedia.org) Quit (Ping timeout: 480 seconds)
[1:50] <darkfader> nhm: sorry :( i had to wipe the systems for the next class. i thought about it, and if you wanna i can reproduce it within 1-2 weeks
[1:50] <darkfader> it was a bit sad it topsided SO fast
[1:50] <darkfader> what i noted was only one mds came up instead of 2 or 3
[1:50] <darkfader> i didn't notice on time
[1:51] <darkfader> and i think our bottleneck on the network was part of the problems, which *is* a nice test caser
[1:51] <darkfader> -r
[1:52] * joao (~joao@ Quit (Quit: joao)
[1:53] <nhm> darkfader: yeah, I imagine once the floodgates open we'll have all kinds of crazy end-user setups so the more data we have now the merrier. :)
[1:53] <darkfader> hrhr
[1:54] <darkfader> it was actually funny
[1:54] <darkfader> during the intro to ceph i told them a typical stress test that could still fail would be a powerful stream write plus one more that fires off a few 100k small ops
[1:55] <darkfader> and that guy didn't relate that was *exactly* his bonnie++ run
[1:55] * raso (~raso@debian-multimedia.org) has joined #ceph
[1:55] <darkfader> since $boss was going cheap on the switches we had some 100mbit and some gbit segments
[1:56] <darkfader> so totally uneven osd connections
[1:56] <nhm> nice
[1:56] <darkfader> but if i can pick something: dedicate time on gceph, really
[1:56] <darkfader> because it's not easy to explain ceph to people who are not aware of scale-out object storage from scratch
[1:56] <darkfader> but gceph does that on an instant
[1:57] <darkfader> (the instant before the segfault)
[1:57] * lofejndif (~lsqavnbok@83TAADQ5N.tor-irc.dnsbl.oftc.net) has joined #ceph
[1:57] <nhm> darkfader: we'll see what they have me working on. Primarily I'm going to be looking at performance, but who knows?
[1:57] <nhm> anyway, gotta run and put kids to bed. ttyl
[2:01] <darkfader> nhm: performance was like this:
[2:01] <darkfader> 100MB/s line rate
[2:01] <darkfader> drop
[2:01] <darkfader> 23KB/s
[2:01] <darkfader> 5KB/s
[2:01] <darkfader> get better for a while
[2:01] <darkfader> then everything going to hell
[2:01] <darkfader> then 23% of PGs lost 3 OSD crash
[2:02] <darkfader> s/lost/degraded-peering/
[2:04] * adjohn is now known as Guest4593
[2:04] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[2:06] * tnt_ (~tnt@235.36-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[2:10] * Guest4593 (~adjohn@ Quit (Ping timeout: 480 seconds)
[2:11] * Tv|work (~Tv__@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:11] <darkfader> nite
[2:13] * adjohn is now known as Guest4594
[2:13] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[2:18] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:20] * Guest4594 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Ping timeout: 480 seconds)
[2:37] <nhm> darkfader: how much time would pass for each phase?
[2:43] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[2:56] * lofejndif (~lsqavnbok@83TAADQ5N.tor-irc.dnsbl.oftc.net) Quit (Quit: Leaving)
[3:26] * dmick (~dmick@aon.hq.newdream.net) has left #ceph
[3:37] * adjohn (~adjohn@50-0-92-115.dsl.dynamic.sonic.net) has joined #ceph
[3:41] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:04] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[4:05] * JohnbedoL (administra@ has joined #ceph
[4:05] * JohnbedoL (administra@ has left #ceph
[4:27] * Enoria (~Enoria@albaldah.dreamhost.com) has joined #ceph
[4:33] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:43] <lxo> do mon logs ever get trimmed? I've got 4 2GB log.* files in each mon's tree (monster-y :-)
[4:43] <lxo> this was after a long recovery past a crushmap reorg that caused tons of these new “delayed/starting/etc op” messages to be logged
[6:08] <imjustmatthew> lxo: I don't think there's any automatic trimming, you can run logrotate more often on the files in /var/log/ceph to keep thinks under control
[6:08] <imjustmatthew> lxo: or is it's one-time just gzip them...
[6:08] <imjustmatthew> *if
[6:10] <lxo> imjustmatthew, oh, I don't mean the /var/log logs, but rather the ones in the mon tree itself: log, log.debug, log.info and log.warn are the big ones, and they've been like that for over 24h, with all cluster members up and running
[6:14] <lxo> that filled up my primary monitor's filesystem; bringing it down and re-copying the data over from another monitor seems to have got much better compression (btrfs compress=lzo) and it all fit with some room to spare, but since it's also my / filesystem, I'm a bit concerned that it grows unbounded and again
[6:15] <imjustmatthew> lxo: yeah, I think there was an e-mail about that on the list, but I can't find it right now, try Wido den Hollander
[6:16] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[6:18] * adjohn (~adjohn@50-0-92-115.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[6:20] * imjustmatthew (~matthew@pool-96-228-59-130.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[6:23] * nyeates (~nyeates@pool-173-59-237-75.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[6:36] * imjustmatthew (~imjustmat@pool-96-228-59-130.rcmdva.fios.verizon.net) has joined #ceph
[7:00] * adjohn (~adjohn@50-0-92-115.dsl.dynamic.sonic.net) has joined #ceph
[7:08] * ameen (~ameen@unstoppable.gigeservers.net) Quit (Ping timeout: 480 seconds)
[7:40] * greglap (~Adium@ has joined #ceph
[7:40] <greglap> lxo: there's no trimming on them yet iirc, but you can do it manually if you like
[7:41] <greglap> there probably should be, but it's a silly little thing nobody's done yet and we might decide we'd rather do super-compression and keep them around for debugging purposes or something
[7:49] * greglap (~Adium@ Quit (Quit: Leaving.)
[7:51] * tnt_ (~tnt@235.36-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:48] <lxo> gregaf, you mean trimming manually as in just removing or truncating the files?
[8:48] <lxo> should the mon be down when I do that?
[8:52] * tnt_ (~tnt@235.36-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:04] * tnt_ (~tnt@office.intopix.com) has joined #ceph
[9:05] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:10] * stxShadow (~jens@p4FFFEB89.dip.t-dialin.net) has joined #ceph
[9:11] * stxShadow (~jens@p4FFFEB89.dip.t-dialin.net) Quit ()
[9:11] * stxShadow (~jens@p4FFFEB89.dip.t-dialin.net) has joined #ceph
[9:12] * stxShadow (~jens@p4FFFEB89.dip.t-dialin.net) Quit ()
[9:13] * stxShadow (~jens@p4FFFEB89.dip.t-dialin.net) has joined #ceph
[9:58] <lxo> gregaf, answering myself, just moving log and log.* aside and restarting the monitor works just fine
[10:02] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[10:20] * adjohn (~adjohn@50-0-92-115.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[10:53] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[11:12] * joao (~joao@ has joined #ceph
[12:00] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:00] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[12:00] * fronlius_ is now known as fronlius
[12:25] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:25] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[12:25] * fronlius_ is now known as fronlius
[13:04] * lofejndif (~lsqavnbok@tor.pm-ib.de) has joined #ceph
[13:33] * jks2 (jks@ has joined #ceph
[13:41] * jksM (jks@ Quit (Ping timeout: 480 seconds)
[13:49] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[13:53] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[13:57] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[14:49] * lofejndif (~lsqavnbok@83TAADRJH.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[14:54] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[15:03] * joao (~joao@ Quit (Ping timeout: 480 seconds)
[15:10] <nhm> good morning all
[15:24] * joao (~joao@ has joined #ceph
[15:44] <darkfader> nhm: dont know, i think 20 minutes then it seemed to block for ~3-4mins, then we were in the slow write phase + slow ls responses, then after 5 more ls became snap-fast again but write stayed slow
[15:45] <darkfader> i'll edit our system images to have a monitoring agent from start, then i can get better data i test stuff like that again
[15:45] <nhm> hrm, ok.
[15:45] <nhm> darkfader: great, thanks
[15:46] <nhm> darkfader: I'm doing some testing now with collectl running and also am playing around with Jeff Layton's strace analyzer. I'm probably going to have to modify it to support -f to make it useful for examining ceph though.
[15:47] <darkfader> an strace analyzer? sounds cool
[15:47] <darkfader> brb let me look at that
[15:47] <nhm> darkfader: his page seems to be down but you can get it off github here: https://github.com/joewilliams/strace_analyzer_ng
[15:47] <nhm> apparently there is a python version floating around too.
[15:49] <darkfader> thanks, a lot more tabs to read later :))
[15:49] <darkfader> i need to make a coffee and go back to my work now since $bed is trying to trick me into a nap
[15:49] <nhm> hehe
[15:56] <elder> Welcome Mark!
[15:56] <nhm> elder: thanks. :)
[16:05] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[16:16] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[16:30] * nyeates (~nyeates@pool-173-59-237-75.bltmmd.fios.verizon.net) has joined #ceph
[17:23] * prstshk (~dsf@triband-mum- has joined #ceph
[17:23] <prstshk> hi all, does ceph not add a new OSD to the crush maps?
[17:29] <stxShadow> no ... not automaticaly
[17:38] <nhm> elder: ping
[17:38] <elder> Yes?
[17:39] <nhm> elder: looks like I just saw unexpected state (4) on a test cluster I set up on sepia. Sounds like you saw something like that in 2099?
[17:39] <elder> Just a sec.
[17:39] <elder> Why yes it does look that way. What are you running?
[17:40] <elder> Or rather, are you running the the kernel client master, or testing branch, ro what?
[17:40] <nhm> elder: looks like the same thing you were running, 3.2.0-ceph-00164-gcc050a5
[17:41] <elder> You were running wip-messenger?
[17:41] <elder> Oh, nevermind.
[17:41] <elder> I see I updated it later.
[17:42] <stxShadow> elder: cloud you tell me when 1907 will be in the 3.3 rc series ? this bug drives me crazy
[17:42] <elder> As Sage pointed out, it may not be an issue at all. Just make a note in that tracker entry that you saw it too I guess.
[17:43] <nhm> elder: guess I'll have to reboot that node or something. lsof is hanging and I can't unmount.
[17:43] <elder> Use the testing branch.
[17:43] <elder> For now.
[17:43] <elder> I'm almost ready to be cleaning all this up and will send out an announcement about it (I'm hoping before the end of this week).
[17:43] <elder> Basically, at the moment, the ceph-client/testing branch is the current best code we have.
[17:44] <elder> I have all that stuff out for review right now, and once it's been reviewed I'll update the testing branch, and also update the master branch.
[17:44] * Tv|work (~Tv__@aon.hq.newdream.net) has joined #ceph
[17:44] <elder> The master branch is going to become stable (which is different from how it has been used until now)
[17:46] <elder> Basically: wip --- (ready for review) ---> testing --- (reviewed) ---> master --- (ready for Linus) --> for-linus
[17:47] <stxShadow> hmm
[17:48] <joao> btw, what does 'wip' stand for?
[17:48] <elder> Work In Progress
[17:48] <joao> ah k
[17:48] <joao> thanks :)
[17:49] <elder> I can't state that's "official and settled" but I'm going to try to get agreement on it or something close, soon.
[17:50] <stxShadow> we use the /dev/rbdx to blow up our images ..... but sometimes something goes wrong .... and we have to reboot the maschine
[17:51] <elder> Can you be more specific stxShadow?
[17:52] <elder> If we can narrow down what's happening (and record it in the tracker) we can start trying to fix the problem.
[17:52] <stxShadow> its bug 1907 -> it it solved already
[17:52] <stxShadow> but not currently in 3.3.0-rc5
[17:52] <nhm> hrm... I wonder if my strace blew up the OSD.
[17:53] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[17:54] <elder> Ahh.
[17:55] <elder> The bug was fixed after a series of other changes, and it seemed like it might be tough to port the real fix back to the 3.3 baseline code. Now that I've got them out for review I'll look a bit more closely at getting it ported back, and if reasonable, send it to Linus for 3.3.
[17:56] <stxShadow> thanks !!
[18:06] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[18:15] * prstshk (~dsf@triband-mum- Quit (Quit: prstshk)
[18:18] <yehudasa__> nhm: welcome!
[18:18] * tnt_ (~tnt@office.intopix.com) Quit (Ping timeout: 480 seconds)
[18:18] <nhm> yehudasa__: thanks!
[18:24] * aliguori (~anthony@ has joined #ceph
[18:29] * tnt_ (~tnt@235.36-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:41] * nyeates (~nyeates@pool-173-59-237-75.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[18:48] * stxShadow (~jens@p4FFFEB89.dip.t-dialin.net) Quit (Remote host closed the connection)
[18:54] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[19:03] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[19:31] * aliguori (~anthony@ Quit (Ping timeout: 480 seconds)
[19:40] * aliguori (~anthony@ has joined #ceph
[19:41] * chutzpah (~chutz@ has joined #ceph
[19:55] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[20:00] * adjohn is now known as Guest4702
[20:00] * Guest4702 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: Connection reset by peer)
[20:00] * _adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[20:00] * _adjohn is now known as adjohn
[20:04] * cattelan_away is now known as cattelan
[20:12] * nyeates (~nyeates@pool-173-59-237-75.bltmmd.fios.verizon.net) has joined #ceph
[20:13] * nyeates (~nyeates@pool-173-59-237-75.bltmmd.fios.verizon.net) Quit ()
[20:24] <wido> I was playing a bit with librados today
[20:25] <wido> I set "key" to a random string (not base64) and that caused a crash. searching a bit I found that Monclient:init doesn't handle it well when bufferlist.base64_decode fails
[20:25] <wido> Now I'm not sure if that is on purpose or not
[20:26] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[20:33] <sjust> wido: by key you mean the librados "key" concept rather than the object name?
[20:40] <yehudasa__> wido: the key is suppposed to be base64 encoded. The MonClient shouldn't crash though
[20:42] <wido> sjust: Uh, I meant the cephx key
[20:42] <wido> yehudasa__: Yes, I know. I was testing my libvirt storage driver and saw libvirt crash. Tracing it down showed that monclient was the problem since I didn't load a base64 encoded key
[20:44] <wido> yehudasa__: http://pastebin.com/Hv0q9YL8
[20:44] <wido> rados_conf_set(cluster, "key", "AABBCCDDEEFFGG");
[20:44] <wido> was a small test
[20:46] <yehudasa__> wido: yeah, I'll open a bug for it. Fix should be trivial, need to try/catch around the call to decode_base64
[20:47] <wido> Figured so indeed. I'm not that familiar with the code, better let you handle it
[20:49] <yehudasa__> wido: issue #2124
[20:49] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[20:54] <wido> yehudasa__: thanks
[21:09] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[21:15] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[21:18] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[21:23] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[21:26] * ^conner (~conner@leo.tuc.noao.edu) Quit (Read error: Operation timed out)
[21:41] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[21:44] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[21:46] * verwilst (~verwilst@dD57696BA.access.telenet.be) has joined #ceph
[22:00] * Tv|work (~Tv__@aon.hq.newdream.net) Quit (Quit: Tv|work)
[22:00] * Tv|work (~Tv__@aon.hq.newdream.net) has joined #ceph
[22:11] * verwilst (~verwilst@dD57696BA.access.telenet.be) Quit (Quit: Ex-Chat)
[22:12] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:12] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[22:14] * aliguori_ (~anthony@ has joined #ceph
[22:14] * fronlius (~fronlius@f054184025.adsl.alicedsl.de) has joined #ceph
[22:19] * aliguori (~anthony@ Quit (Ping timeout: 480 seconds)
[22:21] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[22:26] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[22:30] * BManojlovic (~steki@ has joined #ceph
[22:32] * lofejndif (~lsqavnbok@82VAAB41E.tor-irc.dnsbl.oftc.net) has joined #ceph
[22:38] * stxShadow (~jens@ip-88-153-224-220.unitymediagroup.de) has joined #ceph
[22:45] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[22:53] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[23:03] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[23:06] * fronlius (~fronlius@f054184025.adsl.alicedsl.de) Quit (Quit: fronlius)
[23:09] * stxShadow (~jens@ip-88-153-224-220.unitymediagroup.de) Quit (Quit: bye bye !! )
[23:12] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[23:41] * pulsar (6a5be70dba@ Quit (Quit: bbl)
[23:45] * aliguori_ (~anthony@ Quit (Ping timeout: 480 seconds)
[23:46] * tnt_ (~tnt@235.36-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.