#ceph IRC Log


IRC Log for 2011-02-22

Timestamps are in GMT/BST.

[2:18] * Dantman (~dantman@S0106687f740dba3e.vc.shawcable.net) Quit (Ping timeout: 480 seconds)
[3:02] * ooolinux (~bless@ has joined #ceph
[3:03] * DJLee (82d8d198@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[3:03] <ooolinux> hi
[3:34] <ooolinux> do you know what paxos_service and paxos? diffenert?
[3:35] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[5:33] <greglap1> johnl: Ceph's monitors use a custom implementation of the basic leader-follower two-phase commit
[5:33] <greglap1> not sure if you want more than that
[5:33] <greglap1> ooolinux: the Paxos class implements the actual data replication/committal stuff
[5:34] <greglap1> PaxosService is a related class which provides the interface used by the different kinds of monitors to provide their own logic for updates and stuff
[6:35] * bless_ (~bless@ has joined #ceph
[6:40] * ooolinux (~bless@ Quit (Ping timeout: 480 seconds)
[7:34] * darkfader (~floh@ Quit (Remote host closed the connection)
[7:34] * darkfader (~floh@ has joined #ceph
[7:56] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[7:59] * ooolinux (~bless@ has joined #ceph
[8:01] <ooolinux> take osdmap as example, how it update from small to large number epoch? like mon0/osdmap/1--->100. what is the process
[8:04] * bless_ (~bless@ Quit (Ping timeout: 480 seconds)
[8:09] <ooolinux> hi
[8:55] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:30] * allsystemsarego (~allsystem@ has joined #ceph
[9:36] * hijacker (~hijacker@ Quit (Quit: Leaving)
[10:14] * Yoric (~David@ has joined #ceph
[10:20] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[10:23] * ooolinux (~bless@ Quit (Quit: Leaving)
[10:26] * hijacker (~hijacker@ has joined #ceph
[10:29] * hijacker (~hijacker@ Quit ()
[10:29] * hijacker (~hijacker@ has joined #ceph
[10:49] [11:10] * Yoric (~David@ Quit (Quit: Yoric)
[12:09] * darkfader (~floh@ Quit (Remote host closed the connection)
[12:09] * darkfader (~floh@ has joined #ceph
[13:14] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[13:29] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[13:49] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:09] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:29] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[15:52] * greglap1 (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[15:55] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Remote host closed the connection)
[15:57] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[16:31] * greglap (~Adium@ has joined #ceph
[17:21] * greglap (~Adium@ Quit (Quit: Leaving.)
[17:50] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:52] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[17:57] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:00] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:01] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:02] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:15] * Meths_ (rift@ has joined #ceph
[18:22] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[18:23] * Meths_ is now known as Meths
[18:35] <wido> sagewk: you there?
[18:38] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:44] <sagewk> wido: yep
[18:48] <wido> sagewk: http://pastebin.com/ajY5M5FS
[18:48] <wido> Don't know if you saw that yesterday
[18:49] <wido> National holiday I found out :)
[18:49] <sagewk> which branch, and[19:05] <wido> For now the recovery seems to be working much better then before
[19:05] <sagewk> great
[19:30] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[19:37] <wido> A short update about the Atom. Currently I have 4 OSD's (2TB) running. 4GB Ram and right now ~700MB is in use
[19:37] <wido> recovery ops is set to 1 for all OSD's, right now about ~200Mbit of data is being written to the machine, load is about 1.3
[19:39] <prometheanfire> sounds good :D
[19:39] <prometheanfire> most of that is probably iowait too :D
[19:41] <wido> No, not really. CPU seems 60% idle. I/O wait is at 20%
[19:41] <wido> I think I could set recovery ops to 2 or maybe 3
[19:44] <prometheanfire> try it
[19:52] <wido> yes, when I get the rest of my HW
[19:59] <wido> hmm, my recovery seems to stall now
[19:59] <wido> pg v63698: 8240 pgs: 72 active, 8168 active+clean; 260 GB data, 534 GB used, 14355 GB / 14904 GB avail; 4445/134768 degraded (3.298%)
[20:00] <wido> auto-scrub started doing it's thing again. But I/O operations in the VM have started to stall
[20:01] <wido> "rados -p rbd ls" works though
[20:03] <sagewk> do you have osd logs?
[20:08] <wido> sagewk: Yes, but logging is low
[20:09] <wido> How do you do that? Since 8 OSD's with debugging on 20 will produce SO much information
[20:09] <wido> That I would need 2TB disks for just logging
[20:09] <wido> sagewk: I've got remote syslogging running: logger.ceph.widodh.nl:/srv/ceph/remote-syslog
[20:10] <wido> ceph-osd.log
[20:10] <sagewk> no worries. for our test clusters we usually have a dedicated disk for the logs
[20:10] <sagewk> i'm seeing something similar on one cluster.. let me diagnose that first, may be the same problem.
[20:11] <wido> sagewk: Ok. I went from 4 to 8 OSD's, so my degradation was ~45% some 6 hours ago
[20:14] <gregaf> wido: did you bump up your replication when you added the OSDs?
[20:18] <wido> gregaf: No, it was on 3. I added the OSD's and updated the crushmap
[20:18] <gregaf> and it became degraded after you added the OSDs?
[20:18] <wido> gregaf: Yes, I added 4 OSD's, went from 4 to 8. Then I went to 45% degraded
[20:19] <wido> started rebalancing the data, ran for 6 hours until it stalled at 3.298%
[20:19] <gregaf> oh, heh, I got my terminology wrong...
[20:19] <gregaf> was thinking that meant there weren't enough replicas but in our happy lingo land it just means that the objects aren't in all the places they should be ;)
[20:21] <wido> ah, ok :)
[20:22] <wido> I think my strategy was ok? 1: Format OSD's, 2: Add to keyring, 3: bump max osd, 4 start osd's, 5: add to crushmap
[20:22] <sagewk> yeah it's fine
[20:41] * allsystemsarego (~allsystem@ Quit (Ping timeout: 480 seconds)
[20:49] <Tv> i need to stop calling the mon key monkey.. it just makes me laugh too often
[20:56] * tjikkun (~tjikkun@82-168-5-225.ip.telfort.nl) Quit (Read error: Operation timed out)
[20:56] <bchrisman> heh yeah
[20:59] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[21:07] * cclien (~cclien@ec2-175-41-146-71.ap-southeast-1.compute.amazonaws.com) has joined #ceph
[21:07] * cclien_ (~cclien@ec2-175-41-146-71.ap-southeast-1.compute.amazonaws.com) Quit (Read error: Connection reset by peer)
[21:10] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[21:22] <wido> sagewk: I'm going afk. If you want to take a look in my cluster, you can acces "noisy" and "atom" from logger.ceph.widodh.nl
[21:22] <wido> Those are the two machines right now
[21:53] <sagewk> ok thanks
[21:58] * verwilst (~verwilst@dD576FAAE.access.telenet.be) has joined #ceph
[23:30] * jantje_ (~jan@paranoid.nl) Quit (Read error: Connection reset by peer)
[23:31] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[23:36] * jantje (~jan@paranoid.nl) has joined #ceph
[23:45] * valtha (~valtha@ohmu.fi) Quit (Remote host closed the connection)
[23:45] * valtha (~valtha@ohmu.fi) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.