#ceph IRC Log


IRC Log for 2012-07-13

Timestamps are in GMT/BST.

[0:04] <mgalkiewicz> gregaf: from the one which causes election?
[0:04] <gregaf> from cc2, since it's the leader and the one calling them all
[0:05] <mgalkiewicz> k
[0:06] <mgalkiewicz> gregaf: it may disturb rbd connected rbd clients/
[0:06] <mgalkiewicz> ?
[0:06] * theron (~Theron@ip66-43-220-25.static.ishsi.com) Quit (Read error: Connection reset by peer)
[0:06] <elder> Is this workunit expected to work: rbd/test_librbd.sh ?
[0:06] * theron (~Theron@ip66-43-220-25.static.ishsi.com) has joined #ceph
[0:07] <joshd> yes, it runs through some tests for the userspace stuff
[0:07] <gregaf> mgalkiewicz: it shouldn't cause them much more harm than a monitor election
[0:07] <mgalkiewicz> ok
[0:08] <elder> I hit some sort of problem, I'm going to try to narrow it down. I'm running stable branch
[0:08] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[0:09] * s[X]_ (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[0:09] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) Quit (Read error: Connection reset by peer)
[0:09] <elder> Maybe it was rbd/copy.sh
[0:10] <joshd> elder: copy.sh won't work on the stable branch right now due to a bug; I removed it from the stable branch in the qa suite a day or two ago
[0:10] <elder> O-key-doke.
[0:10] <mgalkiewicz> gregaf: ceph-mon.cc2.log on your ceph.com machine
[0:10] <joshd> elder: the problem is that workunits are always pulled from the master branch right now
[0:11] <elder> I've been using stable in an effort to avoid bumping into surprises, which has happened too many times.
[0:11] <elder> Maybe I'm just hitting a different sort of surprise this way.
[0:14] <joshd> this sort shouldn't happen, but it requires some work on teuthology to make the workunit task more flexible
[0:21] <mgalkiewicz> gregaf: please take a look I gotta go
[0:22] <gregaf> yep, been looking
[0:22] <gregaf> mgalkiewicz: well, I can see that mon.2 (n12c1) simply isn't talking to cc2
[0:22] <mgalkiewicz> k
[0:22] <gregaf> cc2 is sending it messages but it's not getting responses or other messages of any kind
[0:22] <mgalkiewicz> hmm
[0:22] <mgalkiewicz> I will check connectivity
[0:23] <mgalkiewicz> thats right they cannot ping each other
[0:23] <elder> When data from a message is decoded, I've noticed the result is inconsistent--either return -ERANGE or in some cases, return -EINVAL. I'll make it all consistent someday, but for now, which is better?
[0:23] <gregaf> also look at cpu usage again and maybe do a rolling restart on them all with logs and see if they're doing anything
[0:23] <gregaf> okay, if you have monitor machines that can't ping each other that's going to do all kinds of bad stuff
[0:23] <elder> I think -ERANGE is better, but I'd like a second opinion.
[0:25] <joshd> elder: is that when there's not enough space to decode the message? in that case ERANGE tends to be used in many places in ceph
[0:25] <elder> Correct.
[0:25] <elder> But -EINVAL is sometimes used also.
[0:25] <elder> I'm going with ERANGE
[0:26] <joshd> sounds good to me. I'm guessing most callers don't check for EINVAL anyway.
[0:27] * dmick (~dmick@ has left #ceph
[0:27] <mgalkiewicz> gregaf: ok problem fixed there were some connectivity issues because of ipsec thx!
[0:27] <mgalkiewicz> gregaf: and entries like mon.cc2@0(leader).log v15065 check_sub sub monmap not log type
[0:27] <mgalkiewicz> are normal
[0:27] <mgalkiewicz> ?
[0:27] <mgalkiewicz> and mon.cc2@0(leader).log v15075 check_sub sub osdmap not log type
[0:29] <gregaf> that's an inconsequential bug (it's just noise output)
[0:29] <gregaf> fixed for the next version, but not anything you need to worry about
[0:29] * BManojlovic (~steki@ has joined #ceph
[0:32] <mgalkiewicz> gregaf: ok thx
[0:32] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) Quit (Quit: Ex-Chat)
[0:41] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:52] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[0:57] <sagewk> wip-crush-encoding, for anyone who wants to see the crush word size encoding fix
[0:59] <elder> Sage, posted another edition of the ceph_extract_encoded_string() patch.
[0:59] <elder> I'm using ERR_*(). Please don't say that's bad :)
[1:00] <elder> Off to dinner. I'll be back in an hour or two.
[1:05] * theron_ (~Theron@ip66-43-220-25.static.ishsi.com) has joined #ceph
[1:05] * theron (~Theron@ip66-43-220-25.static.ishsi.com) Quit (Read error: Connection reset by peer)
[1:05] * theron_ is now known as theron
[1:16] * Tv_ (~tv@2607:f298:a:607:394a:5e1a:feb6:b166) Quit (Quit: Tv_)
[1:33] * theron (~Theron@ip66-43-220-25.static.ishsi.com) Quit (Quit: theron)
[1:34] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[2:03] * Cube (~Adium@ Quit (Quit: Leaving.)
[2:04] * JJ (~JJ@ Quit (Ping timeout: 480 seconds)
[2:20] * The_Bishop (~bishop@2a01:198:2ee:0:4da1:7b7b:4d6e:2594) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[2:22] <joao> elder, the ERR_*() macros are probably the most useful and awesome thing I have ever seen in my life
[2:22] <joao> when it comes to error handling, I mean
[2:55] <elder> I think they're a good thing too, very useful.
[2:55] <elder> But Sage suggested I return the length rather than the pointer in order to avoid using them, so I was afraid he might disagree.
[3:29] * tnt_ (~tnt@150.189-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[3:44] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[3:46] * renzhi (~renzhi@ has joined #ceph
[4:02] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:50] * nhmlap (~Adium@ Quit (Quit: Leaving.)
[5:19] * nhmlap (~Adium@ has joined #ceph
[6:20] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[7:34] * dpemmons (~dpemmons@ has joined #ceph
[7:39] <dpemmons> I'm new to ceph and have been reading through the docs a bit to learn its architecture. I haven't been able to figure out one thing though: does the filesystem client cache reads? If so, how aggressively does it read ahead?
[7:40] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Ping timeout: 480 seconds)
[7:44] <dpemmons> (and by cache I mean local disk cache, not memory cache which it appears to do)
[8:00] * tnt_ (~tnt@150.189-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:10] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[8:31] * nhmlap (~Adium@ Quit (Quit: Leaving.)
[8:52] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:19] * s[X]_ (~sX]@eth589.qld.adsl.internode.on.net) Quit (Remote host closed the connection)
[9:27] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[9:29] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[9:39] * tjikkun_ (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[9:43] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[9:54] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[10:04] * BManojlovic (~steki@ has joined #ceph
[10:14] * loicd (~loic@ has joined #ceph
[10:15] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[10:25] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[10:37] * renzhi (~renzhi@ Quit (Quit: Leaving)
[10:44] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:54] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[10:54] * sdouglas (~sdouglas@c-24-6-44-231.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[10:58] * LarsFronius (~LarsFroni@95-91-243-243-dynip.superkabel.de) has joined #ceph
[11:02] * sdouglas (~sdouglas@c-24-6-44-231.hsd1.ca.comcast.net) has joined #ceph
[11:08] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[11:11] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[11:12] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[11:13] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[11:21] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[11:21] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[11:25] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[11:26] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[11:38] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[11:38] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[11:48] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[12:07] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[12:08] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:39] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) has joined #ceph
[12:45] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[12:52] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[14:00] <iggy> dpemmons: no, it doesn't
[14:03] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[14:05] <todin> joshd: I updated issue 2777, it works for me.
[14:30] * tnt_ (~tnt@150.189-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[14:37] * tremon (~aschuring@d594e6a3.dsl.concepts.nl) has joined #ceph
[14:37] * tnt_ (~tnt@212-166-48-236.win.be) has joined #ceph
[14:50] * LarsFronius (~LarsFroni@95-91-243-243-dynip.superkabel.de) Quit (Quit: LarsFronius)
[14:51] * LarsFronius (~LarsFroni@95-91-243-243-dynip.superkabel.de) has joined #ceph
[14:55] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[15:04] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[15:09] * theron (~Theron@ip66-43-220-25.static.ishsi.com) has joined #ceph
[15:10] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[15:17] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[15:22] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[15:24] * theron (~Theron@ip66-43-220-25.static.ishsi.com) Quit (Read error: Connection reset by peer)
[15:24] * theron (~Theron@ip66-43-220-25.static.ishsi.com) has joined #ceph
[15:24] * nhmlap (~Adium@ has joined #ceph
[15:27] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[15:32] * tnt_ (~tnt@212-166-48-236.win.be) Quit (Quit: leaving)
[15:36] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[15:38] * MK_FG (~MK_FG@ Quit (Ping timeout: 480 seconds)
[15:41] * MK_FG (~MK_FG@ has joined #ceph
[15:55] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[16:01] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[16:11] * James259 (~James259@ has joined #ceph
[16:28] * theron (~Theron@ip66-43-220-25.static.ishsi.com) Quit (Read error: Connection reset by peer)
[16:29] * theron (~Theron@ip66-43-220-25.static.ishsi.com) has joined #ceph
[16:55] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[17:05] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[17:14] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[17:15] * jluis (~JL@ has joined #ceph
[17:17] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[17:21] * joao (~JL@ Quit (Ping timeout: 480 seconds)
[17:24] * jsfrerot (~jsfrerot@charlie.mdc.gameloft.com) has joined #ceph
[17:25] * The_Bishop (~bishop@e179019194.adsl.alicedsl.de) has joined #ceph
[17:25] <jsfrerot> Hi, anyone here can help with debian packages ? Seems /usr/bin/ceph-mds is not in the ceph package
[17:25] * lofejndif (~lsqavnbok@09GAAGLAY.tor-irc.dnsbl.oftc.net) has joined #ceph
[17:27] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:27] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[17:30] <tremon> jsfrerot: are you sure? It is in ceph 0.43 and 0.47 from testing and sid
[17:36] <jsfrerot> yeah, i download get the package from http://ceph.com/debian/
[17:36] <jsfrerot> ii ceph 0.48argonaut-1~bpo60+1 distributed storage and file system
[17:41] <elder> sage, if you will offer a sign-off on this, I will commit the 16 patches recently posted to the "testing" branch. I've been testing it for several hours.
[17:41] <elder> [PATCH v4 04/16] libceph: define ceph_extract_encoded_string()
[17:41] <sage> looks good to me.
[17:41] <elder> OK.
[17:42] <sage> i still suspect that the callers will be happier if you return int, tho .. :)
[17:42] <elder> Just waiting for you to be around before I go ahead with it.
[17:42] <elder> Depends on the programmer more than the caller I think.
[17:42] <sage> otherwise you end up with
[17:42] <sage> struct_foo->str = ceph_decode_string....()
[17:42] <sage> if (IS_ERR(struct_fo->str)) {
[17:42] <sage> err = PTR_ERR(struct_foo->str);
[17:42] <sage> strctu_foo->str = NULL;
[17:42] <sage> return err;
[17:42] <sage> }
[17:42] <sage> or similar?
[17:43] <elder> That's correct.
[17:43] <sage> well, unless labels are used for the initialization cleanup. it'd be fine then
[17:44] <sage> but i suspect this bike shed has had enough paint at this point :)
[17:44] * Ryan_Lane (~Adium@ has joined #ceph
[17:44] <elder> It's a pretty sturdy bike shed though.
[17:44] <elder> It can hold a lot of paint.
[17:45] * Ryan_Lane (~Adium@ Quit ()
[17:50] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:50] * Tv_ (~tv@ has joined #ceph
[17:53] * brambles (brambles@ Quit (Remote host closed the connection)
[17:53] * brambles (brambles@ has joined #ceph
[18:00] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[18:01] * sagelap (~sage@2600:1012:b002:b7ba:d942:1186:3b57:31fa) has joined #ceph
[18:02] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[18:08] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[18:09] <nhmlap> Here's a riddle: you have a 7 drive raid0 array that with a single writer doing directIO and 256MB requests can get 800MBs. Add a second writer and now your throughput drops to 80MB/s. I'm leaning toward preemption issues. Any thoughts?
[18:09] * Cube (~Adium@ has joined #ceph
[18:10] <nhmlap> I'm thinking blktrace will provide hints.
[18:14] <elder> The OS and FS shouldn't be locking the concurrent writers for direct I/O. Is this on Ceph?
[18:15] <nhmlap> elder: this is wtih an FIO test since ceph was having issues.
[18:15] <nhmlap> first tested on xfs, then directly on the block device with the same results.
[18:16] <elder> I don't know what a "FIO test" is.
[18:16] <elder> for information only? file I/O?
[18:16] <nhmlap> elder: it's just a benchmarking tool that gives you a lot of options for how to do IO.
[18:17] <nhmlap> http://linux.die.net/man/1/fio
[18:17] <elder> Does the underlying block device have anything to do with rados?
[18:17] <elder> Or is it a direct-attached device?
[18:18] <nhmlap> elder: this is a raid0 array I created on one of the burnupi nodes to test the raid controllers in those machines. I've been having a ton of problems with throughput and I think I've tracked it down to this problem.
[18:19] <elder> Interesting.
[18:19] <joshd> jsfrerot: in 0.48 the mds was split into a separate package, ceph-mds, and client tools for it were moved to ceph-fs-common
[18:19] <elder> And you're using a "real" machine (not virtual)?
[18:19] <nhmlap> buffered IO was sucking, and I think it's because of the same problem I'm seeing here where if there are multiple writers the io wait time skyrockets and the throughput drops through the floor.
[18:19] <nhmlap> elder: yep, this is on a real machine.
[18:20] <elder> Sounds like a good problem to solve.
[18:20] <nhmlap> By default for buffered io do multiple flusher threads get spawned or is it 1 per device?
[18:21] <elder> I don't know off hand. I would expect one per device.
[18:27] <nhmlap> heh, I/O errors on the machine and it didn't come back up after a reboot. yay
[18:30] * sagelap (~sage@2600:1012:b002:b7ba:d942:1186:3b57:31fa) Quit (Ping timeout: 480 seconds)
[18:31] * nhmlap (~Adium@ Quit (Quit: Leaving.)
[18:46] <James259> HEALTH_WARN 8 pgs degraded; 24 pgs stuck unclean; recovery 1446/23434 degraded (6.171%)
[18:47] <James259> been stuck there for a few days.. anyone any clue if I can somehow force then to update?
[18:49] * bchrisman (~Adium@ has joined #ceph
[18:53] <joshd> James259: are all your osds up, or did you permanently lose one? if not, you may have hit a bug
[18:55] <joshd> err, reverse that condition
[18:55] <elder> (I thought it was ambiguous, so, either way)
[18:59] * dmick (~dmick@ has joined #ceph
[19:02] <elder> joshd, does the client not need to keep track of snapshots for format 2? I'm looking at how to get them, and it looks like I need to fetch the snap context.
[19:04] <joshd> yes, the client still needs the snapcontext
[19:04] <elder> OK.
[19:04] <elder> How do I decode a vector?
[19:04] <joshd> it can still get all the snapshot metadata as well (name, size, etc)
[19:05] <elder> Is it just a 32-bit count followed by that many snapid's?
[19:05] <joshd> vector is encoded as (u32 count, encoding of each element)
[19:06] <elder> OK.
[19:06] <elder> I think I have it then. Thanks.
[19:06] <joshd> no problem
[19:08] <elder> Is there a maximum number of snapshots? I need to be able to provide space to receive the snapshot context.
[19:09] <James259> Hi Josh. osd.0 had been down and the whole cluster (3 osd's) got rebooted. Initially (after it settled down) it was showing 16 active+remapped. After rebooting all machines again its now stuck at the above. All of the pgs showing problems that are in the active+remapped state show some relation to osd.0 in the detail output. the additional 8 degraded ones just show osd 1/2. (only a single
[19:09] <James259> osd)
[19:09] <joshd> elder: no
[19:09] <joshd> elder: that's one reason we might want to stop reading metadata for all of them (currently they're all show in sysfs)
[19:10] <elder> Regardless I need a way to receive a potentially limitless response from a request for information from the server.
[19:11] <elder> I could query the snap count somehow, allocate space, and then ask for the snapcontext. But by the time I get it the count may have changed.
[19:11] <gregaf> hurray for retry loops
[19:11] <elder> But what if it's gotten larger, so what gets sent is more than the space I provided buffer for?
[19:12] <elder> there's still crap on the wire that needs to be consumed.
[19:12] <gregaf> toss it out, reallocate space based on new data, and query again?
[19:12] <gregaf> (I forgot you had these issues with messaging in the kernel, sorry)
[19:13] <elder> I'm not sure the interface allows me to stop mid-receive and change the number of bytes I expect to get.
[19:14] <elder> If I could, then I could do what you say. I.e., "after receiving the data I asked for, discard this many more bytes."
[19:14] <elder> But right now I think I only have "receive this many bytes."
[19:14] <gregaf> I'm not very familiar with the kernel messenger, but I think you're right
[19:14] <elder> So if more arrive, they aren't discarded, they are just treated as the start of something new.
[19:15] <elder> Which will cause a connection reset.
[19:15] <elder> That's tolerable I guess, but not very nice.
[19:15] * chutzpah (~chutz@ has joined #ceph
[19:17] * theron (~Theron@ip66-43-220-25.static.ishsi.com) Quit (Read error: No route to host)
[19:17] <elder> I'll impose an arbitrary limit for now and defer solving this problem...
[19:17] * theron (~Theron@ip66-43-220-25.static.ishsi.com) has joined #ceph
[19:18] * theron (~Theron@ip66-43-220-25.static.ishsi.com) Quit ()
[19:18] * nhmlap (~Adium@ has joined #ceph
[19:20] <James259> Is there a doc anywhere that lists all the possible commands that can be given to the ceph tool? http://ceph.com/docs/master/control/ has a list but I see from various other pages that there are obviously lots more that are not on that page. (just wondering if there are some commands I can try and might help)
[19:23] <joshd> I don't think there's a complete list anywhere
[19:24] <joshd> you probably want to check where things are stuck in the recovery process though
[19:25] <jsfrerot> joshd: thx, going to try this ;)
[19:27] <James259> I am relatively new to ceph so sorry if I am asking noobish things. I am not sure how I check that. My only guess is looking at the recovery state section at the bottom of the output from 'ceph pg <id> query'. I do not see anything indicating a problem there though...
[19:28] <James259> "recovery_state": [
[19:28] <James259> { "name": "Started\/Primary\/Active",
[19:28] <James259> "enter_time": "2012-07-13 16:37:28.670046",
[19:28] <James259> "might_have_unfound": []},
[19:28] <James259> { "name": "Started",
[19:28] <James259> "enter_time": "2012-07-13 16:37:27.633618"}]}
[19:28] * LarsFronius (~LarsFroni@95-91-243-243-dynip.superkabel.de) Quit (Quit: LarsFronius)
[19:29] <joshd> can you pastebin the output for some pgs that have degraded objects? (ceph pg dump | grep degraded)
[19:29] <dmick> James259: don't worry about the noob factor, that's why we're here
[19:30] <James259> :) Thanks. I feel a little guilty taking up your time but really stuck.
[19:37] <James259> http://pastebin.com/3AATt6zk
[19:37] <elder> joshd, I'm going to go get some lunch but I'd like to discuss whatever you and Sage said about the snap context when I return.
[19:40] <joshd> elder: we just mentioned that the existing code for the old format retries reading the header in a loop if it doesn't have enough space for snapshots. a better way might be to allow allocating a larger receive buffer based on the message size, but sage would have a better idea of how that would work
[19:47] <joshd> James259: what about 'ceph osd dump' and 'ceph osd tree'? it's odd that several of your pgs are only mapped to one osd
[19:48] <joshd> also a query of some of the unclean pgs might tell us something
[19:48] * loicd (~loic@ Quit (Quit: Leaving.)
[19:53] * sagelap (~sage@2607:f298:a:607:d942:1186:3b57:31fa) has joined #ceph
[19:54] <James259> Josh: I think the last two of the queries I posted are from the unclean pg's. Running the other commands now.
[19:56] <joshd> yeah, you're right. those don't look unusual to me though... any thoughts sjust?
[19:57] <sjust> yeah, something odd is happening on those osds
[19:57] <sjust> ***pgs
[19:57] <sjust> how many osds do you have?
[19:58] <James259> http://pastebin.com/pUeAxh0w
[19:58] <James259> 3
[19:58] <sjust> can you post th eoutput of ceph osd tree?
[19:58] <James259> ^^ :)
[19:58] <sjust> ah there we go
[19:58] <James259> Josh asked for it a second ago
[19:58] <sjust> sorry, just noticed :)
[19:59] <sjust> yeah, that's a funny looking map
[19:59] <sjust> can you post your crushmap/
[19:59] <sjust> ?
[19:59] <James259> I have only recently started playing with ceph so I have probably done something stupid.
[19:59] <sjust> actually
[19:59] <sjust> ceph osd getmap -o <filename>
[19:59] <sjust> and post <filename>
[20:00] <James259> looks like a binary file?
[20:00] <joshd> it looks like your second two hosts aren't in a rack, so your crushmap might be acting funny as a result
[20:00] <joshd> crushtool -d <filename>
[20:00] <James259> ooh, it didn't like that.
[20:01] <sjust> actually, osdmaptool --export-crush <filename>
[20:01] <sjust> I think
[20:01] <James259> http://pastebin.com/HcikiFVJ
[20:01] <sjust> and run crushtool -d on that
[20:01] <joshd> oh yeah, sorry, I thought you'd said getcrushmap
[20:01] <James259> np. :)
[20:01] <James259> sec
[20:01] <sjust> joshd: yeah, sorry
[20:03] * The_Bishop (~bishop@e179019194.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[20:04] * The_Bishop (~bishop@e179019194.adsl.alicedsl.de) has joined #ceph
[20:06] <sjust> oops, it's osdmaptool <filename> --export-crush <filename2>
[20:06] <sjust> and then crushtool -d <filename2>
[20:07] <James259> scp is not doing a great deal for some reason. here is a pastebin. http://pastebin.com/0Q7nqF5T
[20:09] <James259> i can give you ssh if it makes life easier.. although the crushmap is a place where i was confused for a long time, so may be the source. Also, I have to admit I have no idea what the racks and shelves are. (Other than I saw them mentioned very briefly in the docs as parameters to a command)
[20:10] <joshd> I don't think ssh will be necessary - the crushmap is probably the problem
[20:11] <James259> I see the weight on osd.0 looks to be set to 0. I was trying to do this at one point but wasn't sure what I was doing. I wanted to stop files being added to osd.0
[20:11] <joshd> that's the way to do it, but you may have unintentionally triggered some bad crush behavior
[20:11] <James259> is zero a bad weight value by any chance?
[20:12] <joshd> no, zero is fine
[20:12] <James259> ahh, kk.
[20:12] <joshd> it's equivalent to 'ceph osd out 0'
[20:13] <James259> ahh, I did try that at one point.. then did ceph osd in 0 again afterwards.
[20:14] <James259> would it be better to use that method instead of messing with the crushmap?
[20:14] <joshd> since you just have three osds, and you presumably don't care which of them store things, you can put them all directly in the pool rule
[20:15] <joshd> it's harder to mess up 'ceph osd out' than a crushmap change
[20:16] <joshd> so your default pool rule should have 'item Control weight 0.000' instead of 'item unknownrack weight 0.000', and you can remove unknownrack
[20:17] <James259> yes, thats right. The plan is that each ceph cluster will be on one switch and we just want to let it distribute across about 20 osds. (its being used to house KVM images) for some reasons completely unrelated to ceph, I need the osd numbers to start at 1 (not 0) but other than that, it can distribute evenly.
[20:18] <James259> I see that. :)
[20:18] <James259> I will go figure out how to edit that and let you know if it fixes the problem. Many thanks for taking the time to help me.
[20:19] <joshd> you're welcome :)
[20:20] <James259> should I delete the 'rack unknownrack { }' section too?
[20:21] <joshd> yeah
[20:29] <James259> what's the opposite of 'ceph osd getmap'?
[20:29] <James259> I have compiled the crush, imported it into the map. Just need to put the map back now.
[20:36] <James259> I found it. ceph osd setmap -i <file> :)
[20:38] <tremon> is more documentation available on the ceph-authtool permissions? the man page describes how to use it but not what each permission means. I'm looking to answer questions like "does a client need mon r access to locate osd's" or "what does it mean to have osd x permission"
[20:39] <James259> Josh: err.. unknown command setmap. ceph osd setcrushmap -i <file> seems to work though. (just adding this for the benefit of anyone reading logs later really.)
[21:01] * Ryan_Lane (~Adium@ has joined #ceph
[21:07] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[21:12] <elder> joshd, does the rbd server side do anything with snapshot names?
[21:19] <Cube> Ceph apply_role_pre_chef_call: entering ["d52-54-00-14-9c-7c.crow.sepia.ceph.com"]
[21:19] <Cube> Ceph ceph-mon elements: ["d52-54-00-14-9c-7c.crow.sepia.ceph.com"]
[21:19] <Cube> !
[21:19] <Cube> wrong room :)
[21:19] <James259> lol. :)
[21:19] <dpemmons> cephFS cache question: how aggressively does the filesystem client cache read data? eg. how well does it handle the case of a reader doing lots of little seeks within a large file?
[21:20] * JJ1 (~JJ@ has joined #ceph
[21:21] <James259> Josh, should I be seeing the degraded/unclean count dropping yet? I rebooted all the servers a few minutes after making the changes to ensure they took effect. Still seeing 8 degraded, 24 unclean from ceph -s
[21:22] <elder> joshd, nevermind, found what I need.
[21:22] <jsfrerot> exit
[21:22] * jsfrerot (~jsfrerot@charlie.mdc.gameloft.com) Quit (Quit: leaving)
[21:23] * lofejndif (~lsqavnbok@09GAAGLAY.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[21:23] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:29] * BManojlovic (~steki@ has joined #ceph
[21:41] <joshd> James259: yes, I had hoped changing the crushmap would fix it
[21:41] <joshd> James259: can you verify that 'ceph osd tree' looks like the new crushmap?
[21:44] <elder> joshd, format 1 snapshots had no features, so representing that with a 0 features field is reasonable, isn't it?
[21:46] <joshd> elder: yes, and for parent info the defaults of (poolid -1, image_name "", snapid CEPH_NOSNAP, overlap 0) make sense for format 1
[21:46] <elder> Wow, I better write that down...
[21:47] <elder> Actually, that makes sense, so I don't have to.
[21:49] <James259> Josh: I will check it shortly. I did try setting replication size to 1 and running scrub (which seemed to clear the duplicates) but it still shows the 8 degraded + 16 remapped. I have set size back to 2 now and its working away making all the duplicates again.
[21:50] <James259> Josh: I think the tree looks okay.
[21:50] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[21:50] <James259> http://pastebin.com/jH4du7dq
[21:55] <joshd> yeah, looks good
[21:56] <joshd> as long as it's continuing to make progress, it's ok. it may take a while since you only have two active osds
[21:57] * lofejndif (~lsqavnbok@04ZAAEIEE.tor-irc.dnsbl.oftc.net) has joined #ceph
[21:58] <dspano> Does anyone configure fencing for their ceph clusters, or does the design itself handle that via ceph-mon?
[21:59] <Tv_> dspano: there's no shared disks in the architecture, so no old school fencing
[22:00] <Tv_> dspano: if you're talking about rbd, we're adding locking into it as a first-class feature
[22:03] <dspano> Tv_: That's amazing. I'm messing with cephfs as well. When two clients have mounted the filesystem and are potentially modifying the same files/directories, does it follow the same logic?
[22:03] <Tv_> dspano: the distributed filesystem has very near full POSIX semantics; it's like having a local filesystem, as far as two concurrent processes are concerned
[22:04] <dspano> Tv_: This is the answer to all my prayers. Lol.
[22:04] <Tv_> dspano: well, it's not stable yet..
[22:04] <Tv_> dspano: also, http://ceph.com/docs/master/appendix/differences-from-posix/
[22:06] <dspano> Tv_: I'm mainly using RBD with Openstack compute and glance, they work great so far, even just using the default ubuntu 12.04 ceph packages.
[22:07] <dspano> Tv_: For testing, I've been using cephfs to host the Openstack database and configuration files for a very small test cloud. So far, the only issues I've had were when I mounted the fs on the same server my OSDs were on.
[22:08] <Tv_> nice
[22:08] <Tv_> and the loopback problem would affect nfs too, etc -- it's not really a ceph problem
[22:09] <joshd> dspano: see http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/7160 since those packages have fiemap enabled by default
[22:11] <dspano> joshd: Thanks.
[22:12] <dspano> Tv_: And not using GFS or OCFS is worth convincing my boss to buy more hardware to avoid that loopback problem.
[22:15] <tremon> another question: what's the use case for changing the rbd stripe size? I'm thinking of creating two rbd pools, one with mostly static source images and the other with regular kvm images
[22:16] <tremon> would the (mostly) read-only store benefit from changed defaults?
[22:31] <joshd> dpemmons: readahead settings are configurable, as are caching settings for the userspace (fuse) client. the kernel client uses the page cache
[22:34] * dmick (~dmick@ has left #ceph
[22:34] * dmick (~dmick@ has joined #ceph
[22:36] <joshd> tremon: I'm not sure how much being read-only would affect it, but generally I'd expect larger stripe sizes to be more useful for higher throughput backend devices on the osds
[22:37] <tremon> thx. If the primary effect would be throughtput, that's something I can measure :)
[22:42] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[22:42] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:56] <sagewk> aie: teuthology kernel task hasn't been installing kernels for a couple weeks :(
[22:58] <elder> Is this only if it's not specified? I've been having luck.
[22:58] <sagewk> it was broken if the kernel: section had more than just branch/tag/sha1 (e.g., if you put kdb: true in there, like the nighties now do)
[22:58] <elder> I have that too.
[22:59] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) Quit (Quit: Leaving)
[23:02] <dmick> what is a "crope"?
[23:02] <dmick> cf. include/buffer.h
[23:03] <joshd> probably referring to something like this: http://www.sgi.com/tech/stl/Rope.html
[23:04] <dmick> ah. cute. string -> rope
[23:08] <elder> "cable" is for unbelieveably large strings.
[23:09] <elder> joshd, I'm doing "get_snapcontext" and got back 0 for a sequence number and then garbage for the count of snapshot id's that follow.
[23:09] <elder> Should I be interpreting the response differently?
[23:10] <elder> dmick, maybe you could help here too.
[23:13] <joshd> elder: did the operation return success? if it failed, the contents of the return buffer are undefined (in this case it won't be filled in at all)
[23:13] <elder> I'm working on re-running it withd ebug on.
[23:14] <elder> Wait a sec, I had a messed up printf. My bad I'm sure.
[23:15] <elder> [ 17.410000] rbd_dev_header_v2: seq = 0, snap_count = 0
[23:15] <elder> All better.
[23:16] <joshd> cool
[23:24] <elder> joshd, for snapshot_list(), do I need to supply a vector of snapshot ids as input parameter? I.e., __le32 count followed by that many __le64 snapid vals?
[23:24] <elder> Wait, now.
[23:24] <elder> now.
[23:25] <elder> I have to do that manually. Send one snapid, get that snapshot's name. and so on. Nevermind.
[23:25] <joshd> yeah, userspace has a convenience function to do that for all snapshots in one transaction
[23:26] <joshd> and that convenience function takes the snapids from the snapshot context as input
[23:26] <elder> I see it now. And this sort of thing is why we want to fire off multiple ops in a single request.
[23:27] <joshd> yup
[23:28] <joshd> I'll be back in an hour or so
[23:28] <elder> Remind me what a format 2 snapshot name will look like.
[23:28] <elder> (approximately)
[23:29] <dmick> isn't that user-defined?
[23:29] <elder> Oh yeah.
[23:29] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[23:30] <elder> Do you suppose 128 would be a reasonable starting limit on the length of such a name?
[23:30] <elder> Or 512?
[23:30] <dmick> some parts of the code were bandying about 96
[23:30] <elder> It used to be 32...
[23:30] <dmick> not sure why
[23:31] <dmick> let me investigate that a sec
[23:31] <elder> That may have been due to concatenating the object name.
[23:31] <elder> Maybe my question wasn't so dumb after all...
[23:31] <elder> I need to pull up Josh's design document.
[23:33] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Quit: Ex-Chat)
[23:36] <elder> I'm going with 256. Linux hard-codes that as the maximum length of a file name.
[23:36] <dmick> still looking
[23:36] <elder> Or rather, 255. In any case, I'll use NAME_MAX
[23:40] <ninkotech> 640kb should be enough for everybody
[23:44] <dmick> I'm having a hard time finding anywhere any length is enforced, actually
[23:45] <elder> Maybe it's not. But Linux appears to give it a practical limit.
[23:45] <dmick> because the name shows up as a path somewhere?....
[23:46] <elder> Yes.
[23:47] <elder> Under /sys/bus/rbd/something

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.