#ceph IRC Log


IRC Log for 2011-10-24

Timestamps are in GMT/BST.

[1:19] * votz (~votz@pool-108-52-121-23.phlapa.fios.verizon.net) has joined #ceph
[2:18] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:24] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[2:55] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[3:10] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[3:48] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[3:56] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Ping timeout: 480 seconds)
[4:54] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[5:15] * Nightdog (~karl@190.84-48-62.nextgentel.com) Quit (Ping timeout: 480 seconds)
[5:20] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[6:11] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[6:35] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[6:54] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[8:17] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (Quit: leaving)
[8:18] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[8:27] <chaos__> sagewk, is protocol that you are using in collectd plugin documented somewhere?
[9:29] <chaos__> when i'm sending 0, 1, 2 through i don't get anything.. no response at all (0 should give me at least version, right?), if i send something random (1000) i have "unknown request code in node logs"
[9:30] <NaioN> sage: it seems that the patch you refer to is already applied in rc10
[10:24] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[10:52] * alexxy (~alexxy@ has joined #ceph
[11:48] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:20] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Read error: No route to host)
[12:21] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[13:56] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[14:15] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[14:26] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[15:06] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[15:22] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:31] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[15:56] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[16:03] * Nadir_Seen_Fire (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[16:39] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[16:45] * alexxy (~alexxy@ has joined #ceph
[16:55] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[16:58] * alexxy (~alexxy@ has joined #ceph
[17:34] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[17:41] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:49] * fronlius1 (~Adium@testing78.jimdo-server.com) has joined #ceph
[17:49] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[18:03] <sagewk> chaos__: you're sending a packed uint32_t right?
[18:06] <sagewk> chaos__: not documented, no. it's pretty simple, tho.. should go in common/admin_socket.cc at least.
[18:08] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[18:12] * cp (~cp@ has joined #ceph
[18:21] * cp (~cp@ Quit (Quit: cp)
[18:30] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:34] * cp_ (~cp@ has joined #ceph
[18:40] * jojy (~jojyvargh@ has joined #ceph
[18:41] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:44] <sagewk> josef: around?
[18:45] * bchrisman (~Adium@ has joined #ceph
[18:58] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[18:58] * fronlius1 (~Adium@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[19:00] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[19:01] <jmlowe> I see that the 3.1 kernel has been released, does anybody know off the top of their heads how many of the closed kernel client bugs made it in?
[19:04] <yehudasa> jmlowe: 9
[19:05] * wido_ (~wido@rockbox.widodh.nl) has joined #ceph
[19:07] * wido (~wido@rockbox.widodh.nl) Quit (Ping timeout: 480 seconds)
[19:21] * adjohn (~adjohn@ has joined #ceph
[19:22] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Quit: Leaving.)
[19:34] * cp_ (~cp@ Quit (Ping timeout: 480 seconds)
[19:47] <sagewk> jmlowe: the only bug fixes i didn't send for 3.1 were in the readahead code (which isn't upstream yet). or insignificant memory leaks.
[19:48] <chaos__> well sagewk i was trying with ruby (there is no uint here;p), i didn't write anything in c/c++ in long time but i think i understand your code
[19:48] <chaos__> maybe u can provide some examples in some scripting language? or socat?
[19:48] <sagewk> any comments on https://github.com/NewDreamNetwork/ceph/commits/wip-pools ?
[19:49] <sagewk> chaos__: we should really have a command line tool to interact with the socket
[19:49] <gregaf1> yehudasa: s3tests.functional.test_headers.test_object_create_bad_contentlength_mismatch_above isn't passing for me on master with sepia, but it's not marked fails_on_dho
[19:50] <chaos__> sagewk, it would be nice;) i suppose u won't write it now;p
[19:50] <gregaf1> do you know if it was ever good for you?
[19:50] <yehudasa> gregaf1: checking
[19:51] <yehudasa> gregaf1: yeah, it passes
[19:52] <yehudasa> gregaf1: could be wrong version of the fastcgi module
[19:52] <gregaf1> hmm, okay
[19:53] <gregaf1> what do we need to do to upgrade that on sepia?
[19:53] <gregaf1> (and argh why isn't this automated)
[19:53] <gregaf1> ;)
[19:54] <sagewk> wget -q -O- https://raw.github.com/NewDreamNetwork/ceph-qa-chef/master/solo/solo-from-scratch | sh
[19:54] <sagewk> that's from https://github.com/NewDreamNetwork/ceph-qa-chef/blob/master/solo/solo-from-scratch
[19:55] <sjust> sagewk: would it make sense to allow the default_data_pool_replay_window to be used for pools other than data?
[19:56] <sagewk> the file system client is the only thing that looks at acks before commits (when in sync io mode). so that's the only place it's (normally) useful
[19:56] <sagewk> rbd and fs metadata pools don't request/ignore ack
[19:57] <sagewk> a user might set up new pools for file system data manually.. currently they'd be responsible for setting up the pool properly
[19:57] <sagewk> oh, we do need a monitor command to adjust it
[19:58] <gregaf1> sagewk: can't we look at the pool ID for setting the data pool replay interval?
[19:58] <gregaf1> the rbd, data, metadata pools are all defined numbers
[20:01] <sagewk> gregaf1: that's basically what OSDMap::build_simple() does
[20:02] <gregaf1> I'm looking at the commit which tries to guess if it's the data pool based on crush_ruleset and auid
[20:02] <sagewk> gregaf1: i just didn't want to hardcode 60 seconds in that function. and maybe in the future we'll have a higher-level command for creationg new data pools for the file system (at the directory layout policy to use them), and at that point that command would use this setting too
[20:02] <sagewk> oh, that's just to handle old cluster upgrades. and pg_pool_t doesn't include the pool id.
[20:03] <wido_> hi
[20:03] <gregaf1> ah
[20:03] <wido_> never mind my kick on that old issue
[20:03] <wido_> going from 0.27 -> 0.37 was a real no go...
[20:03] <sagewk> (which should maybe also change, now that the encoding is conditional on the target's features)
[20:03] * wido_ is now known as wido
[20:03] <sagewk> wido: :(
[20:03] <gregaf1> oh crap
[20:04] <gregaf1> disk format change troubles?
[20:04] <wido> yeah, never mind. But things broke down pretty hard
[20:04] <wido> yes. The convert of the disk format seemed to go OK, but afterwards the OSDs were missing osdmaps, recent ones
[20:05] <wido> so I started scp'ing them the new maps, but as soon as I did that, the new OSDmap (since they joined the cluster again) didn't propegate
[20:05] <NaioN> sagewk: the patch you mentioned was already included in rc10
[20:05] <NaioN> but I saw another patch on the mailinglist
[20:05] <wido> so I got a no such file crash, since the OSD tried to open a non-existing osdmap
[20:05] * Nightdog (~karl@190.84-48-62.nextgentel.com) has joined #ceph
[20:05] <wido> Also saw a current/_temp map which should have been empty, but wasn't
[20:06] <wido> I saw a lot of crashes
[20:07] <sagewk> naion: which other patch?
[20:07] <sagewk> wido: if you have logs leading up to new maps (post-upgrade) being missing that would be helpful.. definitely haven't seen that
[20:07] <NaioN> the mail from David on the mailing list with the subject: Re: kernel BUG at fs/btrfs/inode.c:1163
[20:08] <NaioN> but it's more to pinpoint the bug
[20:08] <sagewk> oh i see. yeah
[20:08] <NaioN> as far as i can see
[20:08] <sagewk> any help narrowing down the btrfs issue would be great
[20:08] <NaioN> so we are going to include it and give it a try
[20:08] <NaioN> but it wouldn't patch clean against a rc10
[20:09] <wido> sagewk: Yes, but it's all my syslog
[20:09] <NaioN> so we will give it a try in de morning
[20:09] <wido> I can compress them for you, but I didn't have full logging enabled, especially not debug ms = 20
[20:09] <wido> some were running with osd, filestore at 20 so I could figure out what was happening
[20:09] <sagewk> its just the osd level debug that would be helpful. cool
[20:11] * sagelap (~sage@aon.hq.newdream.net) has joined #ceph
[20:11] <chaos__> sagewk, so what about this sockets?:p if i get it work, i'll have to write plugin for munin and probably my company will make me to write blog post about it ;p so it's for greater good
[20:12] <chaos__> now after sending uint 1, osd respond with max_int
[20:12] <chaos__> erm.. after sending 0
[20:13] <sagelap> chaos__: you're talking to common/admin_socket.cc do_accept()
[20:14] <sagelap> which means you need to write a 4 byte big-endian value as the command (1 for the perfcounter json), and you'll read a 4 byte length (big endian) and then that many bytes for the payload
[20:15] <wido> sagewk: gzipping the stuff now, want me to open a issue or just sent you an e-mail with a download link?
[20:15] <sagelap> wido: issue is better, thanks!
[20:17] <chaos__> ok, it isn't bin endian, switching now
[20:17] <sagelap> it's a socket, so it's network byte order.
[20:17] <chaos__> ok ;) now i'm getting somethink
[20:17] <chaos__> something*
[20:18] <chaos__> yea.. it's network order now...
[20:19] <chaos__> response after sending 1 is amount of bytes?
[20:24] <sagelap> yeah
[20:24] <sagelap> and then the actual json
[20:25] <chaos__> and i should talk to osd socket?
[20:25] <sagelap> yeah
[20:25] <chaos__> and it's working;)
[20:26] <chaos__> even with ruby
[20:26] <chaos__> yey \o/ ;)
[20:26] <chaos__> thanks sage
[20:26] <sagelap> np
[20:26] * sandeen_ (~sandeen@sandeen.net) has joined #ceph
[20:27] <sandeen_> sage, I found the ceph vs. ext4 problem, now just trying to decide how best to fix it.
[20:27] <sagelap> sandeen_: awesome.
[20:28] <sandeen_> and in the process reminded myself just how much I looove ext4's bolted on delalloc handling ;)
[20:28] <chaos__> sagelap, maybe small wiki art about connecting to sockets using ruby?
[20:28] <chaos__> it may help someone;)
[20:28] <sagelap> sandeen_: speaking of bolts, do you know what it might take to get the lustre large xattr patches upstream?
[20:28] <sagelap> chaos__: go fo rit
[20:28] * adjohn (~adjohn@ Quit (Quit: adjohn)
[20:29] <sandeen_> ted doesn't seem to have met a feature he doesn't like lately ;) I can't remember when/if they were last submitted...
[20:29] <chaos__> sagelap, k when i'm done with munin and mandatory blog post i'll write something for ceph wiki
[20:30] * sandeen_ tests ugly fix, and it works, now for the more holistic approach :(
[20:30] <yehudasa> sandeen_: +1 for large xattrs
[20:31] <sandeen_> tell ted :) (and andreas, I suppose)
[20:31] <sagelap> sandeen_: does that bug lead to actual corruption, or just an incorrect block count in the inode + fsck warning/fix?
[20:31] <sandeen_> I'd be more enthused with large xattrs than some of the other features proposed lately
[20:31] <sandeen_> 1 block is a rather arbitrary restriction
[20:31] <sandeen_> sagelap, the latter I think
[20:31] <sandeen_> there is this nasty flag-passing saying whether blocks should be accounted at IO time or at delalloc/flush time
[20:32] <sandeen_> but it seems we have a race; one thread is flushing delalloc blocks, so has the "don't account now" flag set; another thread is setting an xattr, and really SHOULD be counting it (I think) - but the inode flag gets set in the middle
[20:32] <sandeen_> anyway, I think it's just a question of when/if inode_add_bytes gets called, so it's just i_blocks that is wrong.
[20:33] <sandeen_> so nothing should be lost, but stat will be wrong
[20:33] <sandeen_> and fsck should properly fix it.
[20:34] <sagelap> sandeen_: cool
[20:35] <sandeen_> so a bit of a hack to not re-test that state in the xattr setting thread makes the bug go away but I need to think a litlte more about what is supposed to be going on here to fix it right
[20:39] * adjohn (~adjohn@ has joined #ceph
[20:48] <chaos__> sagelap, is json schema documented somewhere? (most of perf counters i can guess by name, but some are mystery for now)
[20:50] <chaos__> oh.. found it
[20:50] <chaos__> osd/OSD.cc
[21:08] * cp (~cp@ has joined #ceph
[21:10] <cp> Question: why are there bucket types in the crunch definition file. Apart from devices being special, it seems redundant
[21:10] <cp>
[21:10] <cp> s/crunch/crush
[21:13] <josef> sagewk: yeah whats up?
[21:14] <joshd> cp: you can have a deeper hierarchy than just devices - you could have types for room, row, rack, and host, for example, to spread replicas across failure domains
[21:14] * Nightdog (~karl@190.84-48-62.nextgentel.com) Quit (Ping timeout: 480 seconds)
[21:15] <Tv> gregaf: The content-length related s3tests need to be taken with a grain of salt; I tried to fix some issues, but I think they're still submitting invalid HTTP, and having our load balancer behave the way it did was completely justified.
[21:18] <cp> joshd: why given them different _types_ though, rather than just: bucket BUCKETNAME {...}?
[21:19] <cp> Is there any enforcement, eg, that type 3s must contain only type 2s?
[21:19] <joshd> you enforce that with the rules you specify
[21:20] <joshd> and the composition of the buckets - see http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH
[21:21] <Tv> cp: when you say bucket types, do you mean the uniform / list / tree / straw thing?
[21:21] <cp> TV: I mean # types
[21:21] <cp> type 0 device
[21:21] <cp> type 1 host
[21:21] <cp> type 2 rack
[21:21] <cp> type 3 root
[21:22] <Tv> cp: oh yeah, my understanding is that those are just labels, so you can conveniently say "pick a device, then pick something in a different rack"
[21:23] <cp> Tv: You mean lines like "step chooseleaf firstn 0 type rack"?
[21:23] <Tv> cp: yes
[21:24] <cp> Tv: What exactly does that line mean btw?
[21:24] <Tv> cp: personally, i don't like the DSL for specifying the crush rules, but it's a low priority thing to fix right now..
[21:24] <Tv> step take root
[21:24] <Tv> step chooseleaf firstn 0 type rack
[21:24] <Tv> step emit
[21:25] <cp> Tv: Yes. What does chooseleaf mean vs choose and where does the type come in?
[21:25] <Tv> i forget the details but goes like this.. 1. start with the root 2. pick N racks 3. pick a device in each rack
[21:25] <cp> Tv: and how would that differ from
[21:25] <cp> step take root
[21:25] <Tv> but i really can't explain that right now without re-reading a lot of stuff
[21:25] <cp> step chooseleaf firstn 0 type host
[21:25] <cp> step emit
[21:26] <cp> ok
[21:26] <Tv> you need to read the thesis / sc06 paper to understand crush
[21:26] <Tv> i wish that weren't true ;)
[21:26] <cp> Ah - didn't think of that
[21:26] <cp> THanks
[21:26] <gregaf1> and then you need to remember it, and write a few rules yourself…
[21:26] <gregaf1> I'm pretty fuzzy on it myself
[21:27] <gregaf1> but I believe if you're doing a chooseleaf straight to the bottom level you're skipping over intermediate domains, so you might get all your replicas in the same rack
[21:28] <Tv> as i recall, N replicas means it's effectively running N parallel copies of the virtual machine, and enforcing them to make different decisions; saying "rack" there means each copy picks a different rack
[21:28] <Tv> but i'm very fuzzy on the details without a half-an-hour refresher
[21:29] <Tv> and the difference between "choose" and "chooseleaf" is that "choose" picks something on the next level (for further processing), "chooseleaf" picks a leaf node no matter how deep it is
[21:29] <Tv> so you don't need to hardcode a bunch of "choose" "choose" "choose" repetition
[21:33] <cp> so "chooseleaf firstn 0 type rack " will choose different racks and then take any device below them?
[21:34] <Tv> that's my understanding
[21:34] <Tv> there's a simulation mode that'll run e.g. 1000 object placements and see what comes out; that's a good tool to debug your mental model of crush rules
[21:35] <cp> Ah. Where can i find the simulation tool?
[21:35] <Tv> cp: crushtool
[21:35] <cp> Thanks
[21:37] * cp (~cp@ Quit (Quit: cp)
[21:44] * cp (~cp@ has joined #ceph
[21:53] <josef> sagewk: saw the email and replied
[21:53] <josef> sorry we're both at kernel summit atm
[22:14] * in__ (~n0de@ has joined #ceph
[22:29] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[22:29] * jmlowe (~Adium@140-182-134-224.dhcp-bl.indiana.edu) has joined #ceph
[22:37] * sagelap (~sage@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[22:38] * jmlowe (~Adium@140-182-134-224.dhcp-bl.indiana.edu) Quit (Ping timeout: 480 seconds)
[22:39] * cp (~cp@ Quit (Quit: cp)
[22:40] * cp (~cp@ has joined #ceph
[23:02] * cp (~cp@ Quit (Ping timeout: 480 seconds)
[23:24] * Iribaar (~Iribaar@ Quit (Ping timeout: 480 seconds)
[23:27] <sandeen_> crud I think maybe this ceph corruption is actually my fault
[23:27] <sandeen_> hi josef :)
[23:31] * Iribaar (~Iribaar@ has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.