#ceph IRC Log

Index

IRC Log for 2011-08-31

Timestamps are in GMT/BST.

[0:00] <gregaf> I'm a little annoyed at myself about it, actually :x
[0:00] <bchrisman> where's that implemented?
[0:00] <gregaf> it got moved at some recent point to mds/flock.h
[0:00] <bchrisman> heh.. in a header file? okie..
[0:01] <gregaf> don't remember where it was before, server.cc maybe?
[0:01] <gregaf> *glares at sagewk*
[0:01] <gregaf> ;)
[0:01] <gregaf> jojy: oh, that's why it's in a different debug option, actually
[0:02] <gregaf> bchrisman: yeah, mds/flock.h
[0:03] <gregaf> been there since January, looks like I wrote it in mds/mdstypes.h before that
[0:03] <gregaf> *glare at myself*
[0:03] <sagewk> gregaf: it used to be mdstypes.h
[0:03] <bchrisman> the bug actually in get_last_before?
[0:03] <bchrisman> oh
[0:03] <sagewk> gregaf: you should move it to flock.cc or similar
[0:03] <bchrisman> lotta ugly in there.. yeah.
[0:03] <gregaf> bchrisman: yeah
[0:04] <gregaf> I couldn't come up with a way to implement all their edge cases cleanly :/
[0:04] <bchrisman> yeah??? it's a massive intersection operation.
[0:04] <gregaf> suggestions are welcome, although I think it's probably just because the original locking implementation had some strange internal data types
[0:08] <gregaf> 1 "locked_by": "dev.newdream.net/issues/11124",
[0:08] <gregaf> 1 "locked_by": "kylem@flak",
[0:08] <gregaf> 13 "locked_by": "sage@metropolis",
[0:08] <gregaf> 22 "locked_by": "scheduled_sage@metropolis",
[0:08] <gregaf> 28 "locked_by": "scheduled_teuthology@teuthology",
[0:08] <gregaf> 4 "locked_by": "stephon@flak",
[0:08] <gregaf> 27 "locked_by": null,
[0:08] <gregaf> scheduled_teuthology?
[0:09] <gregaf> joshd: sagewk: Tv: ?
[0:09] <Tv> scheduled_ is an ugly hack i blame on Josh ;)
[0:10] <gregaf> yeah, but I don't know who to blame scheduled_teuthology on :(
[0:10] <joshd> yup, and scheduled_teuthology is me too
[0:10] <Tv> that's the centralized runner i believe
[0:10] <gregaf> and I can't get 3 machines to run my stuff on
[0:10] <joshd> I should have some more unlocked soon
[0:10] <jojy> gregaf: maybe interval trees ?
[0:11] <joshd> the reason so many are locked is that a bunch of tests failed (I've been unlocking when it wasn't due to a bad machine)
[0:13] <gregaf> well that's nice but it's a shared cluster and we shouldn't all need to hold unused machines locked just so we can grab THREE when we need them
[0:13] <gregaf> jojy: hmm, don't think I've seen those before, nifty
[0:13] <jojy> i can try getting an impl
[0:13] <joshd> gregaf: if you're testing a lot, keep 3 locked
[0:14] <gregaf> jojy: now my question becomes if there are any good C++ implementations ;)
[0:14] <bchrisman> yeah.. that's the solution I was looking for...
[0:14] <gregaf> I see some but nothing in boost :(
[0:16] <sagewk> do you actually need overlapping intervals here?
[0:17] <gregaf> sagewk: read locks
[0:17] <gregaf> or I'd have done it with interval_set or a thin shim around it to handle 0 lengths
[0:19] <sagewk> got it
[0:37] * verwilst (~verwilst@dD576F5B5.access.telenet.be) Quit (Quit: Ex-Chat)
[0:57] <gregaf> bchrisman: jojy: actually, I think it's not too bad
[0:57] <gregaf> try http://pastebin.com/Q0x66xdL and see if that works?
[0:57] <gregaf> I'll see if I can write some tests here that hit this bug and commit it then :)
[1:22] <bchrisman> hmm.. if I have faster disks on one osd than the other, and I adjust the weights in the crushmap to reflect that, will reads prefer that 'weightier' disk?
[1:23] <bchrisman> (if those two osds are in a pool, for example)
[1:24] <gregaf> bchrisman: weights in the crushmap influence data placement, so the faster disks will get more data, and thus more reads, but it won't try to preferentially read from them instead of the slower disks when the slow ones are the primaries
[1:24] <bchrisman> ahh okay???
[1:25] <bchrisman> also wanted to check.. is there a way to get primary info out of the objects underlying an rbd device?
[1:25] <Tv> bchrisman: they're just objects named a certain way in the rbd pool, as far as i know
[1:25] <gregaf> hmm, not sure what the rbd interface looks like
[1:25] <Tv> bchrisman: i don't think the naming is *exposed* anywhere, on purpose
[1:26] <Tv> bchrisman: but it's something like name of the rbd image and then how manieth chunk it is
[1:26] <gregaf> yeah
[1:26] <bchrisman> okay.. I was hoping for a way to setup an rbd pool such that it would preferentially read from a faster osd...
[1:27] <Tv> bchrisman: where "faster" differs based on the object?
[1:27] <Tv> i mean, based on the rbd image
[1:27] <bchrisman> well.. if I can create a pool that will prefer to read from a particular osd, then I can use that pool for an rbd device.. that would do what I want.
[1:27] <Tv> bchrisman: if nothing else, you can always put your images in more than pool, and do that logic on the pool level
[1:28] <gregaf> bchrisman: can't you just build the crush map so that primaries are always chosen from the set of faster disks and replicas from the set of slower ones?
[1:28] <gregaf> or are you trying to actually set up locality from rbd?
[1:28] <bchrisman> gregaf: yeah.. that's pretty much what I was trying to figure out??? it's the same thing as far as I can tell.
[1:29] <Tv> didn't we do the whole "read local if possible" logic for hadoop integration?
[1:29] <Tv> that sounds like it could help
[1:29] <bchrisman> that would've been libceph?
[1:29] <gregaf> Tv: we just expose the layout info there, and it's via libceph
[1:29] <Tv> ah
[1:29] <gregaf> I don't think it's exposed via librbd at all, although it probably could be
[1:30] <gregaf> but Hadoop grants us some assumptions that rbd users don't
[1:30] <gregaf> bchrisman: are you actually trying to set up tiered storage or is it like you've got 5 nodes, and you want all reads to go on the local node?
[1:31] <gregaf> I'm pretty sure CRUSH lets you specify a group to pick the primary from, although I'm less familiar with it than I should be so I think we need sagewk to answer that
[1:31] <bchrisman> gregaf: ahh cool.. I think I can look through crush stuff a little more and probably figure that out.
[1:32] <bchrisman> gregaf: and your assumption was correct??? looks like the functionality should be there.
[1:32] <sagewk> the read from local replica decision actually happens in the rados client layer
[1:32] <sagewk> but the switch to turn it on isn't currently exposed, iirc.
[1:33] <bchrisman> but if I setup a pool correctly.. for one rbd device say??? that should be doable?
[1:33] <bchrisman> to specify 'this osd/device primary', 'that osd/device replica'?
[1:34] <sagewk> well, if you want, you can always create pools and then explicitly map each pg to the devices you want. it just doesn't scale if you don't do placement using a function
[1:35] <gregaf> sagewk: but say you have a set of 5 devices which we'll call SSD, and another set of 5 devices which we'll call HDD ??? isn't it possible with CRUSH to say "choose one primary from SSD, choose one replica from HDD"?
[1:36] <gregaf> I mean, it has to be, or you wouldn't be able to do rack placement
[1:36] <sagewk> gregaf: yeah that too. and that scales indefinitely.
[1:37] <gregaf> but I was thinking for some reason that there was a way to do that which didn't require you to divorce it completely from failure domains
[1:58] <bchrisman> In the example crushmap (wiki), the data rule specifies 'step chooseleaf firstn 0 type rack'.. I'm guessing there's a way to say, for example, "for the primary, choose from rack1, for the replica, choose from rack2" ?
[1:59] <bchrisman> right now, I think the example would select one from rack 1 and one from rack 2, but which is primary and which is replica is not explicit
[2:03] <gregaf> yehuda_hm: around?
[2:04] <gregaf> I'm wondering about your commit "osd: fix osd reply message"??? I don't think __s32 should have broken anything, and the result is getting stuffed into a 32-bit value so if switching from __s32 to int is fixing a problem I think that's probably a problem
[2:10] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:11] <cmccabe> gregaf: switching from __s32 to int didn't fix anything, but replacing "result = result" with "result = r" did
[2:16] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:33] * huangjun (~root@61.184.205.45) has joined #ceph
[2:39] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[2:44] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[2:46] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[2:47] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[2:59] <jojy> gregaf: the flock fix looks to have fixed the issue
[2:59] <jojy> i will run a few tests tomorrow also to verify
[3:00] <jojy> gregaf: good one! ty
[3:00] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[3:00] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:30] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[3:53] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Remote host closed the connection)
[3:56] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[3:58] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[3:58] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit ()
[4:31] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Read error: Connection reset by peer)
[4:32] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[5:37] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[5:39] * lxo (~aoliva@9KCAAARPL.tor-irc.dnsbl.oftc.net) has joined #ceph
[8:12] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[9:05] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[9:40] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[11:39] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[13:58] * gregorg (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[14:10] * mtk (nrt2niVPD1@panix2.panix.com) Quit (Remote host closed the connection)
[14:10] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[14:36] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[14:48] * huangjun (~root@61.184.205.45) Quit (Quit: Lost terminal)
[15:07] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[15:12] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[16:17] <damoxc> I appear to have done something to expose a bug in the OSDs that's causing them to be unable to start
[16:18] <damoxc> Issue #1471 on redmine
[16:26] <damoxc> I was wondering if anyone would know of a way to get them starting again>?
[17:34] * lxo (~aoliva@9KCAAARPL.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[17:40] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:09] <sagewk> damoxc: is this btrfs or extN?
[18:09] <damoxc> sagewk: btrfs
[18:09] <damoxc> sagewk: linux 3.0
[18:19] <Tv> the wiki has database issues again
[18:21] * sagewk (~sage@aon.hq.newdream.net) Quit (Remote host closed the connection)
[18:23] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[18:24] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[18:28] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:37] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[18:43] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[18:44] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[18:45] * cmccabe (~cmccabe@69.170.166.146) has joined #ceph
[18:52] <gregaf> sagewk: looks like the tests were all on ae552539e338b6399a909a0f8e53dc3ceed0a3cf
[18:52] <gregaf> "client: additional sanity checks on link/unlink"
[18:55] <gregaf> sagewk: which I think included the bad OSD message, but I'm not sure how many of those failed tests it actually mattered for
[18:57] <joshd> there might still be a bad message - http://tracker.newdream.net/issues/1462 was reproducible on master at the end of the day yesterday
[18:58] <gregaf> joshd: did that include Yehuda's fix commit?
[18:59] <joshd> yeah
[19:00] <joshd> wait, maybe not
[19:01] <joshd> ok, that did fix it
[19:02] <joshd> I'm guessing that fixes the rgw_admin crash the suite had yesterday as well
[19:03] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[19:04] <sagewk> joshd: 6180c2cccf3e910cb78a139909b19b2333b79144 is probably responsible for a bunch of those failures.. basically all osd failure codes were getting masked
[19:05] <joshd> yeah, that would fix the rados_api_tests and probably rgw_admin crashing
[19:13] <joshd> sagewk: your suite run is locking a bunch of machines - and all the ones run didn't include that fix
[19:14] <sagewk> we can clobber them all
[19:14] <joshd> ok
[19:14] <damoxc> sagewk: WRT #1471 is cosd -f meant to output logs to stdout?
[19:14] <sagewk> i thin it just avoids the fork(), but still logs to the log file
[19:14] <sagewk> but the system's assert goes to stderr,so you should still see that
[19:15] <sagewk> iirc -d will also redirect logs to stderr
[19:15] <damoxc> http://dpaste.com/606144/
[19:15] <damoxc> that's what I get
[19:15] <gregaf> hmm, is it -F that logs to stdout?
[19:15] <damoxc> logs going to the file
[19:15] <cmccabe> damoxc: -d will output all logs to stderr and also not fork
[19:16] <cmccabe> gregaf: you can log to stderr with --log-to-stderr=2 or with -d
[19:16] <sagewk> looks like -f isn't working?
[19:16] <sagewk> back in ~20 min
[19:16] <damoxc> sagewk: I think it's running until it crashes
[19:17] <damoxc> sagewk: for when you get back, http://dpaste.com/606146/
[19:19] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[19:39] <sagewk> damoxc: ah right. hmm.. can you cherry-pick/apply 5ae3e13617c9a63d12d12c8506daefd2be14677d , rebuild cosd, and then re-run? that'll make the assert message in the log include useful information
[19:43] <cmccabe> sagewk: -f and -d seem to work fine for me, did you see something strange?
[19:43] <sagewk> cmccabe: no false alarm
[19:43] <cmccabe> sagewk: ok
[19:46] <wido> ceph mon injectargs \* '--mon_osd_down_out_interval 1800'
[19:46] <wido> should that still works? I couldn't find a injectargs anymore for the mon subsystem
[19:46] <wido> those darn WD Green disks, they keep dying on me...
[19:47] <ajm> wido: we had that happen, i hate them with a passion.
[19:47] <cmccabe> wido: I think you don't need to write 'mon' in that command
[19:47] <cmccabe> wido: since all cephtool commands go to the monitor anyway that is the default target
[19:48] <gregaf> cmccabe: wido: pretty sure you do need to specify the subsystem now
[19:48] <gregaf> I think the original command is correct
[19:48] <wido> cmccabe: gregaf: this works now: ceph injectargs \* '--mon_osd_down_out_interval=1800'
[19:48] <cmccabe> cmccabe@metropolis:~/ceph/src$ ./ceph injectargs '--mon_osd_down_out_interval 1800'
[19:48] <cmccabe> 2011-08-31 10:34:15.205294 mon <- [injectargs,--mon_osd_down_out_interval 1800]
[19:48] <cmccabe> 2011-08-31 10:34:15.206175 mon2 -> 'parsed options' (0)
[19:48] <cmccabe> cmccabe@metropolis:~/ceph/src$ ./ceph mon injectargs '--mon_osd_down_out_interval 1800'
[19:48] <cmccabe> 2011-08-31 10:34:30.117493 mon <- [mon,injectargs,--mon_osd_down_out_interval 1800]
[19:48] <cmccabe> 2011-08-31 10:34:30.118791 mon1 -> 'unknown command injectargs' (-22)
[19:48] <gregaf> hmm, okay
[19:48] <wido> with "mon" you get a unknown command
[19:49] <wido> btw, if I do: ceph injectargs \* '--mon_osd_down_out_interval 1800' I get: 'must supply options to be parsed in a single string'
[19:49] <cmccabe> gregaf: anyway, it was never 'ceph foo injectargs', but always 'ceph osd tell 0 injectargs' or 'ceph mds tell 0 injectargs'
[19:50] <cmccabe> wido: for some reason your text shows up as ceph injectargs \* '--mon_osd_down_out_interval 1800'
[19:50] <cmccabe> wido: you're not really putting an asterisk in there are you?
[19:50] <cmccabe> wido: must be my IRC client
[19:51] <wido> uh yeah, but that was old I think.. I was googling a bit, found a old post of Sage ;)
[19:52] <cmccabe> wido: I'm not sure why you would put an asterisk in there...
[19:52] <wido> cmccabe: Old habit I guess, for telling it's for every monitor
[19:53] <wido> without the asterisk it fails btw
[19:54] <cmccabe> wido: I can't seem to find any place that parses an asterisk in mon/Monitor.cc
[19:55] <wido> cmccabe: Ok, I don't know where it came up, I just used: ceph injectargs '--mon_osd_down_out_interval 1800'
[19:55] <wido> that works fine
[19:55] <wido> swapping disk now
[19:55] <cmccabe> cmccabe@metropolis:~/ceph/src$ ./ceph injectargs \* '--debug-ms 0'
[19:55] <cmccabe> 2011-08-31 10:43:08.002266 mon <- [injectargs,*,--debug-ms 0]
[19:55] <cmccabe> 2011-08-31 10:43:08.002881 mon1 -> 'must supply options to be parsed in a single string' (-22)
[19:55] <cmccabe> cmccabe@metropolis:~/ceph/src$ ./ceph injectargs '--debug-ms 0'
[19:55] <cmccabe> 2011-08-31 10:43:10.201687 mon <- [injectargs,--debug-ms 0]
[19:55] <cmccabe> 2011-08-31 10:43:10.202679 mon0 -> 'parsed options' (0)
[19:56] <cmccabe> so I'm guessing if the asterisk ever had any function, it doesn't now?
[19:56] <wido> cmccabe: http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/342
[19:56] <sagewk> the top level injectargs doesn't exist anymore
[19:56] <sagewk> do 'ceph <daemontype> tell <id number or \*> injectarts 'foo bar'
[19:58] <cmccabe> sagewk: just out of curiosity. what was the function of the asterisk in days of old
[19:58] <gregaf> cmccabe: it means "all daemons"
[19:58] <cmccabe> gregaf: ok
[19:58] <gregaf> check out MonmapMonitor::preprocess_command
[19:58] <sagewk> still does mean that :)
[19:59] <sagewk> (or don't, it's not pretty)
[19:59] <cmccabe> so I guess you could still use that syntax with 'tell'
[20:00] <gregaf> cmccabe: that's how you have to
[20:00] <gregaf> ceph mon tell \* injectargs 'bla'
[20:01] <Tv> does "<id number or \*>" mean "<name or \*>"? like, osd.42, not just 42...
[20:01] <gregaf> Tv: nope, it's '42'
[20:01] <Tv> or is it "ceph mon tell", "Ceph osd tell", etc
[20:01] <wido> I just cleanly unmounted the btrfs filesystems of my remaining OSD's (3) on that machine, halted it, booted again: btrfs: open_ctree failed
[20:01] <Tv> ok so each mon/osd etc has it's own tell
[20:01] <gregaf> yeah
[20:01] <wido> It was a clean unmount. Are you guys seeing that as well?
[20:04] <Tv> umm, inside cmon, where does cout << "foo" goes?
[20:05] * lxo (~aoliva@9YYAAA8W6.tor-irc.dnsbl.oftc.net) has joined #ceph
[20:07] <cmccabe> tv: nowhere
[20:07] <cmccabe> tv: unless you happen to be running cmon in the foreground, or you haven't yet called daemonize()
[20:07] <cmccabe> tv: (more technically, it's rerouted to /dev/null)
[20:13] <Tv> cmccabe: so how would one go about making CrushWrapper.cc debug output nicer...
[20:14] <cmccabe> tv: you could add a CephContext to the class and use ldout / lderr
[20:14] <Tv> just using dout there makes libceph/librados linking fail, undefined reference to `g_ceph_context'
[20:15] <cmccabe> tv: well, those globals aren't defined for library code. You need to use ldout and pass in the library user's own ceph context
[20:15] <Tv> cmccabe: and crush has no such thing
[20:16] <cmccabe> tv: for an example, you might want to check out common/Finisher.cc
[20:17] <cmccabe> tv: basically, you add CephContext *cct as a data member of the class, then initialize it in the class constructor, then use ldout(cct, 1) instead of dout(1), etc
[20:17] <cmccabe> tv: you also need to add a configuration option to set the crush debug level in common/config.cc and common/config.h
[20:18] <cmccabe> tv: probably something like OPTION(debug_crush, OPT_INT, 0),
[20:19] <slang> hello
[20:20] <cmccabe> slang: hi
[20:20] <slang> latest release seems fairly stable
[20:20] <slang> I'm seeing periodic client hangs though
[20:21] <cmccabe> slang: I believe there was an OSD message that had a bug in how it was handled...
[20:21] <slang> the aren't specific to an operation, I've seen it hang on the client in get_caps, write, read, ...
[20:22] <gregaf> cmccabe: master branch, not release
[20:22] <cmccabe> gregaf: ah
[20:22] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Read error: No route to host)
[20:23] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[20:24] <slang> anyone have a commit id for that?
[20:24] <gregaf> slang: I believe we've fixed a few such bugs in our current sprint, but of course if you've collected any data on them you should submit it in the tracker :)
[20:26] <gregaf> the fix is in 6180c2cccf3e910cb78a139909b19b2333b79144; I think it got broken in fbeafdf9c385af6eb4858b7227862039f2ea5a4d which was merged to master yesterday or Monday
[20:28] <sagewk> tv: it's just CrushWrapper (or the methods that spew output) that need a cct pointer.
[20:29] <sagewk> tv: ha, what colin said
[20:29] <damoxc> sagewk: just re-building, I'll update the logs on the ticket when it's done
[20:29] <slang> did the policy for the stable branch change?
[20:30] <slang> I see commits there now that aren't coming from next
[20:30] <slang> as part of a release
[20:30] <slang> has that always been the case?
[20:30] * lxo (~aoliva@9YYAAA8W6.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[20:33] * The_Bishop (~bishop@port-92-206-251-64.dynamic.qsc.de) has joined #ceph
[20:33] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[20:34] <cmccabe> slang: sagewk might be able to clarify more, but I think we've always cherry-picked fixes into stable from other branches if it seemed like they would increase the, um, stability
[20:35] <cmccabe> slang: I mean some things are obvious small bug fixes
[20:36] <slang> cmccabe: yep
[20:37] <slang> cmccabe: just trying to figure out what to track :-)
[20:39] <cmccabe> slang: if I understand correctly, stable should be fine since the bug I was speaking about was never introduced to there
[20:39] <cmccabe> slang: so the fix didn't need to be introduced either :)
[20:40] <slang> cmccabe: sounds like it won't fix the hangs I'm seeing then either
[20:40] <cmccabe> slang: I didn't realize you were using the stable branch or else I wouldn't have mentioned it... I thought you might be on master
[20:40] <cmccabe> slang: yeah I'm not sure what the hangs you're describing are all about
[20:40] <cmccabe> slang: do you have a bug id for this?
[20:40] <slang> no I was trying to see if I could make some process on irc
[20:41] <slang> (and verify that it hadn't already been filed/fixed)
[20:41] <slang> s/process/progress/
[20:41] <cmccabe> slang: so when does it happen
[20:42] * lxo (~aoliva@83TAAC4D7.tor-irc.dnsbl.oftc.net) has joined #ceph
[20:42] <slang> seems to happen at random
[20:42] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[20:42] <cmccabe> slang: it shows up as client hangs?
[20:42] <cmccabe> slang: are you using the kernel client or the fuse client?
[20:44] <slang> I noticed that an osd will just ignore a request (in OSD::handle_op) if the map epoch from the client is old
[20:45] <slang> cmccabe: yeah hangs of processes doing I/O
[20:45] <cmccabe> slang: using the kernel client?
[20:45] <slang> cmccabe: (and the entire mount point)
[20:45] <slang> cmccabe: cfuse
[20:45] <cmccabe> slang: ok
[20:45] <cmccabe> slang: question one: is cfuse still alive?
[20:45] <slang> cmccabe: yes
[20:46] <slang> #0 0x00007f1fea859bac in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
[20:46] <slang> #1 0x00000000006a4c91 in Cond::Wait (this=0x7fff55c46880, mutex=...) at ../../src/common/Cond.h:46
[20:46] <slang> #2 0x000000000067a183 in Client::wait_on_list (this=0x19e0380, ls=...) at ../../src/client/Client.cc:2305
[20:46] <slang> #3 0x00000000006773a6 in Client::get_caps (this=0x19e0380, in=0x4983000, need=2048, want=1024, got=0x7fff55c46a44, endoff=-1)
[20:46] <slang> at ../../src/client/Client.cc:1973
[20:46] <slang> #4 0x00000000006909d8 in Client::_read (this=0x19e0380, f=0x58fadc0, offset=0, size=4096, bl=0x7fff55c46b20)
[20:46] <slang> at ../../src/client/Client.cc:5044
[20:46] <slang> #5 0x00000000006a0045 in Client::ll_read (this=0x19e0380, fh=0x58fadc0, off=0, len=4096, bl=0x7fff55c46b20)
[20:46] <slang> at ../../src/client/Client.cc:6723
[20:46] <slang> that's the stack trace for one such hang
[20:46] <slang> here's another:
[20:46] <slang> #0 0x00007fb1cf155bac in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
[20:46] <slang> #1 0x00000000006a4c91 in Cond::Wait (this=0x7fff617d78e0, mutex=...) at ../../src/common/Cond.h:46
[20:46] <slang> #2 0x000000000066f58a in Client::make_request (this=0x1dff380, request=0x1ead600, uid=1005, gid=1006, ptarget=0x0,
[20:46] <slang> use_mds=-1, pdirbl=0x0) at ../../src/client/Client.cc:1081
[20:46] <slang> #3 0x0000000000687df0 in Client::_getattr (this=0x1dff380, in=0x1eafb00, mask=341, uid=1005, gid=1006)
[20:46] <slang> at ../../src/client/Client.cc:3904
[20:46] <slang> #4 0x0000000000696eb4 in Client::ll_getattr (this=0x1dff380, vino=..., attr=0x7fff617d7ad0, uid=1005, gid=1006)
[20:46] <slang> at ../../src/client/Client.cc:5857
[20:46] <cmccabe> slang: are you running multiple MDS servers?
[20:47] <slang> cmccabe: yes, but only one active
[20:47] <cmccabe> slang: you mean the others are 'out'?
[20:47] <slang> I have another that's from _read_async
[20:47] <slang> no
[20:47] <slang> I mean the others are standby
[20:47] <cmccabe> slang: or you're running a standby mds
[20:47] <cmccabe> gregaf: are you allowed to have more than one standby for a single MDS?
[20:47] <slang> mds e8: 1/1/1 up {0=delta=up:active}, 1 up:standby-replay, 2 up:standby
[20:48] <slang> cmccabe: I think you can -- I've been running this way for a while now
[20:48] <cmccabe> slang: I think greg is at lunch now
[20:49] <slang> cmccabe: my impression is that this isn't just an mds issue
[20:49] <slang> cmccabe: two of the requests are read requests, which should be going to osds
[20:49] <cmccabe> slang: well, caps are definitely an MDS issue
[20:49] <cmccabe> slang: so it's possible that you're waiting for caps?
[20:49] <cmccabe> slang: brb
[20:50] <slang> again, the hangs are not always in the same place
[20:50] <slang> I guess I'll just file a bug
[20:53] * lxo (~aoliva@83TAAC4D7.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[21:01] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[21:02] * lxo (~aoliva@83TAAC4EP.tor-irc.dnsbl.oftc.net) has joined #ceph
[21:03] <cmccabe> slang: back
[21:03] <cmccabe> slang: I think the best thing to do is file a bug, and maybe sagewk or gregaf will be able to correlate this with a recent change
[21:04] <cmccabe> slang: btw, can you reproduce this with debug_ms at 1 and debug_mds at 20?
[21:05] <cmccabe> slang: I understand that it might be hard to reproduce with that debugging turned on, but it's always helpful
[21:08] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[21:09] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[21:17] <damoxc> sagewk: much more informative error this time
[21:18] * lxo (~aoliva@83TAAC4EP.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[21:19] <damoxc> sagewk: information added to the issue
[21:26] * slang (~slang@chml01.drwholdings.com) has left #ceph
[21:27] * bchrisman (~Adium@64.164.138.146) has joined #ceph
[21:53] <sagewk> damoxc: can you do an ls -al in /srv/osd7/current/0.2a5_f ?
[21:53] <gregaf> cmccabe: slang: I don't know of any changes in particular that should cause hangs in v0.34 ??? do you think they're actually new?
[21:53] * slang (~slang@chml01.drwholdings.com) has joined #ceph
[21:54] <damoxc> sagewk: http://dpaste.com/606225/
[21:56] <Tv> what are the consequences of having max_osd >> actual num of osds?
[21:59] <sagewk> tv: there's a vector<> that's wasting some memory
[21:59] <sagewk> and the osdmap encodes larger than it needs to
[21:59] <Tv> sagewk: is max_osd purely an implementation detail? could we get rid of it, or auto-size, or something?
[22:00] <sagewk> it also affects the encoding.
[22:00] <sagewk> we can certainly autosize, though.
[22:00] <Tv> i'm just looking to simplify ops
[22:01] <sagewk> damoxc: not sure how it got there, but its easy to clean up.. remove that file from the newest snap_ directory and restart cosd
[22:01] <sagewk> tv: yeah we can make it autosize.
[22:01] <sagewk> tv: i think the create command already bumps it up as needed.
[22:02] <Tv> it did not ;)
[22:02] <Tv> just sent the email to emerg
[22:02] <damoxc> sagewk: wash, rinse and repeat for all osds that are failing to start?
[22:02] * bchrisman (~Adium@64.164.138.146) Quit (Quit: Leaving.)
[22:02] <sagewk> tv: were there's parallel creates by chance?
[22:03] <Tv> sagewk: nope
[22:03] <sagewk> damoxc: yeah
[22:03] <damoxc> sagewk: am I looking for the exact filename?
[22:05] <sagewk> it's getting ENOTEMPTY on teh rmdir for the dir. just delete whatever files are still in there and it'll succeed the next time around.
[22:10] <sagewk> tv: if you can reproduce that, dump teh osdmap while you're getting the max_osds error.. it sounds like hte osd id wasn't created yet
[22:12] <damoxc> sagewk: sorry I'm being really stupid but I'm not 100% sure what you mean and don't want to go deleting things I'm not supposed to be, http://dpaste.com/606241/ which one of those should I be removing?
[22:13] <sagewk> damoxc: ls -al ./snap_3733783/0.2a5_head should show several files... delete those (but not the directory itself)
[22:21] <damoxc> sagewk: 218 files to be exact
[22:21] <damoxc> that's what threw me off, I was only expecting to find one
[22:21] <sagewk> hrm that is strange
[22:22] <damoxc> http://dpaste.com/606245/ <-- ls -al
[22:22] <damoxc> well, ls -l
[22:23] <sagewk> what version are you running?
[22:24] <damoxc> 0.34
[22:24] <damoxc> with the previous patch applied
[22:28] * bchrisman (~Adium@64.164.138.146) has joined #ceph
[22:38] <sagewk> damoxc: if you are paranoid about losing data, make sure that PG is on other nodes before removing the files from this one
[22:38] <damoxc> sagewk: what's the easiest way to check that, ceph pg dump -o - ?
[22:39] <sagewk> no, you need to do an ls on the other nodes to see if the files are there :/
[22:40] <sagewk> i'm assuming the osd meant to remove the pg but failed to delete the files first. if the collection_remove itself is wrong, though, that's another story.
[22:40] <damoxc> I imagine it's correct
[22:40] <damoxc> I deleted a bunch of snapshots and data around that time
[22:43] <damoxc> should the same snap_ directories be on other osds?
[22:54] * lxo (~aoliva@82VAADKKR.tor-irc.dnsbl.oftc.net) has joined #ceph
[22:57] <sagewk> the numbers will be different.. always use the most recent
[22:57] <sagewk> at least for changes.. if you're just looking, current/ is fine
[22:57] <damoxc> basically, I'm not concerned about any snapshots
[22:57] <damoxc> so should I only check current/ ?
[22:59] <sagewk> the snap_ stuff is unrelated to snapshots.. they're consistency points intenral to the osd/filestore. on startup current/ is ignored, and it rolls back to the last consistent snap_ dir.
[23:01] <damoxc> oh okay
[23:01] <damoxc> foolish assumption on my part there
[23:02] <sagewk> not at all, teh snap_ naming is confusing
[23:03] <johnl_> hi
[23:03] <sagewk> johnl_: hi
[23:04] <johnl_> re #1470, Samuel Just asked me for the "contents of the meta collection". What does that mean? how can I get it?
[23:04] <damoxc> sagewk: the data for 0.2a5 is on 2 other osds from what I can see, and is also marked as active+clean
[23:05] <sagewk> johnl_: ls -al $osd_data/current/meta
[23:05] <johnl_> ta
[23:05] <sagewk> johnl_: actaully, find $osd_data/current/meta, so we see subdirs too
[23:06] <sagewk> damoxc: perfect, safe to remove then.
[23:06] <johnl_> it's empty!
[23:07] <sagewk> sjust: hmm?
[23:09] <sjust> johnl_: I just reproduced it
[23:09] <sjust> all of my collections are also empty
[23:11] <damoxc> sagewk: I've removed all the files in ./snap_3733783/0.2a5_head but it's still erroring :-s
[23:15] <sagewk> damoxc: that's the nwest snap? and it's still giving you ENOTEMPTY when trying to remove it?
[23:17] <damoxc> sagewk: still the same as http://dpaste.com/606241/
[23:17] <damoxc> sagewk: log - http://dpaste.com/606269/
[23:18] <sagewk> does 0.2a5_f even exist in the newest snap_ dir?
[23:26] <damoxc> no
[23:26] <damoxc> only 0.2a5_head
[23:33] <johnl_> sjust: great! (or not great if you lost data :s)
[23:34] <sjust> johnl_: ok, I think I found the bug, it's probably hosed your filestore, I'm sorry about that
[23:34] <gregaf> sjust: what'd it do?
[23:35] <sjust> FlatIndex fails to set exists in the normal case causing collection_add to erroneously return -ENOENT
[23:37] <slang> sagewk: second crash isn't happening again once the server is restarted (with debugging enabled)
[23:39] <sagewk> damoxc: ..and it still generates that error? that's good news, actually.. i think it means its creating the collection that it's failing to remove, which means a full log should be enough to find/fix the bug.
[23:39] <sagewk> can you reproduce and post the full log?
[23:43] <damoxc> sagewk: attached to the issue
[23:56] * bchrisman (~Adium@64.164.138.146) Quit (Quit: Leaving.)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.