#ceph IRC Log


IRC Log for 2011-04-12

Timestamps are in GMT/BST.

[0:06] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[0:29] * verwilst (~verwilst@dD576FAAE.access.telenet.be) Quit (Quit: Ex-Chat)
[1:00] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[1:33] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[1:41] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Operation timed out)
[1:55] <cmccabe> can someone explain the concept of a tmap
[1:57] <Tv> we need to do something about ceph log verbosity in autotests :(
[1:57] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[1:58] <Tv> it's continually filling up a 200GB disk, and there's not much more to give it on the physical box
[1:58] <cmccabe> I'm guessing that most people will want to be able to specify the logging settings for a particular run
[1:58] <Tv> that's kinda unrelated; i'm talking about defaults
[1:59] <Tv> especially the problem case where it hangs & keeps writing a huge pile of logs
[1:59] <Tv> easy first step is to plug in compression, somewhere..
[1:59] <cmccabe> tv: I thought you had logrotate doing tar.gz
[1:59] * Tv plays eeny meeny miny moe with autotest.git
[2:00] <Tv> cmccabe: there's not much rotation going on, and what is is the ceph internal funky symlink dance
[2:00] <cmccabe> heh
[2:01] <cmccabe> tv: to a first approximation, the symlink dance is intended to make it easier to run vstart.sh
[2:01] <Tv> yeah i'm actually thinking of disabling it for autotest tests, so that the log file names will be 100% predictable
[2:02] <Tv> no rotation while the test is running etc
[2:02] <cmccabe> tv: go for it
[2:02] <cmccabe> tv: well, rotation is not exactly what the symlink dance does
[2:02] <Tv> yeah i know, but that also seems to happen, somewhere
[2:02] <cmccabe> tv: more precisely, it does rotation exactly once-- when the program starts
[2:02] <Tv> ahh
[2:03] <cmccabe> tv: aren't there FSes with transparent encryption out there?
[2:03] <Tv> cmccabe: ecryptfs?
[2:03] <cmccabe> tv: that might be the best option. Just make /var/log/ceph into one of those.
[2:03] <Tv> you mean compression?
[2:03] <Tv> i'd very much like to avoid such complexity
[2:04] <cmccabe> tv: well, it's complexity that presumably we won't be debugging, since ecryptfs already exists
[2:04] <Tv> yet when it fails, it'll be even more mysterious
[2:04] <Tv> the less code that runs, the happier i am
[2:04] <Tv> also i don't want to deal with performance implications etc
[2:05] <cmccabe> tv: If you want, you can rotate logfiles manually by changing g_conf.log_file and sending SIGHUP
[2:05] <Tv> no i want to disable all such, so it's just a single predictable pathname
[2:05] <cmccabe> tv: ok, just set log_file
[2:06] <Tv> will do, once i figure out the rest of the picture
[2:06] <cmccabe> tv: there will be no rotation, and we always open that file with O_APPEND
[2:07] <cmccabe> tv: you were talking about compression before
[2:07] <Tv> i've seen ceph logs take ~200GB per run
[2:07] <cmccabe> tv: you could use zlib to write the logfile if you really want
[2:07] <Tv> that's just insane
[2:07] <Tv> nono not at that time
[2:07] <Tv> outside ceph
[2:07] <cmccabe> tv: however that would have the performance implications you were talking about as well
[2:07] <Tv> yeah compress after the run
[2:07] <cmccabe> tv: 7z might be interesting
[2:08] <cmccabe> tv: I'm not sure what the best zip for textfiles out there is now
[2:08] <Tv> i'll take browser support for ungz-ing on the fly above 2% savings
[2:08] <cmccabe> tv: anyways, I suspect most devs will just ask you for logfiles before even thinking about debugging anything
[2:08] <Tv> hell no
[2:09] <cmccabe> tv: so the usefulness of unlogged runs might be kind of limited
[2:09] <Tv> self-service or i will burn out fast
[2:09] <cmccabe> tv: anyway, I don't have the answers for what default logging should be.
[2:10] <Tv> for real installs, on anything debian-based, probably /var/log/ceph/ and rotation via /etc/logrotate.d/
[2:10] <cmccabe> tv: when I was working with the OSD more, I never ran without a lot of OSD logging
[2:11] <Tv> but i think that's gonna change a lot before we're done
[2:11] <cmccabe> tv: you can ask sam and josh if anything has changed on that front snice then
[2:11] <Tv> for example, what users what daemons run as, etc
[2:11] <cmccabe> I'm sure greg wants a lot of MDS logging
[2:11] <cmccabe> sometimes messenger logging is set oddly high... like on the rgw test machines. But there's reasons for that too.
[2:19] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:27] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:43] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[2:49] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[2:50] * sjustlaptop (~sam@ has joined #ceph
[3:06] * sjustlaptop (~sam@ Quit (Quit: Leaving.)
[4:57] * hijacker (~hijacker@ Quit (Read error: Connection reset by peer)
[4:57] * hijacker (~hijacker@ has joined #ceph
[5:08] * lxo (~aoliva@ Quit (Read error: Operation timed out)
[5:08] * lxo (~aoliva@ has joined #ceph
[6:00] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[6:36] * sjustlaptop (~sam@adsl-76-208-183-201.dsl.lsan03.sbcglobal.net) has joined #ceph
[6:39] * sjustlaptop (~sam@adsl-76-208-183-201.dsl.lsan03.sbcglobal.net) Quit ()
[8:06] * Yoric (~David@ has joined #ceph
[8:51] * hijacker (~hijacker@ Quit (Quit: Leaving)
[9:01] * Yoric (~David@ Quit (Quit: Yoric)
[9:06] * hijacker (~hijacker@ has joined #ceph
[9:20] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[9:46] * chraible (~chraible@blackhole.science-computing.de) has joined #ceph
[9:46] <chraible> hi @all
[9:46] <chraible> im searching the right tcmalloc package to configure Ceph on CentOS 5.5 x86_64
[9:47] * allsystemsarego (~allsystem@ has joined #ceph
[9:48] <chraible> I intalled google-perftools-1.3-3.el5.kb.x86_64, google-perftools-1.7-1.el5.i386.rpm, google-perftools-devel-1.7-1.el5.i386.rpm
[9:48] <chraible> but none of those packages ar working...
[9:49] * Yoric (~David@did75-14-82-236-25-72.fbx.proxad.net) has joined #ceph
[10:23] <chraible> is someone here?
[10:36] <stefanha> chraible: Don't ask to ask, just ask. Then wait on the channel for a while.
[12:38] * Yoric_ (~David@did75-14-82-236-25-72.fbx.proxad.net) has joined #ceph
[12:38] * Yoric (~David@did75-14-82-236-25-72.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[12:38] * Yoric_ is now known as Yoric
[12:51] * cockroach (~stefan@80-218-0-190.dclient.hispeed.ch) has joined #ceph
[12:53] <cockroach> hi there! just a quick question: is the "Ceph is under heavy development, and is not yet suitable for any uses other than benchmarking and review" statement still true?
[14:13] <wido> cockroach: yes, that is still true
[14:19] * chraible (~chraible@blackhole.science-computing.de) has left #ceph
[14:28] * chraible (~chraible@blackhole.science-computing.de) has joined #ceph
[14:46] <cockroach> wido: ok, thanks
[14:47] <cockroach> wido: sucks though, now we'll have to go with some other file system. they all suck. :)
[15:38] <chraible> i got the solution for my problem :D
[15:39] <chraible> i forgot the devel package on version 1.3.3 . I have to install google-perftools-devel-1.3-3.el5.kb.x86_64 too ... after that all seems to work fine :D
[16:10] * chraible (~chraible@blackhole.science-computing.de) Quit (Quit: Verlassend)
[16:12] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[16:47] <lxo> with 0.26 I'm experiencing relatively frequent mds assertion failures at Journaler::_trim_finish:963: assert(to > trimmed_pos); with to == trimmed_pos and waitfor_trim.empty(). should the assertion be relaxed to >=, or is that really meant to catch the == condition?
[16:51] * cap_ (~cap@yaydoe.nsc.liu.se) has joined #ceph
[17:23] <sage> lxo: i'll take a look
[17:25] <cap_> how do you keep the kernel client up to date? can it be built out-of-tree or are you simple expected to use the latest kernel?
[17:26] <sage> lxo: do you have an mds in up:standby-replay mode?
[17:27] <sage> lxo: it's not obvious how that is happening. if you can reproduce with 'debug journaler = 20' that should tell us exactly what's going on
[17:39] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:49] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:52] * greglap (~Adium@ has joined #ceph
[17:55] <lxo> sage, I did. I still have the core files and debug info for the failing mdses, but I'm already running a patched version now
[17:55] <lxo> would any information from the core files be useful?
[17:55] <greglap> cap_: you can build it as a module with ceph-client-standalone; it should work with any reasonably recent kernel
[17:56] <cap_> greglap, is 2.6.15 (fc14) reasonably new in this context?
[17:56] <cap_> sorry
[17:56] <cap_> 2.6.35
[17:56] <greglap> cap_: yeah
[17:57] <greglap> there might be some newer patches that aren't in the backports branch yet, but it will keep you roughly in line
[17:57] <cap_> greglap, thanks, just what I was looking for (alternatives for me would be fuse or fedora-rawhide kernel)
[17:59] <lxo> sage, err, sorry, I lied, the core files are gone now :-(
[18:03] <lxo> anyhow, there are two possible scenarios that lead to the problem. I mentioned one in e-mail (deadlocked (?) master mds restarted, standby-replay crashes while taking over). the other, that just occurred to me, also involves problems in the master mds: sometimes they'd restart, without apparent reason, and then the standby-replay that was to take over might run into this scenario.
[18:03] <lxo> the latter is less likely IMHO, because MDSes have failed like that under intense activity, so it's unlikely that the journal is empty
[18:07] * cockroach (~stefan@80-218-0-190.dclient.hispeed.ch) Quit (Quit: leaving)
[18:10] * cap_ (~cap@yaydoe.nsc.liu.se) Quit (Quit: .)
[18:16] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:30] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[18:35] * cmccabe (~cmccabe@ has joined #ceph
[18:41] * greglap (~Adium@ Quit (Quit: Leaving.)
[18:41] <Tv> msg/SimpleMessenger.cc: In function 'int SimpleMessenger::Pipe::accept()', in thread '0x7f1e9f15670
[18:41] <Tv> 0'
[18:41] <Tv> msg/SimpleMessenger.cc: 810: FAILED assert(existing->state == STATE_CONNECTING || existing->state =
[18:41] <Tv> = STATE_STANDBY || existing->state == STATE_WAIT)
[18:41] <Tv> my mds crashed with a fairly simple-looking error..
[18:41] <Tv> sadly, no core
[18:41] * MK_FG (~MK_FG@ Quit (Quit: o//)
[18:42] * MK_FG (~MK_FG@ has joined #ceph
[18:44] * morse_ (~morse@supercomputing.univpm.it) has joined #ceph
[18:46] * morse (~morse@supercomputing.univpm.it) Quit (charon.oftc.net magnet.oftc.net)
[18:46] * wonko_be (bernard@november.openminds.be) Quit (charon.oftc.net magnet.oftc.net)
[18:46] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (charon.oftc.net magnet.oftc.net)
[18:47] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:47] * wonko_be (bernard@november.openminds.be) has joined #ceph
[18:47] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[18:48] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (Ping timeout: 480 seconds)
[18:48] * morse (~morse@supercomputing.univpm.it) Quit (Ping timeout: 480 seconds)
[18:49] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[18:50] <wido> sjust: you there?
[18:51] <wido> did you take a look at the replay_queued_ops? I've got some logs with debug osd = 20
[18:53] <sjust> wido: yeah, I am working on it
[18:53] <sjust> wido: the problem is that do_peer seems to be called on an active pg
[18:53] <sjust> I'd like to see the logs, though
[18:54] <wido> I'll pastebin them, one moment
[18:55] <sjust> cool, thanks
[18:55] <wido> Oh, I'll create a issue for it, for my own time-tracking, if you don't mind
[18:55] <sjust> cool, thanks
[18:59] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[19:01] <wido> sjust: see #1000
[19:03] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[19:06] <sjust> wido: ok
[19:08] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:08] <gregaf1> Tv: do you have any more details on that mds messenger crash?
[19:09] <Tv> gregaf1: sadly, no -- all i can think of is that it was on a machine suspended overnight
[19:09] <gregaf1> ah
[19:09] <Tv> gregaf1: maybe that made tcp time out unexpectedly, or something
[19:10] <gregaf1> the assert is partly a check on the other end's behavior, here
[19:10] <Tv> gregaf1: i just got a reproducible journal replay assert
[19:10] <Tv> 2011-04-12 10:09:23.449217 7f37e6d27700 mds2.journal EMetaBlob.replay missing dir ino 100000030b9
[19:10] <gregaf1> we can discard the messenger one as not a problem
[19:10] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:10] <Tv> i don't know how the journal got into this state, but now that happens every time
[19:11] <gregaf1> ungh
[19:11] <gregaf1> is it on an EOpen?
[19:11] <Tv> well it still shouldn't *crash*, no matter what the socket does
[19:11] <Tv> 1: (EMetaBlob::replay(MDS*, LogSegment*)+0x4103) [0x4ca923]
[19:11] <Tv> 2: (EOpen::replay(MDS*)+0x99) [0x4cd6a9]
[19:11] <Tv> that's the top of the stack trace
[19:11] <gregaf1> yeah, there's an issue in the tracker about that
[19:11] <Tv> gregaf1: do you need a broken journal?
[19:12] <gregaf1> there are problems with no, we know what's happening
[19:12] <gregaf1> err, that was supposed to be two
[19:12] <gregaf1> there are problems with how we journal EOpen on non-auth MDSes when you replay
[19:12] <Tv> ok discarding mah data
[19:12] <gregaf1> and we know what's happening but we tabled it as the problem is a little complicated and Sage wants to think about how best to resolve it
[19:16] * gregaf1 (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[19:16] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:20] * Yoric (~David@did75-14-82-236-25-72.fbx.proxad.net) Quit (Quit: Yoric)
[19:21] * sageph (~yaaic@mac0436d0.tmodns.net) has joined #ceph
[19:21] <sageph> ill be late...lets shoot for 1130
[19:25] * sageph (~yaaic@mac0436d0.tmodns.net) Quit (Read error: Connection reset by peer)
[19:26] * neurodrone (~neurodron@dhcp215-236.wireless.buffalo.edu) has joined #ceph
[19:55] * neurodrone (~neurodron@dhcp215-236.wireless.buffalo.edu) Quit (Quit: neurodrone)
[19:57] <sjust> wido: wip_bug1000 branch should fix the problem I think you are having, if you'd like to give it a try
[20:05] <wido> sjust: Sure!
[20:05] <sage> done with the kindergarten orientation thing. working from home, though (sick)
[20:05] <sage> we can do the skype whenever you guys are ready
[20:06] <cmccabe> I'm ready whenever.
[20:07] <Tv> i don't see greg from here but everyone is right here
[20:07] <Tv> +else
[20:07] <Tv> gregaf: ?
[20:07] <gregaf> whenever's good
[20:08] <gregaf> is conference room open now?
[20:08] <Tv> yrds
[20:08] <Tv> yes
[20:08] <gregaf> let's go
[20:08] <Tv> this keyboard thing is challenging this morning
[20:08] <sjust> ok
[20:12] <bchrisman> weird bug we're seeing.. we've got one node which fails to mount btrfs filesystems for ceph with an EIO (can't read superblock)… fails regularly with the -o noatime… take that flag away, and it mounts with no problem… however.. the same bits works on several other clusters without fail… weird… nothing odd in dmesg/messages either.. ;?)
[20:19] <sage> bchrisman: i think there was an old old version of btrfs that would fail on noatime (unrecognized mount option)?
[20:19] <sage> or some other similarly generic mount flag
[20:23] <bchrisman> sage: interesting… it actually shows up (only on that system) if we put any option in… well.. we'll see what priority to have for tracking it further.
[20:37] <sage> gregaf, bchrisman: pushed the mds mydir rstat fixes. these are fallout from the CDir commit changes that do _commit_partial (which only saves dirty dentries). i'm guessing the stray issue is probably something similar...
[20:39] <bchrisman> sage: cool… we'll have another build on that in a couple hours and I'll verify the messages aren't getting pumped out (and spamming the logs)
[20:39] <sage> cool
[20:39] <gregaf> this is only for an MDS that gets restarted though, right?
[20:40] <gregaf> I'm pretty sure that's not your problem, bchrisman
[20:40] <bchrisman> ahh okay
[20:40] <sage> my bug was easily triggered by restart, but could also trigger after some cache trimming.
[20:41] <bchrisman> this was during/after initial filesystem create/mount with no MDS restarting
[20:41] <sage> ..but it was on the ~mdsN dir content, not stray dirs.
[20:41] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:42] <cmccabe> so I see some code like this in create_bucket
[20:42] <cmccabe> librados::ObjectOperation op;
[20:42] <cmccabe> op.create(true);
[20:42] <cmccabe> root_pool_ctx.operate(bucket, &op, &outbl);
[20:43] <gregaf> sage: bchrisman: looking through this again the issue is definitely something to do with projecting/popping fnodes
[20:43] <cmccabe> is that intended to create an object named 'bucket' in the .rgw pool?
[20:43] <cmccabe> if so, that is what is failing with the 251-character long bucket-name problem
[20:43] <sage> yeah. that object has the bucket acl on it
[20:43] <cmccabe> so this works for short bucket names, but fails for the longer one
[20:44] <cmccabe> no error code is returned either
[20:44] <gregaf> the stray dir is happy with one entry, and then 5 more get added to it via link_primary_inode, but it only pops the fnode twice and then the rstats think there are only two more inodes
[20:44] <gregaf> gotta run though, pizza's here
[20:44] <cmccabe> based on rados -p .rgw ls
[20:45] <sage> no error code is returned? the osd_op_reply has result code 0 but the bucket isn't created?
[20:45] <cmccabe> the bucket is created, but not .rgw/<bucket>
[20:45] <sage> look at the osd log for that object create and see what's up?
[20:45] <cmccabe> more precisely, it seems the bucket is represented by a pool
[20:46] <cmccabe> that pool is created fine
[20:46] <cmccabe> but it's also represented by an object in the .rgw pool, which is not successfully created by "operate"
[20:46] <Tv> cmccabe: len(".rgw/") = 5, so 250 still works, right?
[20:47] <cmccabe> let me check
[20:47] <Tv> cmccabe: 251 is the first one that fails, right
[20:47] <cmccabe> yes
[20:47] <Tv> what we have is two separate layers enforcing max name len, and something trying to add a prefix when going between the layers
[20:47] <Tv> and really shitty error handling, on top of that
[20:47] <cmccabe> tv: so far, everything I've found is using std::string. But I'm digging deeper.
[20:48] <Tv> i'd recommend step 1: fix the error handling
[20:48] <Tv> otherwise we'll just bury that bug
[20:48] <cmccabe> in soviet russia, bug buries *you*.
[20:48] <sage> the bucket object is in its own pool, right? do we need the .rgw/ prefix?
[20:49] <cmccabe> .rgw is a pool which has objects for each pool
[20:49] <cmccabe> also .rgw has some other special objects that don't represent pools
[20:49] <cmccabe> er, I mean .rgw is a pool with objects for each bucket
[20:50] <cmccabe> I didn't think we had max name length on pool objects, but I'll check
[21:01] <lxo> help! monitors won't switch any standby[-replay] mds to active, even though there isn't any active mds
[21:01] <lxo> I tried restarting all monitors, all mdses, and even disabling their standby replay configuration, to no avail
[21:03] <lxo> not the first time this happens to me, but the first time with 0.26. the other times it happened, I couldn't find a way out, so I ended up re-creating the entire filesystem
[21:03] <lxo> but this is the first time I'm very close to completing uploading the 1TB of data I intend to keep in the cluster, so I'd rather not throw it all away
[21:03] <lxo> any ideas of how to debug this?
[21:04] <greglap> Ixo: you must have originally had an active mds, what happened?
[21:06] <lxo> I don't know. MDSes started to disappear from the “ceph mds dump -o -”, although their processes were still running. eventually the list got empty
[21:07] <greglap> what does ceph -s give you?
[21:07] <lxo> then I restarted the MDSes, and they were brought up in standby-replay state -2
[21:08] <greglap> and what commit are you on?
[21:08] <lxo> nothing unusual. 3 mdses in standby, 2 osdes up
[21:08] <lxo> pg v70750, if that's what you're asking
[21:08] <greglap> no, what version of Ceph
[21:09] <greglap> there is a setting now where it will promote standby-replay MDSes into active
[21:09] <greglap> but that's pretty new, it used to be that if you only had standby-replay MDSes they wouldn't go active unless they were following an MDS that failed
[21:10] <lxo> c494689062 (stable) plus the patch I posted today that relaxes the assert
[21:10] <lxo> yeah, I know about that, but I even tried to disable standby-replay, and it didn't help
[21:10] <greglap> how'd you disable it?
[21:11] <lxo> mds standby replay = false for all mdses, restarted all monitors and mdses
[21:12] * neurodrone (~neurodron@dhcp213-175.wireless.buffalo.edu) has joined #ceph
[21:12] <greglap> I suspect that the monitor has the MDS in standby-replay mode in the map and won't let it switch itself out
[21:12] <lxo> (after trying restarting less than that)
[21:12] <greglap> the best solution is probably to add another MDS that doesn't have any of the standby settings
[21:12] <greglap> I'd really like to know how your cluster lost its active MDS though
[21:13] <sage> lxo: can you pastebin the output from 'ceph mds dump -o -'
[21:15] <lxo> sage, it's IP/port 'name' mds-1.0 up:standby seq 1 now (just restarted them all)
[21:15] <lxo> for the 3 mdses
[21:16] <lxo> err IP:port/id
[21:17] <lxo> any other info that would be useful before I bring up a fresh MDS without any standby settings that the monitors might still remember?
[21:18] <greglap> you didn't change anything before the MDSes started to disappear?
[21:19] <lxo> the only change was that I started a -j18 GCC build on one of the machines that holds one of the active OSDs (in a separate filesystem) and monitor #1 (on the same filesystem as the build)
[21:23] <lxo> greglap, http://pastebin.com/BSPzwxJJ shows the last bits of activity from the cluster, and then, next thing I know, all mdses are down
[21:24] <lxo> in the time window that shows no activity, I ran ceph mds dump -o - a few times and saw the mdses being kicked out one by one, although their logs showed they were fine
[21:24] <lxo> (and their processes were active)
[21:26] <greglap> Ixo: okay, according to that output the MDSes died as best anybody else could tell
[21:26] <lxo> (you see that one of the 3 OSDs is down, so all pgs are degraded. they have been like this since i started uploading data onto this filesystem)
[21:27] <lxo> before that, the cluster had some 15 minutes of activity in which all it logged were new PG versions
[21:27] <greglap> I'm trying to think of how the MDS could stop sending beacon's or whatever and still be alive but I'm not coming up with anything, hmm
[21:27] <cmccabe> I'm going to shut down rgw-1; is anyone using it?
[21:28] <lxo> here's one possibility: they *were* sending beacons, but the disk holding monitor 1 was busy so the monitors couldn't advance
[21:28] <cmccabe> it was something yehuda used I think
[21:28] <greglap> Ixo: hmmm, that seems unlikely but it's possible
[21:28] <greglap> how many monitors do you have?
[21:29] <sage> lxo: can you pastebin the whole thing?
[21:29] <greglap> oh, you do have 3 — was mon0 the one with the busy disk?
[21:29] <greglap> n/m, you said mon1
[21:29] <greglap> although by that time the MDSes were all declared down
[21:31] <lxo> greglap, 3 monitors
[21:32] <lxo> sage, what “whole thing”? this was ceph -w output
[21:32] <sage> ceph mds dump -o -
[21:33] <lxo> http://pastebin.com/R5Gir0jW
[21:34] <sage> do you by chance have logs from mon0?
[21:35] <sage> up {0=4477} is inconsistent w/ teh rest of the map (4477 isn't there)
[21:35] <sage> (btw you can probably get out of this with 'ceph mds fail 0')
[21:36] <lxo> nice. should I try that now, or would it help to keep the cluster down?
[21:36] <sage> go ahead
[21:36] <lxo> I do have logs for mon0. in fact, i was looking at them
[21:37] <sage> if you have logs that's helpful, otherwise no worries
[21:37] <sage> excellent, can you post them somewhere (in their entirety)?
[21:37] <sage> a tarball of $mon_data/mdsmap/ would also help
[21:41] <lxo> you want the whole 880MB of logs for mon.0, or want me to trim it starting say one hour before the failure or so?
[21:41] <sage> whichever is easier for you
[21:41] <sage> if you trim, make sure it is well before the first mds failure
[21:45] <lxo> is there any privacy issue re: posting mon logs? I see filenames in mds logs, which is why I ask
[21:45] <sage> ips. maybe the crypto keys, if debug auth is turned up. you can msg me the url privately if you prefer
[21:46] <lxo> I was thinking of uploading the logs to a new bug
[21:46] <lxo> lemme see if I have crypto keys in the logs
[21:48] <lxo> nope. good
[21:55] <lxo> sage, thanks for the tip, the “ceph mds fail 0” command brought the cluster back up!
[21:56] <lxo> now, how could you tell that 4477 was inconsistent? now it says 5139, but that's not the /id of any of the MDSes
[21:56] <lxo> or was it just because none of the MDSes was active in the first place?
[21:57] <lxo> mdsmap and trimmed log compress down to 2MB. is that kosher to attach to the bug tracker?
[21:57] <sage> after the 0=<gid> mapping you should see that <gid> instance listed below
[21:58] <lxo> you mean <gid>=number, right? 'cause now I have 0=5139, and an mds0.## active, but this mds doesn't have 5139 anywhere in the mds dump
[21:58] <lxo> oh
[21:59] <lxo> except for before the :, where I wasn't looking
[21:59] <lxo> :-)
[21:59] <lxo> grep coloring is great, isn't it? :-)
[21:59] <sage> it is.. i wish debian had it on by default
[22:08] * RickB17 (~rbreidens@pat.recoverynetworks.com) Quit (Remote host closed the connection)
[22:10] <lxo> it's bug 1001
[22:10] <lxo> thanks for the help!
[22:11] <sage> got it, thanks
[22:40] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[22:47] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[22:49] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[22:53] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:58] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[23:25] * neurodrone (~neurodron@dhcp213-175.wireless.buffalo.edu) Quit (Quit: neurodrone)
[23:30] <cmccabe> so I wrote a test that does the same thing as the call to IoCtx::operate in rgw
[23:30] <cmccabe> but it succeeds in the test.
[23:31] <sage> by succeed you mean the osd doesn't return an error?
[23:31] <sage> and from rgw it fails by returning an error?
[23:31] <cmccabe> no, I mean it creates the object.
[23:31] <cmccabe> I didn't reproduce the op.setxattr stuff in the test though
[23:31] <sage> and from rgw it returns success (0), but doesn't create the object?
[23:31] <cmccabe> yep
[23:31] <cmccabe> either that or something is removing the object later
[23:32] <sage> do you have an osd log of that happening?
[23:32] <cmccabe> yeah
[23:32] <sage> it may be #963
[23:33] <cmccabe> I'm not sure if my logging was turned up enough to have anything interesting
[23:33] <sage> vstart.sh -d ?
[23:34] <cmccabe> yep
[23:35] <cmccabe> the setxattr stuff doesn't include the long bucket name though
[23:35] <sage> where can i take a look? metropolis?
[23:35] <cmccabe> it's on rgw-cmccabe
[23:35] <cmccabe>
[23:36] <sage> do you have an object or bucket name i can grep for?
[23:36] <cmccabe> ggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg
[23:36] <cmccabe> that's the only bucket
[23:37] <cmccabe> actually that bucket worked
[23:37] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[23:37] <sage> its not in any of the logs...
[23:38] <cmccabe> ok... weird
[23:38] <sage> gregaf: can you review the mon_mds branch? includes fix for #1001 and some cleanups
[23:38] <cmccabe> I can now ls the buckets with long names again
[23:38] <cmccabe> just created hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
[23:38] <cmccabe> it works fine
[23:38] <sage> that's only 152 chars
[23:39] <cmccabe> doh
[23:39] <gregaf> sage: yeah, will do
[23:40] <cmccabe> ok, new bucket jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
[23:40] <cmccabe> has the error
[23:40] <cmccabe> so I don't know when/if these bucket/pool names appear in the osd log
[23:40] <sage> the cosd doesn't seem to be logging, grep jjj out/* turns up nothing?
[23:41] <cmccabe> sage: well, you can look in /var/log/radosgw
[23:41] <cmccabe> sage: that is the logs from apache
[23:41] <cmccabe> also, you need to look in /home/cmccabe/src/ceph-old/src/out for the new logs
[23:42] <sage> ah found it
[23:42] <cmccabe> so I do see the pool names mentioned in those logs
[23:43] <cmccabe> sage: also I deactivated the symlink dance and used log_file = /my/path/to/log/file/$name.log
[23:43] <cmccabe> sage: that seems to work great; I recommend we move the symlink dance into vstart.sh and remove it from the logging code as soon as feasible
[23:44] <Tv> cmccabe: yay!!!!!
[23:44] <cmccabe> :)
[23:45] <cmccabe> ah, I do see something fishy here: filestore(/data/osd0) write couldn't open /data/osd0/current/5.7_head/2011-04-12-jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj_head flags 65 errno 36 File name too long
[23:47] <Tv> ext4.h:1409:#define EXT4_NAME_LEN 255
[23:47] <Tv> and that's 267 bytes
[23:48] <Tv> but by that logic, it should start failing earlier, not at 251
[23:48] <Tv> unless the date prefix isn't really there
[23:48] <sage> that's the log object (with the date prefix)
[23:48] <Tv> len("_head") == 5, that'd explain the 250 works
[23:49] <cmccabe> I think we talked about this earlier and came to the conclusion that the PATH_MAX was for the path as a whole, but 255 was the limit for each component on ext3/4
[23:49] <cmccabe> anyway, I'm using btrfs, if that matters
[23:49] <Tv> ah let's grep those constants then..
[23:49] <sage> it's probably still 255 in btrfs
[23:50] <Tv> ctree.h:131:#define BTRFS_NAME_LEN 255
[23:50] <cmccabe> I'm pretty sure that PATH_MAX is 4k
[23:50] <Tv> same thing
[23:50] <cmccabe> not that that matters really
[23:50] <cmccabe> tv: k
[23:50] <sage> 2011-04-12 17:39:57.797442 7f137f892700 filestore(/data/osd0) touch /data/osd0/current/4.2_head/jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj_head = -36
[23:50] <Tv> sage: ok so the date isn't really there, as i expected
[23:51] <cmccabe> so we have a dilemma I guess
[23:51] <Tv> now, who wants to write a mapping layer ;)
[23:51] <cmccabe> it might be better to forbid those long names for now
[23:51] <sage> bleh..
[23:51] <cmccabe> I can easily imagine other parts of the osd slamming together oid with ".foo" to make all kinds of other side files
[23:52] <cmccabe> so having really really long pool or object names is risky
[23:52] <Tv> cmccabe: that means we can't even be sure what length is still ok..
[23:52] <sage> the _<snapid> is used throughout. but that's the only suffix.
[23:52] <Tv> especially if it fails this weirdly
[23:52] <cmccabe> but how long can snapid be?
[23:52] <sage> it's a uint64
[23:52] <sage> in hex, iirc
[23:53] <Tv> that's 20 in base-10, 16 in hex without decorations
[23:53] <cmccabe> 16 chars then
[23:53] <sage> we could use the pool id (int) instead of bucket name for these objects.. it's a 1:1 mapping
[23:53] <cmccabe> but we also have an underscore of course
[23:53] <sage> plus _
[23:53] <Tv> yeah
[23:54] <Tv> also
[23:54] <Tv> what do you do if OpenStack says 4096 byte bucket names are ok ;)
[23:54] <sage> yeah.
[23:55] <cmccabe> I really doubt that openstack allows bucket names of that size...
[23:55] <cmccabe> I guess the place to start looking is in Nova?
[23:55] <sage> let's use the pool id instead... rados_pool_lookup()
[23:56] <sage> that changes the backend format, unfortunately.
[23:56] <cmccabe> from nova/objectstore/bucket.py:
[23:56] <cmccabe> def _object_path(self, object_name):
[23:56] <cmccabe> fn = os.path.join(self.path, object_name)
[23:57] <cmccabe> I wonder if Nova is using an FS backend themselves?
[23:57] <Tv> http://bazaar.launchpad.net/~hudson-openstack/swift/trunk/view/head:/swift/common/constraints.py#L39
[23:58] <cmccabe> 256 is a weird choice given that EXT4_NAME_LEN/BTRFS_NAME_LEN = 255
[23:59] <sage> yeah
[23:59] <cmccabe> I wonder if we just found an openstack bug

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.