#ceph IRC Log


IRC Log for 2011-04-08

Timestamps are in GMT/BST.

[0:19] * sagelap (~sage@m1f0436d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[0:32] * pombreda (~Administr@ has joined #ceph
[1:17] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[1:20] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[1:21] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit ()
[1:26] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) has joined #ceph
[1:34] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[1:42] * midnightmagic (~midnightm@S0106000102ec26fe.gv.shawcable.net) has joined #ceph
[1:52] * samsung (~samsung@ has joined #ceph
[1:57] * pombreda (~Administr@ Quit (Quit: Leaving.)
[2:04] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[2:20] * pombreda (~Administr@ has joined #ceph
[2:23] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:35] * sjustlaptop (~sam@adsl-76-208-183-201.dsl.lsan03.sbcglobal.net) has joined #ceph
[2:42] * pombreda (~Administr@ Quit (Quit: Leaving.)
[2:45] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) Quit (Ping timeout: 480 seconds)
[2:58] * greglap (~Adium@ has joined #ceph
[3:07] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:13] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[3:32] * WesleyS (~WesleyS@ has joined #ceph
[3:32] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[3:35] * WesleyS (~WesleyS@ Quit ()
[3:38] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[4:15] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[4:26] * sjustlaptop (~sam@adsl-76-208-183-201.dsl.lsan03.sbcglobal.net) Quit (Quit: Leaving.)
[4:42] * sjustlaptop (~sam@adsl-76-208-183-201.dsl.lsan03.sbcglobal.net) has joined #ceph
[5:00] * pombreda (~Administr@adsl-71-142-66-118.dsl.pltn13.pacbell.net) has joined #ceph
[5:08] * sjustlaptop (~sam@adsl-76-208-183-201.dsl.lsan03.sbcglobal.net) Quit (Quit: Leaving.)
[5:09] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has left #ceph
[5:27] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Ping timeout: 480 seconds)
[5:30] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[6:03] * pombreda (~Administr@adsl-71-142-66-118.dsl.pltn13.pacbell.net) Quit (Quit: Leaving.)
[6:06] * pombreda (~Administr@adsl-71-142-66-118.dsl.pltn13.pacbell.net) has joined #ceph
[6:31] <pombreda> gregaf: howdy, FWIW, the playground is not behaving , but you may be already aware of that. tow nodes down, afaik per sage?
[6:31] <pombreda> gregaf: writing new things there sometimes hangs
[6:48] * hutchint (~hutchint@c-75-71-83-44.hsd1.co.comcast.net) has joined #ceph
[6:55] * pombreda1 (~Administr@adsl-71-142-66-118.dsl.pltn13.pacbell.net) has joined #ceph
[6:59] * pombreda (~Administr@adsl-71-142-66-118.dsl.pltn13.pacbell.net) Quit (Ping timeout: 480 seconds)
[7:24] * pombreda1 (~Administr@adsl-71-142-66-118.dsl.pltn13.pacbell.net) Quit (Quit: Leaving.)
[7:52] * hutchint (~hutchint@c-75-71-83-44.hsd1.co.comcast.net) Quit (Quit: Leaving)
[8:04] * allsystemsarego (~allsystem@ has joined #ceph
[9:24] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[9:59] * Administrator_ (~samsung@ has joined #ceph
[10:04] * samsung (~samsung@ Quit (Read error: Operation timed out)
[11:58] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) Quit (Ping timeout: 480 seconds)
[12:06] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) has joined #ceph
[12:57] * hubertchang (~hubertcha@ has joined #ceph
[13:01] <hubertchang> I want to use ceph filesystem as the enterprise git central repository. Does it make sense?
[13:27] * hubertchang (~hubertcha@ Quit (Ping timeout: 480 seconds)
[14:12] * Psi-Jack_ is now known as Psi-Jack
[15:02] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[16:17] * lxo (~aoliva@ Quit (Read error: Connection reset by peer)
[16:18] * lxo (~aoliva@ has joined #ceph
[16:23] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[16:46] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:50] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[16:58] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[17:02] * alexxy (~alexxy@ has joined #ceph
[17:20] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[17:24] * alexxy (~alexxy@ has joined #ceph
[17:38] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[17:48] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[17:48] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit ()
[17:51] * greglap (~Adium@ has joined #ceph
[17:55] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:16] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:39] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:48] * cmccabe (~cmccabe@ has joined #ceph
[18:51] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[19:00] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:00] <sagewk> meet now
[19:00] <sagewk> ?
[19:01] <cmccabe> ok
[19:03] * Administrator__ (~samsung@ has joined #ceph
[19:08] * Administrator_ (~samsung@ Quit (Read error: Operation timed out)
[19:13] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:13] <cmccabe> so, in the interest of resolving #903
[19:13] <cmccabe> tv: I'm getting this. Any clues?
[19:13] <cmccabe> tv: INFO ---- ---- timestamp=1302282689 job_abort_reason=[Errno 2] No such file or directory: '/usr/local/autotest/results/382-cmccabe/testvm-001.ceph.newdream.net/testvm-001.ceph.newdream.net/status.log' localtime=Apr 08 10:11:29 [Errno 2] No such file or directory: '/usr/local/autotest/results/382-cmccabe/testvm-001.ceph.newdream.net/testvm-001.ceph.newdream.net/status.log'
[19:14] <cmccabe> preceded by the message: "machine can not be tupled"
[19:16] <Tv> looking at the job..
[19:17] <Tv> cmccabe: you can't run a 6-node job with 1 node
[19:17] <cmccabe> tv: I don't remember adding more than 1 host...
[19:17] <cmccabe> tv: oh, I see.
[19:17] <cmccabe> tv: in the control file.
[19:18] <cmccabe> tv: if I try to put everything on the same node, is that going to work?
[19:18] <cmccabe> tv: because for the purposes of this test, that would be fine.
[19:18] <Tv> cmccabe: when in doubt, compare against 354
[19:19] <Tv> cmccabe: if you edit the control file, you can do all kinds of things
[19:19] <joshd> cmccabe: if you rebase on master, you can set the client type to 'rados' or something and not have ceph mounted
[19:19] <cmccabe> joshd: I will try that, thanks
[19:25] <cmccabe> joshd: I'm afraid there are no current control files using rados in the ROLES array.
[19:25] <cmccabe> joshd: so it would be something like this: ROLES = [ [ ...., rados.0 ] ]
[19:26] <joshd> cmccabe: no, you want to have a CLIENT_TYPES dict that says {'client.0': 'rados'}
[19:26] <cmccabe> joshd: are there docs for CLIENT_TYPES somewhere
[19:27] <joshd> cmccabe: check out skeleton.CephTest
[19:27] <joshd> the only types that matter in master are 'kclient' and 'cfuse', with 'kclient' as the default
[19:28] <joshd> in my branches, there's also an rbd type, which does class activation etc.
[19:29] <cmccabe> joshd: cmccabe@metropolis:~/src/ceph-autotests$ grep -rI kclient *
[19:29] <cmccabe> joshd: no results?
[19:29] <joshd> re-pull?
[19:30] <cmccabe> joshd: oh, yeah.
[19:30] <cmccabe> ok now I am seeing the different client types
[19:30] <joshd> see 6ffda8ee7a69d7140cae03d2f64a2fa2dcf0e661
[19:34] <cmccabe> joshd: so is there a problem if I try to run a control file that's too far "behind the times"
[19:35] <cmccabe> joshd: so tests/ceph_pybind/ceph_pybind.py will get placed on the web server by git, and fetched through that TEST_URL link.
[19:35] <joshd> cmccabe: yes
[19:35] <cmccabe> joshd: but something is executing the control file I paste in
[19:35] <cmccabe> joshd: I guess it's the autotest server itself
[19:35] <joshd> cmccabe: correct
[19:36] <cmccabe> joshd: so is that code changing at all when we change the skeleton, etc.
[19:36] <Tv> cmccabe: yeah that's why you're mostly supposed to use the control file that lives in the source right next to the test; they'll be updated at the same time
[19:37] <Tv> taking an old test run and re-running it might not work
[19:37] <cmccabe> tv: unless I'm on a branch... which I'm always supposed to be
[19:37] <Tv> cmccabe: if your developing a test, you're developing the control file too
[19:37] <Tv> and you just need a s/master/branchname/ in it
[19:38] <Tv> not the worst wart we have
[19:39] <Tv> s/your/you're/hatethat;
[19:41] <cmccabe> tv: I know I've been harping on this, but we really do need to find some way to avoid that copy-paste step
[19:41] <Tv> cmccabe: so much not the worst wart
[19:41] <cmccabe> tv: especially since the control file is already in git, it should be able to just fetch it
[19:42] <Tv> cmccabe: trust me i run more autotests than you do, and that is not the thing i end up cursing
[20:11] <cmccabe> tv: you said earlier there was a way to get stdout from a test, what might that be?
[20:12] <Tv> cmccabe: there's functions in autotest's utils module to run commands, many of them even default to that.. they called it TEE_TO_LOGS or something like that
[20:12] <Tv> utils.system is the simple one
[20:14] <cmccabe> looks like it's just a passthrough to os.system
[20:15] <cmccabe> tv: it looks like the python bindings are not installed on the cluster nodes.
[20:16] <Tv> cmccabe: the librados ones? they need to be in the binary tarball, you need to use them from there
[20:16] <cmccabe> tv: so just to be clear, we never run make install on the cluster nodes
[20:16] <Tv> cmccabe: every file created goes in a temp directory
[20:16] <Tv> cmccabe: the binary tarball is unpacked; there's no Makefile at that point anymore
[20:17] <cmccabe> why not make install DESTDIR=/tmp/foo ?
[20:17] <joshd> cmccabe: look at how skeleton.CephTest calls our binaries with the correct directory -- you probably need to set LD_LIBRARY_PATH as well, for librados
[20:17] <Tv> cmccabe: because it's not compiled there
[20:19] <cmccabe> on the build machine: make -j 16 && make install DESTDIR=/tmp/foo && tar cvzf files.tar.gz /tmp/foo
[20:19] <cmccabe> on the host machine: tar xvzf
[20:19] <cmccabe> sigh
[20:20] <Tv> cmccabe: that's somewhat decent description of what the binary tarballs are
[20:23] <cmccabe> so how are we generating the tarballs
[20:23] * pombreda (~Administr@adsl-71-142-66-118.dsl.pltn13.pacbell.net) has joined #ceph
[20:23] <cmccabe> web.py?
[20:24] <Tv> teuthology.web yes, "web.py" the framework no
[20:25] <cmccabe> so how can we get the rados bindings into that tarball
[20:26] <Tv> oh wrong tarball sorry, you mean the ceph binary tarball
[20:26] <Tv> that comes from gitbuilder
[20:26] <cmccabe> but gitbuilder should just be doing make install?
[20:26] <Tv> you can use ceph_bin_url to point to a non-master branch
[20:26] <Tv> yes
[20:26] <cmccabe> ok. then the bindings should already be there.
[20:27] <cmccabe> so the problem is that we never actually install on the host machines
[20:28] <cmccabe> I do understand why you chose to do that-- buggy uninstalls don't always do what they're supposed to
[20:28] <cmccabe> probably the best solution is to add to the python library path
[20:29] <Tv> cmccabe: you shouldn't need anything beyond LD_PRELOAD and maybe PYTHONPATH
[20:29] <Tv> err LD_LIBRARY_PATH
[20:33] * pombreda (~Administr@adsl-71-142-66-118.dsl.pltn13.pacbell.net) Quit (Quit: Leaving.)
[20:50] <wido> joshd: I've revived osd 0,1,2,3 today, the PG's all became clean again, but, the fs is still blocking, seems a kernel client issue to me
[20:51] <wido> a df still works, but even cd'ing to the mount point freezes
[20:51] <wido> mds is still online btw
[21:02] <cmccabe> tv: so again, how is this bundle generated?
[21:02] <cmccabe> tv: by make install?
[21:04] <wido> the kclient doesn't seem to recognize the OSD change, the osdmap in /sys/kernel/debug/ceph still says that osd0,1,2,3 are down, but they are up again
[21:11] <Tv> cmccabe: yes
[21:12] <cmccabe> tv: so we can't know exactly where the python files will end up
[21:12] <Tv> cmccabe: huh
[21:12] <cmccabe> tv: because that will depend on the version of python on the machine running make install
[21:12] <Tv> oh yeah, if autotest & gitbuilder don't agree on python version, it'll fail
[21:12] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:12] <Tv> so let's not do that, then
[21:13] <cmccabe> tv: for example on my computer it is /usr/lib/python2.6/dist-packages/
[21:13] <Tv> i hereby solemnly promise to keep them running the same version
[21:13] <cmccabe> tv: why not just run make install on the target?
[21:13] <Tv> because running the compile on the target was too slow
[21:14] <cmccabe> tv: not the compile, just the make install
[21:14] <cmccabe> tv: never mind... that would assume that they have the same architecture
[21:14] <cmccabe> tv: at least you could untar a tar file under /, to put things where they belong
[21:14] <Tv> i don't want to contaminate anything outside the temp dir
[21:15] <cmccabe> tv: it's a test node.
[21:15] <cmccabe> tv: its goal in life is to be contaminated.
[21:15] <Tv> cmccabe: i don't want to make future tests less trustworthy
[21:15] <cmccabe> tv: anyway, as a workaround, I wrote a script to recurse through a directory and add any subdirectory that has a .py file.
[21:15] <Tv> huh
[21:15] <cmccabe> tv: add to PYTHONPATH
[21:16] <Tv> what's so hard about figuring out the right python path?
[21:16] <cmccabe> cmccabe: tv: because that will depend on the version of python on the machine running make install
[21:16] <Tv> cmccabe: and i already promised that and the test machine will run the same version of python
[21:16] <cmccabe> tv: I don't even have a login on the gitbuilder machine; how can I even know what version of python that is?
[21:17] <cmccabe> tv: it will also depend on the flags you passed to configure on gitbuilder
[21:18] <Tv> $ wget -q http://ceph.newdream.net/gitbuilder/tarball/ref/origin_master.tgz -O-|tar tvzf -|grep python|head -1
[21:18] <Tv> drwxr-xr-x autobuild-ceph/autobuild-ceph 0 2011-04-08 11:07 ./usr/local/lib/python2.6/
[21:26] * sjustlaptop (~sam@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[21:28] <cmccabe> tv: perhaps we should set LD_LIBRARY_PATH and PYTHONPATH for all tests
[21:28] <cmccabe> tv: I don't think this will be a unique problem, and I'd like to avoid a lot of copypasta
[21:30] <Tv> cmccabe: we can refactor it out once it becomes too commonplace
[21:30] <Tv> i always prefer that to trying to guess things in advance
[21:32] <sjust> wido: huh, at least recovery finished...
[21:37] <bchrisman> Is this a known osd crash? http://pastebin.com/KFyaM5dv
[21:38] <bchrisman> I included a few lines above it with the 'wrong node!' and 'resumably this is the same node!' messages.. but it's not clear that they're related
[21:38] <bchrisman> this is on startup after a cluster reboot
[21:38] <bchrisman> (soon after startup)
[21:39] <sjust> bchrisman: that may be a bug Greg fixed yesterday
[21:39] <bchrisman> ah..
[21:40] <bchrisman> okay.. will pull latest and rebuild
[21:40] <sjust> specifically 24caedc8f549eeeba48b2d4a44927ee16e65c42a
[21:42] <cmccabe> tv: http://autotest.ceph.newdream.net/results/390-cmccabe/testvm-001.ceph.newdream.net/testvm-001.ceph.newdream.net/status has "completed successfully"
[21:43] <Tv> cmccabe: hold on
[21:43] <cmccabe> tv: but the web interface says job 390 is "incomplete"
[21:49] <Tv> cmccabe: never seen that before
[21:51] <Tv> cmccabe: maybe it's related to how you're running the job on just one node, that's a somewhat untested codepath
[21:55] <cmccabe> tv: well, let me know what you want to do
[21:55] <cmccabe> tv: I guess I can try again with multiple nodes and see if that helps
[22:07] <Tv> cmccabe: in case of doubt, be less different
[22:11] * WesleyS (~WesleyS@ has joined #ceph
[22:11] * WesleyS (~WesleyS@ Quit ()
[22:53] <Tv> does this dmesg make sense to anyone: http://pastebin.com/raw.php?i=arv7273E
[22:53] <gregaf> cmccabe: looks like 6966c3eda74064e766e22c21203ea30f97910f32 broke cfuse
[22:54] <Tv> it's a sepia machine hanging on sync
[22:54] <Tv> ceph -s claims the cluster is ok
[22:54] <cmccabe> gregaf: ok, I'll take a look at it
[22:54] <cmccabe> gregaf: any particular test setup you suggest?
[22:54] <gregaf> cmccabe: killing cfuse no longer cleans up the mount properly
[22:55] <gregaf> just vstart.sh; sudo ./cfuse mnt; ls mnt; killall cfuse; ls mnt
[22:55] <gregaf> you have to sudo fusermount -u mnt to clean it all up again
[22:55] <gregaf> I bisected it down to that commit
[22:56] <cmccabe> gregaf: it could be that we need our signal handler to call FUSE's
[22:56] <gregaf> well the commit strips the signal handling out of the messenger, so it's something to do with that
[22:56] <gregaf> I'm not very familiar with our signal handling though and I don't want to bust it more ;)
[22:57] <cmccabe> it's all right, I'll get it.
[22:57] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:57] <gregaf> Tv: if a sync is hanging there's a very good chance the problem is on the server side
[22:58] <gregaf> the kclient is probably just waiting for a message to come back like it's supposed to
[22:58] <Tv> gregaf: well with anything else than osdtimeout=0 i see a kclient-side bug with repeating timeouts
[22:58] <Tv> gregaf: but this is with osdtimeout=0, so that particular one is out
[22:58] <gregaf> what do you mean bug with repeating timeouts?
[22:59] <Tv> gregaf: a big write sent in chunks never completes on time, gets restarted from scratch, doesn't complete in time, gets restarted from scratch, ...
[22:59] <Tv> but *any* help on debugging sync issues is welcome -- this is currently making most of our autotests useless
[22:59] <Tv> ceph claims to be healthy
[23:00] <gregaf> I'm not real familiar with how the kclient handles error conditions like timeouts
[23:00] <gregaf> but do you mean it sends the messages off and then waits x time and says something like "no reply, resending"?
[23:00] <gregaf> something like that would again just mean that the request is hanging on the server
[23:01] <gregaf> most hangs turn out to be a request that gets put on a server wait queue and doesn't get woken up for one reason or another
[23:01] <Tv> gregaf: i just stumbled on this code an hour ago, so i'm no expert, but it looks like a big write is sent to osd in page-size chunks, and e.g. bonnie causes a near 300-page write
[23:01] <Tv> gregaf: and the *whole thing* isn't fast enough, so gets started from scratch, over and over again
[23:01] <gregaf> but the server itself is still active and otherwise happy so it's considered healthy
[23:01] <Tv> but as i said, now i have osdtimeout=0, so that particular reason no longer applies
[23:02] <gregaf> okay
[23:02] <gregaf> so I'd look and see what outstanding requests remain for the MDS and the OSD
[23:02] <Tv> gregaf: can you be more explicit please?
[23:02] <gregaf> you can look at files in /proc to get a listing, though I confess I don't remember their names
[23:02] <Tv> cat /sys/kernel/debug/ceph/54b777db-4d51-f05a-35c2-3f89e28356bb.client4110/osdc
[23:02] <Tv> is empty
[23:03] <Tv> so is mdsc
[23:03] <Tv> monc says
[23:03] <gregaf> hrm
[23:03] <Tv> have mdsmap 18
[23:03] <Tv> have osdmap 5
[23:03] <Tv> so not useful
[23:03] <gregaf> so probably not a server bug then, how odd
[23:03] <gregaf> this is on a sync?
[23:03] <Tv> if it were a server bug, i'd expect to reproduce it with cfuse; i've never seen this with cfuse
[23:04] <Tv> yeah there's a sync that's been sitting there for >15min
[23:04] <gregaf> I did see one issue recently where it looked like the client was waiting on file caps that the server wasn't handing out, but I don't think a sync should trigger that
[23:06] <gregaf> sorry, I don't know the kclient that well, I'm out :/
[23:07] <Tv> i guess next week i'm just gonna start reading code... that seems to be the only way to climb this mountain
[23:08] <gregaf> yeah, pretty much
[23:08] <gregaf> :/
[23:37] <Tv> this might be a osd bug after all
[23:38] <Tv> i see outgoing tcp socket full of stuff
[23:38] <Tv> or maybe not
[23:38] <Tv> digging into kernelspace tcp....

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.