#ceph IRC Log


IRC Log for 2011-06-20

Timestamps are in GMT/BST.

[1:26] * MarkN (~nathan@ has joined #ceph
[1:57] * nolan (~nolan@phong.sigbus.net) has joined #ceph
[3:10] * yoshi (~yoshi@p24092-ipngn1301marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[5:48] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:08] * Nadir_Seen_Fire (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[6:15] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[8:33] * lidongyang (~lidongyan@ Quit (Remote host closed the connection)
[8:55] * Nadir_Seen_Fire (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Remote host closed the connection)
[9:13] * Yulya_ (~Yu1ya_@ip-95-220-159-77.bb.netbynet.ru) has joined #ceph
[9:37] * lidongyang (~lidongyan@ has joined #ceph
[11:54] * jbd (~jbd@ks305592.kimsufi.com) has left #ceph
[11:54] * jbd (~jbd@ks305592.kimsufi.com) has joined #ceph
[12:37] * allsystemsarego (~allsystem@ has joined #ceph
[12:39] * Juul (~Juul@ has joined #ceph
[12:43] * yoshi (~yoshi@p24092-ipngn1301marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[13:30] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:01] * jbd (~jbd@ks305592.kimsufi.com) Quit (Ping timeout: 480 seconds)
[14:06] * jbd (~jbd@ks305592.kimsufi.com) has joined #ceph
[14:38] * Juul (~Juul@ Quit (Quit: Leaving)
[14:50] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[14:51] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[14:59] * aliguori (~anthony@ has joined #ceph
[16:21] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[16:24] <stingray> sage: I have a directory, which has size 0 although it's not really 0. Is there a way to tell mds to recalculate stats for directories etc. ?
[16:29] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[17:33] * aliguori (~anthony@ Quit (Ping timeout: 480 seconds)
[17:52] * lx0 (~aoliva@ Quit (Read error: Connection reset by peer)
[17:53] * lx0 (~aoliva@9YYAABKDG.tor-irc.dnsbl.oftc.net) has joined #ceph
[17:55] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:04] <sagewk> stingray: not yet
[18:05] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:06] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[18:07] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[18:09] * aliguori (~anthony@ has joined #ceph
[18:10] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[18:12] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[18:15] * slang (~slang@chml01.drwholdings.com) has joined #ceph
[18:18] <yehudasa> swndel: hi
[18:19] <yehudasa> swendel_: you had some question?
[18:28] <stingray> sagewk: okay
[18:29] * stingray is writing ceph configs for systemd
[18:29] <sagewk> stingray: nice
[18:30] <sagewk> slang: around?
[18:30] <stingray> sagewk: the basic stuff is easy, I'll post it somewhere
[18:31] <slang> yep
[18:31] <Tv> const fullbit operator=(const fullbit& other);
[18:31] <Tv> sagewk: shouldn't that be fullbit& operator=(const fullbit& other)
[18:32] <sagewk> yeah, tho it doesn't matter here; the prototype is there to force a link error if something tries to use it.
[18:32] <Tv> ahhhh ok i see what you mean in the email
[18:32] <Tv> intentional link error
[18:33] <sagewk> common c++ trick. turns out that didn't solve the problem, though; you can't reclaim a raw pointer from a shared_ptr bc you don't (necessarily) know how many refs there are. I pushed another branch with a somewhat different solution
[18:33] * Juul (~Juul@tahoe0.imm.dtu.dk) has joined #ceph
[18:35] <slang> sagewk: patch I can try?
[18:35] <sagewk> mds_metablob_fix
[18:38] <Tv> sagewk: i think the copyable requirement is there because internally map is a sorted data structure and does something like rb-tree rotations to keep itself sorted
[18:45] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:47] <Tv> network to sepia is down again :(
[18:53] * alexxy (~alexxy@ has joined #ceph
[18:54] <Tv> oh sepia32 works but 70 doesn't.. hurmmm
[18:59] * Juul (~Juul@tahoe0.imm.dtu.dk) Quit (Quit: Leaving)
[19:00] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:02] <sagewk> tv: i'd hope it could just shuffle the pointers and not copy the containing nodes. the errors i was hitting were just with inserting items, but knows hwat happens after that
[19:08] * verwilst (~verwilst@dD576F64D.access.telenet.be) has joined #ceph
[19:10] <Tv> sagewk: seems like they decided to make the using pointers part a decision to be made by the caller
[19:13] * cmccabe (~cmccabe@ has joined #ceph
[19:14] <gregaf1> well the maps default construct items and then uses in-place copies with our typical use pattern, maybe we could get around that using insert
[19:16] <sagewk> standup!
[19:16] <Tv> well one question to ask, does anything need it to be ordered?
[19:17] <cmccabe> I don't know if there is any difference between std::map::insert and std::map::operator[] as far as constructors go
[19:17] <cmccabe> well, I guess there is one
[19:17] <cmccabe> insert does not require a default no-args constructor
[19:18] <cmccabe> in both cases, you are invoking the copy constructor, though
[19:19] <sagewk> bchrisman: there?
[19:30] <sagewk> cmccabe: i seem to recall this being in the c++ faq, been a while tho. in any case, i would be surprised if it copies nodes around _after_ insertion, but who knows.
[19:31] <cmccabe> I think it is allowed to copy as much as it wants
[19:31] <Tv> if you force it to always use pointers (behind your back), you slow the simple int->int etc cases
[19:31] <cmccabe> I know that's true with std::vector at least
[19:31] <Tv> making the pointers explicit makes that cost optional and controllable by the caller
[19:31] <Tv> as far as i can see
[19:32] <cmccabe> there are some operations in vector that actually trigger a cascade of copy constructors-- I think sort is one
[19:32] <bchrisman> sagewk: I have a monday morning planning mtg that'll likely run into those on mondays
[19:32] <cmccabe> I mean-- it makes sense
[19:33] <sagewk> tv: for vector, yeah, but not for an rbtree.. in that case inserting new elements relinks pointers between nodes; other items shouldn't get copied. internaller there's some struct foo { mytype val; foo *left, *right; } or whatever, and once foo is allocated it isn't copied (unless you copy the whole container or somethign)
[19:33] <sagewk> but anyway... :)
[19:34] <Tv> sagewk: well i don't know what data structure STL map actually uses.. i saw some conversations talk about "rotations"
[19:35] <sjust> Tv: that sounds rather tree like to me :)
[19:35] <sagewk> presumably the red/black node pointer shuffling
[19:35] <cmccabe> if you want to get pedantic, and I know I do, std::map is not a data structure, it's an interface
[19:35] <Tv> yeah it does, but the api doesn't reveal that
[19:35] <sagewk> hehe
[19:35] <cmccabe> don't assume random things about the implementation
[19:36] <sagewk> in any case, using map<mykey, boost::shared_ptr<myval> > does the trick.
[19:36] <Tv> well yes, i'm just saying assuming it's an rbtree and thus can do pointer shuffle is not safe either
[19:36] <sagewk> there may be a lower-overhead variant of shared_ptr, but it's probably not worth worrying about?
[19:36] <Tv> it might be e.g. a skip list
[19:36] <cmccabe> sagewk: tr1::shared_ptr is the new standardized version of boost::shared_ptr
[19:37] <sagewk> oh ok
[19:37] <sagewk> assuming it works, i need to clean up a bit before putting in master anyway
[19:37] <cmccabe> you need #include <tr1/memory> to get std::tr1::shared_ptr
[19:37] * swendel (~swendel@p578D18A2.dip.t-dialin.net) has joined #ceph
[19:37] <cmccabe> but you don't need any libraries... it is part of the base C++ now
[19:39] * swendel_ (~swendel@p5DC6D0F7.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[19:41] <cmccabe> so std::map does give complexity formulae for its operations
[19:41] <cmccabe> I think all the STL classes do
[19:41] <cmccabe> so you can know that insertion is O(log n)
[19:42] <cmccabe> but you can't actually know that it's using a red-black tree as opposed to an AVL tree
[19:46] <cmccabe> also, there's a replacement for the old GNU hash table thing called std::tr1::unordered_set
[19:47] <cmccabe> it's probably not worth replacing the old usages, but for new code, it would be nicer to use the standard thing.
[19:48] <cmccabe> and the tradeoffs are pretty much what you would expect. std::map is slower, but more space-efficient than std::tr1::unordered_set.
[19:48] <cmccabe> https://hbfs.wordpress.com/2009/10/13/cargo-cult-programming-part-1/#more-1534
[19:56] <stingray> okay...
[20:01] <sagewk> tv: those sepia nodes should be back up?
[20:02] <Tv> sagewk: looks so
[20:07] <yehudasa> stingray: I pushed a couple of fixes for the rbd offsets to the master branch
[20:07] <yehudasa> sadly the fix is on both the osd an on the rbd client
[20:07] <yehudasa> rbd client <-- librbd
[20:12] <swendel> hi guys
[20:12] <yehudasa> hi swendel
[20:12] <swendel> one question about btrfs and ceph
[20:12] <yehudasa> ok
[20:13] <swendel> i try to create mkcephfs osd on a btrfs subvulume and cant find so much infos about it
[20:13] <swendel> is that cenario possible?
[20:16] <swendel> in my ceph.conf is set btrfs path = /datastore
[20:17] <swendel> but when try mkcephfs -c /etc/ceph/ceph.conf --allhosts --mkbtrfs -k /etc/ceph/admin.keyring it fails
[20:17] <swendel> any idea??
[20:17] <swendel> ?
[20:17] <yehudasa> swendel: it should work, what does it fail on?
[20:18] <swendel> it just prints btrfs --help
[20:18] <swendel> umount: /datastore/cosd: not mounted
[20:18] <swendel> failed: '/sbin/mkcephfs -d /tmp/mkcephfs.8z50PTz80f --prepare-osdfs osd.0'
[20:19] <yehudasa> what version of the btrfs utils are you using?
[20:19] <swendel> Btrfs Btrfs v0.19
[20:19] <yehudasa> what kernel?
[20:19] <swendel> Linux hv 2.6.38-8-server #42-Ubuntu SMP Mon Apr 11 03:49:04 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
[20:23] <yehudasa> swendel: can you add -v to the comman line?
[20:24] <swendel> for mkcephfs?
[20:24] <yehudasa> yes
[20:25] <swendel> it doesnt change anything
[20:25] <swendel> sudo mkcephfs -c /etc/ceph/ceph.conf --allhosts --mkbtrfs -k /etc/ceph/admin.keyring -v
[20:26] <yehudasa> does it dump more stuff at least?
[20:26] <swendel> no :(
[20:26] <yehudasa> hmm.. actually it wouldn't
[20:27] <swendel> ceph Version: 0.29.1-1natty
[20:28] <yehudasa> can you run 'sudo btrfsctl -a'
[20:28] <yehudasa> ?
[20:29] <swendel> yes, Scanning for Btrfs filesystems
[20:29] <swendel> /dev/sda 7.3T 3.7G 7.3T 1% /datastore
[20:30] <yehudasa> ok, brb
[20:30] <yehudasa> you can try checking to see on what line it fails exactly in mkcephfs in the mean time
[20:31] <swendel> ok
[20:31] <swendel> umount: /datastore: not mounted
[20:31] <swendel> --- hv# /sbin/mkcephfs -d /tmp/mkcephfs.y17IfMsBjk --prepare-osdfs osd.0
[20:32] <swendel> /usr/bin/cconf -c /etc/ceph/ceph.conf -n osd.0 "user"
[20:32] <swendel> failed: '/sbin/mkcephfs -d /tmp/mkcephfs.y17IfMsBjk --prepare-osdfs osd.0'
[20:45] <swendel> yehudasa: if i use "btrfs path" and not want file-based journal, i should comment out "osd journal =", or?
[20:47] <swendel> if yes, the output is now different, but didnt create the fs still
[21:04] <Tv> cmccabe: can you answer swendel? i think you have the best clue on the config behavior
[21:06] <yehudasa> swendel: in addition to the 'set -e' in mkcephfs, can you add 'set -x'?
[21:06] <cmccabe> swendel: if you don't want a journal, you should put "osd journal = " in your configuration
[21:06] <yehudasa> that'll give us better idea on what fails exactly
[21:06] <cmccabe> swendel: I believe no journal is also the default
[21:06] <yehudasa> cmccabe: I think he wants a journal
[21:07] <cmccabe> swendel: in that case, you should set it to a path that the daemon can open and write to
[21:07] <cmccabe> swendel: I think mkcephfs does cosd --mkjournal at some point
[21:07] <cmccabe> I don't really understand why you would want both a journal and btrfs
[21:08] <cmccabe> based on my understanding, the journal was implemented because ext3/4/etc didn't support transactions
[21:09] <cmccabe> that would actually be a good question for sage... does the journal provide any benefits except transactionality
[21:09] <yehudasa> cmccabe: it's orthogonal
[21:09] <yehudasa> cmccabe: yes
[21:10] <cmccabe> so why would I want a journal on btrfs then?
[21:10] <cmccabe> not that I don't appreciate terse answers :)
[21:10] <yehudasa> cmccabe: yes, it does provide extra benefits, e.g., turns random writes into serial writes
[21:11] <cmccabe> so in other words, it could have performance advantages
[21:11] <yehudasa> cmccabe: the btrfs transaction is just a way to implement the journal
[21:12] <swendel> hey guys, yes confusing, i have to say after some try i think i messed it a bit up. ok mow im clear you use journaling of btrfs
[21:12] <yehudasa> we're not doing it via transactions anymore (iirc), but via snapshotting
[21:13] <cmccabe> I actually very much doubt that our userspace stuff is going to outperform the kernel's own write coalescing
[21:13] <cmccabe> but I guess you're welcome to try and see
[21:14] <swendel> i get the example config and have one more try
[21:16] <cmccabe> yehudasa: yes, it does seem like the code in FileStore.cc is using btrfs snapshots rather than transactions
[21:21] <yehudasa> cmccabe: yeah, transactions were awfully awkward and didn't work well.
[21:22] <cmccabe> you would think that snapshots would be more expensive than transactions
[21:23] <cmccabe> but I guess hardly anyone creates huge btrfs transactions, whereas lots and lots of people use snapshots
[21:23] <cmccabe> so maybe that code has gotten more optimization and work than the transaction stuff
[21:26] <yehudasa> cmccabe: actually transactions are used internally in btrfs, we just exposed it to userspace
[21:27] <cmccabe> I think that ioctl is still privileged, though?
[21:27] <yehudasa> yeah, but that didn't mean much
[21:28] <cmccabe> yehudasa: so the often-quoted example of transactionally updating configuration files can't really be done with btrfs transactions, unless something is sudo
[21:28] <yehudasa> I'm not sure about often-quoted
[21:29] <cmccabe> well, there have been lots and lots of discussions about fsync and configuration files.
[21:30] <yehudasa> you do need some privileges for user transactions there anyway
[21:30] <cmccabe> not for fsync
[21:30] <cmccabe> or for sync either
[21:30] <yehudasa> you can't really control when the filesystem is going to sync
[21:31] <cmccabe> actually malicious user could probably kill I/O performance pretty effectively by running while true; do sync; done
[21:31] <yehudasa> anyway, we had trouble with recovering from when an operation failed mid-transaction
[21:32] <cmccabe> yeah, there was a long thread on fsdevel about checking error codes in the transaction code
[21:33] <yehudasa> we actually submitted something to group system calls, but the snapshots seemed a better approach
[21:33] <cmccabe> apparently a lot of spurious BUG_ONs got added and removing them turned out to be very challenging
[21:33] <cmccabe> a good example of why you should build in error handling from day 1
[21:34] <cmccabe> yeah, I was reading briefly about the grouped system calls thing
[21:35] <cmccabe> BTRFS_IOC_USERTRANS?
[21:39] <yehudasa> yeah
[22:05] <stingray> well
[22:05] <stingray> despite one systemd bug, seems to be working
[22:06] * stingray -> home
[22:26] <slang> sagewk: that branch fixes the crashes I was seeing with the mds
[22:26] <slang> sagewk: no more errors in valgrind (for the mds) either
[22:29] * Yulya__ (~Yu1ya_@ip-95-220-242-20.bb.netbynet.ru) has joined #ceph
[22:36] * Yulya_ (~Yu1ya_@ip-95-220-159-77.bb.netbynet.ru) Quit (Ping timeout: 480 seconds)
[22:49] <sagewk> slang: excellent, thanks for testing!
[22:50] <slang> sagewk: sure thing
[23:32] <Tv> pushed two teuthology commits: fix bug that marked all test runs failed due to seeing phantom core dumps, archive syslog
[23:33] <Tv> $ bzcat p/remote/ubuntu@sepia70.ceph.dreamhost.com/syslog/kern.log.bz2 |head -1
[23:33] <Tv> 2011-06-20T13:30:56.292878-07:00 sepia70 kernel: imklog 4.2.0, log source = /proc/kmsg started.

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.