#ceph IRC Log


IRC Log for 2012-05-23

Timestamps are in GMT/BST.

[0:04] * mkampe (~markk@aon.hq.newdream.net) has joined #ceph
[0:05] * yehudasa_ (~yehudasa@aon.hq.newdream.net) Quit (Quit: Ex-Chat)
[0:06] * ThoughtCoder (~ThoughtCo@60-240-78-43.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[0:12] * sjust1 (~sam@aon.hq.newdream.net) Quit (Remote host closed the connection)
[0:13] * gregaf1 (~Adium@aon.hq.newdream.net) Quit (Remote host closed the connection)
[0:16] * Tv_ (~tv@aon.hq.newdream.net) Quit (Quit: Tv_)
[0:24] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[0:24] * Ryan_Lane (~Adium@ has joined #ceph
[0:33] * MarkDude (~MT@ip-64-134-236-53.public.wayport.net) Quit (Quit: Leaving)
[0:39] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[1:01] * joao (~JL@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:07] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:09] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[1:10] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[1:11] <CristianDM> I have an error with openstack
[1:11] <CristianDM> global name 'rados' is not defined
[1:11] <CristianDM> Any idea?
[1:12] * joao (~JL@aon.hq.newdream.net) has joined #ceph
[1:13] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[1:13] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[1:15] * Theuni (~Theuni@p57A088AF.dip0.t-ipconnect.de) has joined #ceph
[1:23] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[1:24] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[1:29] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[1:29] * Tv_ (~tv@aon.hq.newdream.net) has joined #ceph
[1:31] * Theuni (~Theuni@p57A088AF.dip0.t-ipconnect.de) Quit (Quit: Leaving.)
[1:31] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[1:33] <joshd> CristianDM: that's from glance, right? sounds like it's not finding the rados python bindings
[1:36] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:37] <CristianDM> Wops
[1:37] <CristianDM> What package need?
[1:37] <joshd> python-ceph
[1:38] <CristianDM> Wops...I am a stupid :(
[1:39] <CristianDM> Good, works.
[1:40] <CristianDM> joshd: In the 0.47.1 the cache work good? I can use this now?
[1:43] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[1:45] * jtang (~jtang@089-101-195084.ntlworld.ie) Quit (Quit: Leaving.)
[1:50] <Qten> gregaf: thanks for that :)
[1:57] * jtang (~jtang@089-101-195084.ntlworld.ie) has joined #ceph
[1:58] * jtang (~jtang@089-101-195084.ntlworld.ie) Quit ()
[2:08] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit (Quit: Ex-Chat)
[2:13] <joshd> CristianDM: there's one more bug fix for it that should be released soon (11030793fae4226352b67b1c806beae51e88150a)
[2:14] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[2:22] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[2:25] * Tv_ (~tv@aon.hq.newdream.net) Quit (Quit: Tv_)
[2:27] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:53] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[3:24] * nhm (~nh@ Quit (Ping timeout: 480 seconds)
[3:30] * CristianDM (~CristianD@host217.190-230-240.telecom.net.ar) Quit (Ping timeout: 480 seconds)
[3:30] * CristianDM (~CristianD@host217.190-230-240.telecom.net.ar) has joined #ceph
[3:47] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:53] * yoshi (~yoshi@p3167-ipngn3601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:53] * joao (~JL@aon.hq.newdream.net) Quit (Quit: Leaving)
[4:13] * nhm (~nh@ has joined #ceph
[4:16] * cattelan (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[4:25] * cattelan (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[4:59] * renzhi (~renzhi@ has joined #ceph
[5:00] <renzhi> hijacker, I'm writing an app using librados to connect to the cluster. The app has to run as root. How do I run as normal user instead?
[5:01] <renzhi> sorry, I meant to say hi, but don't know the text became hijacker...
[5:03] * cattelan (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[5:05] <dmick> (There's a hijacker in this room, your chat client autocompleted the name as if it were a shoutout to that user)
[5:05] <dmick> is your cluster using cephx?
[5:06] <dmick> either way, the issue is going to be reading the .conf file or the keyring files, probably
[5:07] <dmick> I was, just this second, interacting with librados through rados.py as a normal user
[5:08] <dmick> renzhi: ^ make sense?
[5:10] <renzhi> dmick: sorry, just walked away
[5:10] <renzhi> dmick: yes, the cluster is using cephx
[5:10] <renzhi> but I have my conf file and keyring for the user, the user can read all these files
[5:10] <dmick> hm
[5:11] <dmick> what error do you get running as a normal user?
[5:11] * cattelan (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[5:12] <renzhi> when I do : err = rados_connect(cluster);
[5:12] <renzhi> I get : No such file or directory
[5:12] <renzhi> I have to run as root, and everything just works
[5:13] <dmick> you may get a quick clue by running the application under strace to see which file it can't find
[5:13] <dmick> something like strace -e trace=open <your application> should say interesting things
[5:14] <renzhi> I got it, thanks. It's trying to get the keyring under /etc/ceph by default.
[5:15] <renzhi> I need to load it from another directory.
[5:15] <dmick> it should be using whatever your ceph.conf file says to use
[5:15] <dmick> (I think)
[5:16] <renzhi> I had a mistake in my ceph.conf file, I missed one line when I did my adaptation. Thanks :)
[5:17] <dmick> yw
[5:18] <renzhi> got it to work now :D
[5:19] <dmick> good show!
[5:21] <elder> Too right!
[5:21] <dmick> Pip pip!
[5:22] <dmick> elder do you happen to know offhand why root solved this?
[5:22] <elder> Sorry, wasn't paying attention, old bean.
[5:23] <elder> But it clearly is permission based if root got through.
[5:23] <elder> I presume no selinux is in place or anythign.
[5:23] <dmick> renzhi: was the 'wrong' keyring file readable only by root, maybe?
[5:23] <renzhi> dmick: yeah
[5:23] <dmick> ah. so the same ring was in two places, but with different perms. that makes sense
[5:24] <renzhi> I had a typo in my keyring param in ceph.conf
[5:24] <elder> Well there you go.
[5:28] * chutzpah (~chutz@ Quit (Quit: Leaving)
[5:30] <dmick> elder: do you happen to know offhand how to translate config file section-and-name into name?
[5:31] <elder> Ceph config?
[5:31] <elder> If so, no.
[5:31] <dmick> I know I can get [global] conf vars by replacing spaces with underscores (and yes)
[5:32] <elder> Have you looked at ceph_conf?
[5:32] <elder> (Just grasping at straws here)
[5:33] <dmick> .cc? yeah, looking now
[5:33] <elder> Yes.
[5:39] <dmick> ah ok. I think when I create a rados handle, I specify who I am "client-wise"
[5:39] <dmick> and that controls what options I see
[5:39] <dmick> the default is 'client', so I can see things in [client] or [global]
[5:40] <elder> Sounds reasonable. So you were seeing only a subset, and you expected to see everything or something?
[5:40] <dmick> well I didn't know how to express "show me [osd.1]'s 'host' key"
[5:41] <dmick> and I think the answer is "you can't get there from here"
[5:41] <elder> ceph-conf -s osd.1 --lookup <key> ?
[5:42] <dmick> yeah, except I'm trying to do that from librados/rados.py
[5:43] <elder> Where is librados/rados.py?
[5:43] <dmick> heh, sorry
[5:43] <dmick> librados is the C/CC version of the library
[5:43] <dmick> rados.py is the python binding to it
[5:44] <dmick> pybind/rados.py
[5:44] <dmick> and librados/*
[5:44] <elder> OK got it
[5:46] <elder> OK, that's a pretty big file. What portion are you working in?
[5:46] <dmick> I'm playing with simple cluster operations
[5:46] <dmick> right now, was playing with conf_get()
[5:58] <elder> Well the best clue I've found is there's something called get_val_from_conf_file(), in src/common/config.cc, but I haven't really gotten any farther than that.
[5:59] <elder> I'm headed to bed though.
[5:59] <dmick> yeah, it's late, I should quit too
[5:59] <dmick> sorry, didn't mean to send you on a chase, just asking if you might know offhand
[6:00] <elder> No, I was waiting for a build and then a test to complete anyway.
[6:00] <dmick> not a big deal, off experimenting with read() and stat() and stuff now
[6:00] <elder> So it gate me something more interesting to do than scan my bookmarks for something I haven't looked at in a while.
[6:00] <dmick> ;)
[6:00] <elder> Talk to you tomorrow. Good luck.
[6:01] <dmick> tnx
[6:21] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[6:51] * goedi (dfdsav@ Quit (Ping timeout: 480 seconds)
[6:52] * goedi (goedi@ has joined #ceph
[6:54] * nhm (~nh@ Quit (Ping timeout: 480 seconds)
[7:02] * cattelan is now known as cattelan_away
[7:33] * Theuni (~Theuni@p57A088AF.dip0.t-ipconnect.de) has joined #ceph
[7:46] * Ryan_Lane (~Adium@c-98-210-205-93.hsd1.ca.comcast.net) has joined #ceph
[7:49] * Theuni (~Theuni@p57A088AF.dip0.t-ipconnect.de) Quit (Quit: Leaving.)
[7:52] * Theuni (~Theuni@p57A088AF.dip0.t-ipconnect.de) has joined #ceph
[8:15] * Theuni (~Theuni@p57A088AF.dip0.t-ipconnect.de) Quit (Quit: Leaving.)
[8:56] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[9:11] * yanzheng (~zhyan@ has joined #ceph
[9:12] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[9:15] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:24] * Theuni (~Theuni@i59F73B1F.versanet.de) has joined #ceph
[9:24] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) Quit (Remote host closed the connection)
[9:33] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:37] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[9:40] * MK_FG (~MK_FG@ Quit (Remote host closed the connection)
[9:44] * MK_FG (~MK_FG@ has joined #ceph
[9:55] * BManojlovic (~steki@ has joined #ceph
[9:56] * s[X]_ (~sX]@60-241-151-10.tpgi.com.au) has joined #ceph
[10:01] * steki-BLAH (~steki@bojanka.net) has joined #ceph
[10:04] * yanzheng (~zhyan@ Quit (Remote host closed the connection)
[10:04] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[10:04] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[10:05] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[10:07] * gohko_ (~gohko@natter.interq.or.jp) has joined #ceph
[10:07] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[10:21] * yanzheng (~zhyan@ has joined #ceph
[10:37] * s[X]_ (~sX]@60-241-151-10.tpgi.com.au) Quit (Remote host closed the connection)
[10:37] * ao (~ao@ has joined #ceph
[10:45] * BManojlovic (~steki@ has joined #ceph
[10:50] * steki-BLAH (~steki@bojanka.net) Quit (Ping timeout: 480 seconds)
[10:52] * s[X]_ (~sX]@60-241-151-10.tpgi.com.au) has joined #ceph
[10:53] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[10:57] * yanzheng (~zhyan@ Quit (Remote host closed the connection)
[11:02] * s[X]_ (~sX]@60-241-151-10.tpgi.com.au) Quit (Remote host closed the connection)
[12:09] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[12:32] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[12:43] * raw (~raw@ has joined #ceph
[12:47] <raw> on the wiki i can read "Creating a new file system on an ext4 partition that already contains data, will invoke rm -rf to delete the data." does that mean it will wipe the whole partition or just everything below the osd data path?
[12:48] <raw> i have a ext3 mounted at /home and for testing i want to use /home/ceph/$name as storage space
[12:51] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[12:53] * renzhi (~renzhi@ Quit (Quit: Leaving)
[12:53] <darkfader> raw: just don't, please
[12:53] <darkfader> but you can look inside mkcephfs
[12:54] <darkfader> you're testing a distributed filesystem and it would be really good to test that in dedicated space
[12:54] <darkfader> one can always try
[12:54] <darkfader> i often do
[12:54] <darkfader> and i often lose :)
[12:55] <raw> ok, i better start testing with two virtual machines then :)
[12:56] <darkfader> yes :))
[12:59] * Theuni (~Theuni@i59F73B1F.versanet.de) Quit (Quit: Leaving.)
[14:03] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[14:31] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[14:31] * yoshi (~yoshi@p3167-ipngn3601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[15:22] * nhm (~nh@ has joined #ceph
[15:33] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[15:35] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[15:38] <nhm> good morning #ceph
[15:38] <filoo_absynth> morning nhm
[15:42] <nhm> no btrfs in 3.4 seems to have much less performance degredation when using -l 64k and -n 64k
[15:43] <nhm> s/no/so
[15:48] * Theuni (~Theuni@i59F73B1F.versanet.de) has joined #ceph
[15:48] * CristianDM (~CristianD@host217.190-230-240.telecom.net.ar) Quit (Ping timeout: 480 seconds)
[15:49] * CristianDM (~CristianD@host217.190-230-240.telecom.net.ar) has joined #ceph
[16:32] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[16:35] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[16:37] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[16:38] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[16:41] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[16:48] <elder> sagewk, not that it's very important, but when a client gets a connection response from a server, the server *will* know its own IP address, right?
[16:49] <elder> So there's no need for the client to handle the possibility that the server supplied a blank address in its response?
[16:56] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[16:57] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[17:12] <nhm> elder: any idea how I'd recognize XFS metadata operations in blkparse output?
[17:13] * ao (~ao@ Quit (Quit: Leaving)
[17:14] * sbohrer (~sbohrer@ has joined #ceph
[17:14] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:15] <elder> I don't remember what's in it.
[17:15] * sagewk (~sage@aon.hq.newdream.net) Quit (Quit: Leaving.)
[17:15] <elder> But I believe XFS does a thing that marks all metadata as metadata when it passes requests down to the block layer.
[17:15] <elder> Or it did at one time.
[17:16] <elder> Let me look so I can be more specific with what I'm talking about...
[17:16] * joao (~JL@aon.hq.newdream.net) has joined #ceph
[17:20] <elder> The bio is submitted with REQ_META set in its rw flags. I don't know if that comes through or is recorded in blkparse output.
[17:21] <elder> In other words, it will include: WRITE or WRITE_SYCN and possibly FUA and possibly FLUSH and
[17:21] <elder> Sorry.
[17:21] <elder> (WRITE or WRITE_SYNC) and possibly FUA and possibly FLUSH
[17:21] <elder> or READ or READA
[17:21] <elder> And in all three cases, also possibly META
[17:22] <elder> Does that make sense?
[17:22] <nhm> elder: maybe, I'm not yet familiar enough with blkparse to know how that will come through.
[17:22] <elder> I'm looking at the man page now.
[17:23] <elder> Looks like it ought to show up (if at all) in the two-character "rwbs description"
[17:23] <nhm> elder: I'm going back and investigating the blktrace results from those animations I posted a while back. I wantt o know what's going on when the seeks skyrocket.
[17:23] <elder> So {R,W,D}{B,S}
[17:24] * Theuni (~Theuni@i59F73B1F.versanet.de) Quit (Ping timeout: 480 seconds)
[17:24] <elder> WS would be write syncrhonous
[17:24] <elder> I imagine it would thus need to be a third character, so it might not show up (or I have an out-dated man page)
[17:24] <elder> I.e., WSM for synchronous metadata write
[17:26] <nhm> I'm seeing a far number of things like "WM 1491103890 + 256 [swapper/0]"
[17:26] <nhm> ah, here we go:
[17:26] <nhm> WSM 976770466 + 128 [xfsaild/sdb1]
[17:26] <elder> Sounds like that's your answer then.
[17:27] <elder> Synchronous requests ought to be unusual.
[17:27] <elder> You should verify with a current man page, but I presume "M" means metadata.
[17:31] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:33] <nhm> elder: heh, tough to find what all of the possible RWBS values are.
[17:33] <elder> You shouldn't really have to.
[17:33] <nhm> elder: yeah, I just wantedt o verify that M is metadata.
[17:33] <elder> But synchronous (S) would be a sign of trouble... I.e., it'll hold everything up.
[17:34] <nhm> the blktrace manpage seems to only list the ones you mentioned. Yeah, the asynchronous ones are much more common. Going to track down all of the synchronous ones and how long they lasted.
[17:36] <elder> Also the barriers might take a little longer. I believe it's intended to be a bit lighter weight than a synchronous request, meaning the drive will ensure ordering without the host having to stop making new requests awaiting completion of a synchronous one.
[17:39] <nhm> ooh, ceph-osd did something WSM during both of the bad times.
[17:40] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[17:41] <elder> I think that will cause the entire AIL to get forced to disk, and maybe all I/O buffered on that disk to be written, and completed, before anything else can go.
[17:41] <nhm> hrm, yeah, ceph-osd has some WS in there too.
[17:41] * Tv_ (~tv@aon.hq.newdream.net) has joined #ceph
[17:42] <elder> Find out more and if you want I think I can dig a little further into what a WS or WSM implies--what it means must be going on in XFS.
[17:45] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[17:54] <nhm> elder: yeah, I would appreciate that. It definitely looks like there's a lot more WM and WSM operations going on during the times when there are high amounts of seeks.
[17:55] <elder> OK. I want to get to a good stopping point in what I'm doing before losing my train of thought. See if you can do any further characterization of what's going on--maybe at a higher level--in the mean time and I'll try to spend a little time on it this afternoon.
[17:56] <nhm> elder: cool, thanks. Whenver you have time. I've got like three different performance issues I'm juggling so I can always work on one of the other ones. :)
[17:59] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[18:02] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[18:11] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[18:12] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[18:13] * adjohn (~adjohn@50-0-164-218.dsl.dynamic.sonic.net) has joined #ceph
[18:34] * aliguori (~anthony@ has joined #ceph
[18:36] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[18:47] * BManojlovic (~steki@ has joined #ceph
[18:51] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:52] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[18:58] * Ryan_Lane (~Adium@c-98-210-205-93.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:01] * bchrisman (~Adium@ has joined #ceph
[19:08] * chutzpah (~chutz@ has joined #ceph
[19:09] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[19:09] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[19:28] * rturk (~rturk@aon.hq.newdream.net) has joined #ceph
[19:32] * raw (~raw@ Quit (Remote host closed the connection)
[19:35] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[19:35] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[19:39] * cattelan_away is now known as cattelan_away_away
[19:44] <sagewk> that *was* ken in the black box
[19:45] * Ryan_Lane (~Adium@ has joined #ceph
[19:45] <dmick> I thought I heard a join when it appeared
[19:52] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[19:58] * cattelan_away_away is now known as cattelan_away
[19:59] <nhm> elder: I'm seeing a lot of xfsbufd asynchronous metadata writes during the problem times.
[20:00] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[20:05] <nhm> elder: at second 89 in http://nhm.ceph.com/movies/wip-throttle/4m-noflusher-2threads-xfs-ag4-osd0.mpg, there were 889 WM lines attributed to xfsbufd.
[20:06] <nhm> close to 1590 total WMs for that second.
[20:07] <nhm> exactly 1590
[20:08] <nhm> So it looks like a bunch of metadata writes get issued, and then it spend the next 5-6 seconds doing all of them (grinding throughput to a halt), and then things start clearing up and throughput goes back up.
[20:10] <sagewk> nhm, elder: my guess is syncfs(2) writing the inodes..
[20:10] <sagewk> yehudasa: can you look at wip-mon-auth?
[20:12] <nhm> sagewk: these tests were done on oneiric, so it wouldn't even be syncfs would it?
[20:13] <sagewk> oh, yeah. triggers the same code, though
[20:14] <sagewk> i don't think many people rely/expect sync(2) to be efficient
[20:16] <nhm> sagewk: would logbsize and/or delaylog have any effect here you think?
[20:17] <sagewk> don't know enough about xfs internals to know
[20:18] <nhm> hrm, looks like delaylog was set on by default in 2.6.39.
[20:22] <elder> I was off eating lunch.
[20:23] <nhm> elder: np, I'm going to eat some lunch soon.
[20:26] <elder> sagewk, nhm, syncfs would explain issuing a ton of metadata I/O.
[20:27] <elder> logbsize simply affects how big log buffers are. Bigger means bigger chunks of I/O.
[20:27] <elder> delaylog has a huge performance impact. I'll explain it a bit.
[20:28] <elder> In the olden days (pre-2011 I think) XFS would generate log entries for every update to important data structures (like metadata).
[20:29] <elder> If another operation came along and happened to affect the same data structure, it would simply generate yet another log entry describing the change.
[20:29] <elder> Certain operations--such as file truncations--would involve a long sequence of updates to the same data structure. Directory operations are also that sort of thing.
[20:30] <elder> As a result, the log would fill with lots of updates to the same data structure, and the log traffic would be therefore larger than it maybe needed to be as a result.
[20:31] <elder> Furthermore, anything with updates about it in the log is held pinned in memory, not allowed to be written out.
[20:31] <elder> Anyway, delaylog is a set of changes to improve all that.
[20:31] <elder> Basically, if a second or subsequent update to the same data structure is scheduled to be logged, instead of adding new entries to the log, the updates are sort of aggregated in place.
[20:32] <elder> Then there's various stuff that is tracked in order to ensure things are done in an appropriate sequence.
[20:32] <elder> But in the end, updates are accumulated and log traffic is reduced--significantly in some cases.
[20:33] <elder> One side-effect is that you can essentially have much more "change" represented in log that has the same size as previous versions.
[20:33] <nhm> elder: ok, that makes sense. So assuming that delaylog is on (it should be since we are running 2.6.39+), that would mean all of these metadata writes we are seeing are to different data structures?
[20:33] <elder> And a consequence of that is that, when the log is flushed out, there's potentially a lot more data structutres that need to be written.
[20:34] <elder> It also means that a crash potentially involves a longer history of activity that needs to get replayed. And since XFS doesn't log user data, that may have user data consequence sin the event of a crash.
[20:34] <elder> Not necessarily all different nhm, but yes, probably mostly different datta structures.
[20:35] <elder> In fact, writes to the same data structure would show up as writing to the same offset on disk.
[20:35] <elder> Log writes, however, are going to be the log portion of the disk.
[20:36] <elder> So like Sage said, metadata writes all over the place are more likely the result of syncing the filesystem, which requires that all data (and metadata) actually be committed to disk, leaving a clean log.
[20:39] <elder> We should maybe start playing around with XFS tracepoints.
[20:40] <nhm> elder: sure. The systems now have syncfs, kernel 3.4, and are on SSDs instead of spinning rust, so the results may be slightly different. ;)
[20:41] <elder> Slightly.
[20:41] <elder> Are you using "perf"?
[20:42] <nhm> elder: wish I was. I need to get the packages built for this kernel.
[20:43] <nhm> say, that sounds like a good intern job. :D
[20:44] <elder> Sure.
[20:44] <elder> When the kernel module is loaded, there are a shitload of XFS tracepoints that become available.
[20:44] <elder> See /sys/kernel/debug/tracing/events/xfs
[20:44] <nhm> elder: I'm going to grab lunch, will be back later...
[20:44] <nhm> ok
[20:44] * danieagle (~Daniel@ has joined #ceph
[20:44] <elder> Soudns good.
[20:54] * adjohn (~adjohn@50-0-164-218.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[21:17] <todin> has any of you guys seen with 3.4.0 warning tcp.c 1610?
[21:19] <ajm-> ceph-specific or in general?
[21:20] <todin> ajm-: its a ceph osd node, but I don't think ceph it the problem
[21:20] <todin> I think it is a regression in the kernel, but not sure yet
[21:21] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[21:49] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[21:50] <yehudasa> sagewk: gregaf: want to take a look at wip-2465?
[21:50] <sagewk> yehudasa: sure
[21:50] <sagewk> yehudasa: can you look at wip-mon-auth?
[21:51] <yehudasa> sagewk: sure
[21:53] <sagewk> yehudasa: the .c_str() isn't needed with dump_string()
[21:53] <yehudasa> oh, right
[21:53] <yehudasa> ok, I'll rework that
[21:53] <sagewk> otherwise looks fine
[21:54] * adjohn (~adjohn@50-0-164-218.dsl.dynamic.sonic.net) has joined #ceph
[21:54] <sagewk> there are still some dump_format(..., "%d", ...) that can be dump_int or _unsigned
[21:54] <sagewk> not that its a bug, just unnecessary
[21:56] <yehudasa> sagewk: we can work on that in master
[21:56] <sagewk> yeah
[21:57] <yehudasa> sagewk: opened issue #2469 for that
[22:13] * lofejndif (~lsqavnbok@9YYAAGBRA.tor-irc.dnsbl.oftc.net) has joined #ceph
[22:14] <yehudasa> sagewk: pushed updated fix to github/dho
[22:14] <sagewk> k
[22:14] <yehudasa> sagewk: we can push that to the dho github
[22:16] <sagewk> ok, go for it!
[22:18] <yehudasa> sagewk: also cherry pick into stable?
[22:18] <sagewk> yeah
[22:19] <sagewk> it's tested?
[22:20] <yehudasa> yes
[22:22] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:39] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[22:43] <Tv_> sagewk: uh.. i wonder why/how mkcephfs ever ended up in /sbin and not /usr/sbin
[22:44] <sagewk> iirc it was at the request of clint or laszlo
[22:44] <Tv_> huh
[22:44] <sagewk> probably laszlo, clint probably doesn't care
[22:45] * lofejndif (~lsqavnbok@9YYAAGBRA.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[22:46] <liiwi> need nodes up before /usr gets up?
[22:46] <Tv_> b502be7a8df4526875b91e6ad3ede72dec1c1b75 has no justification
[22:46] <Tv_> liiwi: mkcephfs is a cluster creation time, not a boot-time, thing
[22:46] <liiwi> right
[22:48] <elder> sagewk, if ceph_fault() gets called (socket closed from below, or error in try_read() or try_write()) there is a backoff.
[22:48] <elder> Each time that happens, it gets doubled.
[22:49] <sagewk> tv_: can check the ML. it was either laszlo, or a lintian check. don't remember.
[22:49] <elder> But the backoff delay never gets reset again unless/until the connection gets closed and reopened.
[22:49] <sagewk> elder: right.
[22:50] <elder> Which means that if there is a set of errors, then everything is copasetic for a very long time, the next error could cause a backoff of as much as 5 minutes.
[22:50] <sagewk> elder: yeah, it could be reset sooner, if we can figure out where. the problem is there are lots of things that can feed into fault.. from connection errors to disconnects to negotiation errors (e.g. bad auth)
[22:50] <elder> Shouldn't it somehow decay again back to a short one?
[22:50] <elder> OK.
[22:51] <sagewk> maybe something like TAG_READY resets it.. that should sufficiently imply happiness
[22:51] <elder> Maybe.
[22:51] <Tv_> ok so automake/autoconf doesn't even know how to talk about /usr/sbin in the first place
[22:51] <elder> I'll keep it in mind.
[22:51] <Tv_> i always forget how miserable the thing is
[22:51] <sagewk> tv_: :)
[22:58] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) has joined #ceph
[23:10] * Theuni (~Theuni@p57A089C9.dip0.t-ipconnect.de) has joined #ceph
[23:15] * Theuni (~Theuni@p57A089C9.dip0.t-ipconnect.de) Quit ()
[23:20] <gregaf> sagewk: Tv_: looks like maybe the gitbuilders ran out of space?
[23:20] <gregaf> or at least my build is failing and one of them has a completely empty compile log
[23:20] <gregaf> http://ceph.newdream.net/gitbuilder-i386/
[23:20] <sagewk> the oneiric one did.. i wiped some random commits to make space. sigh.
[23:21] <sagewk> that one i cleared up yesterday i think? just redo your build and it should be ok
[23:21] <Tv_> 2.4GB free on that one
[23:22] <elder> sagewk, what is the purpose of the STANDBY connection state?
[23:22] <sagewk> elder: the connection state is preserved, and we will reconnect later when a message is queued.
[23:22] <elder> It is entered when a fault occurs and there's no outgoing activity pending.
[23:22] <sagewk> yeah
[23:23] <elder> But there's no coordination with the other side.
[23:24] <Tv_> 868MB free now
[23:25] <sagewk> there is.. the other side will also preserve state, and when they reconnect, they'll have the same seq #'s and pick up where they left off
[23:26] <sagewk> the server side will assume the client will reconnect and queue messages. until a client session timeout triggers and the client is kicked out
[23:27] <elder> But I mean, there is no protocol to say "I'm going into standby." Or is that just that the server will see a bumped connection sequence number and know that the client *was* in standby?
[23:28] <elder> Is the main thing about not having to re-authorize for a momentary connection loss?
[23:33] <elder> If there was no outbound activity pending, and a connection in standby will undergo a complete connection sequence with the server, I don't understand what "preserved state" there is that could be of benefit.
[23:34] <gregaf> elder: untransmitted messages, in the case of a network issue that recovers quickly
[23:34] <gregaf> is the main one
[23:34] <sagewk> elder: there is a message seq # for the "session" (which may consistent of many connect/error/reconnect cycles) and within that session message delivery is lossless and ordered
[23:36] <elder> gregaf, the reason I questioned it is that the client messenger only enters standby if there *are* no untransmitted messengers.
[23:36] <elder> messages
[23:37] <gregaf> ah, so you were talking about standby specifically (I presume that if there are messages it tries to reconnect instantly)
[23:37] <gregaf> I missed that, sorry
[23:37] <elder> If there are messages it does the backoff dance
[23:37] <sagewk> elder: that sounds right..
[23:38] <elder> This is only in the "fault" case, which means the socket got closed (other end) or an error on send or receive occurred.
[23:38] <sagewk> yeah
[23:38] <elder> So what benefit is there in keeping the message sequence number intact, if there is nothing outstanding anyway?
[23:39] <elder> And if a reconnect will reset everything else?
[23:39] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[23:39] <sagewk> the seq # doesnt' reset on reconnect, and the client and server are blissfully unaware a reconnect happened, and also certain that nothing was lost
[23:40] <mgalkiewicz> sagewk: hi
[23:40] <elder> sagewk, OK, I think that may have answered my question.
[23:40] <sagewk> mgalkiewicz: hi
[23:41] <mgalkiewicz> sagewk: regarding bug #2379 how can I provide mon data directory for u (118MB)
[23:53] * nhm (~nh@ Quit (Ping timeout: 480 seconds)
[23:57] * nhm (~nh@ has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.