#ceph IRC Log


IRC Log for 2012-07-20

Timestamps are in GMT/BST.

[0:00] * aliguori (~anthony@ Quit (Remote host closed the connection)
[0:00] * loicd1 (~loic@ Quit (Quit: Leaving.)
[0:15] * MarkN (~nathan@ has joined #ceph
[0:16] * MarkN (~nathan@ has left #ceph
[0:19] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[0:19] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[0:19] * Leseb_ is now known as Leseb
[0:27] * LarsFronius (~LarsFroni@95-91-243-240-dynip.superkabel.de) Quit (Quit: LarsFronius)
[0:28] * loicd (~loic@ has joined #ceph
[0:32] * Tv (~tv@cpe-24-24-131-250.socal.res.rr.com) Quit (Quit: Tv)
[0:38] * loicd (~loic@ Quit (Quit: Leaving.)
[0:42] * loicd (~loic@ has joined #ceph
[0:52] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[0:52] * Tv_ (~tv@ has joined #ceph
[1:04] * joebaker (~joebaker@ has joined #ceph
[1:05] * lxo (~aoliva@9KCAAGV1R.tor-irc.dnsbl.oftc.net) Quit (Read error: No route to host)
[1:05] <joebaker> So I'm thinking about setting up a storage box with Ceph on it. Could I then connect to it with OpenStack/StackOps?
[1:06] <joebaker> Or is Ceph already integrated into the install of OpenStack?
[1:06] <lurbs> joebaker: http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/
[1:06] <lurbs> Best intro to that that I've seen thus far.
[1:06] <joebaker> thanks!
[1:06] <lurbs> Short version: Wait for folsom.
[1:07] <lurbs> Until then a bunch of stuff can't be done through the dashboard, etc.
[1:07] <joebaker> So I gather that Ceph competes with ZFS.
[1:07] <joebaker> I like that Cefs is GPL
[1:09] <Fruit> ceph doesn't compete with zfs
[1:09] <Fruit> ceph could use zfs as a backend
[1:10] * Leseb_ (~Leseb@ has joined #ceph
[1:11] <joebaker> Oh, I see. I'm new to this stuff... So maybe Ceph compares to samba in that it presents object orient data over some protocol to machines on other boxes?
[1:11] <joebaker> Like samba uses the smb protocol to allow file and printer sharing.
[1:12] <joebaker> ...reading the introduction link.
[1:17] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[1:17] * Leseb_ is now known as Leseb
[1:19] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:19] * loicd (~loic@ Quit (Quit: Leaving.)
[1:20] * loicd (~loic@ has joined #ceph
[1:23] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[1:24] * Cube (~Adium@ Quit (Quit: Leaving.)
[1:25] <nhm_> joao: are you still awake?
[1:28] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:28] <joao> nhm_, just got back from a coffee with friends
[1:28] <joao> what's up?
[1:32] <nhm_> joao: would it make sense that even with "--test-suppress-ops ocl" I'd still see a lot of setxattr calls?
[1:32] <joao> not sure
[1:32] <joao> but I think so
[1:33] <nhm_> they do seem to drop off over time.
[1:33] <joao> there must be something in the filestore doing xattrs
[1:33] <joao> the suppress ops option will only suppress the generator's explicit set xattrs
[1:34] <joao> if there are any being done internally, those would not be suppress
[1:34] <joao> *suppressed
[1:34] <nhm_> ah
[1:34] <joao> let me check if I find evidence in the filestore to support my theory
[1:34] <joao> (otherwise it could be some unfortunate bug, although I don't think so)
[1:37] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[1:38] <joao> nhm_, does that happen every now and then, or do they happen once you begin running the test and then drop being set?
[1:38] <joao> or something else entirely?
[1:39] <joao> ah!
[1:39] <joao> the replay guards are xattr based
[1:40] <nhm_> joao: the start out very high and slowly drop off over time
[1:40] <joao> okay, at the start it's probably the _detect_fs() function
[1:40] <nhm_> joao: well, drop quickly at first, then mostly level out after a while.
[1:40] <joao> or some log replaying
[1:41] <joao> nhm_, what I see that will potentially generate a larger amount of xattrs is the replay guards
[1:41] <nhm_> ok
[1:41] <joao> whenever a non-idempotent operation is made, we will set some xattr guards on the object
[1:41] <joao> so we can guarantee a correct replay of the journal if we fail
[1:42] <joao> so that should generate some setxattrs
[1:43] <nhm_> joao: I'm working on a little tool to generate per second counts of completed operations from strace and also per second operation latencies.
[1:44] <nhm_> Not sure if it will end up helping in the end or not, but I'm hoping I can do some comparisons verses things that write quickly.
[1:44] <joao> that seems a cool idea
[1:44] <joao> at least you'd have more infos on what's going on
[1:45] <nhm_> joao: yeah. The problem so far is that it seems like everything slows down at the same time. It's not like writev slows down because of some crazy increase in stats or setxattrs or something.
[1:46] <joao> could it be cache related?
[1:46] <joao> this is the only thing I can come up with now that would affect everything at the same time, all of a sudden
[1:47] <joao> or memory consumption
[1:47] <nhm_> joao: or the controller just spazzing
[1:47] <joao> as if both options weren't generic enough :p
[1:47] <dmick> eh, the cluster just needs a rest every now and again :)
[1:47] <joao> lol
[1:48] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[1:49] <nhm_> dmick: It's a special feature to help make sure the drives reach their MTTF.
[1:51] * Leseb (~Leseb@ Quit (Ping timeout: 480 seconds)
[1:52] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[1:58] <sjust> joao: that shouldn't happen on btrfs though
[1:58] <nhm_> sjust: this is xfs
[1:58] <sjust> k
[1:59] <sjust> we could probably hack the filestore to turn that stuff off
[1:59] <nhm_> yay, I think my latency stuff is working
[2:01] * nhm (~nh@174-20-8-72.mpls.qwest.net) has joined #ceph
[2:05] * nhm__ (~nh@174-20-12-175.mpls.qwest.net) has joined #ceph
[2:07] * nhm_ (~nh@184-97-241-232.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[2:08] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[2:09] * nhm (~nh@174-20-8-72.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[2:16] * Leseb_ (~Leseb@ has joined #ceph
[2:18] * loicd (~loic@ Quit (Quit: Leaving.)
[2:23] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[2:23] * Leseb_ is now known as Leseb
[2:27] * Leseb (~Leseb@ Quit (Quit: Leseb)
[2:36] * Tv_ (~tv@ Quit (Quit: Tv_)
[2:53] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[2:54] * Tv (~tv@cpe-24-24-131-250.socal.res.rr.com) has joined #ceph
[3:04] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) has joined #ceph
[3:13] * ryann (~chatzilla@ has joined #ceph
[3:26] * mib_c8e5bt (4461c8b5@ircip1.mibbit.com) has joined #ceph
[3:26] <mib_c8e5bt> Is your website on page 1 of Google? http://bit.ly/LYh99L
[3:27] * mib_c8e5bt (4461c8b5@ircip1.mibbit.com) has left #ceph
[3:45] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[3:47] <ryann> Looking through the mailing lists, however, one on here may be able to point me to a link more quickly. Running Argonaut, 4 osd nodes (4 disks in node), 2 mds nodes, 1 mon (currently), I wish to split the disks on each osd node into 2 different cephfs's. Possible? Good example anywhere? Is rados mkpool the best option here?
[3:48] <gregaf> you can't presently run two different filesystems in one cluster
[3:49] <gregaf> you can run multiple clusters on the same hardware, though ?????there's a cluster config option you can set in newish versions, and as long as you set everybody up properly the first time they shouldn't collide
[3:52] <ryann> Help me, newish versions... Which holidays do they celebrate? I MEAN...where might i find more information on that? Sorry, I had to :)
[3:53] <gregaf> I don't remember how new it is???but if you aren't running something new enough, you should be upgrading anyway :)
[3:54] <ryann> It seems this is exactly what I need to do, as, I have a faster set of disks I wish to cluster, and a slower, larger set, for longer term storage items...
[3:54] <ryann> I have ceph-0.48argo built and deployed at this point.
[3:55] <gregaf> yeah, that'll include it
[3:55] <gregaf> I haven't actually run this myself so I don't remember what it will do for you (Tv or sage should, among others)
[3:56] <gregaf> but I think you just need to create a ceph.conf with (obviously) different daemons and IP addresses
[3:56] <gregaf> and then point new daemons at the new ceph.conf
[3:57] <gregaf> and if you specify the cluster it will do a better job of naming things so they're easier to separate, and be a bit more intelligent when you turn one fresh ones
[3:58] <ryann> When you say specify cluster, you mean name the object cluster1-osd.0 and the like? or are talking about another configuration option I haven't found?
[3:58] <gregaf> "cluster" is a thing you can specify in the ceph.conf and/or when starting up daemons
[3:58] <Tv> ryann: multiple cluster thingie is the --cluster= option to just about everything, config is read from /etc/ceph/$cluster.conf, default cluster name is "ceph"
[3:59] <gregaf> you could for instance have "ceph-osd ???cluster=fast-cluster" and "ceph-osd ???cluster=big-cluster"
[3:59] <nhm__> man, it's tough to find free profilers that will let you break down performance of functions over time.
[3:59] <Tv> ryann: not sure that's the best way; perhaps you'd be satisfied with just mounting separate subtrees of the filesystem?
[3:59] * nhm__ is now known as nhm
[3:59] <gregaf> oh, yeah???probably just having separate pools and moving data between them would make more sense *blush*
[3:59] <Tv> nhm: that sounds like multiple separate runs, while the process keeps running -- perf should be able to do that, right?
[4:00] <nhm> Tv: yeah, I could probably hack something up to run perf every 10s or something.
[4:00] <nhm> Tv: though right now perf isn't picking up our symbols, possibly related to the thing Sage was talking about earlier.
[4:01] <Tv> nhm: yeah never used it on userspace..
[4:01] <ryann> Ah! ok how would I do that? I'm open to either; building separte clusters WAS my original plan, but i wanted to do it as designed, not how I wished to hack it to work. But, if pools do what i need, then I just need to be pointed to info on how to set that up.
[4:01] <gregaf> nhm: I haven't used it, but I bet that google's profiler will do that???I know it does for the heap profiler
[4:01] <gregaf> ryann: have you looked into how crush maps work?
[4:02] * joshd (~joshd@ Quit (Quit: Leaving.)
[4:02] <nhm> gregaf: interesting, I'll look into it. I've never used google's profiler etiher.
[4:02] <Tv> ryann: do you really need the strict separation of storage?
[4:02] <gregaf> you'd want to create a new "fast-data" pool and place it using a CRUSH rule that only looks at the OSDs with fast drives
[4:02] <Tv> ryann: or just two mounts that have different files in them
[4:02] <gregaf> Tv: sounds like he wants to segregate SSDs and long-term magnetic storage
[4:03] <gregaf> not the data segregation
[4:03] <Tv> oh i see the fast/slow mention
[4:03] <gregaf> err, not the user segregation
[4:03] <Tv> sorry, lots of backscroll
[4:03] <nhm> ok, time to go watch some TV before my head explodes
[4:03] <Tv> ryann: oh hey, even better
[4:03] <Tv> ryann: you can tell individual subtrees where to store their dat
[4:03] <Tv> a
[4:03] <ryann> nhm: nice :P
[4:03] <Tv> ryann: no need for two mounts in the first place!
[4:03] <Tv> ryann: http://ceph.com/docs/master/man/8/cephfs/
[4:04] <nhm> joao: btw, if you've got some free time, it looks like there may be some memory leaks in the workload generator
[4:04] <gregaf> okay guys; I'm out ??? later
[4:05] <Tv> ryann: come up with a crushmap that says pool "slow" is stored on the osds using the slow disks (everything else on just the fast disks), then do cephfs set_layout to make subtrees get stored in the slow pool
[4:05] <Tv> ryann: only affects new files, but sounds *better* than two separate filesystems
[4:06] <ryann> Cool. thanks! Well, i'm building out both FS's tongith, so there's no data on them yet. I can clobber away. I just need to get my head wrapped around CRUSH, It seems there's info in like 3 different places and I have to look at all three to get it. :-/
[4:06] <Tv> it's replicated ;)
[4:06] <ryann> HAHA!
[4:07] <ryann> Yeah that's it the ceph.com/docs fs must have lost a few osd's... "TO DO: Write this doc..." :P
[4:07] <Tv> ryann: you know, whenever you see that in the docs, i'm the guy who put it there ;)
[4:08] * joao (~JL@ Quit (Remote host closed the connection)
[4:08] <nhm> joao: specifically "c = new C_OnReadable(this, t)" and "stat_state = new C_StatState(this, now)". There may be a couple of others too.
[4:08] <ryann> Yeah, well i figured if I get these servers off the ground, may be I can help you. Not sure how much external help you guys take for you docs...
[4:08] <nhm> ok, really out now.
[4:08] <Tv> ryann: well these days we at least have one person full time on docs
[4:09] <Tv> ryann: that helps a lot already.. but all input is always welcomed
[4:09] <Tv> we're especially bad at figuring out what makes/doesn't make sense to you
[4:09] <ryann> If I wasn't full time keeping a TV station on the Air, I would apply - I looked at the of the people you're looking for :(
[4:09] <ryann> It looks like a lot of fun.
[4:09] <Tv> Inktank & DreamHost are awesome companies
[4:10] <Tv> and we're hiring like crazy
[4:16] * joao (~JL@89-181-150-156.net.novis.pt) has joined #ceph
[4:16] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:46] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[4:52] <ryann> Tv: after reviewing the crushmap that is online here, would I just adjust the map to include a 'pool fast {...item rack_with_fast_stuff}' and include those osd's, and do similar for a 'pool slow {...'? Would I need to change the rules any at this point? (I'm still getting up to speed)
[4:53] <Tv> ryann: i'm sorry, i've never been able to keep the crushmap grammar in my head for more than 15 minutes at a time
[4:53] <Tv> ryann: i think you need two rulesets and then point at the different pools at different rulesets, but i don't have much more than that
[4:54] <Tv> ryann: if ceph.com/docs and mailing list archive fail you, email the mailing list
[4:54] <Tv> ryann: i'll make sure someone responds
[4:55] <ryann> Tv: Understood. Thanks for you time! ....you, uh...might see a random resume show up, at some point.....:P
[4:55] <ryann> (your time)
[4:55] <Tv> hehe
[4:56] <Tv> ryann: as long as you understand the distributed file system is still not stable, we're happy to help
[4:56] <Tv> and i really hope to see that resume
[4:57] <ryann> Thanks!
[5:00] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[5:21] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[5:48] * adjohn (~adjohn@ Quit (Quit: adjohn)
[6:04] * ryann (~chatzilla@ has left #ceph
[6:05] * jluis (~JL@ has joined #ceph
[6:10] * joao (~JL@89-181-150-156.net.novis.pt) Quit (Ping timeout: 480 seconds)
[6:31] <lurbs> I've got a test setup with 0.48, with three Ceph nodes - each acting as a mon and and osd, with the journal on an SSD. Getting an odd thing when using RBD for the backing device for KVM instances, though. Any given VM has decent read and write performance, but a heavy write slows a read *way* down, from ~70 MB/s to around 1 MB/s, even if it's a separate RBD device that's being read from (vda vs vdb in the VM). A read on a separate VM isn't affected.
[6:31] <lurbs> s/and and/and an/
[6:31] <lurbs> Anyone seen anything similar?
[6:36] <lurbs> Backed with XFS, and with rbd_cache on, if that makes any difference.
[7:39] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[7:41] * dmick (~dmick@ Quit (Quit: Leaving.)
[7:48] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) has joined #ceph
[7:50] * adjohn (~adjohn@50-0-133-101.dsl.static.sonic.net) Quit ()
[7:54] * Tv (~tv@cpe-24-24-131-250.socal.res.rr.com) Quit (Quit: Tv)
[8:05] * Cube (~Adium@ has joined #ceph
[8:35] * MK_FG (~MK_FG@ Quit (Ping timeout: 480 seconds)
[9:05] * BManojlovic (~steki@ has joined #ceph
[9:06] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:26] * s[X]_ (~sX]@eth589.qld.adsl.internode.on.net) Quit (Ping timeout: 480 seconds)
[9:28] * Leseb (~Leseb@ has joined #ceph
[9:38] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:52] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:16] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[10:33] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[10:48] * MK_FG (~MK_FG@ has joined #ceph
[11:01] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[11:13] * EmilienM (~EmilienM@ has joined #ceph
[11:17] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[11:30] <loicd> Hi, I'm trying to assert the impact of not being able to dynamically change the number of placement group http://ceph.com/docs/master/dev/placement-group/ after the initial setup. Let say I am to grow a cluster from 100TB to 500TB , would that be an inconvenience ?
[11:31] * andret (~andre@pcandre.nine.ch) Quit (Quit: Verlassend)
[11:31] * andret (~andre@pcandre.nine.ch) has joined #ceph
[12:18] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[12:20] * morse (~morse@supercomputing.univpm.it) Quit (Read error: Connection reset by peer)
[12:20] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[12:45] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:49] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[13:23] <newtontm> Hi,
[13:23] <newtontm> anyone know what is the debian package to install to be able to do this: mount -t ceph -o name=admin,secretfile=/etc/ceph/secret ceph01:/ /mnt/ceph
[13:51] * _benoit_ (~benoit@paradis.irqsave.net) Quit (Remote host closed the connection)
[14:13] <newtontm> nvm I didn't have the right kernel, it works now !
[14:18] * lofejndif (~lsqavnbok@1RDAAC5V3.tor-irc.dnsbl.oftc.net) has joined #ceph
[14:28] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[14:28] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[14:29] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[14:40] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[14:48] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Read error: Connection reset by peer)
[14:57] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[15:06] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[15:42] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[15:54] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:03] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:27] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) has joined #ceph
[16:31] * lofejndif (~lsqavnbok@1RDAAC5V3.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[16:51] <loicd> Leseb: regarding benchmarking, what do you mean by "simulating SSD" ?
[16:51] <loicd> I assume you'll be using https://github.com/ceph/teuthology right ?
[16:51] * Leseb (~Leseb@ has left #ceph
[16:51] * Leseb (~Leseb@ has joined #ceph
[16:52] <loicd> Leseb: regarding benchmarking, what do you mean by "simulating SSD" ?
[16:52] <loicd> I assume you'll be using https://github.com/ceph/teuthology right ?
[16:52] <Leseb> I just thought about using ram block device to store the journal
[16:52] <Leseb> yes I will
[16:52] <Leseb> pop I mean ramdisk
[16:53] <loicd> nice idea :-)
[16:53] <Leseb> I don't have SSDs at my disposal??? but it's only for testing purpose of course :)
[16:54] <loicd> The problem with SSD for storing journals is that they burn too quickly.
[16:56] <loicd> During Ceph: object storage, block storage, file system, replication, massive scalability, and then some! ( http://www.oscon.com/oscon2012/public/schedule/detail/26463 ) it was suggested to using a 32GB SSD for a 8GB journal by creating a single 8GB partition and not use the rest all
[16:56] <Leseb> indeed, at least you will opt for raid 1 SSDs if you want to store the journal in it
[16:56] <Meyer__> get some ZeusRAM ;)
[16:56] <loicd> that would transparently multiply by 4 the lifetime of the SSD
[16:57] <loicd> Meyer__: how much does it cost ?
[16:58] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[16:58] <Leseb> $2,485.95
[16:58] <Meyer__> loicd: they are quite expensive
[16:58] <Leseb> http://www.neobits.com/stec_z4rzf3d_8uc_zeusram_6gb_sas_8gb_high_iops_low_p960255.html?atc=gbs
[16:58] <Meyer__> ah, Leseb had the answer for you
[16:58] <nhm> leseb: what I'd *really* like to see is a ssd backed ramdrive for journal use.
[16:59] <Leseb> does someone say commodity hardware? xD
[16:59] <dspano> Leseb: Most of my servers are cheaper than that. Lol.
[16:59] <Meyer__> But they are very quick
[17:00] <Leseb> for this price, I hope so!
[17:00] <nhm> leseb: something like the old gigabyte i-ram cards but with some flash that can be written to when the battery is invoked.
[17:02] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:02] <Meyer__> nhm: the zeusram works like that iirc, ram and then some capacitors that writes stuff to flash if the power runs out
[17:03] <Leseb> loicd: could you be more specific when you said that SSDs burn to quickly? matter of months? years?
[17:03] <nhm> Meyer__: yeah, it's just pricey and still on SAS rather than PCIE.
[17:03] <Meyer__> nhm: true
[17:03] <Meyer__> i wonder how those ddrive-x1 or whatever they are called works
[17:04] <loicd> In a matter of a few months, that's what I understood. But it stands to reason that it will happen real quick because the journal is all about writing and SSD burn when you write.
[17:04] <jluis> gregaf, around?
[17:04] <loicd> It's a matter of burning by the number of writes rather than calendar days ;-)
[17:04] <nhm> Meyer__: yeah, Kype was talkign about those the otehr day. No BBU afaik so you better have a UPS. :)
[17:05] <nhm> er Kyle
[17:05] <Leseb> ok stuff like cell's life cycle
[17:06] <loicd> One possible strategy would be to have a LVM VG with two SSD , when one is about to given up, migrate the LV to the next SSD , replace the used SSD etc.
[17:06] <nhm> I figure with SSDs for journal you should probably buy a pretty big one and stick smaller partitions on it for better wear levelling.
[17:06] <nhm> that'll at least buy you a bit more time.
[17:07] <loicd> nhm: yes
[17:07] <Meyer__> nhm: It seems like the ddrive x1 also has some flash that it backups to in case of power failure
[17:07] <nhm> Meyer__: interesting!
[17:07] <nhm> I didn't know that.
[17:08] <Meyer__> and they are pci-e
[17:09] <loicd> Leseb: I bet you'll get about 2x better performances on write when using a XFS based OSD with a journal in a ramdisk.
[17:10] <nhm> Meyer__: yeah, I wonder if they'll come out with a shorter version that can fit in 2U.
[17:11] <Meyer__> nhm: I have no clue, just knew they existed and have not really done any research into them or their future plans.
[17:11] <Leseb> what prevents the ssd controller using the entire ssd/having the same lifetime if you make a partition using the entire drive?
[17:11] <nhm> if they could do 2 sodimm version and move the flash along the side...
[17:12] <Meyer__> I wonder how much they cost
[17:13] <nhm> Leseb: I think if it's unpartitioned space it can utilize the free space for wear levelling.
[17:13] <nhm> Leseb: and if it's partitioned but empty it can not. Don't quote me on that though.
[17:14] <Leseb> afaik the controller ought to be able to do figure out the wear leveling itself???..
[17:14] <loicd> sileht: updated https://labs.enovance.com/issues/434 with input from florian regarding methods to benchmark the ceph installation
[17:15] <dspano> My new favorite log message: kernel: [138108.068622] VFS: Busy inodes after unmount of ceph. Self-destruct in 5 seconds. Have a nice day...
[17:15] <nhm> Leseb: I could be entirely wrong.
[17:15] <loicd> (except cephfs that's out of scope for now)
[17:17] <loicd> sileht: did you figure out if it would be appropriate to publish your benchmark results on the http://ceph.com/wiki ? I assume that would be a good place but since it is not very active, maybe there is another wiki better suited for sharing this kind of information.
[17:18] <nhm> Leseb: I haven't read it yet, but apparently there is a very detailed paper about all of this here: http://domino.research.ibm.com/library/cyberdig.nsf/papers/50A84DF88D540735852576F5004C2558/$File/rz3771.pdf
[17:19] <Leseb> nhm: cool I will read that
[17:20] <sileht> loicd, Yes it is
[17:21] <sileht> loicd, I will provide the benchmark procedure, in case I miss something or misunderstand a result, and the result of it on our platform.
[17:26] * stxShadow (~Jens@ip-78-94-238-69.unitymediagroup.de) has joined #ceph
[17:27] <Leseb> using a journal means writing 2 times, it's safer obviously but it could be interesting to perform some test with a disable journal...
[17:31] <loicd> fc: could you please remind me the URL of the page that discuss placement groups resizing on the wiki ?
[17:33] <loicd> oh, it was not on the wiki
[17:33] <loicd> http://ceph.com/docs/master/dev/placement-group/ :-)
[17:33] <fc> loicd: http://ceph.com/docs/master/ops/manage/grow/placement-groups/
[17:33] <loicd> simultaneous paste
[17:34] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:35] <loicd> fc: florian suggests that setting it initially to the maximum size of the cluster is the only way and has no performances impact. He is not sure about it though and it is to be confirmed.
[17:35] <newtontm> Hi, I'm currenlyt test ceph with the kernel module, everything was working pretty fine until I decided to test failures. So I kill one of the 3 servers running the osd/mon and i'ts been like an hour and the mount is still not available. How does the recovery supposed to work ?
[17:36] <jluis> newtontm, output of ceph -s ?
[17:37] <newtontm> ceph -s
[17:37] <newtontm> health HEALTH_WARN 513 pgs degraded; 63 pgs down; 15 pgs recovering; 63 pgs stuck inactive; 576 pgs stuck unclean; recovery 6234/12480 degraded (49.952%); 20/6240 unfound (0.321%); 1 mons down, quorum 0,1 a,b
[17:37] <newtontm> monmap e1: 3 mons at {a=,b=,c=}, election epoch 16, quorum 0,1 a,b
[17:37] <newtontm> osdmap e25: 3 osds: 1 up, 1 in
[17:37] <fc> loicd: I agree this is the only way, and I have totally no clue about the performance impact but it should be checked
[17:37] <newtontm> pgmap v2609: 576 pgs: 63 down, 15 active+recovering+degraded, 498 active+degraded; 24875 MB data, 25939 MB used, 66210 MB / 92150 MB avail; 6234/12480 degraded (49.952%); 20/6240 unfound (0.321%)
[17:37] <newtontm> mdsmap e53: 1/1/1 up {0=a=up:active}
[17:38] <jluis> newtontm, the cluster is still recovering; give it a bit more time
[17:38] <newtontm> jluis: I have logs like that in my osd logs: 2012-07-20 15:36:52.901072 7fe1faf31700 0 log [WRN] : 303 slow requests, 1 included below; oldest blocked for > 1457.119941 secs
[17:38] <loicd> fc: agreed, it deserves a question on the mailing list
[17:39] <newtontm> jluis: and my mount is still unavailable... is this normal behaviour ?
[17:39] <newtontm> jluis: also it's been stuck at 49% for a while now
[17:40] <jluis> newtontm, maybe some one else can take a look at why the pgs are still recovering
[17:42] <jluis> oh, btw, the 49% is probably because you only have one of the osds up and in the cluster
[17:42] <jluis> not sure though
[17:42] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:42] <jluis> you did have 3 osds up, didn't you?
[17:42] <newtontm> jluis: that is what ceph -s is reporting, only one osd up, but I have 2 osd running...
[17:44] <Leseb> guys I have to leave! cheers!
[17:44] <newtontm> jluis: humm checking the logs of my first OSD it seems it dumped some stuff, i'll give you the logs in pastebin
[17:45] <newtontm> jluis: http://pastebin.com/E9e5Nj6y
[17:46] * stxShadow (~Jens@ip-78-94-238-69.unitymediagroup.de) Quit (Read error: Connection reset by peer)
[17:46] <jluis> newtontm, I think I saw this issue last week, not sure if it is the same
[17:47] <jluis> you should talk to sjust when he comes around
[17:47] <newtontm> k
[17:51] <loicd> fc: regarding placement groups, if you are to scale to 20PB and 1PB requires 1000 OSD (maybe a little less but it's the order of magnitude) then you would need to configure the initial number of placement groups to 20 * 1000 * 100 = 2 millions pg
[17:52] <newtontm> jluis: btw, I can't kill -9 the osd process... it won't die...
[17:52] <jluis> is it stuck in IO?
[17:52] <newtontm> probably, rebooting the server
[17:53] <jluis> what does dmesg say?
[17:53] <jluis> anything worth mentioning?
[17:53] <newtontm> [13530.188024] libceph: tid 3016611 timed out on osd1, will reset osd
[17:53] <jluis> no kernel oops, no nothing?
[17:53] <newtontm> nope
[17:54] <newtontm> the ceph mount is not available
[17:54] * Leseb_ (~Leseb@ has joined #ceph
[17:54] <newtontm> i was writing to the mount while testing the crash senario on a different node
[17:57] * Leseb_ (~Leseb@ Quit ()
[18:00] <fc> loicd: I'm just asking myself - if, as you say, a ceph pool is set up with 2 millions PG (to handle 1000 OSDs) - how will behave this ceph pool if it starts with for example only 100 OSDs, but still with 2 million placement groups ? (that would be 1000 placement groups per OSD)
[18:02] * Leseb (~Leseb@ Quit (Ping timeout: 480 seconds)
[18:03] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[18:04] <loicd> fc: that's precisely the question that should be asked on the list, I think. 2 million pg to handle 20,000 osds.
[18:09] * lofejndif (~lsqavnbok@28IAAF49V.tor-irc.dnsbl.oftc.net) has joined #ceph
[18:16] * Tv_ (~tv@ has joined #ceph
[18:25] * LarsFronius (~LarsFroni@95-91-243-240-dynip.superkabel.de) has joined #ceph
[18:39] * bchrisman (~Adium@ has joined #ceph
[18:43] <gregaf> newtontm: that crash means that the OSD ran a sync and the filesystem didn't return after 30 seconds ?????ie, hardware is too slow
[18:45] <gregaf> you can tune that default by setting "osd op thread timeout = 60" or something
[18:46] <gregaf> it's possible you also ran into a kernel client bug that we discovered recently, but the main issue is that you lost 2 out of 3 OSDs (one too slow and suicided, one you killed) and the cluster couldn't recover from that ;)
[18:47] * Tv (~tv@ has joined #ceph
[18:54] * MarkDude (~MT@ has joined #ceph
[19:01] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[19:03] * joshd (~joshd@ has joined #ceph
[19:05] * dmick (~dmick@ has joined #ceph
[19:09] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[19:12] * joao (~JL@ has joined #ceph
[19:13] * chutzpah (~chutz@ has joined #ceph
[19:14] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[19:18] * jluis (~JL@ Quit (Ping timeout: 480 seconds)
[19:19] * loicd (~loic@ has joined #ceph
[19:22] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[19:40] <nhm> sjust: fyi Sage said some of those 8k writes (along with some other strange writes) might be the journal header getting written too often. I'm going to do some tests with debug journal 15. I'll let you know what I find.
[19:40] <sjust> k
[19:45] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: No route to host)
[19:54] * Ryan_Lane (~Adium@ has joined #ceph
[19:57] * loicd1 (~loic@ has joined #ceph
[19:57] * loicd (~loic@ Quit (Quit: Leaving.)
[19:58] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:21] * lofejndif (~lsqavnbok@28IAAF49V.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[20:21] <newtontm> gregaf: hi, i'll try to do another fail test. In an ideal world what would be the impact of killing a node to a running cluster ?
[20:26] <gregaf> newtontm: IO for which that node is the primary will hang for the duration of the timeouts (if you have 3 OSDs then under the defaults it will be 25 seconds; adjustable of course), and then the replica will take over
[20:28] <newtontm> ok but about about the problem I encounter with the OSD that crashed ?
[20:28] * LarsFronius (~LarsFroni@95-91-243-240-dynip.superkabel.de) Quit (Quit: LarsFronius)
[20:28] <gregaf> well, that's a node failure; same thing
[20:29] <gregaf> the problem you ran into was that you had one node die and killed the second out of three OSDs and you didn't have 3x replication so some of the data was no longer available
[20:29] <newtontm> ok, but the file system become unresponsive for quite a while
[20:30] <gregaf> did you get one of nodes back online?
[20:30] <newtontm> I basically had to kill my server because I wasen't able to kill the OSD process with kill -9
[20:30] <newtontm> yeah, it's online right now
[20:30] <gregaf> so what's ceph -s report?
[20:30] <gregaf> if the OSD process wasn't killable it was stuck waiting on disk IO
[20:31] <gregaf> this sounds to me like you're having trouble with your virtualization again, but I dunno :)
[20:31] * MarkDude (~MT@ Quit (Read error: Connection reset by peer)
[20:31] <newtontm> could be but I fixed my mtu problems, network is working fine
[20:32] * gregphone (~gregphone@ has joined #ceph
[20:32] <newtontm> I tested 2 more kill of 1 node and I had no issue so far, only 1 osd node (the one i killed) was down, and everything else works fine...
[20:33] <newtontm> So I may have ran into a bug
[20:33] <newtontm> is the kernel module to mount ceph considered stable ?
[20:33] * gregphone_ (~gregphone@ has joined #ceph
[20:33] <newtontm> or is there any options like with nfs i should put for network timeout &
[20:33] <newtontm> ?
[20:35] * gregphone (~gregphone@ Quit (Read error: Connection reset by peer)
[20:35] * gregphone_ (~gregphone@ Quit (Read error: Connection reset by peer)
[20:35] * gregphone (~gregphone@ has joined #ceph
[20:36] <gregphone> newtontm: well, the filesystem in general isn't production-ready yet
[20:38] <gregphone> and things are going to take longer to clean up if you lose enough OSDs to lose all access to data for a while
[20:39] <newtontm> I guess we can configure the replication to be able to loose more than one node right ?
[20:40] * gregphone_ (~gregphone@66-87-131-227.pools.spcsdns.net) has joined #ceph
[20:41] <gregphone_> yep; you set the number of copies of data you want
[20:42] <gregphone_> and then I'd you don't lose that many nodes at once you don't lose access
[20:42] <gregphone_> *if you don't
[20:42] <newtontm> ok can I ask you more questions about the functionnality of ceph, like I would be interested of knowing how does ceph handles split brain?
[20:42] <newtontm> is there any wiki/doc that explains this ?
[20:43] <gregphone_> ask away!
[20:44] <gregphone_> I'm eating lunch though so I'll be slow at the moment
[20:44] <newtontm> np :)
[20:45] <gregphone_> so Ceph is a strongly consistent system, so it needs a majority of monitors to make progress
[20:45] <gregphone_> if your network partitions, the partition that has a strict majority of the total monitors will be able to keep working
[20:46] <gregphone_> you can also check out ceph.com/wiki, which is often out of date on the particulars but describes the architecture and failure handling fine
[20:47] <gregphone_> and ceph.com/docs for better info in some areas
[20:47] <newtontm> so that's why it's recommanded to have odd number of mon process running right?
[20:47] * gregphone (~gregphone@ Quit (Ping timeout: 480 seconds)
[20:47] * gregphone_ is now known as gregphone
[20:48] <gregphone> you need a attic majority of moms to make progress
[20:48] <gregphone> three mons can keep running after one fails; four mons can keep running after one fails
[20:49] <newtontm> right
[20:49] <gregphone> having an even number of monitors has more failure points but doesn't add resiliency compared to the odd number right beneath it
[20:49] <gregphone> sorry, that was "strict majority"
[20:50] * loicd1 (~loic@ Quit (Quit: Leaving.)
[20:51] * loicd (~loic@ has joined #ceph
[20:52] <newtontm> ok another question, like you know now I mounted my partitions with the kernel module as follow in my fstab: ceph02,ceph03:/ /mnt/ceph ceph name=admin,secretfile=/etc/ceph/secret,noexec,nodev,noatime 0 2
[20:53] <newtontm> now I tried to test the locking system, I wanted to know if client 1 can write to a file that client 2 is writing to. I found out that it is possible. However if I try writing to the same file within the ceph fs on the same client, the lock works.
[20:54] <newtontm> So is this "normal" or is the locking feature will come later on ?
[20:54] <gregphone> I'm not sure what you men's
[20:54] <gregphone> *mean
[20:54] <nhm> sjust: ping
[20:54] <gregphone> files are
[20:55] <gregphone> locked in the sense that one client writes at a time
[20:55] <nhm> gregphone: was that issue you were talking about earlier regarding the reads the same one as on the mailing list?
[20:55] <gregphone> but it doesn't prevent multiple clients from accessing the files
[20:55] * loicd (~loic@ Quit (Read error: Connection reset by peer)
[20:55] * loicd (~loic@ has joined #ceph
[20:56] <gregphone> nhm: loicd mentioned it g
[20:56] <gregphone> *here last night
[20:56] <gregphone> I didn't see it on the list
[20:56] <gregphone> joshd thinks it was a QEMU issue
[20:57] <nhm> gregphone: Ok. I think I remember that now.
[20:57] <newtontm> gregphone: i tried this on 2 different client: flock test -c 'while true; do echo 1 >> test; sleep 1; done'
[20:57] <newtontm> and flock test -c 'while true; do echo 2 >> test; sleep 1; done'
[20:57] <nhm> gregphone: Ok, I figure this is probably Josh territory since I haven't even started looking at RBD yet.
[20:57] <newtontm> and the 2 files got written and I had "121212121212" in the file
[20:58] <newtontm> but if I ran this on the same server/client then the second process is waiting for the first to release the lock
[20:58] <newtontm> then I get '1111111112222222222'
[21:00] <gregphone> I'm not familiar with flock test?
[21:01] * BManojlovic (~steki@ has joined #ceph
[21:04] * glowell (~glowell@c-98-210-226-131.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[21:04] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[21:06] <newtontm> man flock in bash
[21:08] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[21:09] * gregphone (~gregphone@66-87-131-227.pools.spcsdns.net) Quit (Quit: BRB! Reconnecting ...)
[21:10] * Tv (~tv@ Quit (Remote host closed the connection)
[21:10] * Tv (~tv@2607:f298:a:607:3c8e:5803:fd18:393b) has joined #ceph
[21:10] * gregphone (~gregphone@ has joined #ceph
[21:13] <gregphone> I'll have to look at what it's doing, but we probably have a bug in our flock implementation
[21:16] <newtontm> ok, so it's supposed to lock it so only 1 process can write to a file at the same time when write lock is applied
[21:21] <gregphone> yes
[21:21] <newtontm> would you know by chance the test the devs did for that situation, maybe i'm not testing it correctly
[21:21] <gregphone> newtontm: if you have something similar for fcntl locks, those are better tested
[21:25] <newtontm> i'll to see if I can test this with fcntl
[21:26] <newtontm> oh, and i'm using xfs, does this requires btrfs ?
[21:26] <gregphone> no; posix filesystem locking doesn't go to the OSDs at all
[21:27] * loicd (~loic@ Quit (Quit: Leaving.)
[21:27] <newtontm> k
[21:29] * gregphone_ (~gregphone@66-87-131-227.pools.spcsdns.net) has joined #ceph
[21:33] <nhm> sjust: ping
[21:34] * gregphone (~gregphone@ Quit (Ping timeout: 480 seconds)
[21:34] * gregphone_ is now known as gregphone
[21:35] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) has joined #ceph
[21:39] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) Quit (Read error: Connection reset by peer)
[21:39] * loicd1 (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) has joined #ceph
[21:40] <sjust> nhm: hang on
[21:41] * gregphone (~gregphone@66-87-131-227.pools.spcsdns.net) Quit (Quit: BRB! Reconnecting ...)
[21:41] * gregphone (~gregphone@ has joined #ceph
[21:41] * gregphone (~gregphone@ Quit ()
[21:48] <newtontm> Is there anything done to monitor the status of ceph with monitoring system like nagios/icinga/opennms?
[21:50] <gregaf> the ceph health command is intended for use by systems like that
[21:51] <gregaf> I think there are some though, eg https://github.com/dreamhost/ceph-nagios-plugin
[21:51] <gregaf> Tv might know if there are more
[21:52] <newtontm> any plans to have this infor available through snmp ?&
[21:53] <gregaf> we're not working on writing that sort of thing that I'm aware of
[21:53] <newtontm> k
[21:53] <newtontm> thx
[21:53] <gregaf> we want to enable community members to do it with the tools we provide, not to write all the interfaces ourself ;)
[21:58] <Tv_> newtontm: we're looking at StatsD more than SNMP
[21:59] <Tv_> newtontm: the low-level interface is a unix domain socket per daemon that dumps a lot of counters as json
[22:03] <newtontm> Tv_: thx
[22:03] <newtontm> well guys thanks for answering all my questions.
[22:05] * newtontm (~jsfrerot@charlie.mdc.gameloft.com) Quit (Quit: leaving)
[22:26] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:49] * danieagle (~Daniel@ has joined #ceph
[22:53] * EmilienM (~EmilienM@ Quit (Quit: Leaving...)
[23:10] * chutzpah (~chutz@ Quit (Quit: Leaving)
[23:26] <elder> sagewk, I'm about to push the testing branch.
[23:26] <sagewk> hold on, need to rebase it on what i'm pushing
[23:26] <sagewk> there was a problem with a patch i put in there a few days ago
[23:27] <elder> Damn!
[23:27] <sagewk> sorry :)
[23:27] <sagewk> testing-sage
[23:27] <elder> I wanted to win the race.
[23:27] <elder> I should have just gone for it :)
[23:27] <sagewk> hehe
[23:27] <sagewk> well, i would have had ot rebase lal your stuff then
[23:27] <nhm> elder: that'll teach you to communicate. ;)
[23:27] <sagewk> anyway, just rebase everythong that was on top of testing before onto that.
[23:28] <sagewk> i squashed a few patches, pulled one out, added in another couple things.
[23:28] <elder> Which patch are you testing?
[23:28] <elder> Oh.
[23:28] <sagewk> i fixed a bio iter bug when there are socket errors
[23:28] <elder> Is there any reason to believe the testing I did on my branch is invalidated?
[23:28] <elder> They all passed--everything I ran.
[23:28] <sagewk> but am running up against a very reproducible crash (with no helpful debug crash info) on rbd + msgr faults :(
[23:28] <elder> Including xfstests over rbd.
[23:29] <sjust> sagewk: merging next into master as well, if that's ok
[23:29] <sagewk> don't think so
[23:29] <sagewk> sjust: ok
[23:29] <sagewk> this is mostly dealing with messenger and fault handling
[23:29] <elder> OK. I can wait. Please let me know when you've pushed it so I can get busy on rebasing.
[23:29] <sagewk> it's pushed
[23:29] <elder> OK.
[23:29] <sagewk> testing-sage
[23:30] <elder> Oh.
[23:30] * glowell (~glowell@dhcp201-19.nersc.gov) has joined #ceph
[23:30] <sagewk> actually, maybe we should a few these out still
[23:30] <sagewk> let me reorder, one sec
[23:32] <sagewk> meh, nevermind, rebase on all of them.
[23:35] <elder> Are you going to push to testing?
[23:36] <sagewk> oh, sure
[23:36] <elder> It's up to you. But I'm prepare to update the testing branch.
[23:36] <sagewk> pushed, go ahead.
[23:38] <elder> OK, well a rebase isn't going to work.
[23:38] <elder> That's OK.
[23:38] <elder> I'll fix it manually.
[23:38] <elder> Your changes conflicted with the previous testing branch
[23:39] <sagewk> rebase --onto origin/testing oldtesting
[23:40] <sagewk> it shouldnt' conflict with any of your changes
[23:40] <elder> Too late.
[23:40] <elder> I'm almost done
[23:41] <elder> But I'll look into your fancy-pants git command for doing what I just did.
[23:42] <sagewk> git rulez
[23:42] <sagewk> good news is the crash i was seeing isn't appearing with the same workload on cephfs, so it's probably the msgr bio stuff
[23:42] <sagewk> or a random rbd thing
[23:43] <elder> I see it's in the bio read path, right?
[23:45] <sagewk> i can't tell where... it only bakctraces half the time, and it has no useful context. all i have to go on is recent debug output.. which isn't super enlightening
[23:47] <elder> Well, you were resetting the bio_iter, and I was a little concerned it was code I touched recently. To my relief it was not that particular spot that was affected. (But I am still more than ready to take the blame.)
[23:47] <elder> The stuff I touched was in the write side.
[23:48] <sagewk> which patch(es)? i'll take a look
[23:49] <elder> All of them.
[23:49] <elder> :)
[23:49] <elder> Anyway, the one I was concerned about was abdaa6a849af1d63153682c11f5bbb22dacb1f6b
[23:50] <elder> Which was preceded by 572c588edadaa3da3992bd8a0fed830bbcc861f8
[23:50] <elder> But I think it's OK.
[23:51] <sagewk> abdaa6a849af1d63153682c11f5bbb22dacb1f6b was added back by yan's patch in mainline
[23:52] <sagewk> well, the null assignment part
[23:52] <elder> Interesting.
[23:52] <elder> Well it's not needed any more.
[23:53] <sagewk> yeah
[23:53] <elder> Wait.
[23:54] <elder> abdaa6a849af1d63153682c11f5bbb22dacb1f6b was my change, which avoided the problem that Yan's fixed.
[23:54] <elder> ...which rendered Yan's patch unnecessary (it replaced it)
[23:54] <sagewk> not sure how they got ordered..
[23:54] <sagewk> in any case, that hunk came back when linus/master was merged back into our testing branch
[23:56] <elder> I'm not so sure.
[23:57] <elder> The thing you fixed with 7749c500 affected read_partial_message().
[23:57] <elder> The thing I was talkign about has to do with putting the init in prepare_write_message_data()
[23:58] <elder> That's at the beginning of an attempt to write a message. Previously it happened every time we tried to write a part of a message.
[23:58] <elder> So like I said before, your fix affects the read path. I think mine was in the write path.
[23:58] <elder> And I don't believe it "came back" with the merge with Linus' branch.
[23:58] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[23:59] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) Quit (Remote host closed the connection)
[23:59] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[23:59] <elder> I don't mean to be confusing, but you got me interested, so I wanted to try to sort it out...

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.