#ceph IRC Log

Index

IRC Log for 2013-08-13

Timestamps are in GMT/BST.

[0:00] <sjust> type level?
[0:00] <sjust> and make sure to use the crush map validator before injecting it
[0:00] <mjevans> Am I not trying to construct a logical tree of the storage network, while specifing at what level of redundancy each node in the tree is?
[0:00] <sjust> yes, but you only have 1 host on that root
[0:00] * sprachgenerator (~sprachgen@130.202.135.205) Quit (Quit: sprachgenerator)
[0:01] <sjust> and therefore requiring 2 hosts of redundancy isn't working
[0:01] <sjust> actually, I think I misparsed what you typed, can you elaborate?
[0:01] <mjevans> sjust: Yeah, I'm setting it up so that maybe later I could add a different rack (probably never happen in this case but that's the idea); Logically ssd_all is every possible SSD in the 'cluster', while ssd_rackN is every ssd in given rack, etc.
[0:02] <sjust> and ssd_rack0_osds is what?
[0:03] <mjevans> When I say step chooseleaf firstn 0 type host Is it not searching through the rule chain until it reaches N-nodes of type host? (which it seems I designated in name, but not with an actual /type/ setting)
[0:03] <sjust> that root contains exactly 1 host, ssd_rack0_osds
[0:03] <mjevans> sjust: That is supposed to be all ssds in the 0th rack
[0:03] <mjevans> That isn't a host, it's a rack
[0:03] <sjust> it's labeled a host
[0:03] <mjevans> A rack being a collection of hosts; yeah, I didn't set the type properly
[0:04] <sjust> in any case, you are asking for redundancy across hosts, right?
[0:04] <sjust> so 2 replicas, 1 on each host?
[0:04] <mjevans> Yes
[0:04] <sjust> you have to actually specify which osd is in which host
[0:04] <mjevans> The example I was copying from simply didn't have a setting
[0:04] <sjust> crush won't figure that out
[0:04] <mjevans> Where can I do that?
[0:04] <sjust> well, actually it will, but not with a custom crush map
[0:04] <sjust> are osd.0 and osd.1 in the same host (in real life)?
[0:05] <mjevans> sjust: No, osd.EVEN and osd.ODD are on hosts for EVEN and ODD
[0:05] <sjust> so you have 1 rack and 2 hosts
[0:05] <mjevans> yes
[0:05] <sjust> each has 1 ssd and 3 hdds?
[0:05] <mjevans> Yes
[0:05] <sjust> hdd_all is set up correctly
[0:06] <sjust> probably osd.1 is in host ssd_rack0_host1 and osd.0 is in ssd_rack0_host0
[0:06] <mjevans> Oh I see, I need... that's yeah
[0:07] <sjust> also, annoyingly, you can't make the ssd-primary rule work with the current crush language
[0:07] <sjust> it doesn't have a way of ensuring that it didn't choose the same host twice
[0:07] <mjevans> Yes, the way I wrote it I am trying to get redundancy N on hdds and caching 1 on SSDs
[0:07] <sjust> since ssd_all and hdd_all are really seperate roots
[0:07] <sjust> yeah, it's a reasonable setup, we just can't do it right now
[0:08] <sjust> it'll work if the hosts with ssds and the hosts with hdds don't overlap, but not with the setup you have
[0:08] <sjust> mjevans: again, make sure to validate the crush map before injecting it, you can mess up your cluster badly if the map is buggy enough
[0:08] * mschiff (~mschiff@tmo-103-66.customers.d1-online.com) Quit (Read error: Connection reset by peer)
[0:09] <mjevans> sjust: I'm getting things setup here first before putting any data I care about on it; which if things go well will be later this afternoon.
[0:09] <sjust> ok
[0:09] * mschiff (~mschiff@tmo-103-66.customers.d1-online.com) has joined #ceph
[0:09] <mjevans> Changing brain-cell blender configurations with live data inside seems; beyond risky
[0:09] <sjust> heh
[0:11] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[0:13] * alram (~alram@208.86.100.62) Quit (Quit: leaving)
[0:13] * wonkotheinsane (~jf@jf.ccs.usherbrooke.ca) Quit (Quit: WeeChat 0.3.7)
[0:14] * mozg (~andrei@host109-151-35-94.range109-151.btcentralplus.com) Quit (Quit: Ex-Chat)
[0:14] * grepory1 (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[0:14] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Read error: Connection reset by peer)
[0:16] <mjevans> sjust: crush map validate method/tool/documentation?
[0:17] <sjust> crushtool --help
[0:17] <sjust> you want --test with --simulate, I think
[0:19] * mschiff (~mschiff@tmo-103-66.customers.d1-online.com) Quit (Remote host closed the connection)
[0:20] <dmick> sjust: I generally use --test with --output-csv
[0:20] <dmick> I think --simulate is something...different
[0:20] <sjust> ah
[0:21] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[0:23] <mjevans> I think it's gacking on the ssd_primary rule that sjust mentioned earlier; I'm going to have to adjust the number of desired replicas for that group to 3 and change how I write the rule a little.
[0:24] <mjevans> It doesn't seem like I can tune that at the rule level; just the pool level...
[0:26] * sjustlaptop (~sam@38.122.20.226) has joined #ceph
[0:29] * codice_ (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) has joined #ceph
[0:29] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) Quit (Read error: Connection reset by peer)
[0:30] <mjevans> Does ceph rely on the default pools still existing? (can I just delete the data, metadata and rbd pools?)
[0:30] <joshd> yes you can delete the default pools
[0:32] * doxavore (~doug@99-7-52-88.lightspeed.rcsntx.sbcglobal.net) Quit (Quit: :qa!)
[0:34] * AfC (~andrew@2407:7800:200:1011:7905:7fe3:6df8:c4c0) has joined #ceph
[0:38] <mjevans> This is slightly frustrating but I'll take a break and look at the map again; there's probably something small I'm overlooking.
[0:39] <loicd> I collected coverage information from teuthology in /tmp/teuthology ( teuthology --archive=... ) but I'm not sure how to invoke teuthology-coverage to generate a report. Is there example somewhere that I could follow ?
[0:40] * sjustlaptop (~sam@38.122.20.226) Quit (Ping timeout: 480 seconds)
[0:42] * alram (~alram@199.66.166.55) has joined #ceph
[0:44] <joshd> loicd: see https://github.com/ceph/teuthology/blob/master/teuthology/queue.py#L92 which ends up calling it https://github.com/ceph/teuthology/blob/master/teuthology/suite.py#L227
[0:48] * davidzlap (~Adium@cpe-75-84-249-188.socal.res.rr.com) has joined #ceph
[0:49] * rturk-away is now known as rturk
[0:49] <loicd> joshd: thanks :-)
[0:53] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[0:53] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) Quit (Quit: ...)
[1:01] <mjevans> So I'm trying to validate/test my crush map... and it doesn't behave as expected (it never stores anything to the SSD osds) http://pastebin.com/J39yF4Ug
[1:04] <mjevans> None of the examples I've seen cover having just a single osd per host...
[1:04] <sjust> what is the output of the validation tool?
[1:05] * tnt_ (~tnt@91.176.3.64) Quit (Ping timeout: 480 seconds)
[1:06] * buck1 (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has joined #ceph
[1:07] <mjevans> sjust: http://pastebin.com/eJ9mgty0
[1:08] <mjevans> Pretty much it's just never ever touching the SSDs
[1:11] * AfC (~andrew@2407:7800:200:1011:7905:7fe3:6df8:c4c0) Quit (Quit: Leaving.)
[1:11] * AfC (~andrew@2407:7800:200:1011:7905:7fe3:6df8:c4c0) has joined #ceph
[1:13] <sjust> you may have to specify the pool
[1:14] <mjevans> sjust: you mean put them together and use the choose firstn 0 type osd rule?
[1:14] <mjevans> That would technically work for the current case; but I'd have to revisit things if they ever change.
[1:16] * jantje (~jan@paranoid.nl) has joined #ceph
[1:17] <sjust> I mean to the tester, I think it's testing the wrong pool
[1:17] <mjevans> It's testing all of them though; and when I made the change I just suggested it worked as expected.
[1:17] <sjust> oh
[1:19] <mjevans> It looks like the crushmap just can't cope with chooseleaf when the tree has an ending sequence of root->{1branch}->{2 branch}->{1 leaf} (for a total of 2 leafs)
[1:20] <mjevans> I don't know what it's doing exactly, but from the other files it seems it often results in choosing a null-set since nothing fulfills the criteria
[1:20] <sjust> root ssd_rack0_host0 {
[1:20] <sjust> should be host ssd_rack0_host0?
[1:21] <mjevans> What's the difference between those?
[1:21] <sjust> it's a host, not a root
[1:21] <sjust> chooseleaf 0 type host
[1:21] <sjust> means to choose hosts
[1:21] <sjust> not roots
[1:21] <mjevans> Oh THAT's where you specify the type!
[1:23] <mjevans> sjust: Well, I fixed that, annnd it still doesn't work.
[1:23] <sjust> post the crusmap again.
[1:23] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[1:23] <mjevans> I'll just post it all
[1:24] <sjust> ssd_rack0 is also probably not a root
[1:26] <joao> NelsonJeppesen, still around?
[1:27] <mjevans> http://pastebin.com/Bjuvekqa
[1:27] <mjevans> sjust: yeah, I fixed that too
[1:27] * NelsonJeppesen (~oftc-webi@199.181.135.135) Quit (Remote host closed the connection)
[1:28] <mjevans> The numbers seem correct if I use the ssd_all_direct root
[1:28] <sjust> ok, now you are asking for hosts in a tree with no hosts
[1:28] <sjust> step take ssd_all_direct
[1:28] <mjevans> *sigh* forgot to change that back
[1:29] <mjevans> Yup, that works perfectly now
[1:29] <sjust> ok then
[1:30] <mjevans> I suppose editing a crush map while feeling work down isn't such a great idea.
[1:30] <mjevans> and the health is once again OK
[1:31] <mjevans> Now to see if ganeti blows up when I try to create a cluster...
[1:45] * buck1 (~buck@c-24-6-91-4.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[1:45] * glowell1 (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[1:47] <joshd> joao: what log levels would you like to debug a pool creation issue? mon = 20, ms = 1?
[1:47] <joao> add paxos = 10 to the mix and I'll be a happy man
[1:48] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[1:51] <joshd> hmm, actually it could just be separate userspace and kernel bugs with the same result
[2:01] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[2:13] * rturk is now known as rturk-away
[2:18] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Quit: Leaving.)
[2:19] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[2:24] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[2:32] <mjevans> Looks like ganeti 2.7.1 doesn't die instantly... but also doesn't yet support userspace (kvm direct) rbd access; that... is a bit risky. I -think- if I make sure to reserve a few GB of ram overhead I should be ok until a newer version of ganeti is released. Though I will delay moving critical things as long as I can...
[2:38] * huangjun (~kvirc@59.173.200.119) has joined #ceph
[2:42] * guppy_ (~quassel@guppy.xxx) has joined #ceph
[2:42] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[2:43] * guppy (~quassel@guppy.xxx) Quit (Read error: Connection reset by peer)
[2:45] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[2:47] * nerdtron (~kenneth@202.60.8.252) has joined #ceph
[2:51] * yy (~michealyx@122.233.47.137) has joined #ceph
[2:55] * devoid1 (~devoid@130.202.135.246) Quit (Quit: Leaving.)
[3:03] * joao (~JL@2607:f298:a:607:9eeb:e8ff:fe0f:c9a6) Quit (Quit: Leaving)
[3:15] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[3:20] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[3:22] * grepory1 (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[3:23] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[3:28] <wrencsok1> did quorum rules change from 56.6 to 61.7?
[3:29] <wrencsok1> i did an update, everything actually went smoothe, but while all 3 mon's came back up. I seem to be missing one from the quorum: e3: 3 mons at {a=10.30.177.4:6789/0,b=10.30.177.5:6789/0,c=10.30.177.6:6789/0}, election epoch 552, quorum 0,1 a,b
[3:32] <dmick> are all the monitors really upgraded?
[3:32] <wrencsok1> yes
[3:32] <dmick> (and restarted)
[3:32] <wrencsok1> yes
[3:32] <dmick> ceph daemon mon.<id> version would prove it. If so, no, they obviously should talk, still
[3:33] <dmick> maybe check logs for whining
[3:33] <wrencsok1> that's the odd part no whiny logs, but here's the version info: root@hpbs-c01-m03:~# dsh -g mon ceph-mon --version
[3:33] <wrencsok1> ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
[3:33] <wrencsok1> ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
[3:33] <wrencsok1> ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
[3:33] <wrencsok1> root@hpbs-c01-m03:~#
[3:34] <wrencsok1> i restarted them twice. early in the process all 3 were in the quorum list, later 'c' seemed to disappear, tho it still things its there.
[3:34] <wrencsok1> thinks and talks fine
[3:35] <wrencsok1> whoops, now i have some noiuse.
[3:35] <wrencsok1> 1 mon.c@2(synchronizing sync( requester state chunks )) e3 discarding message auth(proto 0 27 bytes epoch 3) v1 and sending client elsewhere
[3:36] <dmick> ceph daemon mon.<id> mon_status might be interesting to c
[3:37] * LeaChim (~LeaChim@176.248.81.121) Quit (Ping timeout: 480 seconds)
[3:39] <dmick> that is, ceph daemon mon.c mon_status
[3:39] <wrencsok1> that's coming up as an unrecognized command.
[3:39] <dmick> ?
[3:39] <wrencsok1> oot@hpbs-c01-m03:/var/log# ceph daemon mon.c mon_status
[3:39] <wrencsok1> unrecognized command
[3:40] <dmick> replace mon_status with help?
[3:40] * sagelap (~sage@76.89.177.113) has joined #ceph
[3:41] <wrencsok1> no love, also unrecognized. is that a command to access ceph --admin-daemon /var/run/......asok
[3:42] <dmick> yes.
[3:42] <dmick> is your ceph command itself old?
[3:42] <dmick> ceph -v?
[3:43] <wrencsok1> that ones not working for me. the old method. i think so on the ceph-v thing, let me double check, i noticed that discrepancy when updating. checking
[3:43] <wrencsok1> yes, that update didn't happen. still have 56.6 of that
[3:43] <dmick> ah. that would explain the unrec commands
[3:44] <wrencsok1> root@hpbs-c01-m03:/var/run/ceph# ceph -v
[3:44] <wrencsok1> ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
[3:44] <wrencsok1> root@hpbs-c01-m03:/var/run/ceph#
[3:44] <dmick> probably best to upgrade that as well
[3:44] <wrencsok1> ok so. question then. should that have come along for the ride when we synched mirrors for 61.7 and ceph
[3:45] <dmick> not sure what you mean by 'synched mirrors'
[3:45] <dmick> but you need to actually upgrade the package that it's in, whatever the process is
[3:45] <wrencsok1> its not a straightforward process, i have to get a internal mirror updated to pull in your code changes to my lab.
[3:45] <wrencsok1> thougth we grabed them all from deb.
[3:45] <dmick> do you mean "is it part of the repo"? the answer is yes
[3:46] <dmick> ceph-common is the package
[3:46] <wrencsok1> yeah, i'll have to check, i am not sure he grabbed everything. ok thanks. will look into that in the morning.
[3:46] <wrencsok1> thanks
[3:47] <dmick> well that may not be the problem, but
[3:47] <dmick> at least it's getting in the way of diagnosis.
[3:49] <dmick> I suspect what's happening is that that last monitor is trying to sync state with the others before joining quorum. How long has this been happening?
[3:50] <wrencsok1> well, yeah if you want me to run that command, its also like half and upgrade, imho. i can pull stuff from teh admin socket, quite familiar with it and using it for custom scripts
[3:50] <wrencsok1> well initially. before updating 168 osd's
[3:50] <wrencsok1> all 3 mon's were in the quorum
[3:50] <dmick> with --admin-daemon is an ok way to get it
[3:50] <wrencsok1> somewhere around 100 or so osd's c dropped out
[3:50] <dmick> hm.
[3:50] <dmick> that sounds odd.
[3:51] <wrencsok1> was also getting some really nice, tho crazy bandwitch spikes during the update.
[3:51] <wrencsok1> 33GB/s type of clsuter throughput, which i have to say is very nice to see, after playing with bobtail a few months, and all its bugs
[3:52] <wrencsok1> i'll put this on the plate til morning, babysitting the update and attempting to script it up can wait til the morrow when i have a fresher mind to dig in to our internal mirror and make sure everything was pulled in.
[3:53] <wrencsok1> thanks tho, probably ask a few questions tomorrow.
[3:54] * davidzlap1 (~Adium@cpe-75-84-249-188.socal.res.rr.com) has joined #ceph
[3:54] * davidzlap (~Adium@cpe-75-84-249-188.socal.res.rr.com) Quit (Quit: Leaving.)
[3:56] * alram (~alram@199.66.166.55) Quit (Quit: leaving)
[4:10] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[4:20] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Quit: Leaving.)
[4:20] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[4:22] * AfC (~andrew@2407:7800:200:1011:7905:7fe3:6df8:c4c0) Quit (Quit: Leaving.)
[4:23] * clayb (~kvirc@proxy-ny2.bloomberg.com) Quit (Read error: Connection reset by peer)
[4:27] * AfC (~andrew@2407:7800:200:1011:7905:7fe3:6df8:c4c0) has joined #ceph
[4:34] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Quit: Leaving.)
[4:34] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[4:38] <nerdtron> hi all!
[4:39] <nerdtron> how do you perform speed tests for your ceph cluster? i have created a ceph osd pool and the write speed is only 20MB to 30MB per sec
[4:39] <nerdtron> all my ineterfaces and switches are gigabit
[4:53] * yanzheng (~zhyan@jfdmzpr05-ext.jf.intel.com) has joined #ceph
[4:56] * haomaiwang (~haomaiwan@117.79.232.149) has joined #ceph
[4:57] * rongze_ (~quassel@117.79.232.218) has joined #ceph
[5:03] * haomaiwa_ (~haomaiwan@117.79.232.201) Quit (Ping timeout: 480 seconds)
[5:03] * rongze (~quassel@notes4.com) Quit (Read error: Operation timed out)
[5:06] * fireD_ (~fireD@93-142-199-217.adsl.net.t-com.hr) has joined #ceph
[5:07] * ShaunR (~ShaunR@staff.ndchost.com) has joined #ceph
[5:07] * fireD (~fireD@93-139-146-165.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:17] <davidzlap1> nerdtron: You could try doing using the rados bench command. See "rados —help"
[5:24] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Read error: Connection reset by peer)
[5:24] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[5:37] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.90.1 [Firefox 22.0/20130618035212])
[5:56] * AfC (~andrew@2407:7800:200:1011:7905:7fe3:6df8:c4c0) Quit (Quit: Leaving.)
[5:58] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[5:59] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Remote host closed the connection)
[6:01] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[6:06] * Cube (~Cube@c-98-208-30-2.hsd1.ca.comcast.net) has joined #ceph
[6:10] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:13] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[6:17] * Cube (~Cube@c-98-208-30-2.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[6:28] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[6:30] * AfC (~andrew@2407:7800:200:1011:a17b:e651:3b26:70d2) has joined #ceph
[6:52] <sage> yanzheng: for that trim patch, would it be simpler to make d_prune_aliases() call d_prune sooner?
[6:55] <sage> ceph is the only d_prune user, and it was added to deal with exactly this complete flag problem, so we are free to change it as needed
[7:04] * KrisK (~krzysztof@213.17.226.11) has joined #ceph
[7:07] <yanzheng> sage, it's not convenient to do that
[7:09] <sagelap> why not just http://fpaste.org/31686/70582137/ ?
[7:11] <yanzheng> d_prune will be called twice
[7:13] <sage> where?
[7:13] * nigwil (~idontknow@174.143.209.84) Quit (Quit: leaving)
[7:15] <yanzheng> once in d_prune_aliases, once in dentry_kill
[7:16] <yanzheng> I think it's not clean, although it's ok to our use case
[7:16] <sage> hmm, the dentry_kill one could be conditional on whehter the dentry is in fact hashed?
[7:17] * nigwil (~idontknow@174.143.209.84) has joined #ceph
[7:18] <yanzheng> yes, it could
[7:18] <yanzheng> I will try sending a patch to fsdevel
[7:19] <sage> that seems nicer given the d_prune intention anyway (to not get a call on an already-unhashed dentry on dput)
[7:20] <sage> i just pushed the first 2 patches to the testing branch, btw. both look good (i just did an s/logical/logic/... logical is the adjective, logic is the noun)
[7:22] <yanzheng> another issue. I think the caller of d_prune() need to hold parent dentry's d_lock
[7:23] <sage> you mean dentry->d_parent->d_lock, or the dentry->d_lock?
[7:23] <yanzheng> dentry->d_parent->d_lock
[7:23] <sage> in case it races with something that reattaches it somewhere else?
[7:24] <yanzheng> because we access dentry->d_parent->d_inode in ceph_d_prune
[7:24] <yanzheng> yes
[7:24] <sage> oh
[7:24] <sage> d_parent is stable in this case. question is whether d_inode can get cleared..
[7:24] <yanzheng> yes
[7:25] <sage> ooh. comment is
[7:25] <sagelap> /*
[7:25] <sagelap> * we hold d_lock, so d_parent is stable, and d_fsdata is never
[7:25] <sagelap> * cleared until d_release
[7:25] <sagelap> */
[7:25] <sagelap> ceph_dir_clear_complete(dentry->d_parent->d_inode);
[7:25] <sage> *taht* is why teh flag used to go in the dentry info and not the ceph_inode. we changed it back recently
[7:26] <sagelap> a8673d61ad77ddf2118599507bd40cc345e95368
[7:27] <yanzheng> yes
[7:27] <sage> i see. true for dentry_kill, but not for d_prune_aliases
[7:28] * Jedicus (~user@108-64-153-24.lightspeed.cicril.sbcglobal.net) has joined #ceph
[7:28] <yanzheng> if we revert that patch, how to handle the locking order issue
[7:28] <sage> not sure if it is true for shrink_dcache_for_unmount_subtree, the other d_prune caller
[7:29] <sage> can't we just use an atomic bit op for it? then there's no lock
[7:29] <yanzheng> who cares umounting fs
[7:30] <sage> as long as d_inode doesn't get cleared
[7:32] <sage> hmm, the original version of those helpers didn't take a dentry ref and didnt call dput
[7:35] <sage> as long as the callers hold i_lock, the inode->i_dentry will be stable and we can skip the get/put.
[7:35] <sage> or know that we won't actually sleep in dput because we hold one of the refs?
[7:39] * davidzlap1 (~Adium@cpe-75-84-249-188.socal.res.rr.com) Quit (Quit: Leaving.)
[7:40] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[7:41] <sage> yeah, seems to me like the dput() sleep issue isn't really there?
[7:41] <yanzheng> i think we can skip the get/put when holding the i_lock
[7:41] <sage> or even export __d_find_any_alias(), since we know the dput won't sleep if we hold a ref. either way..
[7:41] <yanzheng> what do you mean "because we hold one of the refs"
[7:41] <sage> oh
[7:42] <yanzheng> how do we hold a ref
[7:42] <sage> i'm thinking that the presense on the i_dentry list is a ref, but i guess its not
[7:42] <yanzheng> dentry holds a reference to inode, but inode does not have reference to dentry
[7:43] <sage> yeah ok
[7:43] <sage> but another racing dput will block on i_lock when unlinking, so we'd still be okay
[7:48] <sage> off to bed, ttyl!
[7:48] <yanzheng> you mean rewrite ceph_dir_set_complete(), set the flag while holding the i_lock?
[7:53] * tnt (~tnt@91.176.3.64) has joined #ceph
[8:02] * Jedicus (~user@108-64-153-24.lightspeed.cicril.sbcglobal.net) Quit (Quit: Leaving)
[8:06] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[8:13] * allsystemsarego (~allsystem@5-12-241-157.residential.rdsnet.ro) has joined #ceph
[8:13] * Mithril (~Melkor@208.175.141.7) Quit (Read error: Connection reset by peer)
[8:13] * Mithril (~Melkor@208.175.141.7) has joined #ceph
[8:23] * Mithril (~Melkor@208.175.141.7) Quit (Read error: Connection reset by peer)
[8:23] * Mithril (~Melkor@208.175.141.7) has joined #ceph
[8:25] * odyssey4me (~odyssey4m@165.233.71.2) has joined #ceph
[8:25] * AfC (~andrew@2407:7800:200:1011:a17b:e651:3b26:70d2) Quit (Quit: Leaving.)
[8:38] * mschiff (~mschiff@85.182.236.82) has joined #ceph
[8:40] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[8:40] * ChanServ sets mode +v andreask
[8:44] * mschiff (~mschiff@85.182.236.82) Quit (Remote host closed the connection)
[8:49] <nerdtron> any commands on how to benchmark the rd/wr speed of ceph?
[9:00] <wogri_risc> rados bench
[9:01] <nerdtron> wogri_risc how do you use rados bench?
[9:04] <odyssey4me> nerdtron: Here're some examples - http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/ and http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[9:04] <wogri_risc> nedtron: rados bench <seconds> write|seq|rand [-t concurrent_operations] [--no-cleanup]
[9:04] <wogri_risc> default is 16 concurrent IOs and 4 MB ops
[9:04] <wogri_risc> default is to clean up after write benchmark
[9:07] <odyssey4me> I have a challenge I'd really like to overcome. Most of my ceph cluster clients are also cluster members, but I need a common shared storage across almost all of them. The common shared storage needs to be mounted to a folder.
[9:08] <nerdtron> cephfs sound fit here
[9:08] <odyssey4me> I tried using the kernel module, but I seemed to get hangs from time to time - especially when working with large files in the mapped folder.
[9:08] <nerdtron> sudo ceph-fuse /path/to/folder
[9:08] <nerdtron> although this will result to a one huge shared folder
[9:09] <odyssey4me> I'm told that CephFS isn't ready for production usage... that there are issues... but I'd like to know whether this can work for my use-case, assuming I accept the lack of HA for the MDS?
[9:09] <nerdtron> I'm using cephfs for opennebula
[9:09] <odyssey4me> nerdtron - yes, it needs to be one large shared folder per cluster
[9:09] <nerdtron> lack of HA for the MDS - what does this mean?
[9:10] <odyssey4me> I'm wanting to use it for the instance storage for an Openstack compute cluster.
[9:10] <nerdtron> try cephfuse..
[9:11] <odyssey4me> nerdtron: http://ceph.com/dev-notes/cephfs-mds-status-discussion/
[9:11] <nerdtron> or create a big RBD and mount it on one machine then share it to others...but that would slow down it bigg files will be writtern
[9:12] <nerdtron> ahhh..yes, only one mds is active on my cluster
[9:13] <odyssey4me> nerdtron - yeah, I want to avoid having all file access go over the network...
[9:13] <odyssey4me> nerdtron - you're using it at opennebula? would you mind sharing the use-case?
[9:14] <nerdtron> sure do.. and it's quite easy
[9:14] <odyssey4me> (and some info about the config - ie how you've done the mapping, etc)
[9:14] <odyssey4me> and what's the size of yuor installation, and what's the setup being used for
[9:14] <nerdtron> 3 nodes for ceph and 3 nodes for opennebula
[9:15] <nerdtron> you don't have to worry about mapping... opennebula will map the RBD on the fly for you
[9:16] <odyssey4me> I see that the one issue is the MDS HA, but that you can do an active/passive config.
[9:17] <odyssey4me> I also see that another issue is scale, but there's not much info on when the scale problem kicks in.
[9:19] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:20] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[9:21] <nerdtron> you don't have to worry about mds when you do open nebula
[9:22] <nerdtron> because you don't have to use cephfs when you need a shared storage to all
[9:22] <nerdtron> I assume that you need a big storage for your vm images in openstack?
[9:22] <odyssey4me> nerdtron - but I won't be using opennebula :)
[9:26] * tnt (~tnt@91.176.3.64) Quit (Ping timeout: 480 seconds)
[9:29] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[9:31] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[9:34] * aciancaglini (~quassel@5-157-112-19.v4.ngi.it) Quit (Remote host closed the connection)
[9:37] * mschiff (~mschiff@p4FD7F08D.dip0.t-ipconnect.de) has joined #ceph
[9:38] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:38] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:39] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[9:39] * capri_on (~capri@212.218.127.222) has joined #ceph
[9:39] * capri (~capri@212.218.127.222) Quit (Ping timeout: 480 seconds)
[9:44] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) has joined #ceph
[9:45] <odyssey4me> joao - any comments on using CephFS in production?
[9:45] <odyssey4me> It would seem to me that it's simple enough to setup the MDS in an active-passive configuration.
[9:46] <yanzheng> it's unlikely that you data get corrupted, but you may experience some hang
[9:47] <yanzheng> in most case, restart the mds can resolve the hang
[9:49] <odyssey4me> yanzheng - ok, so monitoring, alerting and self-healing (auto-restart) is very important
[9:50] <nerdtron> how to you set the replication from 2 to three?
[9:50] <odyssey4me> I think I'm going to give it a try - my only concern is that my test environment isn't big enough to test it to scale.
[9:51] <wogri_risc> nedtron: you have to change the size of the corresponding pool.
[9:57] <yanzheng> odyssey4me, it's unlikely you will encounter problem for data IO. the data IO path for cephfs is similar to rbd
[9:58] <yanzheng> for metadata operation, single mds is bottleneck.
[9:59] <yanzheng> it doesn't scale
[10:02] <nerdtron> that is why you shouldn't use it for large I/O operations, but cephfs is still good if you want to storage with replication
[10:06] <yanzheng> large data I/O operations are OK. metadata intensive operations aren't
[10:10] <topro> yanzheng: thats true for the time beeing until multiple active MDS are fully supported, right?
[10:10] <nerdtron> well according to the docs it is not recommended
[10:11] <topro> ^^ not yet
[10:11] <topro> as this feature is not sable enough yet. but eventually will. at least that is my understanding
[10:11] <topro> s/sable/stable/
[10:12] <yanzheng> actually single mds still has lots of room for optimization
[10:12] <topro> well, looking at the performance of my single-mds cephfs cluster I believe thats true ;)
[10:14] <nerdtron> actually mds will not be "stable enough" in the near future http://ceph.com/dev-notes/cephfs-mds-status-discussion/
[10:15] <topro> nerdtron: does that mean taht in the mean time it could even become worse than it actually is?
[10:16] * LeaChim (~LeaChim@176.248.81.121) has joined #ceph
[10:16] <nerdtron> nope..I'm current using it in my cluster
[10:17] <nerdtron> as emergency storage when i need to hold gigabytes of data when no servers are available :)
[10:20] <topro> well, seems like i'm one of the "brave" ones, using it in production. its concepts are great and it already works ok once you learned what works and what doesn't, and how to fix things like getting stalled mds to work again
[10:21] <nerdtron> how many mds do run? i don't want to add more mds, i haven't experimented it yet
[10:21] <loicd> ccourtaut: I think you'll be able to run teuthology by the end of the day :-) It turns out to be not so complicated after all.
[10:24] * dosaboy_alt (~dosaboy_a@faun.canonical.com) has joined #ceph
[10:24] * dosaboy_alt (~dosaboy_a@faun.canonical.com) Quit ()
[10:25] <topro> nerdtron: 3 nodes with 1MON, 1MDS and 3OSD each. but only one MDS is active at a time, the others get started as "hot standby" MDS automatically (unless you configure ceph to have more than one active MDS, but thats what is explicitly NOT RECOMMENDED a.t.m.)
[10:25] * dosaboy (~dosaboy@faun.canonical.com) has joined #ceph
[10:26] <nerdtron> 1MON for three nodes???
[10:26] <nerdtron> what if the mon goes down??
[10:26] <nerdtron> topro anyway, how do you configure a host standby mds? i think i like that idea
[10:26] <topro> read agin: 3 nodes with 1mon(each) 1mds(each) and 3osd(each). so a total of 3mon 3mds and 9 osd
[10:26] <nerdtron> sorry my bad
[10:27] <topro> NM
[10:27] <nerdtron> mine is 3nodes, 1 mon each 2 osd each, and only one mds is configured...
[10:28] <topro> don't know about ceph-deploy as i deployed manually. there you have ceph.conf. and in mds section you can configure how many active mds are allowed, 1 is default if not configured (at least with cuttlefish)
[10:29] <topro> see http://ceph.com/docs/master/cephfs/mds-config-ref/ parameter+ "max mds"
[10:30] <topro> actually I have problems with 0.61.3+ whith MDS restart (cache size problem preventing MDS to get to normal operation again) maybe thats related to me havinf standby MDS. so if you don't have those issues, maybe stay away from standby MDS for the moment
[10:31] <nerdtron> hmm...okay i'll have a read and see it it's worth the risk.. I have about 700GB of data on the cephfs and i don't wanna lose it
[10:31] <nerdtron> if my MDS node fails, what will happen to my cephfs??
[10:32] * athrift (~nz_monkey@203.86.205.13) Quit (Remote host closed the connection)
[10:32] <topro> cephfs I/O stalls until you got MDS back up again. normally clients can continue I/O then without trouble
[10:33] <topro> at least that is what I am experiencing here (0.61.7)
[10:33] * athrift (~nz_monkey@203.86.205.13) has joined #ceph
[10:34] * shimo (~A13032@122x212x216x66.ap122.ftth.ucom.ne.jp) Quit (Read error: Operation timed out)
[10:34] * AfC (~andrew@2001:44b8:31cb:d400:e11d:6d64:9ea7:85f4) has joined #ceph
[10:35] <odyssey4me> yanzheng - what counts as a metadata operation? access times and such? my thinking is that I'
[10:35] <yanzheng> rename, link, unlink, atime .....
[10:35] <odyssey4me> I'll likely mount without most attributes as I don't need them. I pretty much just need access to the shared storage from a cluster member.
[10:35] <guppy_> almost anything except reading/writing data is a metadata operation
[10:38] * KindTwo (KindOne@50.96.224.178) has joined #ceph
[10:38] * brzm (~medvedchi@node199-194.2gis.com) has joined #ceph
[10:39] <brzm> hello
[10:39] <brzm> does anybody use http://github.com/ceph/ceph-cookbooks?
[10:41] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[10:41] <odyssey4me> brzm - yes, I've been adding some stuff to them recently
[10:41] * athrift (~nz_monkey@203.86.205.13) Quit (Remote host closed the connection)
[10:41] * KindTwo is now known as KindOne
[10:42] <brzm> odyssey4me: well, i have some issues on debian 7. with same config ubuntu precise works flawlessly
[10:43] <odyssey4me> brzm - hmm, I'm using Ubuntu... where're the problems coming in?
[10:44] * athrift (~nz_monkey@203.86.205.13) has joined #ceph
[10:44] <brzm> odyssey4me: monitors sais: http://pastebin.com/N20puMpt
[10:44] * bergerx_ (~bekir@78.188.101.175) has joined #ceph
[10:44] <brzm> odyssey4me: not so good logging stuff. i cannot even determine which line is error and which are not :(
[10:46] <brzm> mon_status output: http://pastebin.com/a0PC8Tnw
[10:47] <odyssey4me> ok, let's take a step back
[10:47] <odyssey4me> what's the symptom?
[10:47] <brzm> not deploying
[10:47] <odyssey4me> is the chef-client run completing on all nodes successfully?
[10:47] <brzm> no
[10:47] * jcfischer (~fischer@peta-dhcp-3.switch.ch) has joined #ceph
[10:47] <brzm> it stucks on bootstrap
[10:47] <odyssey4me> ok - paste me a log from the chef run
[10:48] <brzm> a sec...
[10:48] <brzm> * ruby_block[get osd-bootstrap keyring] action run[2013-08-13T15:48:13+07:00] INFO: Processing ruby_block[get osd-bootstrap keyring] action run (ceph::mon line 96)
[10:48] <odyssey4me> pastebin the whole run so that I can see the context
[10:48] <jcfischer> morning - coming back from holidays (after replacing 4 OSD disks just before I left), my cluster has on pg with two unfound objects. How can I determine what file/image/... they belong to?
[10:49] <brzm> odyssey4me: ok
[10:51] * athrift (~nz_monkey@203.86.205.13) Quit (Remote host closed the connection)
[10:52] * KindTwo (~KindOne@h97.55.186.173.dynamic.ip.windstream.net) has joined #ceph
[10:52] <brzm> odyssey4me: http://pastebin.com/9jR7r2Lk
[10:53] * athrift (~nz_monkey@203.86.205.13) has joined #ceph
[10:53] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[10:55] <odyssey4me> brzm - is this the first server in your cluster, or do you have an established cluster?
[10:55] <brzm> odyssey4me: first server
[10:56] <brzm> odyssey4me: creating new clean cluster
[10:57] <brzm> wow, "0.0.0.0:6789\/0" appears as a hint. wat?
[10:57] <odyssey4me> brzm - ok, I take it you've set the ['ceph']['config']['fsid'] and ['ceph']['config']['mon_initial_members'] attributes?
[10:58] <odyssey4me> brzm - and if you run 'ceph auth get-key client.bootstrap-osd' on that server from the shell, what do you see?
[10:58] * KindTwo (~KindOne@h97.55.186.173.dynamic.ip.windstream.net) Quit (Read error: Connection reset by peer)
[11:01] * shimo (~A13032@122x212x216x66.ap122.ftth.ucom.ne.jp) has joined #ceph
[11:02] * yanzheng (~zhyan@jfdmzpr05-ext.jf.intel.com) Quit (Quit: Leaving)
[11:07] <odyssey4me> brzm - why would the chef-client receive SIGINT... that's odd. Stop the service and any monitoring daemons you have running (eg: monit, etc) if you can.
[11:07] <brzm> odyssey4me: cannot reproduce chef-client refuses to generate config file and everything broke up
[11:08] <odyssey4me> brzm - pastebin the new run log
[11:09] <brzm> http://pastebin.com/0f4FtF1V
[11:12] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[11:12] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[11:12] <odyssey4me> brzm - access the node's data (via web interface or via 'knife node edit') and delete the 'ceph' xml node with all its sub-attributes. Only do this if you're happy to wipe this cluster, which I assume you are?
[11:13] <odyssey4me> then also purge the ceph packages before re-running the chef client
[11:21] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[11:21] * KindOne (KindOne@198.14.201.231) has joined #ceph
[11:23] <brzm> found it. if file /var/run/ceph-mon.*.asok esists everythin is broke up )
[11:23] <brzm> even if there is no ceph at all
[11:24] * athrift (~nz_monkey@203.86.205.13) Quit (Remote host closed the connection)
[11:24] <odyssey4me> brzm - yup, the library is checking for that file... which was created as part of the initiation...
[11:25] * athrift (~nz_monkey@203.86.205.13) has joined #ceph
[11:26] * toMeloos_ (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) has joined #ceph
[11:27] * athrift (~nz_monkey@203.86.205.13) Quit (Remote host closed the connection)
[11:28] <loicd> it looks like 4GB is not enough for a teuthology target
[11:28] <loicd> s/4GB/2GB/
[11:28] <loicd> trying 4GB
[11:28] * athrift (~nz_monkey@203.86.205.13) has joined #ceph
[11:29] <loicd> terminate called after throwing an instance of 'std::bad_allo\
[11:29] <loicd> c'
[11:29] <loicd> INFO:teuthology.orchestra.run.err:[10.145.3.7]: what(): std::bad_alloc
[11:29] * toMeloos_ (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[11:30] * toMeloos_ (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) has joined #ceph
[11:31] <brzm> odyssey4me: ok, got old situation here. SIGINT because of me: chef-client hangs.
[11:31] <brzm> # ceph auth get-key client.bootstrap-osd
[11:31] <brzm> 2013-08-13 16:31:45.318741 7fe96781c700 0 -- 10.100.254.102:0/127005 >> 10.100.254.104:6789/0 pipe(0x1d75000 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault
[11:32] <brzm> and it's obvious: mon in probing state
[11:32] <brzm> only bootstrap commands works
[11:33] <odyssey4me> brzm - try purging the packages, check that nothing's left, change the fsid to a new uuid, then re-run
[11:33] * jcfischer (~fischer@peta-dhcp-3.switch.ch) Quit (Read error: Connection reset by peer)
[11:34] * jcfischer (~fischer@peta-dhcp-3.switch.ch) has joined #ceph
[11:34] <brzm> odyssey4me: how can it help? maybe there is some method for diagnose why monitors dont want to interact?
[11:37] <odyssey4me> brzm - there may be, but in this case (where there is only the monitor on the server you're running this) it would seem that the mon didn't complete it's initial setup
[11:40] <brzm> odyssey4me: ok, i'll try bootstrap mon by hand and find out why chef not working
[11:40] <brzm> odyssey4me: is there way to start up with one mon?
[11:40] <odyssey4me> brzm - chef isn't working because the mon setup isn't working
[11:41] <odyssey4me> brzm - it may be best to try and setup your cluster by hand first to make sure that everything can work on the platform you're doing it with
[11:41] <odyssey4me> once you've confirmed that, then try with the automation
[11:41] <brzm> odyssey4me: well, i said exactly this )
[11:42] <brzm> odyssey4me: and again: is there a way to work with one mon?
[11:42] * nhm (~nhm@184-97-255-87.mpls.qwest.net) Quit (Read error: Operation timed out)
[11:42] <odyssey4me> brzm - yes, a cluster can work with one mon
[11:43] <odyssey4me> and the chef recipes also work with one mon
[11:43] <brzm> odyssey4me: when i build cluster by hand it works only when second mon appears
[11:44] <brzm> odyssey4me: probing... probing... probing... second mon appears. ok. quorum. nice. everything works.
[11:44] <odyssey4me> brzm - that's not right... it may be some of the other attributes you've set, for instance your quorum minimum and first mon
[11:45] <brzm> odyssey4me: no.
[11:46] * yy (~michealyx@122.233.47.137) has left #ceph
[11:46] <brzm> odyssey4me: how to set quorum minimum?
[11:47] * AfC (~andrew@2001:44b8:31cb:d400:e11d:6d64:9ea7:85f4) Quit (Quit: Leaving.)
[11:47] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[11:47] * kyann (~kyann@did75-15-88-160-187-237.fbx.proxad.net) has joined #ceph
[11:50] <brzm> It works!
[11:51] <odyssey4me> brzm :) I take it you changed the quorum minimum?
[11:51] <brzm> jeeez, so stupid mistake: mon initial members in fqdn
[11:51] <odyssey4me> aha
[11:51] <brzm> odyssey4me: no, only two monitors works
[11:51] <brzm> two or more
[11:52] <brzm> odyssey4me: ran chef-client. it hangs. wait. wait. ran second server. and both started up.
[11:52] <brzm> odyssey4me: i don't know why, but single mon just not working :(
[11:52] <odyssey4me> odd, I haven't seen that behaviour
[11:52] <odyssey4me> not with Ubuntu
[11:53] <brzm> odyssey4me: sorry for wasting your time :(
[11:53] <brzm> odyssey4me: many thaks anyway )
[11:53] <odyssey4me> brzm - no problem at all, happy to help by being a sounding board
[11:55] * yeled (~yeled@spodder.com) Quit (Quit: meh..)
[11:59] * yeled (~yeled@spodder.com) has joined #ceph
[12:06] * brzm (~medvedchi@node199-194.2gis.com) has left #ceph
[12:10] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[12:11] <verdurin> Is an RPM of the ceph-deploy 1.2 release available?
[12:15] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[12:15] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[12:21] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[12:25] <loicd> ccourtaut: teuthology works :-)
[12:25] <loicd> INFO:teuthology.run:Summary data:
[12:25] <loicd> {duration: 571.2752001285553, flavor: basic, owner: ubuntu@teuthology, success: true}
[12:25] <loicd> INFO:teuthology.run:pass
[12:26] <loicd> I'll cleanup the walkthru so that you can give it a try
[12:32] * huangjun (~kvirc@59.173.200.119) Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[12:35] * guppy_ (~quassel@guppy.xxx) Quit ()
[12:37] * guppy (~quassel@guppy.xxx) has joined #ceph
[12:39] * pressureman (~pressurem@62.217.45.26) has joined #ceph
[12:39] * sha (~kvirc@81.17.168.194) has joined #ceph
[12:41] <pressureman> i have a couple of format-1 RBD images that i want to convert to format-2... is it a reasonable approach to export to stdout, and import from stdin into a format-2 image?
[12:41] <pressureman> e.g. rbd -p libvirt-pool --no-progress export OLD_IMAGE - | rbd -p libvirt-pool --image-format 2 import - NEW_IMAGE
[12:43] * Almaty (~san@81.17.168.194) has joined #ceph
[12:43] <sha> how to mark osd In if we marked it on down wrong. we wanted osd.5 and osd.6 pressed(((( degradation is started
[12:44] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:48] <sha> ceph osd in osd.5 help us
[12:49] * sha (~kvirc@81.17.168.194) Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[12:49] * Almaty (~san@81.17.168.194) Quit (Quit: Ex-Chat)
[13:05] * sha (~kvirc@81.17.168.194) has joined #ceph
[13:05] * mathlin (~mathlin@dhcp2-pc112059.fy.chalmers.se) Quit (Read error: Connection reset by peer)
[13:08] * mathlin (~mathlin@dhcp2-pc112059.fy.chalmers.se) has joined #ceph
[13:18] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[13:18] * ChanServ sets mode +v andreask
[13:27] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[13:36] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[13:36] <ccourtaut> loicd: great!
[13:57] <jcfischer> Is there a way to determine which files are affected by unfound pieces of a pg?
[13:57] <absynth> err
[13:57] <absynth> there was something like ceph pg dump or whatnot
[13:58] <absynth> don't remember sorry
[13:58] <jcfischer> ceph pg dump shows me which OSD something is on, but I can't find out what this belongs to (I think)
[13:59] <Gugge-47527> you want to list the objects in a given pg?
[13:59] <jcfischer> yep
[13:59] <Gugge-47527> only way i know of is to list the files in the pg dir on the osd
[14:00] <ofu_> yes, find helps
[14:00] <Gugge-47527> and unfound objects wont be there i think :)
[14:00] <jcfischer> that is kind of the problem :)
[14:01] <Gugge-47527> if your locate db is indexing the osd dirs, and you havent updated it lately, i guess that could help you :)
[14:01] <ofu_> good idea. not.
[14:01] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:01] <Josh_> Every night before I leave I will have 4 OSDs running initctl shows all 4 (on ubuntu), then when I come in the next morning and run the command again I only see 1 running. When looking at the logs of one of the OSDs not listed I see lock_fsid failed to lock /var/lib/ceph/osd/ceph-0/fsid, is another ceph-osd still running? (11) Resource temporarily unavailable. Anyone have any idea what might be causing this and a workaround I can use? T
[14:02] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:03] <jcfischer> I guess I ail have to mark the two unfound pieces lost and hope that it wasn't something critical
[14:07] <liiwi> (wom 12
[14:07] <liiwi> erp
[14:12] <verdurin> alfredodeza: any news on RPMs of ceph-deploy 1.2?
[14:12] <alfredodeza> verdurin: let me check
[14:12] <alfredodeza> a build was triggered last night
[14:13] <alfredodeza> there was nothing ready until I logged off, so I did not ping you
[14:14] <verdurin> alfredodeza: thanks. I tried myself but it appears some of the deps. aren't in EPEL 6
[14:16] <verdurin> alfredodeza: actually, it's EPEL 5 that has the problem. A local build for EPEL 6 works.
[14:16] <alfredodeza> I see
[14:16] <alfredodeza> verdurin: I am looking here btw: http://ceph.com/packages/ceph-extras/rpm/
[14:17] <verdurin> alfredodeza: thanks - I didn't know about that one
[14:17] * markbby (~Adium@168.94.245.4) has joined #ceph
[14:19] <alfredodeza> verdurin: so I see 1.2 for redhat http://ceph.com/packages/ceph-extras/rpm/rhel6/noarch/
[14:22] <jcfischer> hmm - I tried to mark the unfound pieces lost, but ceph says: "pg has 2 objects but we haven't probed all sources" - how do I force it to requery the OSDs? (all are available)
[14:24] <verdurin> alfredodeza: I think you need to update the yum repo files for those extra packages
[14:24] <jcfischer> this is the result of the query command: http://pastebin.com/NfQuXUs8
[14:24] <verdurin> otherwise, the older ceph-deploy release in the main Ceph repo overrides them
[14:25] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) has joined #ceph
[14:25] <alfredodeza> I am not sure I am following, what do you mean by updating the yum repo?
[14:26] <mozg> hello guys
[14:26] <mozg> i am having an issue with some pgs which are not becoming active+clean for some reason
[14:26] <mozg> is anyone here who could help me with debugging the issue?
[14:27] <verdurin> The release file e.g. http://ceph.com/packages/ceph-extras/rpm/centos6/noarch/ceph-deploy-release-1-0.noarch.rpm
[14:27] <verdurin> points at the main repo, not the extras one
[14:27] <mozg> i have 19 active and 25 peering pgs
[14:27] <mozg> which are stuck like that for about 5 hours or so
[14:28] <mozg> not sure if this is the cause, but my cluster is running terribly slow
[14:28] <kraken> ≖_≖
[14:28] <mozg> and i also had slow requests during this perio
[14:33] * pressureman (~pressurem@62.217.45.26) Quit (Remote host closed the connection)
[14:34] <verdurin> alfredodeza: when I try to run the 1.2.1 release, it complains that the version of python-pushy is too old
[14:34] <alfredodeza> ok
[14:34] <alfredodeza> verdurin: I am creating a ticket so we can fix this
[14:34] <alfredodeza> thanks for noting
[14:36] <verdurin> alfredodeza: traceback here: http://ur1.ca/f1bpe
[14:46] <alfredodeza> verdurin: I have created issue 5947
[14:46] <kraken> alfredodeza might be talking about: http://tracker.ceph.com/issues/5947
[14:48] <alfredodeza> thanks for letting me know verdurin
[14:49] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[14:52] * yanzheng (~zhyan@101.83.115.166) has joined #ceph
[14:52] * stepan_cz (~Adium@2a01:348:94:30:91c7:e981:2e4c:e96a) has joined #ceph
[14:53] * stepan_cz (~Adium@2a01:348:94:30:91c7:e981:2e4c:e96a) has left #ceph
[14:53] * stepan_cz (~Adium@2a01:348:94:30:91c7:e981:2e4c:e96a) has joined #ceph
[14:55] <Josh_> Would anyone happen to have any thoughts on my previous post? Thanks you
[14:55] * KrisK (~krzysztof@213.17.226.11) Quit (Quit: KrisK)
[14:56] <mozg> does anyone know why I am getting these types of slow requests:
[14:56] <mozg> slow request 60.411805 seconds old, received at 2013-08-13 07:34:44.985532: osd_op(client.1794319.0:501409 rbd_data.1b558279e2a9e3.000000000000072e [write 2543616~4096] 5.729378d0 e14384) v4 currently no flag points reached
[14:56] <kraken> mozg might be talking about: https://jira.cmgdigital.com/browse/-13
[14:56] <mozg> and these: slow request 61.172531 seconds old, received at 2013-08-13 07:34:43.121862: osd_op(client.1805084.0:1295576 rbd_data.1b74c32ae8944a.0000000000000567 [write 1130496~98304] 5.5a5513ba e14383) v4 currently reached pg
[14:56] <kraken> mozg might be talking about: https://jira.cmgdigital.com/browse/-13
[14:56] <mozg> i've noticed that several of my osds are showing these
[14:57] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[14:57] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:58] <mozg> kraken: i get an error message trying to access your link
[15:03] <stepan_cz> Hi guys, again here for help regarding Ceph… I've been playing with it and all looked good (minor issues but managed to solve these) and when I've decided to give it a go for out testing Open Nebula cluster I've dumped everything and created new cluster from scratch (using pupet), but I have following issue: my PGs are assigned only to one OSD and although the pool have size set to 2, it never gets replicated properly, therefore all of
[15:04] <stepan_cz> and I have already set tunables to optimal "ceph osd crush tunables optimal"
[15:05] <stepan_cz> my problem is that I have no idea why the PGs are not assigned to 2 OSDs, tried recreating OSD, MONs… no luck
[15:06] <stepan_cz> any help/ideas appreciated ;), thanks
[15:07] <janos> stepan_cz - is the health of the cluster OK?
[15:07] <janos> or 50% degraded
[15:07] <stepan_cz> this is output "HEALTH_WARN 400 pgs degraded; 400 pgs stuck unclean; 1 mons down, quorum 1,2 0,1"
[15:08] <stepan_cz> note that I have 3 mons, currently one of them is down…
[15:08] <janos> can all the OSD's and mon's see each other?
[15:08] <janos> in the past that ended up being my issue. one bone-headed routing issue
[15:09] <stepan_cz> I believe so, they're all on same network, there shouldn't be any catch in between
[15:09] <janos> mine were all same network too ;)
[15:09] <stepan_cz> ceph osd tree will show both OSDs as up and in
[15:10] <janos> can you pastebin "ceph osd tree" result?
[15:11] <stepan_cz> http://pastebin.com/uEsrNP8W
[15:11] * nerdtron (~kenneth@202.60.8.252) Quit (Remote host closed the connection)
[15:12] <janos> that looks ok, sparse but ok
[15:12] <janos> the weight of zero on both is concerning
[15:13] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) Quit (Remote host closed the connection)
[15:13] <janos> i'd give those some weight
[15:13] <stepan_cz> yep, just basic… I was thinking about some issue with crushmap, because even if the OSDs would have some trouble with connectivity, the PGs still should have OSDs assigned as acting, no?
[15:14] <stepan_cz> ok will try that, thanks for tip, I just kept the defaults there
[15:14] <janos> sorry i have to head out - some database administration needs my attention
[15:14] <janos> best of luck!
[15:14] <stepan_cz> thanks!
[15:18] * huangjun (~kvirc@58.51.149.217) has joined #ceph
[15:20] <stepan_cz> janos: many thanks! looks like that was it, I've assigned some weights to all buckets in my crush-map and it looks like working now! they're getting to active+clean state… wow! many many thanks!
[15:20] <Josh_> What does ceph-deploy osd activate do? Prepare seems to do everything and put the osd in
[15:22] <alfredodeza> Josh_: it calls ceph-disk-activate with a few args like --mark-init and --mount
[15:22] <alfredodeza> I *think* it basically activates the disk on the remote host
[15:23] <Josh_> So if after issuing prepare, the OSD becomes up and IN does that mean I have an issue? I removed all of my OSDs and now I am re adding them and placing the journal on a SSD
[15:24] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[15:30] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Read error: Operation timed out)
[15:32] * nhm (~nhm@67-220-20-222.usiwireless.com) has joined #ceph
[15:35] * clayb (~kvirc@199.172.169.97) has joined #ceph
[15:35] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[15:46] * capri_on (~capri@212.218.127.222) Quit (Read error: Connection reset by peer)
[15:48] * Mithril (~Melkor@208.175.141.7) Quit (Quit: Leaving)
[15:49] * yanzheng (~zhyan@101.83.115.166) Quit (Ping timeout: 480 seconds)
[15:52] * aliguori (~anthony@32.97.110.51) has joined #ceph
[15:56] * nhm_ (~nhm@65-128-168-138.mpls.qwest.net) has joined #ceph
[16:01] * allsystemsarego (~allsystem@5-12-241-157.residential.rdsnet.ro) Quit (Quit: Leaving)
[16:01] * mschiff (~mschiff@p4FD7F08D.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[16:01] * mschiff (~mschiff@p4FD7F08D.dip0.t-ipconnect.de) has joined #ceph
[16:03] * nhm (~nhm@67-220-20-222.usiwireless.com) Quit (Ping timeout: 480 seconds)
[16:14] <huangjun> does monmap include the clients records?
[16:18] * aliguori (~anthony@32.97.110.51) Quit (Remote host closed the connection)
[16:18] * markbby (~Adium@168.94.245.4) Quit (Remote host closed the connection)
[16:19] * loopy (~torment@pool-72-64-182-94.tampfl.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[16:19] * markbby (~Adium@168.94.245.4) has joined #ceph
[16:19] <huangjun> where should i look for to get clients info that mounted on ceph and have mounted on ceph?
[16:19] * BillK (~BillK-OFT@124-169-72-15.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[16:22] * aliguori (~anthony@32.97.110.51) has joined #ceph
[16:23] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[16:25] <Kioob> Is there any news about recovery on cuttlefish ? I have OSD which �hang� (no IO, no network, no information in logs, just �IO� which are not commits, and others OSD are waiting about it) ; is it the same problem ?
[16:27] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has left #ceph
[16:27] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[16:28] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit (Quit: sprachgenerator)
[16:28] <huangjun> Kioob:did the osd process exists?
[16:29] <Kioob> yes
[16:29] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[16:29] <huangjun> what about the 'ceph health detail' outputs?
[16:29] <Kioob> and if I mark it as down (ceph osd down XX), it will recover after a long time (15 minutes, for example)
[16:30] * stepan_cz (~Adium@2a01:348:94:30:91c7:e981:2e4c:e96a) has left #ceph
[16:30] <huangjun> if you use the cmd to down the osd, it will bring up in few mintues(or seconds) later,
[16:32] <huangjun> ls
[16:32] <kraken> huangjun, this ain't your shell
[16:33] <huangjun> you can restart the osd and see what's happens
[16:34] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit ()
[16:37] * JM (~oftc-webi@193.252.138.241) has joined #ceph
[16:38] <Josh_> Using KVM, what is the best way to speed up access to rbd other than cache=true. When using RBD for vm images I get guest access at around 120mbs. WHen I mount the rbd image to the virt host and place the raw image in there I get well over 300mbs. How can I even this out?
[16:38] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[16:40] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[16:42] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) has joined #ceph
[16:45] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[16:48] <Kioob> I have a lot of noise in "ceph health detail", because of an other (old) problem, I will try to isolate that
[16:48] * Cube (~Cube@c-98-208-30-2.hsd1.ca.comcast.net) has joined #ceph
[16:49] * bandrus (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[16:49] * matt_ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[16:51] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[16:55] <Kioob> huangjun: but it's hard to restart the OSD... it's a production cluster. If the OSD run, writes will be locked
[16:59] <Gugge-47527> you dont have replication on your cluster?
[17:01] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:06] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[17:13] * huangjun (~kvirc@58.51.149.217) Quit (Ping timeout: 480 seconds)
[17:20] <sage> Josh_: flashcache or bcache beneath osds works well, if you have some ssds
[17:23] <Josh_> I don't think it is an issue with the OSDs, I think it is how they are accessed. Networks are all 10GB bonded but throughput seems really slow when using rbd
[17:24] * sprachgenerator (~sprachgen@130.202.135.224) has joined #ceph
[17:28] * sagelap (~sage@76.89.177.113) Quit (Ping timeout: 480 seconds)
[17:28] * sagelap (~sage@2600:1012:b00a:7efe:884c:7afd:50fa:14a3) has joined #ceph
[17:30] * DarkAceZ (~BillyMays@50.107.55.36) Quit (Ping timeout: 480 seconds)
[17:30] * devoid (~devoid@130.202.135.227) has joined #ceph
[17:37] * jcfischer (~fischer@peta-dhcp-3.switch.ch) Quit (Quit: jcfischer)
[17:38] * alram (~alram@208.86.100.62) has joined #ceph
[17:45] <mozg> Josh_: are you having issues with slow performance?
[17:45] <mozg> me too
[17:48] <mozg> josh_: I've been doing some tests and it seems to me there there is an issue with single thread performance
[17:48] <mozg> it is even worse when using 4k block size
[17:49] <mozg> however, try running multiple tests concurrently from the vm
[17:49] <mozg> you would get better performance figures
[17:49] <mozg> i've been testing and could easily achieve between 1.2GB/s and 1.6GB/s on my IP over infiniband link
[17:50] <mozg> however, a single thread performance, as you've pointed out is around 120mb/s
[17:50] <mozg> is anyone here from Ceph/Inktank who could comment on the bottleneck of a single thread? Why is it limited to 120mb/s per thread?
[17:51] <mozg> has anyone investigated this issue?
[17:51] <bandrus> how fast is your network?
[17:51] <Gugge-47527> my guess is latency :)
[17:51] <Gugge-47527> a single read will never be as fast as a local disk
[17:51] <Gugge-47527> and a single thread does one read at a time
[17:51] <bandrus> sorry I see you answered that
[17:52] <mozg> gugge-47527: looking at the server performance with iostat i can see that a single thread dd for example would read from multiple osds
[17:52] <mozg> not sure if it is multiple osds at a time
[17:52] <kraken> ≖_≖
[17:52] <mozg> or just one after another
[17:53] <mozg> but my single osd could handle 160mb/s in seq
[17:53] <Josh_> mozg that is exactly correct. I have 2 10Gb adapters bond=4 yet it runs at the same speed as if I were to have 1 1gbs adapter connected
[17:53] <Gugge-47527> mozg: a singlethreaded read is always one read at a time (unless there is some readahead in the driver)
[17:53] <mozg> so, should I not be getting speeds of around 160mb/s minus some small overhead?
[17:54] <mozg> Josh_: have you tried concurrency?
[17:54] <mozg> can you saturate your link with let's say 10 concurrent dd tests?
[17:54] <mozg> or 20 tests?
[17:54] <Josh_> I will try now
[17:55] <mozg> Gugge-47527, how would I check for the latency and what numbers are considered to be reasonable latency figures and what latencies would give me issues?
[17:55] <mozg> if my osds are 7k SAS disks
[17:56] <Gugge-47527> you could time how long it takes to read a single 4k block :)
[17:56] <Josh_> Single = 2147483648 bytes (2.1 GB) copied, 18.8246 s, 114 MB/s
[17:56] * JM (~oftc-webi@193.252.138.241) Quit (Remote host closed the connection)
[17:56] <Josh_> concurrent = 2147483648 bytes (2.1 GB) copied, 2.72399 s, 788 MB/s
[17:57] <mozg> josh_: what command do you use?
[17:57] <Josh_> dd if=/dev/zero of=output bs=8k count=256k; on multiple
[17:57] <mozg> try using 4k please?
[17:57] <mozg> i just want to check what you get
[17:58] <mozg> also, it might be a good idea to use iflag=direct
[17:58] <mozg> to make sure you are not reading from cache
[17:58] <mozg> also, you could do flash cache on your osd servers to check what is the disk to client speed
[17:59] <Josh_> dd if=/dev/zero of=output2 bs=4k count=256k iflag=direct; ?
[18:01] <joelio> oflag=direct non?
[18:01] <Josh_> yea thats better
[18:02] <mozg> yeah
[18:02] <mozg> no oflag
[18:02] <mozg> sorry
[18:02] <mozg> oflag if you are writing
[18:02] <mozg> and iflag if yo uare reading
[18:02] * tnt (~tnt@212-166-48-236.win.be) Quit (Read error: Operation timed out)
[18:04] <mozg> Josh_, by the way, do you have rbd cache enabled?
[18:04] <Josh_> 335544320 bytes (336 MB) copied, 14.876 s, 22.6 MB/s
[18:04] <mozg> also, what is your kvm cache settings when you start the vm?
[18:04] <Josh_> yes to rbd_cache
[18:04] <Josh_> writeback
[18:04] <mozg> okay
[18:04] <mozg> how many osds do you have?
[18:04] <mozg> and how many servers?
[18:05] * joao (~JL@2607:f298:a:607:9eeb:e8ff:fe0f:c9a6) has joined #ceph
[18:05] * ChanServ sets mode +o joao
[18:05] <Josh_> 20 osds over 5 servers each osd is 1TB 10K with journal on 100K iop ssd
[18:05] <mozg> that is a good setup )))
[18:05] <mozg> your concurrency thoughput should kick ass!!!
[18:05] <mozg> anyway, let me test it with my kit
[18:06] <Josh_> thanks, thats why I am really hoping to get this speed issue resolved
[18:06] <mozg> what kind of workloads are you expecting to have on this cluster?
[18:07] <joelio> using virtio for bus too I take it?
[18:07] <joelio> rather than emulated
[18:08] <Josh_> yes virtio and on windows using red hat virtio scsi. Last version of ceph I was getting around 200mbs average use with nothing big going on, heavy loads I was seeing 1.4gbs streaming for maybe 20 min at a time
[18:09] <joelio> wasn't there some issue with rbd caching on windows?
[18:09] <mozg> Josh_: i was getting crazy speeds with crystal mark on windows
[18:09] <mozg> with small file sizes
[18:09] <mozg> like 100mb
[18:10] <mozg> switching to 4gb file size i had reasonable performance
[18:10] <mozg> but hoping to get more
[18:10] <Josh_> let me test now on a win server
[18:10] <mozg> my problem is slow 4k reads/writes
[18:10] <mozg> especially random reads/writes
[18:11] <Josh_> Yea and it is worse on openstack, I am waiting for the next release because you cannot use rbd cache.
[18:11] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[18:11] <mozg> Josh_, by the way, what is your replica count?
[18:11] <joao> kyann, around?
[18:11] <mozg> Josh_, i am using cloudstack, which had support for rbd for about 6-8 months already
[18:11] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:12] * bergerx_ (~bekir@78.188.101.175) Quit (Quit: Leaving.)
[18:12] <Josh_> 2. I was looking at cloudstack but it can only bind to 1 mon
[18:12] <mozg> Josh_, here is what i have when writing single thread: 419430400 bytes (419 MB) copied, 15.0576 s, 27.9 MB/s
[18:13] <mozg> Josh_, that is not true
[18:13] <mozg> you could have multiple mons
[18:13] <mozg> via dns round robin
[18:13] <mozg> this is what i've got
[18:13] <Josh_> how fast is failover?
[18:14] * mschiff_ (~mschiff@p4FD7F08D.dip0.t-ipconnect.de) has joined #ceph
[18:14] * mschiff (~mschiff@p4FD7F08D.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[18:14] <mozg> Josh_, i think it's not down to CS
[18:14] <mozg> but it's an rbd thing
[18:15] <mozg> from what i recall it takes about 5 secs if one of the mons is down before it switches to the second one
[18:15] <Josh_> Fast speeds when using CephFS
[18:15] <mozg> however, you do not have vm downttime when the mon goes down
[18:15] <mozg> it's not effecting vm
[18:15] <Josh_> I may have to look back into cloudstack, and you can use cache?
[18:16] * tnt (~tnt@91.176.3.64) has joined #ceph
[18:16] * DarkAceZ (~BillyMays@50.107.55.36) has joined #ceph
[18:16] <mozg> Josh_, by default caching is set to cache=none
[18:16] <mozg> however, there is a ticket in the pipeline to change that
[18:16] <mozg> but i've been running some tests
[18:16] <mozg> and i do not see a great deal of performance improvement when using cache=none and cache=writeback
[18:16] <mozg> perhaps it's only me
[18:17] * sagelap (~sage@2600:1012:b00a:7efe:884c:7afd:50fa:14a3) Quit (Read error: No route to host)
[18:17] <Josh_> correct and you can change default under libvirt to issue writeback. Can cloudstack give rbd cache = true though?
[18:17] <mozg> Josh_, forgot to ask, what kvm server os are you on
[18:17] <Kioob> Josh_: for reads, you can try to increase readahead. In my case I get between x2 and x3 throughput by using 512k instead of the 128k
[18:17] <mozg> and what qemu/libvirt do you use?
[18:17] <joelio> is there a way you can increase the concurrency for rbd in libvirt - afaik there are 16 threads per vm images - would like to increase and test performance
[18:17] <Josh_> Ubuntu 13.04
[18:17] <Josh_> latest libvirt
[18:17] <mozg> that's good
[18:17] <Josh_> Yea how do I do that?
[18:18] <mozg> joelio, i would like to know that answer to that too
[18:18] <joelio> if I'm not mistaken of course!
[18:18] <Josh_> also Kioob how do you do reahahead?
[18:18] <mozg> joelio, how do you change that?
[18:19] <joelio> mozg: no idea mate, hence why I'm asking if it's possible - or if I'm confused
[18:19] <Kioob> to check current value : cat /sys/block/rbd*/queue/read_ahead_kb
[18:19] <Kioob> to modify it : echo 512 > /sys/block/rbd22/queue/read_ahead_kb
[18:19] <Josh_> result 128
[18:19] <Josh_> do I need to shutdown all vms before modify?
[18:19] <Kioob> no
[18:20] <Kioob> but I do that *inside* the VM
[18:20] <mozg> kioob: i get an error message: no such file or directory when trying cat /sys/block/rbd*/queue/read_ahead_kb
[18:20] <mozg> what am i missing?
[18:20] <Josh_> not on the host?
[18:20] <joelio> that's for kernel mapped rbd right, not libvirt
[18:20] <Kioob> mozg: that example is for kernel client, sorry
[18:21] <mozg> kioob: like when you mount with mount -t ceph?
[18:21] <joelio> rbd map
[18:21] <joelio> not cephfs
[18:21] <Kioob> yes, rbd map
[18:21] <Josh_> <source protocol='rbd' name='rbd/Kaspersky:rbd_cache=true'/> <----------- will it help this?
[18:22] <joelio> Josh_: you need to add writeback/through too to ensure writes are flushed
[18:22] <joelio> depending on you rbd cache type
[18:22] <Josh_> correct already set to writeback
[18:23] <joelio> one of my libvirt xml defs.. https://gist.github.com/anonymous/d9a9bc08f77210eb8d18
[18:24] * KindTwo (KindOne@h51.215.89.75.dynamic.ip.windstream.net) has joined #ceph
[18:24] <Josh_> correct that is how mine look
[18:25] <joelio> cool
[18:25] <Josh_> echo 512 > /sys/block/rbd22/queue/read_ahead_kb <_----- It is not liking rbd22
[18:25] <mozg> Josh_, what performance do you get with the following command: dd if=output2 of=/dev/null bs=4k iflag=direct
[18:25] * jeff-YF (~jeffyf@67.23.117.122) has joined #ceph
[18:25] <mozg> I am only getting 2.7MB/s (((
[18:26] <mozg> which sucks donkey balls (((
[18:26] <Josh_> standby
[18:26] <mozg> thanks
[18:26] <joelio> I'd really like to get the speed of an idividual vm up - I can mitigate by spreading at the application layer and throwing more VMs. My middleware doesn't support kernel mapping (native libvrit instead - which is fine), so I can create raid-0 (for example) from multiple RBD's
[18:26] <Gugge-47527> is 2.7MB/s that bad on 4K reads?
[18:26] <Josh_> 56782848 bytes (57 MB) copied, 6.89899 s, 8.2 MB/s
[18:27] <Gugge-47527> its under 2ms pr read :)
[18:27] <joelio> 17.1 MB/s on 4k for me
[18:27] <Kioob> Josh_: the 22 here is an example... you need to adapt to your real device
[18:27] <nhm_> joelio: more readers/writers per VM may help.
[18:27] <Kioob> for example (again), in Xen I use devices xvda, xvdb, etc
[18:27] <Josh_> oh so that only helps things under rbd showmapped. I am accessing from libvirt xml
[18:27] <nhm_> joelio: also, multiple volumes per VM.
[18:27] <joelio> nhm_: where do I find such magic?
[18:28] <nhm_> joelio: I mean at your application layer, have more threads doing reads/writes
[18:28] <joelio> ahh, ok
[18:28] <mozg> Gugge-47527, yeah, it is so bad!!!
[18:28] <joelio> nhm_: what I'm after is increasing a per VM concurrency though
[18:28] <mozg> i was getting far better speeds when running nfs on the same hardware
[18:29] <nhm_> mozg: that's entirely possible depending on what you are doing and how your hardware is setup.
[18:29] <mozg> i was getting around 65-75k iops @ 4k
[18:29] <joelio> nhm_: via libvirt native. I've tested multiple volumes using a raid-0 and perfomance is definitely better
[18:29] <Josh_> Yea and I did read the article about hosting NFS on flashcache, however I was hoping to stay away from NFS
[18:30] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[18:30] * KindTwo is now known as KindOne
[18:30] <joelio> nhm_: seems silly though as all your doing is increasing the concurrency - if this could be done 'upstream' it'd make management a lot easier
[18:30] <nhm_> mozg: verified coming to/from disk?
[18:30] <mozg> nhm_, the nfs tests were done on the same hardware (not identical, but the same).
[18:30] <Josh_> those are faster I think because it is using the FS cache
[18:30] <nhm_> mozg: I just mean, did you verify that it wasn't coming from cache?
[18:31] <mozg> nhm_, i had zfs + ssds for arc/zil, so i can't say that it was coming from the disk
[18:31] <mozg> however,
[18:31] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[18:31] <mozg> the 2.7mb/s speed which i've tested
[18:31] <mozg> didn't come from disk either
[18:31] <mozg> there was no disk activity on the storage server when i ran the test
[18:31] <mozg> as i've previously read this file
[18:31] <mozg> it cached it
[18:32] <mozg> so, the 4k reads were coming from the osd server's ram
[18:32] <Josh_> What type of speed do you see from a traditional SAN? I hate that traditional setup that is why I like ceph so much but I am wondering if this is realted to libvirt drivers
[18:32] <nhm_> mozg: was this direct IO or buffered?
[18:33] <mozg> not really sure. but i was running the same dd tests as i am now - using iflag=direct
[18:33] <mozg> the difference is that i previously had nfs
[18:33] <mozg> which was running over zfs
[18:33] <mozg> and now i am using ceph + rbd
[18:35] <mozg> Gugge-47527, sorry, how did you get 2ms from 2.7mb/s 4k speed?
[18:36] <Gugge-47527> how many 4k reads is required for 2.7MB ?
[18:36] <mozg> 2.7*1024/4 ?
[18:36] <mikedawson> joshd: Do you think the bugfix in wip-librados-aio-flush could be beneficial to my issue? You mentioned the Bobtail branch, but it looks like Oliver upgraded to 0.56.6, so I'm confused. We are running 0.61.7 currently.
[18:36] <Gugge-47527> yes
[18:36] <Josh_> Be back in 20
[18:37] <mozg> so, i get 691 reads per second
[18:37] <Gugge-47527> if no readahead is used, yes
[18:38] <mozg> so, that comes to 1.45ms per read
[18:38] <mozg> is this good or bad?
[18:38] <mozg> i think the spinners are giving around 10ms for random, right?
[18:38] <Gugge-47527> around that yes
[18:39] <mozg> but my reads are seq, so i can't really compare it
[18:39] <Gugge-47527> and each layer disk+osd+network+rbd adds latency
[18:40] <mozg> i think all my reads are coming from osd server's ram
[18:40] <mozg> as i've ran the test many times
[18:41] <Gugge-47527> then you only have the osd+network+rbd latency :)
[18:42] <mozg> so, is it safe to assume that 1.45ms latency is not causing the issue with throughput? is the bottle neck in something else?
[18:42] <Gugge-47527> latency is a bitch on singlethreaded reads
[18:42] <mozg> or not?
[18:43] <Gugge-47527> lets assume 128K reads, and still 691 reads pr second, what is your max speed then? :)
[18:44] <mozg> 86.4mb/s
[18:44] * davidzlap (~Adium@cpe-75-84-249-188.socal.res.rr.com) has joined #ceph
[18:44] <grepory> :/
[18:44] <Gugge-47527> and then lets assume you are not gonna have them in the osd host's ram, can you get 691 reads pr second then? :)
[18:45] <mozg> single thread - no
[18:45] <mozg> i think spinners can do around 120 / s
[18:45] <mozg> have to go guys
[18:45] <mozg> back in about 2 hours
[18:45] <mikedawson> Josh_: you can enable cache=writeback using OpenStack, just have to update a py file, then relaunch the instance.
[18:53] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) Quit (Ping timeout: 480 seconds)
[18:59] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Quit: Leaving.)
[19:04] * dosaboy (~dosaboy@faun.canonical.com) Quit (Read error: Connection reset by peer)
[19:10] <loicd> sjust: running teuthology with valgrind on the stable branch on one OSD gives me a single error ( Syscall param msync(start) points to uninitialised byte(s) , occurs 5 times ) . Is it what is expected ? I'm trying to establish a baseline and see how much wip-5510 deviates from it ;-)
[19:10] <Josh_> So am I reading this right though that no matter what you do libvirt will never be able to go faster than 120mbps
[19:11] <sjust> loicd: not sure, where is the msync happening?
[19:12] <loicd> sjust: http://pastebin.com/uhsyAD28 is the full remote/ubuntu@XXX/log/valgrind/osd.0.log.gz
[19:12] <loicd> stack trace included
[19:12] <Gugge-47527> Josh_: i would expect it to go faster with big enough reads and/or big queue :)
[19:12] <loicd> sjust: my bad : branch is master
[19:14] <loicd> sjust: and the teuthology yaml file is http://pastebin.com/adcs8yyN
[19:15] * davidzlap1 (~Adium@cpe-75-84-249-188.socal.res.rr.com) has joined #ceph
[19:15] * davidzlap (~Adium@cpe-75-84-249-188.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[19:18] <sagewk> loicd: probably leveldb
[19:21] * sagelap (~sage@2607:f298:a:607:ea03:9aff:febc:4c23) has joined #ceph
[19:21] * sagelap (~sage@2607:f298:a:607:ea03:9aff:febc:4c23) Quit ()
[19:23] * sjustlaptop (~sam@2607:f298:a:607:745d:e4ac:d953:5689) has joined #ceph
[19:30] <Kioob> Josh_: did you try to modify readahead inside the VM ?
[19:32] <mtl> Hi, just curious to see if there's any kind of a window for the dumpling release. I have a cluster running 0.67-rcX, and I'm waiting to upgrade to the official release before going live with things.
[19:32] * alfredodeza is now known as alfredo|afk
[19:33] <Josh_> I was not able to. I am not mounting the rbd, instead I call it in the source of my xml
[19:33] * clayb (~kvirc@199.172.169.97) Quit (Ping timeout: 480 seconds)
[19:34] * jcfischer (~fischer@peta-dhcp-3.switch.ch) has joined #ceph
[19:39] * sjustlaptop (~sam@2607:f298:a:607:745d:e4ac:d953:5689) Quit (Ping timeout: 480 seconds)
[19:40] <Josh_> in KVM what if I made IO Mode theads?
[19:40] * clayb (~kvirc@69.191.241.59) has joined #ceph
[19:41] <Kioob> Josh_: *inside* the VM, it doesn't really matter
[19:42] <Kioob> from the VM point of view, it's just a blockdevice.
[19:44] <sjust> so the 4k/6-7k ios were on the osd data disk?
[19:45] <sjust> nhm_: ^
[19:45] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Read error: Operation timed out)
[19:45] <nhm_> sjust: Yes, but let me see if there is any overlap with the journal disks.
[19:45] <sjust> overlap?
[19:46] <nhm_> sjust: let me see how consistently that's data disks vs journal disks getting the 6-7k IOs.
[19:53] <nhm_> sjust: interesting. Many of those disks (but not entirely all) that have 4k vs 6-7k writes are journal disks. Many of the OSD disks are seeing ~120-150k writes. With 0.61.7, there are just more of them.
[19:54] <nhm_> makes sense I suppose since the journal is all direct IO.
[19:54] <sjust> nhm_: yeah, but we aggregate it, so it actually should be bigger I would think
[19:54] <sjust> still, that's secondary, nothing changed in the journal
[19:55] <sjust> what do you mean there are more of them
[19:55] <sjust> ?
[19:55] <sjust> the distribution is skewed towards larger writes?
[19:55] <sjust> or the same size writes at a lower rate?
[19:56] * yanzheng (~zhyan@101.82.231.181) has joined #ceph
[19:56] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[19:56] <nhm_> sjust: haven't looked into it deeply, but it apperas that in both cases writes are aggregated to an extend on the data disks, but that more aggregated writes are being written to disk per second for 0.61.7 vs next-rc3.
[19:56] <nhm_> even if they are roughly the same size.
[19:57] * rturk-away is now known as rturk
[19:57] <sjust> that suggests that the flusher may be forcing a suboptimal write pattern?
[19:58] <sjust> if so, you could try increasing the smaller number to 4k
[19:58] <nhm_> sjust: maybe. I think I need to graph out the io sizes and writes to each disk over time in all cases.
[19:58] <sjust> that would prevent it from forcibly flushing until 4k ios
[19:58] * davidzlap1 (~Adium@cpe-75-84-249-188.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[19:58] <nhm_> sjust: meeting now, will try to play with it and get back to you later.
[19:58] <sjust> k
[20:00] <joshd> mikedawson: no, the code that fixes won't be run when caching is turned on
[20:01] <joshd> mikedawson: I've had another user report similar stalls with cache=writethrough, when they had hangs with no caching, so they could be hitting two bugs, or it could be a problem at a higher level than I thought that affects both with and without caching
[20:02] <nhm_> joshd: you have that under control btw?
[20:04] <grepory> nhm: would you say that, in terms of performance, a small ceph cluster in any way resembles a large ceph cluster? let's say, for example, 15 osd's + 3 mons vs 300 osd's + 10 mons
[20:04] <joshd> nhm_: yeah
[20:05] <nhm_> grepory: I would make a general statement that small storage systems tend to be easier to tune than large storage systems. :)
[20:05] <grepory> heh
[20:05] <nhm_> grepory: regardless if you are talking about ceph or not.
[20:05] <mikedawson> joshd: I've seen similar hangs during recovery/backfill, with writeback, and with cache=none. But I am having a tough time getting towards a root cause
[20:05] <grepory> fair enough.
[20:06] * Schelluri (~Sriram@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[20:08] <joshd> mikedawson: I'll add a bunch more tracing in a branch for librbd/librados, since the current stuff isn't enough for this case. That'll probably have to wait until tomorrow unfortunately
[20:09] <mikedawson> joshd: I'll be ready!
[20:09] <joshd> mikedawson: thanks!
[20:14] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[20:16] <Josh_> would it speed it up if I had less osds? what if I combined osds into raid pairs leaving me with only 5 osds?
[20:16] * yanzheng (~zhyan@101.82.231.181) Quit (Ping timeout: 480 seconds)
[20:17] * yanzheng (~zhyan@101.83.205.27) has joined #ceph
[20:18] * Schelluri (~Sriram@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit ()
[20:20] * Schelluri (~Sriram@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[20:20] * alfredo|afk is now known as alfredodeza
[20:24] * Sriram (~SChelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[20:24] <mjevans> Josh_ I am by no means an expert, but if you have only 10 OSDs I don't really see how combining them would speed things up. There are -lots- of other things you can inspect/tune. I would suggest you attempt to gain a better understanding of where your bottlenecks are and what you can do to adjust those parameters
[20:25] * mschiff_ (~mschiff@p4FD7F08D.dip0.t-ipconnect.de) Quit (Remote host closed the connection)
[20:30] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[20:30] <Josh_> Combining the osds was thought of earlier when it was suggested that each osd latency will contribute to the slowdown. The only thing it could be is rbd
[20:30] * Sriram (~SChelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Quit: Nettalk6 - www.ntalk.de)
[20:31] <joelio> Josh_: what kind of rados benches are you pressing?
[20:31] <mjevans> If that is the case it sounds more like a design flaw in ceph; latency from additional osds should be the least of your problems; though the bound of the greatest latency among the devices could be an issue.
[20:32] <mjevans> In that case it might be better to limit your performance pool to only your better OSDs and use the others for less important storage (EG: bulk / backups)
[20:32] <joelio> Josh_: http://ceph.com/w/index.php?title=Benchmark#RADOS_benchmark
[20:32] <kraken> php is just terrible
[20:33] <mjevans> kraken: it is both a 'fractal of bad design' and a clear example of scope creep / language cruft
[20:38] <Josh_> min lat: 0.048949 max lat: 4.6892 avg lat: 0.24874
[20:41] <joelio> Josh_: (only just got back from work, so not scrolled up completely) Is latency the issue for you? In what way if so?
[20:42] <Josh_> The issue is when I use rbd for virtual guests, the speed is very slow. rbd cache is on and using writeback. Earlier we were trying to figure out ways to speed up kvm machines using rbd
[20:50] <clayb> Sorry I'm rather ignorant of my Ceph stack but using KVM setup per (http://github.com/bloomberg/chef-bcpc) I can get ~9MB/sec using urandom as a source via dd if=/dev/urandom of=/tmp/t bs=4k count=40960 (167772160 bytes (168 MB) copied, 18.2135 s, 9.2 MB/s); it does look like the 9.2MB/sec is a quota (maybe a cgroup imposition) as 1M block size also limits at 9.2MB/sec.
[20:50] * mschiff (~mschiff@85.182.236.82) has joined #ceph
[20:51] <Kioob> clayb: there is not a quota on /dev/urandom ?
[20:52] <clayb> Yeah, it might be the limit Ubuntu 12.04's okay letting urandom spew; as a "copy" is ~714MB/sec:
[20:52] <clayb> ubuntu@vm:~$ dd if=/tmp/t of=/tmp/t2 bs=1M count=40960
[20:52] <clayb> 265+0 records in
[20:52] <clayb> 265+0 records out
[20:52] <clayb> 277872640 bytes (278 MB) copied, 0.388929 s, 714 MB/s
[20:52] <grepory> nhm_: do you have an updated version of http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/ or is that still largely relevant?
[20:59] <nhm_> grepory: no updated version, and we've made a bunch of changes since then. :D
[20:59] <nhm_> grepory: Some of it may be relevant though.
[21:05] * User3687 (~user3687@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[21:07] * User3687 (~user3687@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit ()
[21:09] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[21:13] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[21:13] * davidzlap (~Adium@cpe-75-84-249-188.socal.res.rr.com) has joined #ceph
[21:21] <joelio> Josh_: how slow exactly? You won't see the same speed as you get from RADOS benches as default. I don't get fantastic performance either (perfectly acceptable mind) but it all depends on my type of workload. Could you be more specific about what your testing for
[21:23] <joelio> clayb: /dev/urandom? why not /dev/zero? urandom is slow
[21:26] <joelio> clayb: btw you'll probably still be caching in that last dd example
[21:27] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[21:27] * ChanServ sets mode +o scuttlemonkey
[21:31] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[21:31] <Josh_> Do you still put vms that need fast disk speed on local storage?
[21:32] * davidzlap (~Adium@cpe-75-84-249-188.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[21:33] <clayb> @joelio: /dev/zero is also ~300MB/sec for a 4GB file doing a 1MB block size.
[21:33] <cephalobot> clayb: Error: "joelio:" is not a valid command.
[21:33] * odyssey4me (~odyssey4m@165.233.71.2) Quit (Ping timeout: 480 seconds)
[21:34] <clayb> I just didn't want to do something non-trival but indeed urandom is not a fast provider (thus the copy; but yes a 100MB file is easily cached these days)
[21:35] * davidzlap (~Adium@cpe-75-84-249-188.socal.res.rr.com) has joined #ceph
[21:40] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) has joined #ceph
[21:41] <mjevans> clayb: you can tr replace /dev/zero with a random character 'slightly' cheeply.
[21:45] * yanzheng (~zhyan@101.83.205.27) Quit (Ping timeout: 480 seconds)
[21:47] <joelio> Josh_: I'm not sure what you mean by fast disk speeds. Most of my VM's are stuff like ci platform (so jenkins and slaves), web apps, machine learning etc ($WORK ~ R&D). They function fine using their local fscache, rbd caching and the general rbd performance. That's not to say it can't be improved though. Where I can I split across the application layer (so more jenkins slaves, more web server frontends etc..) this means there is a cumulative effect of multiple
[21:48] <joelio> increasing per rbd concurrency would be welcome (if indeed, that is possible and I'm not dreaming it)
[21:49] <joelio> I've had better speeds with striping via mdadm several rbds into one raid-0
[21:50] <joelio> but that doesn't help in practice, as my middleware layer doesn't use kernel mapping but libvirt userland
[21:50] * wonkotheinsane (~jf@jf.ccs.usherbrooke.ca) has joined #ceph
[21:51] <joelio> now.. we could add more volumes per VM, but something is telling me that being able to tune the rbd concurrency across a pool/cluster from ceph tools might work?!? (dev input appreciated!)
[21:51] * rturk is now known as rturk-away
[21:52] * rturk-away is now known as rturk
[21:56] <joelio> I think I need to dive into libvirt's code, see what it does
[22:00] <joshd> sagewk: librados fix looks good
[22:00] <grepory> nhm_: okay. i'll start there and dig around. thanks!
[22:01] <sagewk> k thanks
[22:03] <nhm_> grepory: cool, post to the mailing list if you discover anything interesting!
[22:06] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[22:07] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[22:07] * Schelluri (~Sriram@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[22:07] * Schelluri (~Sriram@108-225-16-176.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[22:11] * mozg (~andrei@host109-151-35-94.range109-151.btcentralplus.com) has joined #ceph
[22:12] <mozg> hello guys
[22:12] * codice_ (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) Quit (Read error: No route to host)
[22:16] * davidzlap (~Adium@cpe-75-84-249-188.socal.res.rr.com) Quit (Quit: Leaving.)
[22:17] * codice (~toodles@75-140-71-24.dhcp.lnbh.ca.charter.com) has joined #ceph
[22:20] <mjevans> For the moment I'm happy enough to have something stable that works; fine-tuning can come later.
[22:23] * danieagle (~Daniel@177.99.135.56) has joined #ceph
[22:24] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[22:24] * ChanServ sets mode +v andreask
[22:24] <dmick> loicd: http://tracker.ceph.com/issues/5954
[22:25] * loicd looking
[22:26] <loicd> dmick: it's funny that you should write Löic instead of Loïc. Makes me feel swedish or something ;-)
[22:27] <dmick> sigh. I couldn't remember; sorry :) Will fix!
[22:27] <loicd> I usually don't bother adding the " ;-)
[22:28] <dmick> I have a thing about getting people's names right
[22:28] <dmick> it's a respect thing
[22:29] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Remote host closed the connection)
[22:30] * tserong_ (~tserong@124-171-113-175.dyn.iinet.net.au) has joined #ceph
[22:30] <loicd> dmick: thanks
[22:31] * Rom (~Rom@c-107-3-156-152.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[22:31] <dmick> I'll remember it by remembering "naïve", which I also had to look up yesterday. Not that there's any correlation. :)
[22:34] <sjust> sagewk: wip-backport-recovery-ops
[22:34] <sjust> I think it's responsible for stefan's thing (mailing list)
[22:36] * tserong (~tserong@124-171-119-22.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[22:38] * markbby (~Adium@168.94.245.4) Quit (Quit: Leaving.)
[22:39] <sagewk> hmm, these patches came from me trying to fix his issue, and iirc they were originally against cuttlefish and didn't help
[22:39] <sagewk> but maybe he had other issues too at the time
[22:39] <loicd> dmick: I updated with a link to https://github.com/ceph/ceph-qa-chef/blob/f1c0aa840f26a9c1dbdbe9a7a07bcc34202ca900/cookbooks/ceph-qa/recipes/default.rb#L13 . Next time I have that kind of problems I'll look there. I did not realize this repository was used to prepare the teuthology machines. Thanks for the hint :-)
[22:44] <dmick> yep
[22:48] <loicd> I'm seeing
[22:48] <loicd> {duration: 3400.712238073349, failure_reason: '"2013-08-13 12:59:21.636893 osd.0 10.214.136.132:6800/31256
[22:48] <loicd> 1 : [WRN] 5 slow requests, 5 included below; oldest blocked for > 30.678902 secs"
[22:48] <loicd> in cluster log', flavor: basic, owner: loic@dachary.org, success: false}
[22:49] <loicd> when running teuthology + valgrind. Is it to be expected sometimes or does that indicate a problem ? I assume valgrind can badly slow down things ;-)
[22:49] * loicd running again to see if it repeats
[22:49] <sjust> with valgrind I would add it to the white list and ignore it
[22:55] <loicd> sjust: ok. is "- slow request" added after "log-whitelist:" in http://pastebin.com/jYFPrStj enough ? It is just a wild guess on my part but I can dig to figure it out.
[22:56] <sjust> log-whitelist:
[22:56] <sjust> - wrongly marked me down
[22:56] <sjust> - objects unfound and apparently lost
[22:56] <sjust> - slow requests,
[22:56] <sjust> - clocks
[22:56] <sjust> is what I had in one of my files
[22:56] <loicd> wonderful
[22:56] <loicd> thanks
[22:59] * rturk is now known as rturk-away
[23:00] * rturk-away is now known as rturk
[23:11] * rturk is now known as rturk-away
[23:24] <sagewk> mikedawson: ping
[23:24] <mikedawson> sagewk: howdy
[23:26] <sagewk> just followed up on the qemu/rbd email
[23:26] <sagewk> i see one weirdness.. can you try with wip-5955 and add in debug finisher = 20?
[23:27] * wusui (~Warren@2607:f298:a:607:a007:f7cd:efab:fe20) has joined #ceph
[23:28] <mikedawson> sagewk: thanks! Really appreciate it! Do you care if this is with qemu cache=none (like my logs) or cache=writeback (what we normally run)
[23:32] <sagewk> writeback or writethru
[23:32] <sagewk> the cache=none bug is different
[23:32] <sagewk> (i think we have a fix for that too, fwiw)
[23:34] <mozg> sagewk: is there is a bug when using qemu and cach=none?
[23:34] <mozg> this is what I am using at the moment
[23:34] <mikedawson> sagewk: Just to confirm, wip-5955 is based on Cuttlefish, right?
[23:34] <mozg> should I worry about it?
[23:35] * BillK (~BillK-OFT@124-169-72-15.dyn.iinet.net.au) has joined #ceph
[23:35] * loopy (~torment@pool-72-64-192-91.tampfl.fios.verizon.net) has joined #ceph
[23:36] <sagewk> mikedawson: it is.. do you need it against cuttlefish?
[23:37] <sagewk> mozg: not unless you're seeing hangs. and i think it will be fixed shortly anyway :)
[23:38] <mozg> do you mean hangs from the osd server side, or hangs from the vm side?
[23:38] <mozg> i am seeing hang tasks often
[23:38] <mozg> especially when i run fio with direct=1 and bs=4k
[23:38] <mozg> with a bit of a load i start having hang tasks
[23:38] * jeff-YF (~jeffyf@67.23.117.122) Quit (Ping timeout: 480 seconds)
[23:38] <mikedawson> sagewk: yes, I am 0.61.7 right now. Just wanted to make sure it didn't pull me up to master/Dumpling
[23:38] <mozg> and sometimes vm just panics
[23:38] <kraken> http://i.imgur.com/WS4S2.gif
[23:39] <sagewk> mikedawson: let me repush,
[23:39] <sagewk> sorry i mean it was against master :) will repush
[23:40] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Remote host closed the connection)
[23:42] <mikedawson> sagewk: excellent! I'll test it at home tonight.
[23:42] <sagewk> repushed
[23:42] <sagewk> thanks
[23:44] * toMeloos_ (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[23:44] <mikedawson> sagewk: I don't see wip-5955 on the gitbuilder-ceph-deb-raring-amd64-basic line in gitbuilder.cgi, but I do for other targets. Can I assume it'll be there in a few hours?
[23:45] <sagewk> i justed pushed.. it will take a bit to build
[23:45] <mikedawson> sagewk: ty
[23:46] * schelluri2 (~schelluri@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[23:46] * rturk-away is now known as rturk
[23:46] * Schelluri (~Sriram@108-225-16-176.lightspeed.sntcca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[23:47] <sagewk> np
[23:48] <mozg> sagewk, regarding the qemu hangs with cache=none, what kind of hangs do you mean?
[23:49] <sagewk> mozg: http://thread.gmane.org/gmane.comp.file-systems.ceph.user/2982
[23:51] * davidzlap (~Adium@ip68-5-239-214.oc.oc.cox.net) has joined #ceph
[23:52] * danieagle (~Daniel@177.99.135.56) Quit (Quit: inte+ e Obrigado Por tudo mesmo! :-D)
[23:54] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:54] <joelio> sagewk: are i/o operations inside libvirt's rbd storage implementation governed by a fixed number of concurrent connections to the cluser, or are they native i/o operations inside the vm abstracted via libvirt/ceph and therefore dependent upon the concurrency of the native i/o operation instide the vm
[23:56] * alfredodeza is now known as alfredo|afk
[23:57] * davidzlap (~Adium@ip68-5-239-214.oc.oc.cox.net) Quit (Quit: Leaving.)
[23:57] * davidzlap (~Adium@ip68-5-239-214.oc.oc.cox.net) has joined #ceph
[23:59] <zjohnson> best cost:performance ssds to get for ceph-journaling and ceph-metadata?
[23:59] <mozg> sagewk, thanks
[23:59] <zjohnson> considering Samsung 840(pro?), Intel 530, and Crucial M500
[23:59] <zjohnson> all consumer level.
[23:59] * aliguori (~anthony@32.97.110.51) Quit (Quit: Ex-Chat)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.