#ceph IRC Log


IRC Log for 2012-12-30

Timestamps are in GMT/BST.

[0:04] * nwl (~levine@atticus.yoyo.org) Quit (Ping timeout: 480 seconds)
[0:05] * NightDog (~Karl@ti0131a340-dhcp0997.bb.online.no) Quit (Quit: Leaving)
[0:14] * nwl (~levine@atticus.yoyo.org) has joined #ceph
[0:17] <Psi-jack> When I do "ceph health", what could it say, other than "HEALTH_OK"?
[0:21] <janos> "Good Job, ol' buddy ol' pal. HEALTH_OK_PEACHY"
[0:21] <janos> (don't let me near the source)
[0:21] <Psi-jack> Being serious. Trying to write a zabbix monitoring agent.
[0:21] <Vjarjadian> what would the agent do?
[0:21] <Psi-jack> Report and track.
[0:22] <janos> not sure what else. there's 100% ok, or some grade of not ok
[0:22] <Psi-jack> Exactly. I need to know what the possible non-OK ones are.
[0:22] <Psi-jack> If more than 1 especially. :)
[0:22] <Vjarjadian> could such an agent produce a visual output of the cluster?
[0:22] <Psi-jack> Yep.
[0:23] <janos> well here's some non-ok output, because i have not had a healthy cluster in forever
[0:23] <janos> HEALTH_WARN 896 pgs peering; 608 pgs stuck inactive; 896 pgs stuck unclean
[0:26] <Psi-jack> Okay, so, HEALTH_OK, HEALTH_WARN, so far. :)
[0:28] <Kioob> Psi-jack: for example :
[0:28] <Kioob> $ ceph health
[0:28] <Kioob> HEALTH_WARN 6 pgs backfilling; 6 pgs stuck unclean; recovery 8244/1757910 degraded (0.469%)
[0:28] <Psi-jack> Hmm
[0:28] <janos> ah, if you're not parsing the warn bits, then then there's really just two i know of
[0:29] <janos> WARN and OK
[0:29] <Psi-jack> According to http://ceph.com/docs/master/rados/operations/monitoring/ there's only two things that health reports. WARN and OK
[0:29] <Vjarjadian> can the cluster itself monitor the usage on OSDs, such as CPU/DiskIO/Network Bandwidth? that would be good info for locating bottlenecks
[0:29] <Psi-jack> Vjarjadian: Yes. :)
[0:29] <Psi-jack> Vjarjadian: Zabbix is the swiss army knife of monitoring. :)
[0:30] <Vjarjadian> :)
[0:32] <Kioob> Vjarjadian: for my �reweight� problem, by looking in the code I saw that the priority of recovery was changed in next version. So I apply the same priority (10 instead of 30), and it's look better. But I'm not sure to can try again a "+0.2" weight on 8 OSD ;)
[0:38] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Quit: Leaving.)
[0:38] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[0:41] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[0:45] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[0:45] * ChanServ sets mode +o scuttlemonkey
[0:46] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[0:54] <Psi-jack> Ahh, blasted..
[0:54] <Psi-jack> Now I gotta figure out how to get the zabbix user itself access to check the health. LOL
[0:55] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[0:57] * Cube (~Cube@c-38-80-203-117.rw.zetabroadband.com) Quit (Quit: Leaving.)
[0:59] <Kioob> Psi-jack: on my system any unprivileged user can call "ceph health"
[1:01] <Kioob> I suppose, you just need an access to the /etc/ceph/ceph.conf file, to know how to contact MON instances
[1:01] <Psi-jack> Heh. How'd you manage that?
[1:01] <Psi-jack> And with, or without cephx in use?
[1:01] <Psi-jack> I can't find the command-line arguments to point ceph to the keyfile.
[1:03] <Kioob> mmm you're right, in fact the user need access to the client.admin.keyring !
[1:03] <Psi-jack> heh
[1:03] <Psi-jack> Exactly...
[1:03] <Kioob> I tested from an account in �root� group... silly
[1:03] <Psi-jack> hehe
[1:04] <Psi-jack> Now, I could get zabbix able to run /usr/bin/ceph health as root without password, and solve that issue.. But would be a bit more than I really want to have to do, if avoidable.
[1:04] <Kioob> but a dedicated keyring, with only "r" privilege on MON should be enough, no ?
[1:08] <Psi-jack> Well, I'm having all 3 of my storage servers, which also runs MON, to handle checking, so I can see the status from every server, in case one's unable to communicate properly over the SAN network.
[1:13] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[1:14] * The_Bishop (~bishop@2001:470:50b6:0:c965:2a01:9176:a308) Quit (Ping timeout: 480 seconds)
[1:15] <Psi-jack> hmmm
[1:16] <Psi-jack> strange.. ceph's using a crapton of memory.
[1:16] <Psi-jack> Like, each OSD is using almost 1GB each.
[1:17] <Psi-jack> heh, 1.6GB on another server.
[1:18] <Kioob> it depends of PG count, no ?
[1:19] <Psi-jack> Not, sure.. I have 3 servers, 3 OSDs each, and all three are overcomitting memory by 2GB, which of course, is causing them to swap.
[1:19] <Psi-jack> And just noticed.. One of my three servers has 1 OSD not even running for some odd reason.
[1:21] * dxd828 (~dxd828@host217-43-125-241.range217-43.btcentralplus.com) Quit (Quit: Textual IRC Client: www.textualapp.com)
[1:22] <Psi-jack> heh, well, on that note, at least I determined my systemd units are working now. :D
[1:23] <Kioob> here with 3 servers, 8 OSDs each, ceph use betwen 3 and 4GB per host. Don't know if it is normal, or not
[1:23] <Psi-jack> Heh.
[1:23] <Psi-jack> yeah. each OSD of mine is literally reporting to be consuming ~950 MB each.
[1:24] <Psi-jack> Since I brought back up the OSD that was down on one, it looks like it's auto-recovering it to the OSD.
[1:24] <Psi-jack> HEALTH_WARN 22 pgs backfill; 10 pgs backfilling; 6 pgs recovering; 9 pgs recover
[1:24] <Psi-jack> y_wait; 13 pgs stuck unclean; recovery 991/451430 degraded (0.220%)
[1:27] <Kioob> do you use 0.48 or 0.55 ?
[1:27] <Psi-jack> 0.55
[1:27] <Kioob> the 0.55 fix some memory leak
[1:27] <Kioob> how :/
[1:27] <Psi-jack> heh
[1:28] <Psi-jack> All my ceph storage servers uses Arch and I customized, from AUR, the ceph-git to actually work with the more current ceph versions in git's master branch.,
[1:28] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) has joined #ceph
[1:34] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: slang)
[1:35] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: This computer has gone to sleep)
[1:37] <Psi-jack> heh. I should be able to simply send a SIGTERM to a ceph-{mon,osd,mds}, relatively sanely, and restart it back up under systemd rather than the original method I was using, the crappy rc.d method. :)
[1:37] <iggy> my rule of thumb has always been 1G memory per 2TB OSD
[1:38] <Psi-jack> iggy: I don't even have /that/ much in total OSD storage. Each server has 1TB, 500GB, and 320GB HDD's for each OSD.
[1:39] <Psi-jack> With 4GB physical RAM per server.
[1:40] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[1:40] * ChanServ sets mode +o scuttlemonkey
[1:42] <iggy> ohhh
[1:43] <iggy> which one got removed?
[1:43] <Psi-jack> A 500 GB on the third node.
[1:46] <Psi-jack> Well, cool, ceph-{mon,osd,mds} is now all running under systemd on one node. :D
[1:55] <Psi-jack> Apparently my mds.c also got stopped on the same node. LOL
[1:57] <Psi-jack> But, right now, I'm stopping every ceph-* service, and re-enabling them as a systemd service after the auto-recovery kicks in.
[1:57] <Psi-jack> Well, finishes. :)
[1:58] <Psi-jack> converting processes like: /usr/bin/ceph-osd -i 5 --pid-file /var/run/ceph/osd.5.pid -c /tmp/ceph.conf.4017 -- into /usr/bin/ceph-osd -f -i 5
[2:00] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[2:09] <Psi-jack> Coolness.. All mon, mds, and osd's are now systemd-ized. :D
[2:12] <Psi-jack> And 0% downtime on any of my RBD VM guests, either. Ceph is AWESOME.
[2:13] <Psi-jack> Killing and restarting the OSD's also seems to have resolved my vmcom issues, too. None of them are overcommitting now.
[2:13] * loicd (~loic@2a01:e35:8aa2:fa50:82c:6216:f50d:ce01) Quit (Quit: Leaving.)
[2:14] <Vjarjadian> psi-jack, you using iSCSI?
[2:14] <Psi-jack> Why the heck would I be using iSCSI when I'm using Ceph? ;0
[2:14] <Vjarjadian> which hypervisor you using that can go direct to ceph?
[2:14] * Leseb (~Leseb@bea13-1-82-228-104-16.fbx.proxad.net) has joined #ceph
[2:15] <Psi-jack> KVM, using Proxmox VE 2.2 which supports RBD Ceph with CephX authentication and provisioning. :)
[2:15] <Vjarjadian> nice
[2:15] <Psi-jack> It's still minimal support, but it works. :)
[2:16] <Vjarjadian> got your proxmox in a cluster too?
[2:16] <Psi-jack> Yep.
[2:16] <Psi-jack> 3 Ceph storage servers, and 4 Proxmox VE servers.
[2:16] <Vjarjadian> nice
[2:17] <Psi-jack> Yep. It's a good solid setup.
[2:17] <Vjarjadian> could you run Ceph on the proxmox server as a VM with RDM disk mapping?
[2:17] <Psi-jack> I'm also using CephFS within some VM's as well, which my webservers and mail servers use subsets of for /var/www and /home (for Maildir mail)
[2:18] <Psi-jack> hmm?
[2:18] <Vjarjadian> then you could have 7 proxmox nodes with 7 ceph nodes... performance might degrade a little on the storage, but the VMs would have a lot more headroom
[2:19] <Vjarjadian> more room for more VMs
[2:19] <Psi-jack> Eh, Proxmox VE's base OS is Debian 6.0, and Proxmox VE has it's own Ceph packages, which are currently based on Argonaught.
[2:20] <Vjarjadian> but if it's in a container, that doesnt matter
[2:20] <Psi-jack> i'm eventually planning to finish upgrading all my hypervisor hardware to newer AMD FX 8-core CPU's with 60~120GB SSD's for the boot drive.
[2:20] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) has joined #ceph
[2:21] <Psi-jack> 1 of the 4 is already upgraded to the AMD FX 8-core CPU, so migrating to/from it, live, seems to fail. :/
[2:21] <Vjarjadian> why AMD CPUs? Intel generally has the advantage in the high end
[2:21] <Psi-jack> AMD has MUCH better virtualization than Intel, by a long shot.
[2:22] * Leseb (~Leseb@bea13-1-82-228-104-16.fbx.proxad.net) Quit (Quit: Leseb)
[2:22] <Psi-jack> And, still, a lot cheaper, overall. :)
[2:22] <Vjarjadian> got links for benchmarks for that? all the benches i've seen say 1 intel core is the same as 2 AMD cores on their new cpus...
[2:22] <Vjarjadian> certainly cheaper
[2:23] <Psi-jack> Not yet, but I will probably end up putting up some of my own benchmarks, eventually.
[2:23] <Vjarjadian> i run AMD E-350s.. theyre low power cpus, but if i wanted more grunt i would go with one of the low power Xeons...
[2:24] <Vjarjadian> still... add a few 4TB HDDs to that octocore... and run ceph as a VM and you have an even more powerful setup
[2:24] * Psi-jack chuckles.
[2:25] <Psi-jack> The main reason I have 3 dedicated storage servers at all, is because I wanted to insure maintenance value was in place. I can take a storage server offline, work on it, and bring it back up to re-join the cluster.,
[2:25] * roald (~Roald@ Quit (Quit: Leaving)
[2:26] <Psi-jack> Each of them has a balanced setup. 120GB SSD for base OS+ceph-journal+xfs-logdev, 1TB SATA3 HDD, 500GB SATA2 HDD, 320GB SATA2 HDD
[2:26] <Psi-jack> Eventually, will replace the 500 and 320 HDD's to match the 1TB. :)
[2:27] <Psi-jack> And, with ceph involved, that'll be eaaaasy. Stop the ceph-osd, sync the data from the old disk to the new, relabel the GPT labels, and adjust mounts, remove old disk, bring OSD back online.
[2:28] <Vjarjadian> or just take the OSD offline... and make a new one for the new disk... it would self heal anyway
[2:28] <Psi-jack> Since I'm facilitating GPT fully, and using partition labels for everything, that will be very easy. :D
[2:28] <Psi-jack> True... But it would take longer to self-heal. :)
[2:28] * LeaChim (~LeaChim@5ad684ae.bb.sky.com) Quit (Read error: Connection reset by peer)
[2:30] <Psi-jack> Heck, I was just totally impressed by the fact i could shut down mon's and osd's, one at a time, and convert them to systemd units being run, without any downtime, and minimal re-balancing needed. I haven't really had time to test the high availability nature of ceph yet, until today. :)
[2:31] <Vjarjadian> im just slightly annoyed it won't work in my network untill the next version with the gro-replication feature... it will be unbearable over my slow WAN at the moment
[2:32] <Psi-jack> heh
[2:32] <Psi-jack> Hey, be glad.
[2:32] <Psi-jack> Ceph will be the ONLY clustered filesystem that will DO geo-replication correctly.
[2:33] <Vjarjadian> and open source... with support... what more could someone ask for?
[2:36] <Psi-jack> Heh
[2:42] <Vjarjadian> there been many discussions on how they'll work the geo-replication out yet?
[2:44] <Psi-jack> Dunno. :)
[2:47] <Psi-jack> Okay.. Another question.
[2:48] <Psi-jack> When I do "ceph osd stat", what does "e209: 9 osds: 9 up, 9 in ", all mean? Specifically, I know 9 osds means their's 9 configured osd's, but what does 9 up, and 9 in mean?
[2:48] <Psi-jack> I'm assuming 9 up means there's 9 actually up and running/ready, 9 in, however.
[2:53] <Psi-jack> Want to monitor that, properly, as well, so I don't have another incident where 8 of 9 osd's are actually up and running, yet HEALTH is OK, still.
[2:54] <Vjarjadian> then you need to catch the OSD failure as it occurs
[2:55] <Psi-jack> Well, systemd is now managing that, and controlling auto-restarts.
[2:55] <Psi-jack> And thus, logging, is also handled by journald.
[2:55] <Vjarjadian> you tried testing performance when 1 OSD is down... during recovery
[2:55] <Psi-jack> Not yet.
[2:56] <Psi-jack> I never would've known the OSD was even down, without randomly doing an osd stat, to see 8 up, 8 in.
[2:56] <Vjarjadian> virus scanner and disk IO monitor should handle that easily enough :)
[2:58] * ScOut3R (~ScOut3R@catv-86-101-215-1.catv.broadband.hu) has joined #ceph
[3:01] <Psi-jack> Oh well, for now, I'll just write my mon agent to track the expected osd's, and up and in will base off of the expected. ;)
[3:01] <Psi-jack> If they're < expected, then trigger. :)
[3:17] <Psi-jack> There we go. Now even my OSD counts are monitored. :D
[3:18] <Vjarjadian> lol
[3:18] <Psi-jack> And now... Time for pizza and movie. :)
[3:25] * ScOut3R (~ScOut3R@catv-86-101-215-1.catv.broadband.hu) Quit (Remote host closed the connection)
[3:54] * calebamiles1 (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[3:58] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[4:29] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[4:44] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[4:46] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) Quit (Ping timeout: 480 seconds)
[4:58] * Etherael (~eric@node-4si.pool-125-24.dynamic.totbb.net) has left #ceph
[5:07] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) has joined #ceph
[5:43] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Remote host closed the connection)
[5:55] * Cube (~Cube@c-38-80-203-117.rw.zetabroadband.com) has joined #ceph
[6:40] * renzhi (~renzhi@ has joined #ceph
[6:46] <renzhi> hi, can someone tell if there is any easy way to read and parse the pgmap directly?
[7:04] <renzhi> or as a matter of fact, I'd like to know how ceph figure where an object is stored (osd host, etc) on the fly. The database information is changing all the time, and that's something I'd like to know though.
[7:18] <renzhi> specifically, I'd like to find out what are the objects in the pg, and where they are located (physically), really quick.
[7:44] <Vjarjadian> go on the website. if i remember correctly it's called CRUSH
[7:45] <Psi-jack> CRUSH, is correct.
[7:46] <Psi-jack> As I understand it though, the CRUSH map just handles balancing of the pg data objects. How to actually examine where specific data is located would be a totally different thing entirely.
[8:08] <renzhi> yeah, what we know is, the location of the object is mapped by the pg map (or at least, that's what I understand), from this page here: http://ceph.com/docs/master/rados/operations/placement-groups/
[8:09] <renzhi> The other question is, is to tell ceph not to do remapping and rebalancing of the data, whenever I add new osds? or at least, still make all objects available for reading while it's doing this, and make the cluster still writable at the same time?
[8:10] <renzhi> I guess everyone is on vacation
[8:12] <renzhi> our data are growing fast, so we need to add disk space all the time. But this remapping and rebalancing is killing us.
[8:12] <Vjarjadian> data should be available during rebalancing
[8:13] <Vjarjadian> someone earlier was complaining their system was very slow... but they did add an extra 50% to their cluster
[8:14] <dweazle> the way i understand it, ceph calculates where an object is expected be in realtime according to the crush map.. it doesn't keep an in-memory map of that
[8:15] <Vjarjadian> it uses that so the client can calculate what OSDs to contact without excessive contact or processing on one of the MONs
[8:15] <renzhi> well, that's what we are expecting, but it's not. We had 30 osds, and we just added 7 osds and we had ceph down for more than 2 days now, and a lot of the objects are still not accessible, and writing new objects is barely working for us.
[8:16] <renzhi> dweazle: do you know where in the code it's doing that? just asking
[8:17] <dweazle> i don't know, i'm not familiar with ceph internals
[8:17] <Vjarjadian> 2 days is certainly excessive... especially for a system thats self healing and without a single point of failure....
[8:17] <renzhi> we had another bigger cluster with 76 osds, that we had to bring down due to remapping. At the time, basically, none of the data is accessible.
[8:17] <renzhi> Vjarjadian: yes, and we are even sure when this is going to finish.
[8:17] <Vjarjadian> you monitoring network and disk traffic?
[8:18] <renzhi> quite low.
[8:18] <Vjarjadian> maybe it's due to adding so many OSDs at once... it would have to rebalance a lot of data...
[8:19] <Vjarjadian> i dont have the hardware to create a cluster with more than 4 or 5 virtual hosts on my system at the moment...
[8:20] <renzhi> well, that's scary, you know. Adding 7 osds to a cluster with already 30 osds is causing this. We still have 30 machines, with 10 disks each, which would add 300 osds to the cluster if we can do it now.
[8:20] <Vjarjadian> lol
[8:20] <renzhi> there's no way we can grow fast with that. But our data are growing fast.
[8:21] <Vjarjadian> how many replicas are there on your data?
[8:21] <renzhi> 2
[8:22] <renzhi> we like ceph, when it's healthy, it's fast, nice, and working. But with the current way of remapping and rebalacing, we just don't have the luxury of having osds going down, or adding new osds, etc.
[8:23] <renzhi> That's supposed to be a distributed system, and this is supposed to handle better than that.
[8:24] <Vjarjadian> had you considered using software raid5/6 and having 1 OSD per host?
[8:24] <Vjarjadian> instead of 300 OSDs you'd have 30...
[8:25] <renzhi> well, in that case, we'd probably better off going with NAS/SAN or whatever solution.
[8:25] <Vjarjadian> it might be more graceful under disk failure... if 1 disk failed it would just be rebuilding the local raid array and if one host failed it would just be rebalancing from 1 OSD
[8:25] * Cube (~Cube@c-38-80-203-117.rw.zetabroadband.com) Quit (Quit: Leaving.)
[8:26] <Vjarjadian> how full is your cluster?
[8:26] <renzhi> right now? 40%
[8:26] <renzhi> for the 30 osd
[8:28] <Vjarjadian> maybe inktank can do something for you...
[8:30] <renzhi> yeah, but the recommendation right now, is to set the weight of the new osds to 0, so that the remapping go really slowly.
[8:31] <Vjarjadian> and slowly increment it by 0.2 untill the desired amount
[8:31] <renzhi> I don't think this is a solution for us. We are browsing and hacking the code right, but I'm just wondering if anyone can give more hints on that.
[8:32] <renzhi> well, in that case, how are we going to grow our business, if we have to wait for ceph to do the work very very slowly?
[8:33] <Vjarjadian> i dont know enough about ceph to help :) your infrastructure makes my E-350 look like a raspberry Pi :)
[8:34] <renzhi> we can't start to buy all the hardwares we need, that's why we chose something that's supposed to let us grow without pain. But that has been a painful experience.
[8:40] <Vjarjadian> one of the people who know ceph inside out will probably know something to help you... as with anything in development there will be minor problems...
[8:42] <renzhi> yes, we understand. We actually ran into bugs after bugs, which had caused a lot of troubles for us, but we can tolerate it just fine. But right now, with the current remapping scheme in ceph, we just can't see how we are going to grow with it.
[8:43] <Vjarjadian> in your tests, how has it coped with OSD failure?
[8:47] <renzhi> we did, it was quite smooth. we added and removed osd, mon, etc
[8:48] <renzhi> but I guess we didn't test a big cluster. We had only 10 osds for testing.
[9:08] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Quit: WeeChat 0.3.2)
[9:21] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[9:28] * loicd (~loic@rom26-1-88-170-47-165.fbx.proxad.net) has joined #ceph
[9:38] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Quit: WeeChat 0.3.2)
[10:14] * Cube (~Cube@c-38-80-203-117.rw.zetabroadband.com) has joined #ceph
[10:16] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[10:31] * `gregorg` (~Greg@ has joined #ceph
[10:32] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[10:32] * Cube (~Cube@c-38-80-203-117.rw.zetabroadband.com) Quit (Quit: Leaving.)
[10:57] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[11:05] * madkiss (~madkiss@ has joined #ceph
[11:06] * madkiss (~madkiss@ Quit ()
[11:09] * Leseb (~Leseb@bea13-1-82-228-104-16.fbx.proxad.net) has joined #ceph
[11:09] * Leseb (~Leseb@bea13-1-82-228-104-16.fbx.proxad.net) Quit ()
[11:32] * madkiss (~madkiss@ has joined #ceph
[11:34] * madkiss1 (~madkiss@ has joined #ceph
[11:34] * madkiss (~madkiss@ Quit (Read error: Connection reset by peer)
[11:34] * madkiss1 (~madkiss@ Quit ()
[11:48] * paravoid (~paravoid@scrooge.tty.gr) Quit (Ping timeout: 480 seconds)
[11:52] * paravoid (~paravoid@scrooge.tty.gr) has joined #ceph
[12:03] <Kioob> (08:13:07) Vjarjadian: someone earlier was complaining their system was very slow... but they did add an extra 50% to their cluster <== No :) I add 50% more OSD yes, but reweight them from 0 to 0.2 only. So, I added 10% at a time... and ceph become near unusable
[12:04] <Kioob> now, I'm reweighting by interval of 0.01, with a script. I also reduced the priority of �recovery� messages, the ceph cluster is usable, but slow. It's still working since 10 hours... The current weight of that new OSD is 0.8
[12:05] <Kioob> But my OSD are backfilling at 25MB/s only, because of slow/old SSD
[12:12] <Kioob> I also tried to move journal from SSD to the same HDD partition as the OSD. Backfilling is faster, but the cluster become really slow
[12:16] * LeaChim (~LeaChim@5ad684ae.bb.sky.com) has joined #ceph
[12:20] * yo (~yo@140.Red-81-32-148.dynamicIP.rima-tde.net) has joined #ceph
[12:22] * yo (~yo@140.Red-81-32-148.dynamicIP.rima-tde.net) Quit ()
[12:26] * sleinen (~Adium@2001:620:0:26:e5b0:70e:ab95:c597) has joined #ceph
[12:28] * The_Bishop (~bishop@2001:470:50b6:0:3dc2:53c2:436c:6670) has joined #ceph
[12:33] <renzhi> Kioob: so you add new osds, and then your cluster becomes unusable and very slow too?
[12:53] <Kioob> yes renzhi
[12:53] <Kioob> during 3 or 4 hours
[12:54] <Kioob> so now, I reweight by step of 0.01
[13:04] <renzhi> how many osd did you add? how many did you have before adding new ones?
[13:28] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) has joined #ceph
[13:28] <Kioob> there was 16 OSD running (2 hosts), I added 8 OSD (1 new host) with weight 0. Then I reweight all 8 OSD to 0.2
[13:31] <Kioob> now, by step of 0.01 on 4 OSD the cluster is still usable, and it take between 10 and 20 minutes for each step
[13:32] <Kioob> so, 100 * 2 step of 15 minutes average, it will take 50 hours... without causing a DoS
[13:33] <Kioob> And I haven't got a lot of data in my cluster : 2522 GB data, 7015 GB used
[13:49] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.89 [Firefox 17.0.1/20121128204232])
[14:14] * Leseb (~Leseb@bea13-1-82-228-104-16.fbx.proxad.net) has joined #ceph
[14:38] <Kioob> in fact, I suppose I have a problem with at least one OSD : now the cluster is ok (HEALTH_OK), and a VM show 36% IOWait, without doing a lot of work (less than 50 tps, 1MB/s)
[14:56] <nhm> Kioob: how long are IOs waiting in the device queues?
[14:57] <Kioob> how can I see that ?
[14:58] <nhm> Kioob: iostat or collectl -sD -oT should tell you (I like the collectl interface myself)
[15:01] <Kioob> on the VM or on each OSD host ?
[15:02] <nhm> oh, I misread your last message. I thought itwas 36% iowait on the OSD hosts.
[15:02] <nhm> that's where I was thinking of checking the IO queue wait times.
[15:03] <Kioob> well on OSD I have 14% :)
[15:04] <Kioob> it's "avgqu-sz" ?
[15:05] <Kioob> (iostat)
[15:05] * Leseb (~Leseb@bea13-1-82-228-104-16.fbx.proxad.net) Quit (Quit: Leseb)
[15:05] <nhm> hrm, that's the request size. closest thing in iostat is svctm I think.
[15:06] <Kioob> avgrq-sz is request size
[15:06] <nhm> oops, thats the queue size
[15:06] <nhm> yeah, read it wrong again.
[15:06] <Kioob> ok ;)
[15:06] <nhm> it's early morning here. ;)
[15:06] <Kioob> oh sorry :D
[15:08] <nhm> Kioob: A little while back I remember there was some talk about trying to implement some kind of QOS for rebalancing.
[15:09] <Kioob> so, on an host with 8 OSD, on avgqu-sz on an interval of 10 seconds is 25 to 50, for some OSD
[15:09] <nhm> Kioob: ie try to have high priority for ensuring replication but loewr priority for rebalancing.
[15:09] <Kioob> yes, it will be great
[15:10] * Cube (~Cube@c-38-80-203-117.rw.zetabroadband.com) has joined #ceph
[15:10] <nhm> Kioob: ok, sounds like there's a fair amount of stuff hitting the disks then.
[15:10] <Kioob> mmm
[15:11] <nhm> what's the svctm like?
[15:11] <Kioob> 8 for near all osd
[15:12] <nhm> ok, that's good. how avgrq-sz (and what's your sector size?)
[15:13] <Kioob> avgrq-sz is between 14 and 24
[15:13] <Kioob> sector size is the default one... don't know
[15:13] * Cube (~Cube@c-38-80-203-117.rw.zetabroadband.com) Quit ()
[15:14] <Kioob> but all OSD are RAID0 of 1 drive on LSI MegaRAID, with a chunk size of 64k
[15:14] <Kioob> (old card, JBOD is not supported)
[15:15] <nhm> Kioob: I think you can just cat /sys/block/<device>queue/hw_sector_size
[15:15] <Kioob> the ioscheduler is CFQ... should I switch to deadline ?
[15:15] <Kioob> you're right, so it's 512b
[15:16] <Kioob> 512B*
[15:16] <nhm> Kioob: Maybe. I've some times seen an improvement. Going to do some more proper tests soonish.
[15:17] <nhm> Ok, so like 7-12KB average request size. That might explain the low throughput
[15:17] <nhm> And your cluster slowing to a crawl
[15:17] <Kioob> yes, Xen doesn't help here
[15:17] <nhm> Kioob: do you have WB cache?
[15:18] <Kioob> on journal only
[15:18] <nhm> Ah, so kernel RBD?
[15:18] <Kioob> yes kernel RBD
[15:18] <nhm> Kioob: if you can enable it for the OSDs it may help
[15:18] <Kioob> xen (so max request size is 22k), over kernel rbd
[15:18] <Kioob> ok
[15:19] <nhm> Kioob: We've gotta figure out some way to improve performance for folks that can't use KVM.
[15:19] <nhm> Kioob: I suspect that the lack of cache is really hurting.
[15:19] <Kioob> or don't want :p
[15:20] <nhm> Kioob: that too. I'm actually going to start looking into RBD performance in a couple of weeks. Trying to get parametrics sweeps of underlying OSD configuration options done first
[15:20] <Kioob> yes... maybe I should add �flashcache� or similar hack on VM hosts :/
[15:20] <Kioob> ok thanks
[15:22] <nhm> Kioob: np. Honestly I suspect that XEN on cephfs would actually be faster right now, but it's totally unsupported from an inktank perspective.
[15:22] <Kioob> the avgqu-sz is now 0.0X. and iowait drop to 0.1X%
[15:22] <Kioob> thanks :)
[15:23] <nhm> Kioob: did you do something to make that happen or just happen on it's own?
[15:23] <Kioob> I suppose I didn't well understood the role of the journal
[15:23] <Kioob> I enabled WB on all OSD
[15:23] <Kioob> like you said :)
[15:23] <nhm> Ah, good!
[15:23] <nhm> That'll let the controller reorder and queue up the small writes in hopefully smarter ways.
[15:24] <nhm> The journal writes are already ordered nicely since writes just go there sequentially as they are received.
[15:25] <nhm> The real nastiness is when they have to eventually go to the underlying OSD filesystem.
[15:26] <Kioob> but, once data are written in journal, the response is sent to the client, no ?
[15:26] <Kioob> so, moving data from journal to OSD, is handled in background ?
[15:27] <nhm> Yep, and that means the request can go back quickly so long as there isn't too much data sitting in the journal. Once the journal fills up too much though, you have to wait for the existing data in the journal to drain, and that's where things then slow way down.
[15:29] <nhm> s/requests can go back/requests can be acknowledged
[15:29] <Kioob> ok, it's probably why when a balancing is running I add horrible performance
[15:30] <Kioob> I had*
[15:30] <nhm> yep
[15:30] <Kioob> great great :)
[15:30] <Kioob> so
[15:30] <Kioob> I should try an other balancing :D
[15:31] <nhm> Yeah. The controller cache should help.
[15:31] <nhm> Not sure how much, but I've seen some nice improvements in some circumstances.
[15:32] <nhm> Just make sure the batteries are all working. :)
[15:40] <Kioob> yes, batteries are monitored ;)
[15:41] <Kioob> here, I start a reweight from 0.8 to 0.81, and I see no slow down at all
[15:41] <Kioob> avgqu-sz of all OSD still lower than 1
[15:42] <nhm> good deal
[15:42] <nhm> Are journals on the same disks as the data?
[15:42] <Kioob> no
[15:42] <Kioob> journals are on SSD, but on same hardware card, sharing the same writeback cache
[15:43] <nhm> Ok. I was wondering if you were trashing previously between journal writes and data writes, but sounds like that wasn't the case
[15:43] <nhm> thrashing rather
[15:45] <ron-slc> I've seen cases with argonaut, where a simple 4kb write caused almost 30MB of back-end disk-writes. I was using btrfs, and I'm about to start some tests to see if disabling btrfs-snaps in ceph config helps this.
[15:46] <Kioob> I'm not sure I can start a bigger reweight, by jumping to 0.9 for example, but here I reweight without slowing down the production, so it's good for me. And now I know where to look, big thanks nhm !
[15:46] <ron-slc> effectively a 4kb write because a DDoS, in the actual work-to-do was amplified thousands of times.
[15:46] <Kioob> (here I use 0.55, over xfs)
[15:46] <ron-slc> because a DDoS == becomes a DDoS... (early..)
[15:47] <ron-slc> yea, that was another thing I wanted to test.. XFS versus btrfs on back-end MB output.
[15:47] <nhm> Kioob: glad it made such a big difference!
[15:48] <nhm> ron-slc: that definitely sounds broken!
[15:49] <ron-slc> that's what I thought.. but it was a clean 3.5 kernel, upgraded to 3.7.1 after seeing the issue. All metrics on Ceph were also healthy.
[15:51] <ron-slc> so I've wiped the osd's, and I'm going to try various filesystems one at a time. though I'm afraid it will take a while for ceph to utilize the snapping functions automatically.. so it make take a while.
[15:52] <ron-slc> Can anybody answer WHY? ceph uses btrfs snaps? What advantages or safety does it offer? I only see the option for it, and I DO see snaps happening, but there is no documentation as to why it happens.
[15:53] <nhm> ron-slc: there's some discussion on the mailing list archives, let me see if I can find any.
[15:58] <nhm> ron-slc: some high level doc: https://github.com/ceph/ceph/blob/master/doc/dev/filestore-filesystem-compat.rst
[16:01] <ron-slc> nhm: hmm interesting. Thanks for that link, never occurred to me to look at github (outside of the source code.)
[16:02] <nhm> ron-slc: we probably still need to organize this stuff more.
[16:02] <ron-slc> yea, but that's the BORING work.. ;)
[16:02] <nhm> lol
[16:04] <ron-slc> yea so for fun, I'm going to try straight XFS, BTRFS (snaps disabled), and BTRFS as normal.
[16:04] <ron-slc> and take io-stats on disk i/o
[16:10] <nhm> cool, would definitely be interested in seeing the results!
[16:11] <ron-slc> cool, then I'll do it right then.. And put results into a spreadsheet.
[16:12] <nhm> cool. I've got to go afk, but feel free to send me a copy if you'd like! mark.nelson@inktank.com
[16:12] <ron-slc> kk thanks for the links and help!
[16:25] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) has joined #ceph
[16:32] * dweazle is now known as dennis
[16:36] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[17:05] * loicd (~loic@rom26-1-88-170-47-165.fbx.proxad.net) Quit (Quit: Leaving.)
[17:11] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Quit: leaving)
[17:11] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[17:24] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) has joined #ceph
[17:30] <Kioob> nhm: I'm thinking again to the problem. For me OSD backend performance should impact the global OSD performance only when journal is full... and I have 25GB of journal per OSD (raw partition on a 80GB SSD)
[17:31] <Kioob> so I'm looking in the documentation : I see �journal queue max bytes� = 10MB
[17:32] <Kioob> what I understand is that only 10MB of my 25GB are used... or did I misunderstood that ?
[17:34] <Kioob> also, there is �journal queue max ops� which is at 500 by default. With my average size of 7-12KB, a maximum of 6MB will be stored on the journal, right ?
[17:36] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[17:38] <Kioob> I also see that �journal aio� is not enabled by default, is there a problem to enable it ?
[17:48] <Kioob> how... I've seen that btrfs doesn't handle dio+aio very well, is it the reason why it's disabled by default ?
[18:26] * SpamapS (~clint@xencbyrum2.srihosting.com) has joined #ceph
[18:55] <Kioob> I found a thread on the list about that kind of workload (IO/s oriented, not throughput) : http://www.spinics.net/lists/ceph-devel/msg08480.html
[18:57] <Kioob> so I will try : enable "aio", increase "journal queue max bytes" to 512MB, set "journal queue max ops" to 8192. It's a first step
[18:59] <Kioob> so journal is freed asynchronously I suppose, so big journal flush doesn't imply long lock, right ?
[19:08] <nhm> Kioob: you may also see a benefit increasing the number of osd_op_threads, and disabling in-memory debugging.
[19:09] <nhm> Kioob: I'm actually in the middle right now of doing parametric sweeps over different ceph configurables at different IO sizes and on different backend file systems.
[19:14] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[19:37] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[19:44] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[19:58] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[19:59] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[20:10] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:12] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[20:24] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:25] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[20:32] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:36] * Cube (~Cube@108-221-17-171.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[20:45] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has joined #ceph
[20:45] <CloudGuy> nhm: do you plan to publish those findings anywhere
[20:52] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[20:53] <stxShadow> good evening !
[20:53] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:00] * maxia (~rolson@ has joined #ceph
[21:00] <maxia> any devs around?
[21:04] * Cube (~Cube@108-221-17-171.lightspeed.sntcca.sbcglobal.net) Quit (Quit: Leaving.)
[21:05] <maxia> specifically in PG.cc/Paxos.cc area
[21:08] <maxia> I'm debugging in certain areas that cause very slow delays in pg mappings
[21:08] <maxia> sometimes 20 hour delays
[21:09] <maxia> 2012-12-31 04:08:57.602327 osd.30 [WRN] slow request 39.768097 seconds old, received at 2012-12-30 13:08:17.834084: osd_sub_op(client.7218.0:53385611 3.d04 2d24ad04/63429307/head//3 [] v 2513'10646 snapset=0=[]:[] snapc=0=[]) v7 currently started
[21:09] <maxia> they look like that, some are far creepier
[21:10] <maxia> I'm trying to understand what causes that delay exactly
[21:12] <maxia> root@s1:~# ceph --admin-daemon /var/run/ceph/ceph-osd.3.asok version
[21:12] <maxia> {"version":"0.55.1-1-g7c7469a"}
[21:12] <maxia> root@s1:~# ceph --admin-daemon /var/run/ceph/ceph-osd.3.asok git_version
[21:12] <maxia> {"git_version":"7c7469a19b0d563a448486adce9f326c6e5bd66d"}
[21:13] <maxia> we're running a version that had an auth fix backported, that's the only change
[21:14] <maxia> this problem exists outside of the custom backport
[21:16] <maxia> we're stuck in what looks like a 2 week degraded mode
[21:17] <maxia> minimum
[21:18] <maxia> I want to know how to determine if this is clock drift detected in ceph, internally
[21:23] <maxia> after that I want to know how to hack it so recovery can happen faster; our data is immutable aside from non-critical appends
[21:23] <maxia> also when I do ceph -s
[21:23] <maxia> 394 TB data, 31066 GB used, 37673 GB / 68739 GB avail;
[21:24] <maxia> where is 394 TB calculated from?
[21:24] <maxia> is that based on pg quantity?
[21:25] <maxia> I almost started to try to debug that but I didn't care enough and made an assumption.
[21:25] <maxia> we don't have 394 TB in this cluster, I think we only have ~100TB in this one
[21:26] <maxia> I hate holidays.
[21:26] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[21:33] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:39] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:50] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[21:50] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:00] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[22:01] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:02] * ScOut3R (~ScOut3R@catv-86-101-215-1.catv.broadband.hu) has joined #ceph
[22:04] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[22:05] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:21] * ScOut3R (~ScOut3R@catv-86-101-215-1.catv.broadband.hu) Quit (Remote host closed the connection)
[22:23] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[22:24] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:31] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) has joined #ceph
[22:32] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) Quit (Read error: Connection reset by peer)
[22:39] <Kioob> CloudGuy: probably here : http://ceph.com/community/blog/ like previous bench http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/ , http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/ and http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/
[22:42] <CloudGuy> Kioob: thank you
[22:42] <CloudGuy> i am still trying to figure out what would be an optimum and recommend setting to start with .
[22:42] <Kioob> nhm: osd_op_threads is 2 by default. Is there a "good practice" for that value ? I've 8 OSD per host, and 2x6 cores CPU, with hyperthreading the system see 24 �threads�. I suppose I can set at least 3 op_threads per OSD
[22:43] <Kioob> CloudGuy: the problem is that it depends of a lot of parameters, like hardware and usage
[22:45] <CloudGuy> exactly .. i tested out cloudstack over nfs ... and cloudstack part is working good .. i want to replace nfs to rbd for the primary storage ( i tested rbd over nfs also ) .. but to really start, i am still trying to figure out what would be a good number to get started .. like 3 servers with 3x replication .. and slow add more osds .. also .. 1 server per osd vs 1 server per 3 osd etc
[22:49] <CloudGuy> a basic recommendation would be good .. so that as load/demand grows, more storage can be added to the cluster
[22:50] <Kioob> well, I have 8 OSD per server, and it doesn't seem to be a good idea : if the server goes down, there is a lot of data to balance
[22:50] <Kioob> CloudGuy: but did you read that http://ceph.com/docs/master/install/hardware-recommendations/ ?
[22:51] <Kioob> it's a good starting point
[22:51] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[22:52] <maxia> Kioob, are you experiencing slow recovery times?
[22:52] <Kioob> yes maxia
[22:52] <maxia> how many pgs?
[22:52] <Kioob> mm, slow *balancing* times
[22:52] <maxia> what does ceph -s show
[22:52] <Kioob> 2328 pgs, for 7007GB used
[22:53] <Kioob> pgmap v1353568: 2328 pgs: 2328 active+clean; 2522 GB data, 7007 GB used, 15229 GB / 22236 GB avail
[22:54] <maxia> you're fine.
[22:54] <Kioob> yes, since it's balancing only, I can do step by step
[22:55] <maxia> no dirty/stale/etc?
[22:55] <Kioob> no
[22:57] <Kioob> nhm: by �disabling in-memory debugging�, are you talking of �filestore_debug_omap_check,� ?
[23:04] <maxia> Kioob, your total pgs count is 2328?
[23:04] <Kioob> yes
[23:04] <maxia> handling 7 TB
[23:04] <Kioob> (it's what I understand)
[23:06] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:06] <maxia> sorry, I'm actually debugging several section in that area because we end up with stale data and recovery times taking weeks
[23:07] <maxia> health HEALTH_WARN 1098 pgs backfill; 30 pgs backfilling; 3 pgs degraded; 2 pgs recovery_wait; 26 pgs stale; 26 pgs stuck stale; 1130 pgs stuck unclean; recovery 1041813/8450952 degraded (12.328%)
[23:07] <maxia> monmap e1: 3 mons at {a=,b=,c=}, election epoch 4, quorum 0,1,2 a,b,c
[23:07] <maxia> osdmap e2561: 38 osds: 37 up, 37 in
[23:07] <maxia> pgmap v44636: 12452 pgs: 11296 active+clean, 1095 active+remapped+wait_backfill, 2 active+degraded+wait_backfill, 2 active+recovery_wait, 26 stale+active+clean, 30 active+remapped+backfilling, 1 active+degraded+remapped+wait_backfill; 394 TB data, 31143 GB used, 37595 GB / 68739 GB avail; 1041813/8450952 degraded (12.328%)
[23:07] <Kioob> outch
[23:08] <maxia> 2 days there, "degraded" was 15% 24 hours ago
[23:08] <maxia> clearly we're doing it wrong
[23:10] <maxia> it's our third 48 hour outage and I am just trying to figure out why. I assume it's misconfiguration
[23:11] <Kioob> one of yours OSD is missing, do you know why ?
[23:11] <maxia> only 1 missing is good in our case.
[23:11] <Kioob> ok
[23:12] <Kioob> but if it stay missing, you shouldn't balance it's data over others OSD ?
[23:12] <maxia> maybe.
[23:13] <Kioob> I'm a recent user of ceph, so I may be wrong, but for what I understood in case of a missing OSD data are not automatically balanced, until needed (for replication)
[23:14] <Kioob> so I suppose PG are balanced only if there is writes on it
[23:14] <maxia> I'm not a recent user, only recently tasked with finding out "why"
[23:15] <maxia> I am a ceph n00b
[23:15] <maxia> I just run stuff through gdb
[23:15] <maxia> in our own use case.
[23:16] <maxia> we have a situation, multi-version, where we end up in recovery mode
[23:16] <maxia> for days
[23:16] <maxia> and that's the good version
[23:16] <maxia> I am fairly certain it is misconfiguration at this point
[23:17] * The_Bishop (~bishop@2001:470:50b6:0:3dc2:53c2:436c:6670) Quit (Ping timeout: 480 seconds)
[23:18] <maxia> Kioob, basically adding osds puts it into "rebalance" and it is sooooo slow
[23:18] <Kioob> yes, it was my problem
[23:19] <maxia> same.
[23:19] <maxia> do you change your files much?
[23:19] <maxia> or are they mostly immutable
[23:20] <Kioob> well mostly immutable
[23:21] <Kioob> but there is some MySQL VM stored on RBD, with frequent small writes
[23:21] <maxia> you use it for db?
[23:21] <Kioob> yes :)
[23:21] <maxia> that is a patently bad idea
[23:21] <maxia> I do not recommend this
[23:22] <Vjarjadian> it's only a bad idea if it doesnt work...
[23:22] <Kioob> for know, it works well... when there is no balancing
[23:22] <maxia> it will always eventually not work, it adds latency, and there is no reason
[23:22] <Vjarjadian> the way it's designed to work could be very good for databases... you have the IOs from many disks
[23:22] <maxia> what????
[23:22] <maxia> no
[23:22] <maxia> what?
[23:22] <maxia> that will never be good
[23:22] <maxia> why would anyone do that
[23:23] <maxia> that is a horrible idea
[23:23] <maxia> don't do that.
[23:23] <Kioob> RBD is used here for virtualization. And people use their VM as �LAMP� servers
[23:24] <Kioob> and yes it add latency, of course
[23:24] <maxia> you should absolutely not use anything mentioned in the last few minutes as the direct store for your database backend.......
[23:25] <maxia> you will suffer
[23:25] <Kioob> but until recently with galera, MySQL doesn't have a good failover solution. So you only have DRBD, which also add latency
[23:25] <maxia> do not use any network mounted or network based filesystem for your critical database
[23:25] <maxia> ever
[23:25] <maxia> you will never win
[23:26] <Vjarjadian> critical is relative...
[23:26] <maxia> it will crash and burn
[23:26] <maxia> so if you want your database
[23:26] <maxia> at all
[23:26] * The_Bishop (~bishop@2001:470:50b6:0:212f:f61b:4e74:a0a4) has joined #ceph
[23:26] <maxia> why is that relative?
[23:26] <maxia> it's a bad idea
[23:26] <Vjarjadian> different requirements for different situations
[23:27] <maxia> I guess if you require functionality, use disk. If you do not, use Vjarjadian's way
[23:33] <phantomcircuit> maxia, lan latencies are like 50x lower than rotational disk latency
[23:34] <Kioob> but network �crash and burn� data !
[23:34] <Vjarjadian> anything can crash and burn
[23:35] <maxia> phantomcircuit, oh, alright guess I am wrong. Best practice should be to use network mounts in your databases
[23:35] <maxia> cephfs
[23:35] <maxia> the fuse driver
[23:35] <Kioob> it's not a �best practice�, in many case direct disk will be faster and cheaper
[23:35] <maxia> that's a good place to mount a large partition
[23:36] <Kioob> (and safer)
[23:36] <maxia> try it :)
[23:36] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[23:46] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[23:47] <Kioob> is there a way to see journal �state� ? (like amount of ops and bytes stored in it)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.