#ceph IRC Log

Index

IRC Log for 2012-11-04

Timestamps are in GMT/BST.

[0:10] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[0:14] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) Quit (Quit: Leaving.)
[0:42] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[0:55] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[0:59] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[1:03] * BManojlovic (~steki@212.200.241.76) Quit (Ping timeout: 480 seconds)
[1:07] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[1:37] * mtk (r9uVl19f_A@panix2.panix.com) Quit (Remote host closed the connection)
[1:47] * jlogan1 (~Thunderbi@2600:c00:3010:1:9b2:ed42:a1f6:a6ec) Quit (Ping timeout: 480 seconds)
[2:04] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[2:13] * Solver (~robert@atlas.opentrend.net) Quit (Remote host closed the connection)
[2:13] * Solver (~robert@atlas.opentrend.net) has joined #ceph
[2:43] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[2:44] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[2:44] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[2:48] * maxim (~pfliu@111.192.251.140) has joined #ceph
[2:51] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[3:17] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[3:19] * loicd (~loic@magenta.dachary.org) has joined #ceph
[4:43] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[4:51] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[4:59] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[5:09] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[5:35] * stp (~stp@dslb-084-056-021-011.pools.arcor-ip.net) has joined #ceph
[5:36] * maxim (~pfliu@111.192.251.140) Quit (Ping timeout: 480 seconds)
[5:42] * scalability-junk (~stp@dslb-084-056-048-012.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[6:05] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[6:17] * gucki (~smuxi@46-127-158-51.dynamic.hispeed.ch) has joined #ceph
[6:18] <gucki> hi
[6:18] <gucki> anbody here?
[6:19] <gucki> it seems like ceph is stuck in degraded state here, not recovering the last few missing data... :(
[6:20] <gucki> this is how my ceph -w output looks like for hours now: http://pastie.org/5179779
[6:20] <gucki> all ceph osds are up, in and idle...
[6:20] <mikeryan> hi gucki, i can take a look
[6:20] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[6:20] <mikeryan> can you paste me the output of ceph pg dump?
[6:22] <gucki> mikeryan: great :). yes, hold on
[6:23] <gucki> btw, do you think i might have bitten by a crushmap bug? http://www.spinics.net/lists/ceph-devel/msg08361.html
[6:24] <mikeryan> unsure
[6:24] <gucki> mikeryan: http://pastie.org/5179790
[6:24] <mikeryan> ah, that's ceph osd dump
[6:24] <mikeryan> i need ceph pg dump
[6:24] <gucki> http://pastie.org/5179796
[6:25] <gucki> one second
[6:26] <gucki> shit, pastie does only allow 64k
[6:26] <mikeryan> ok, that's probably too much info then..
[6:26] <gucki> http://pastebin.com/0b4vFt3T
[6:27] <mikeryan> ok so the PG is 3.174
[6:27] <mikeryan> and looks like 2 should be backfilling to 0
[6:28] <gucki> how can see the backfilling from 2 to 0? :)
[6:28] <mikeryan> 3.174 229 0 19 0 938524672 489701 489701 active+recovering+remapped+backfill 2012-11-03 23:43:54.279013 939'18539 128'56215 [2,4] [2,4,0] 136'5436 2012-11-03 00:49:43.868818
[6:28] <mikeryan> [2,4] are in the active set, [2,4,0] are in the up set
[6:29] <mikeryan> 2 is primary, 0 is up but not active, so it's the backfill target
[6:29] <mikeryan> and as you can see from status, we're backfilling
[6:29] <gucki> ok, great to know how to read those messages :)
[6:29] <gucki> local time here is no 2012-11-04 06:39
[6:29] <gucki> so it's around 7 hours since it failed
[6:30] <mikeryan> just under 1 GB of data
[6:30] <mikeryan> shouldn't take 7 hours to send a gig
[6:31] <gucki> no
[6:31] <gucki> and it's 100% idle
[6:31] <gucki> it was rebalacing a lot because i was adding that osd.4
[6:31] <gucki> but this one doesn't rebalance for hours.. :(
[6:33] <gucki> ah...there's some strange thing i just noticed
[6:34] <gucki> ah no, sorry....everything fine
[6:34] <gucki> i thought one ceph-osd daemon was taking too much cpu, but it's ok
[6:34] <gucki> all idle as seen by top and iostat
[6:35] <gucki> do you have an idea on how i can fix it? i need to remove osd.2, but i don't want to do this before the cluster is back fully healthy?
[6:36] <mikeryan> i'm not sure why it's stuck in this state, so i can't really offer any fixes..
[6:37] <gucki> oh, that's bad news :-( so what shall i do? the data is quite important.. :(
[6:38] <gucki> do you know when the devs will be here?
[6:38] <gucki> probably they can help me..
[6:39] <mikeryan> it's saturday night local time
[6:39] <mikeryan> you probably won't get a response until business hours monday, 9 AM PST (GMT -0800)
[6:40] <gucki> mikeryan: mh ok, too bad that such things always happen at the weekend :(
[6:41] <gucki> mikeryan: do you think removing osd.2 is safe? it looks like the degraded pg is on 2,4 ...so it should still be available when osd.2 is off, right? :)
[6:42] <gucki> mikeryan: probably it also retriggers the rebalance and so fixes the issue? :)
[6:46] <mikeryan> gucki: i *think* it should be ok, but i don't want to guarantee
[6:46] <mikeryan> i believe at this point osd 0 has all the data for sure
[6:46] <gucki> mikeryan: i mean if i take it out (not down) and if it goes wrong, i should be able to just put it back in...right?
[6:47] <mikeryan> again, i think that's right
[6:47] <gucki> mikeryan: do you think adjusting the tunables is risky? i'm running latest argonat everywhere and only using qemu-rbd...so no old clients
[6:47] <mikeryan> don't know, this is outside my expertise
[6:48] <gucki> mh, too bad. because on http://ceph.com/docs/master/cluster-ops/crush-map/ there's written
[6:48] <gucki> For hiearchies with a small number of devices in the leaf buckets, some PGs map to fewer than the desired number of replicas. This commonly happens for hiearchies with “host” nodes with a small number (1-3) of OSDs nested beneath each one.
[6:48] <gucki> sounds exactly like my r15714, which has 2 osds...right?
[6:51] <mikeryan> no, it looks like all your PGs have acting sets with 2 OSDs
[6:52] <mikeryan> this is a problem when you have replication >= 3
[7:00] <gucki> mikeryan: ok....it's now fixed :)
[7:00] <mikeryan> fantastic!
[7:00] <gucki> mikeryan: i just restart osd0
[7:00] <mikeryan> sorry i couldn't give you more authoritative answers
[7:01] <gucki> mikeryan: no problem, thank you very much for your help anway! :)
[7:01] <gucki> mikeryan: and you helped me quite a lot, now i know better what to check and how to read the logs etc.. :)
[7:02] <mikeryan> cool, glad i helped
[7:37] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[7:53] * loicd (~loic@magenta.dachary.org) Quit (Read error: No route to host)
[7:53] * loicd (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) has joined #ceph
[8:20] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[8:23] * lofejndif (~lsqavnbok@1RDAAERGB.tor-irc.dnsbl.oftc.net) has joined #ceph
[8:44] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[9:02] * lofejndif (~lsqavnbok@1RDAAERGB.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[9:05] * synapsr (~Adium@c-69-181-244-219.hsd1.ca.comcast.net) has left #ceph
[9:20] * sagelap (~sage@62-50-199-254.client.stsn.net) Quit (Ping timeout: 480 seconds)
[9:38] * sagelap (~sage@62.50.243.161) has joined #ceph
[10:04] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[10:10] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) has joined #ceph
[10:11] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[10:11] * Leseb_ is now known as Leseb
[10:33] * sagelap (~sage@62.50.243.161) Quit (Ping timeout: 480 seconds)
[10:42] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[10:55] <ctrl> hi!
[11:33] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) Quit (Quit: Leaving.)
[11:36] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) has joined #ceph
[11:51] * sagelap (~sage@62-50-218-8.client.stsn.net) has joined #ceph
[12:08] * joao (~JL@62.50.239.160) has joined #ceph
[12:16] * danieagle (~Daniel@186.214.79.76) has joined #ceph
[12:18] * lightspeed (~lightspee@fw-carp-wan.ext.lspeed.org) Quit (Ping timeout: 480 seconds)
[12:26] * loicd (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) Quit (Ping timeout: 480 seconds)
[13:03] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) Quit (Quit: Leaving.)
[13:06] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[13:44] <nhm_> good morning #ceph
[13:44] * nhm_ is now known as nhm
[13:51] <jtang> jefferai: just to comment on them backblaze pods, we found that out recently :P
[13:51] <jtang> we have ~250tb's of disk in them pods
[13:52] <jtang> its 1pm here!
[13:52] <jtang> its afternoon
[13:56] <NaioN> those backblaze pods are nice, but i don't think you can use them in combination with ceph...
[13:58] <NaioN> they are underpowered with cpu and memory for the number of disks you have
[14:02] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) has joined #ceph
[14:02] <jtang> we're finding that
[14:02] <jtang> we might either get another chassis so we can split the disks out to 90tb per unit
[14:03] <jtang> or else get better sata/raid cards
[14:03] <NaioN> yeah and a dual cpu board with 2 faster cpu's and more mem
[14:03] <jtang> we're limited with memory
[14:04] <jtang> i think the motherboards can only tak 4gb dimms
[14:04] <NaioN> hmmm that's bad
[14:04] <jtang> we've 8gb per machine right now, i guess we can do 16gb if we need to
[14:04] <NaioN> We had supermicro servers with 24disks (4u) and 2 4-core xeons and 24 GB mem
[14:05] <NaioN> and even with that setup mem and cpu can be a problem
[14:05] <jtang> we'll it was a bit of a gamble with the pods
[14:05] <jtang> we're still thinking about what to do with them
[14:05] <NaioN> mainly the mem and mainly with recovering
[14:05] <jtang> we're gonna try one or two more things with ceph before we re-evaluate
[14:06] <jtang> we were using gpfs before hand, and it was horribly slow on the pods
[14:06] <NaioN> hmmm as far as I know backblaze uses an own layer for replication
[14:07] <NaioN> I assume it's a lot lighter...
[14:07] <jtang> yea they have some rest api
[14:07] <NaioN> in the faq they say they use LVM and JFS and then their layer
[14:08] <jtang> we tried lvm and xfs
[14:08] <jtang> partly because we didnt want build and maintain modules for the distro we're using
[14:08] <jtang> that wasn't so good
[14:09] <NaioN> well I went to the workshop last friday and I'm now looking for downsizing the nodes :)
[14:10] <jtang> one of the guys at the workshop on friday suggested a few things to try
[14:10] <NaioN> little cheap osd nodes
[14:10] <jtang> heh yea
[14:10] <jtang> im in the middle of writing some ansible playbook for ceph right now
[14:10] <jtang> since im messing with it, i might as well do something useful
[14:11] <jtang> we've pretty much decided to run with ceph for one of the projects we're working on
[14:12] <jtang> whihc leads to asking if there is an example of the chef cookbooks running in solo mode?
[14:12] <NaioN> i don't know
[14:12] <NaioN> we don't use ceph
[14:12] <NaioN> sorry i mean chef :)
[14:14] <NaioN> we are using it semi-production (second backup)
[14:15] <NaioN> but we made the error to scale-up the nodes to much and if something goes wrong and you have to recover from a node failure the cluster gets a big hit
[14:15] <jtang> we wont be in production with ceph for at least another 6-12months
[14:15] <jtang> i hope not to make that error ;)
[14:15] <NaioN> well they will be releasing the next stable release somewhere half-way december
[14:16] <NaioN> we only use RBD at the moment and that's considered stable
[14:16] <NaioN> we haven't had any troubles with that
[14:17] <jtang> we're thinking the radosgw or cephfs is what we will want to use
[14:17] <jtang> the rbd component is less interesting for us right now
[14:17] <jtang> though i do know the ops people at my workplace like the idea of rbd
[14:31] <NaioN> well radosgw is also considered stable
[14:32] <NaioN> cephfs not
[14:33] * lennie (~leen@524A9CD5.cm-4-3c.dynamic.ziggo.nl) has joined #ceph
[14:35] <lennie> Hi, I've found a way to crash Ubuntu 12.10 with a cephfs kernel mount
[14:36] <lennie> It is really simple: just do: chroot /cephfsmountpoint /some/binary
[14:41] <lennie> this doesn't seem to be a known bug ?
[14:51] <lennie> just tested with using a lxc container, it has the same problem
[14:53] * Anticimex (anticimex@netforce.csbnet.se) has joined #ceph
[15:03] <sagelap> lennie: what does the stack trace in dmesg output look like?
[15:03] <lennie> sorry, I should have mentioned: hang, not crash
[15:04] <lennie> I'll have an other look if I can get any other information. I forgot to run it on the console
[15:04] <sagelap> that'd be great, thanks!
[15:06] <lennie> yes, it does show a real crash on the console :-)
[15:08] <lennie> here is a screenshot: http://postimage.org/image/thpncvh1l/
[15:08] <lennie> not sure if that gives enough information, haven't played with the kernel in ages :-)
[15:15] <lennie> I'm running it a couple of times to see if I get different results, I found at least a second one: http://postimage.org/image/3v6sme6zl/
[15:15] <nhm> lennie: I poked around looking for similar stacktraces. There's lots of random stuff out there.
[15:16] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) Quit (Quit: Leaving.)
[15:16] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) has joined #ceph
[15:18] <nhm> what kernel version are you on?
[15:19] <lennie> this is just the standard Ubuntu 12.10 install: 3.5.0-27-generic x86_64
[15:20] <lennie> maybe it doesn't have the latest updates, I could do that right now (it is just a test VM)
[15:20] <gucki> hi there
[15:21] <nhm> lennie: maybe. I'm seeing stuff like this for slightly older kernels that might be related to acpi, but not yet any silver bullet. Does that machine have any other strange issues?
[15:22] <gucki> i used to use ext4 and so i had "filestore xattr use omap = true" in my ceph.conf. now i setup new osds with xfs, but i forgot to remove the "filestore xattr use omap = true" from these ceph.conf. does it have a performance impact? can i safely remove this setting and restart the osd?
[15:22] <lennie> nhm: no issues that I know of
[15:22] <nhm> gucki: you can remove it, but I think we are considering making it the default option anyway.
[15:23] <gucki> nhm: ah ok, so it doesn't have any performance impact? so i'll leave it in...
[15:23] <nhm> gucki: It's not needed for xfs, but if I remember right it was ever so slightly faster putting the xattrs in leveldb instead of xfs.
[15:23] <lennie> nhm: it's running under KVM on Ubuntu 12.04.1 LTS
[15:24] <gucki> nhm: ok, then i'll just let it unchanched. thanks! :)
[15:24] <nhm> lennie: ok. I found an old thread that might be relevant here: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg06056.html
[15:25] <gucki> another question: currently i only have 1 monitor and 8 osds. can i simply follow http://ceph.com/docs/master/cluster-ops/add-or-rm-mons/ to add two new monitors? will there be any service interruption when doing this?
[15:27] <nhm> gucki: I think so, but most of what I do doesn't require modifying existing ceph clusters so I'm probably the wrong person to give advice. ;)
[15:27] <lennie> nhm: guess depbootstrap also uses chroot
[15:28] <lennie> nhm: it really is as simple as running chroot look at the top of the second image I posted: http://postimage.org/image/3v6sme6zl/
[15:29] <jefferai> jtang: yeah, sorry you're running into those issues
[15:30] <nhm> lennie: sorry, I'm distracted with a 3 year old trying to divert my attention. :) Might be good to update the bug in that mailing list post with your findings.
[15:30] <sagelap> lennie: crazy. let me see if i can reproduce.
[15:31] <jefferai> I was lucky enough to read enough comments and ask in here beforehand so I avoided that issue...ended up buying a bunch of machines from Silicon Mechanics and got them to sell me those machines without any network cards or drives in them
[15:31] <jefferai> so I could put in all my own
[15:31] <lennie> sagelap / nhm: was just thinking is it possible to run chroot under strace to get a better idea of what system call it is ?
[15:31] <jefferai> saved me a bunch of money...because of ceph's redundancy I'm using consumer grade drives instead of enterprise (and without markup)
[15:32] <jefferai> so I got 4 machines with 20 2.5" slots and 8 machines with 12 3.5" slots, putting 512GB SSDs and 4TB drives respectively
[15:32] <nhm> jefferai: cool, I'll be curious how well the drives hold up. We had two different storage platforms at my last job that used the exact same enterprise drives. One had 1 drive failure in 5 years. The other was losing 2-3 drives a month.
[15:33] <jefferai> nhm: yeah, hard drives are like that
[15:33] <jefferai> often times you'd see those failures in drives from the same batch
[15:33] <nhm> yeah
[15:33] <jefferai> you now often get shipped drives with batches pre-mixed
[15:33] <nhm> I also suspect that the second platform might have had a lot more vibration.
[15:33] <jefferai> I got boxes of 20 drives and the serial numbers are not sequential
[15:33] <jefferai> ah, could be
[15:34] <nhm> And the raid controllers may have been more finicky.
[15:34] <jefferai> yeah
[15:35] <nhm> the drives might have actually been fine and just shipped off to the next customer. ;)
[15:35] <darkfaded> vendors messing with the firmware also make a lot of difference
[15:35] <jefferai> nhm: I'm also wondering how well they'll hold up, but as long as our admin keeps an eye on the smartctl emails...
[15:35] <jefferai> I'll have to get down the procedure for taking a drive and/or node out of the ceph cluster for that eventuality
[15:36] <nhm> lennie: let me know if you try the strace.
[15:37] <lennie> nhm: I can't scroll up, so I can't see the real information anymore. Any tips how to get stuff of a VM before a crash ?
[15:39] <lennie> I'll try to run qemu/kvm on the commandline and use a tty for console, maybe that will help ?
[15:39] <nhm> lennie: No idea... Maybe!
[15:39] <nhm> lennie: would it work to simply tee the strace output to a file?
[15:40] <lennie> nhm: it says pretty clearly at the end of the kernel trace: not syncing ;-)
[15:40] <gucki> nhm: argh, i just added the montiros and now everything hangs!! :(
[15:41] <lennie> nhm: not that I didn't try, it didn't work
[15:41] <gucki> nhm: the ceph -w i started before now shows only "0 monclient: hunting for new mon"
[15:41] <nhm> gucki: boo!
[15:41] <gucki> when i now do ceph -w it just hangs :(
[15:42] <gucki> all 3 monitors are running, so i'm not sure why it hangs?
[15:42] <nhm> gucki: Any errors in the logs?
[15:42] <gucki> r15714 tells we are not in quorum
[15:43] <gucki> same for r15717
[15:43] <gucki> 15714 and 15717 are the new ones
[15:43] <nhm> gucki: hrm. all of the ips/ports are right and everything?
[15:43] <gucki> yes
[15:44] <gucki> ok it seems all monitors now say they are not in quorum
[15:44] <gucki> should i just disable the two new monitors?
[15:45] <nhm> gucki: Maybe? ;)
[15:47] * danieagle (~Daniel@186.214.79.76) Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[15:49] <gucki> nhm: shit, removing them from the config doesn't help...i killed all 3 and startet one the old one, but it still says it's not in quorum
[15:50] <nhm> gucki: found some discussion on the IRC channel from about a month ago: http://irclogs.ceph.widodh.nl/index.php?date=2012-09-08
[15:50] <nhm> gucki: look at the stuff MooingLemur posted
[15:52] <gucki> nhm: seems like the same problem
[15:52] <gucki> nhm: but now i get my cluster online asap
[15:52] <gucki> nhm: the data is really important
[15:52] <gucki> nhm: the problem is....i cannot do anything with the ceph command anymore, because the only monitor says it's out of quorum :(
[15:53] <gucki> nhm: can i somehow force it into quarum?
[15:54] * sagelap (~sage@62-50-218-8.client.stsn.net) Quit (Ping timeout: 480 seconds)
[15:54] <nhm> gucki: I don't know, I'm going to see if I can get a hold of one of the guys that knows more about the mons.
[15:54] <gucki> nhm: that would be really awesome :)
[15:55] <gucki> nhm: it seems all the osds are still working, i just cannot do anything anymore :(
[15:55] <nhm> gucki: yeah, super annoying. I'm going to log a bug about the doc page too since it seems like at least two people now have had a problem with it.
[15:55] <gucki> nhm: thanks! :)
[15:56] * joao (~JL@62.50.239.160) Quit (Ping timeout: 480 seconds)
[16:00] <gucki> nhm: ok, i found i can at least access the monitor using it's admin socket ceph --admin-daemon /var/run/ceph/ceph-mon.a.asok mon_status
[16:01] <gucki> nhm: now i "only" need to know how to bring everything back online :)=
[16:01] <nhm> heh, ok. I haven't had any luck getting a hold of people yet.
[16:01] <gucki> http://pastie.org/5181349
[16:02] <gucki> i just need to remove the new d from there somehow i guess
[16:02] <gucki> new b, not d
[16:02] <gucki> is there anywhere a list of commands i can send over the admin socket?
[16:03] <nhm> gucki: hrm, there may be a command to get the list.
[16:03] <lennie> gucki: can you do these command ?: ceph get monmap -o output; monmaptool --print output
[16:04] <gucki> nhm: ok, i tried with "help" and then it shows some things..but nothing which seems to help me
[16:04] <gucki> lennie: "ceph get monmap -o output" hangs
[16:05] <gucki> lennie: http://pastie.org/5181364
[16:06] <gucki> lennie: /tmp/monmap is the monitor map i got using "ceph mon getmap -o /tmp/monmap" for adding the two new mintors
[16:08] <lennie> I was hoping you could run it on the admin socket, but seems you can't
[16:09] <gucki> lennie: can i somehow revert to the old monmap?
[16:10] <gucki> lennie: then i would be online again with just 1 monitor as before, right?
[16:10] <lennie> gucki: i theory yes, but I'm no expert (!)
[16:11] <nhm> gucki: added a bug report: http://tracker.newdream.net/issues/3438
[16:12] <gucki> do you know how i can import a monitor map i changed using the monmaptool?
[16:12] <nhm> gucki: I think that's true, but I'm loath to tell you to do it without having Greg or someone verify that it's a good idea.
[16:12] <lennie> gucki: I see this documentation: http://ceph.com/docs/master/cluster-ops/add-or-rm-mons/
[16:13] <lennie> gucki: but I would advise against it for a production environment (!)
[16:13] <gucki> lennie: well, it is production and i need it back online
[16:13] <nhm> lennie: that's the documentation that seems to have screwed him over in the first place. ;)
[16:13] <gucki> nhm: do you see any changes Greg will be online anytime soon?
[16:14] <nhm> gucki: he was online like 2-3 hours ago. Unfortunately it's like 7am in California right now.
[16:14] <lennie> nhm: ahh, I see... ;-)
[16:14] <gucki> lennie: see you think of "ceph-mon -i a --inject-monmap /tmp/monmap", right?
[16:14] <gucki> nhm: mh i see, so probably he's in the middle of his sleep :-)
[16:15] <gucki> the monmap only stores which monitors exist, not any actual pg data etc...right?
[16:15] <jmlowe> good thing I checked in, I was thinking about adding a mon tomorrow
[16:17] <lennie> gucki: that is what the documentation says, I don't know exactly if it solves something or makes it worse
[16:17] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[16:18] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[16:18] * sagelap (~sage@62-50-218-8.client.stsn.net) has joined #ceph
[16:19] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[16:20] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[16:21] <lennie> gucki: I had a quick look, ceph-mon is actually the mon process, so at least that part of the documentation is correct. you'd need to shutdown all mon processes first
[16:22] <jtang> jefferai: its okay, it didnt come out of my budget ;)
[16:22] <lennie> gucki: I really don't know what data is kept in the mon directory and how critical it is
[16:24] <lennie> gucki: it does store a pgmap in that directory
[16:25] <lennie> gucki: looking from a far in theory it should only change the monmap directory if you inject a new monmap as you suggested
[16:25] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[16:25] <gucki> lennie: there seems only be a list of osds
[16:26] <gucki> lennie: i just saved the current monitor folder and giving the in inject a try now
[16:26] <lennie> gucki: good luck with that !
[16:26] <gucki> lennie: i think it should be fine
[16:26] <gucki> lennie: pray for me :-)
[16:28] <nhm> gucki: good luck! I'll see if I can track down any other info if this doesn't work.
[16:28] <gucki> puuhh
[16:28] <gucki> seems like it worked
[16:28] <nhm> woot
[16:28] <gucki> at least ceph -w is back online :)
[16:28] <lennie> 2 thumbs up ! :-)
[16:29] * joao (~JL@62.50.239.160) has joined #ceph
[16:29] <gucki> thanks guys!!!
[16:30] <nhm> gucki: at least we got it back to working, sorry it happeend in the first place!
[16:30] <lennie> gucki: no problem, I'm just glad it worked
[16:30] <nhm> gucki: once Greg is around he can probably help get you to 3 mons.
[16:31] <gucki> beside all the fear i had it crashed the cluster, i think it was a good experience. ceph seems really good at handling faults :)
[16:31] <lennie> nhm / gucki: so what did you do to end up in this state ? You had 1 monitor and added 1 more ?
[16:32] <nhm> lennie: afaik gucki was trying to add monitors using the process described here: http://ceph.com/docs/master/cluster-ops/add-or-rm-mons/
[16:32] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[16:32] <gucki> yes, i had only 1 monitor. then i wanted to add 2 others. but as soon as i added the second one everything blocked because all monitors where saying "not in quorum"...
[16:33] <nhm> lennie: I created a bug here: http://tracker.newdream.net/issues/3438
[16:33] <lennie> gukci / nhm: I think the documentation does mention it: always have a total of an odd number of monitors
[16:34] <gucki> lennie: yes, but how can you do this? how to get from 1 to 3?
[16:34] <gucki> lennie: i added both new monitors to the ceph.conf on all nodes before
[16:35] <gucki> lennie: not sure if everything went down on "ceph mon add mon.b 10.0.0.4:6789" or "ceph-mon -i b"
[16:36] <lennie> gucki: because you only had 1, you can't restart the one without taking the mon offline, but if you have 3 you can take one mon offline and add 2 more with the inject-monmap ??
[16:37] <lennie> gucki: yes, maybe that is it: add two mons and only start them after you've added them.
[16:38] <gucki> lennie: yes, with the "raw" tools like inject it should work. but i'm not sure if the new ones will catch up on the data the old one has?
[16:38] <gucki> lennie: i think i'll wait for Greg, one such experience is enough per day ;-)
[16:39] <lennie> gucki: I would wait for Greg as well :-)
[16:39] <gucki> now, i only wonder what monitors are really used for. i mean the cluster of 8 osds stayed online the whole time serving clients without any monitor? ;-)
[16:39] <lennie> gucki: on the issue of the raw tool and if it would catch up on the old one: I think it would work if the old one hadn't changed state before the new one was added.
[16:40] <lennie> gucki: I guess you use it only for VMs with blockstorage ?
[16:40] <gucki> lennie: yes exactly, kvm-rbd
[16:40] <gucki> no cephfs, no kernel rbd clients
[16:41] <lennie> gukci: well, the mon isn't in the 'data path', when the VM has a working connection it has no reason to talk to the mon AFAIK
[16:41] <gucki> lennie: even a rebalance/ recovery worked while the monitor was offline
[16:42] <lennie> gucki: the OSDs talk to each other to do that, a 'gossip' protocol is used
[16:42] <gucki> lennie: ok so, what do the clients need from the monitor on connect? i mean the osds are in the ceph.conf file?
[16:42] <lennie> gucki: this only work as long as there is no hardware failure and network connections stay up
[16:43] <lennie> gucki: the CRUSH-map, the PG-information and the list of addresses
[16:43] <lennie> gucki: AFAIK (!) :-)
[16:44] <gucki> lennie: ok, so i hope the monitor will catchup on the pg changes (recover/ rebalance does pg changes?) the osds did while the mon was down.. :)
[16:44] <lennie> gucki: that I don't know. I aint no expert :-)
[16:45] <gucki> lennie: i think some tool like "fsck" for the cluster would really relax me :)
[16:46] <lennie> gucki: cephFS really needs it, if something breaks the whole fs is offline :-(
[16:46] <gucki> lennie: mh, i always thought cephfs has special mds for this. i didn't even setup thise
[16:46] <gucki> those
[16:46] <lennie> gucki: not sure about the object/block storage or the monitors. I don't know what checks they do
[16:47] <lennie> gucki: I do know they do 'scrubbing' which includes OSDs talking to each other to compare their data
[16:47] <gucki> yes, i saw that scrubbing from time to time in the logs
[16:48] <lennie> gucki: yes, cephfs needs the MDS: Meta Data services
[16:48] <lennie> gucki: at least ones a day, I believe
[16:48] <gucki> ok, now recover is done and cluster is again in HEALTH_OK :)
[16:48] <gucki> i'll shutdown some vm and restart it...hopefull it still works :)
[16:49] <lennie> gucki: i guess you could run the netstat -na command and see if any machines are connected to the mons for example
[16:51] <gucki> nhm: i think if some ceph experts have time it'd also be great to have some internals documented. like what a mon does, if the cluster still works when all mons are dead, what happens when the first mon comes back alive ...i think it'd really make people who had problems more relaxed (just like me now, who wonders if the cluster is really fully ok when it says HEALTH_OK) :)
[16:56] <lennie> gucki: I was at the first #cephday in Amsterdam a couple of days ago. Have to say the presentation by sage weil was still the best, this is a video from the same presentation he gave some where else: http://vimeo.com/50620695
[16:56] <gucki> lennie: great thanks, going to watch i now :)
[16:57] <gucki> btw, in general i'm really facinated by ceph :)
[16:57] <lennie> gucki: it is an introduction presentation, but it contains all the information about how it fits together
[16:57] <gucki> i really tried a lot before and ceph is the only solution which really works for me :)
[16:58] <lennie> gucki: if you've used ceph you should probably watch the presentation again to really understand what is going on
[16:58] <gucki> lennie: yeah, i'm already streaming :-)
[16:59] <lennie> gucki: there is a lot in the presentation which people might not understand from watching it the first time (before using ceph)
[16:59] <lennie> gucki: ceph is the only system which uses an algorithm like CRUSH, which is why it can grow and contract easily
[17:01] <lennie> gucki: and the algorithm is the reason you don't depend on the meta data server or mon when getting/storing data, the client connects to the OSD directly.
[17:05] <lennie> gucki: Sage actually worked with Lustre at some educational institute before he worked on the CRUSH algorithm. If you know anything about Lustre, it depends on a central server and that is where the bottleneck is in the system, you can only scale as long as the central server can keep up with handling/storing requests/information
[17:06] <lennie> gucki: the central server obviously has a failover in a Lustre setup, but it can't be loadbalanced I believe. It even depends on using the same shared storage
[17:07] <lennie> gucki: ceph has the better model
[17:07] <nhm> lennie: yeah, you can do active/passive with the lustre MDS.
[17:08] <nhm> lennie: it can be tempermental though.
[17:08] <nhm> lennie: lustre has probably gone about as far as you can go with a single MDS design.
[17:08] <nhm> it's amazing they've scaled it as far as they have.
[17:09] <lennie> That is all thanks to using a hashing algorithm.
[17:10] <lennie> I forgot to thank Sage for comming up with it at the conference ;-)
[17:10] <lennie> Or ask how he came up with it. these algoritms aren't new, but they are in distributed storage so it seems
[17:11] <nhm> lennie: some others have tried to use consistent hashing I think.
[17:11] <gucki> lennie: sorry, need a couple of minutes to get some stuff done (i'm a little behing because of the outage *g*)...i'll answer a little later. hope it's ok :)
[17:12] <lennie> gucki: no worried, I didn't ask a question ;-)
[17:12] <lennie> s/worried/worries/g
[17:12] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:13] <nhm> lennie: I think it just comes down to the fact that there aren't that many people in the world that have the right combination of programming knowledge, algorithm knowledge, and really care about working on distributed filesystems.
[17:13] <lennie> nhm: yeah, it probably needs the right hash and also the placement groups to be effective to allow for the growing/contracting nature
[17:13] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[17:13] <nhm> Not to mention that Sage seems to have this crazy ability to context switch between different things that I only wish I could emulate.
[17:14] <Robe> hehe
[17:14] * joao (~JL@62.50.239.160) Quit (Ping timeout: 480 seconds)
[17:15] <Robe> and the money to keep on working on it!
[17:15] <Robe> can't wait till I can fire up my first few nodes
[17:16] <lennie> Robe: I can wait for a bit longer, I want to get some more experience with it so that I know what works and what doesn't and what to do if something breaks ;-)
[17:17] <Robe> I've got 7000 disks that need liberating!
[17:17] <lennie> can I ask how many machines that adds up to ?
[17:18] <Robe> 250 * 20 & 75 * 36
[17:18] <Robe> probably not all going to be used for rados
[17:18] <nhm> Robe: nice, what are you running now?
[17:18] <Robe> ghetto replication of hell
[17:18] <Robe> some python/twisted/mysql abhorration
[17:18] <Robe> strung together with cronjobs and sql scripts
[17:19] <nhm> Robe: nothing like bailing wire and duct tape. :)
[17:19] <lennie> sounds a bit scary and probably is worse than it sounds ;-)
[17:19] <Robe> lennie: you bet
[17:20] <Robe> probably going to start slow, still need to look how the crush semantics look like
[17:20] <Robe> and if we also can replace some of the redis/mysql app-specific data with rados
[17:21] <lennie> Robe: do you intent to use key/value storage of ceph for that ?
[17:21] <Robe> yesh
[17:21] <Robe> rados is essentially a key/value store
[17:22] <lennie> Robe: yes, I guess it is. I wonder how fast it is for really small pieces of data
[17:22] <Robe> for reads it shouldn't be too much of an issues
[17:22] <Robe> -s
[17:22] <lennie> Robe: I can imaging Redis being very efficient
[17:22] <nhm> lennie: probably is going to depend on the version. Sam has been doing a lot of work on our threadpool and locking code.
[17:23] <Robe> writes are a different story since you've got latency between client and primary OSD for the placement group and primary OSD and all other OSDs
[17:23] <nhm> lennie: I think it will make it into bobtail.
[17:23] <Robe> plus local write overhead, though this can be limited by using proper write-back storage
[17:23] <lennie> cool :-)
[17:23] <Robe> nhm: some scalability issues?
[17:24] <Robe> one of the issues I see compared to Riak is that rados doesn't support different values for n and w
[17:24] <nhm> Robe: We were seeing some limits with SSDs and other faster than disk technologies.
[17:24] <Robe> n being amount of replicas and w amount of replicas needed for a successful write
[17:24] <Robe> nhm: oh, ok
[17:24] <Robe> nhm: you working for inktank?
[17:24] <nhm> Robe: yep
[17:24] <Robe> were you in amsterdam too?
[17:25] <lennie> Robe: how about you ? I was :-)
[17:25] <Robe> lennie: I was! ;)
[17:25] <nhm> Robe: naw, The only two engineers from the states that went were Sage and Greg afaik.
[17:25] <Robe> nhm: oh, ok :)
[17:25] <Robe> yeah, met both of them
[17:25] <nhm> Robe: I'm going to SC12 though. :)
[17:26] <Robe> sc12?
[17:26] <nhm> Robe: Supercomputing 2012.
[17:26] <Robe> is greg in here too?
[17:26] <Robe> and sounds neat :)
[17:26] <nhm> Yeah, Greg is often in here. He's probably trying to get his sleep schedule back to normal right now. :)
[17:26] <Robe> probably not my cup of tea though
[17:26] <Robe> heh, what's his nick?
[17:27] <nhm> usually gregf
[17:27] <Robe> ah, great
[17:27] <Robe> thanks
[17:28] <nhm> He can help with any problem. You can tell him I said that. ;)
[17:28] <Robe> I gathered as much from chatting with him :D
[17:28] <nhm> I think he's the 2nd person Sage hired to work on Ceph.
[17:29] <Robe> ah :)
[17:29] <lennie> I was able to ask Greg a question he needed to ask Sage about ;-)
[17:29] <Robe> when was this?
[17:29] <Robe> recentish?
[17:29] <lennie> in Amsterdam
[17:29] <Robe> err, meant the hiring ;)
[17:29] <nhm> Robe: oh, I think Greg has been working on ceph for a couple of years. I started with Inktank about 8 months ago.
[17:30] <nhm> Actually, I started with DreamHost at that point since Inktank didn't exist yet.
[17:30] <Robe> yeah
[17:30] <Robe> nhm: is there an overview of what kind of "complex" operations librados supports nowadays?
[17:30] <Robe> next to get/put/delete
[17:31] <nhm> Robe: probably somewhere. :)
[17:31] <Robe> at least sage hinted that more of the "redis-style" operations are/will be possible
[17:31] <lennie> Robe: I think Sage did mention something about transactions
[17:31] <Robe> and http://ceph.com/docs/master/api/librados/ looks rather lowlevel
[17:31] <nhm> Robe: Hard to know who is doing what now. We've grown pretty rapidly.
[17:31] <Robe> lennie: that, and being able to increment/decrement stuff etc.
[17:31] <Robe> get a scrum/kanban guy!
[17:33] <nhm> Robe: Dev stuff is slightly more well defined. I kind of sit between a bunch of different groups so it's hard to keep track of it all.
[17:33] <Robe> what's your job? :>
[17:34] <nhm> Robe: Performance Engineer. I do everything from profiling code to helping our professional services group with customer engagements.
[17:34] <Robe> ahhh!
[17:35] <nhm> Robe: I've been trying to get a good handle on how various hardware handles our workload too.
[17:35] <Robe> just the man I've been killing to see! (to quote some max payne ;) )
[17:35] <lennie> Robe: what I did understand from an other talk(video) is that when using cephfs the mds sends parts of the meta data to the ods: like add file to directory. I guess that is kinda like changing one attribute in one object
[17:35] <nhm> Robe: uh oh. ;)
[17:35] * joao (~JL@62.50.239.160) has joined #ceph
[17:35] <Robe> nhm: well, not necessarily right now ;) but I'm very interested in performance myself and have mostly worked with postgres in the past
[17:36] <Robe> lennie: yep - I'd like to see a stable API (documentation) I can use ;)
[17:36] <nhm> Robe: You might be interested in an article I wrote: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[17:36] <Robe> nhm: we're pushing about 900mb/sec writes over the whole cluster so with the amount of spindles we got this shouldn't be an issue
[17:36] <nhm> I'm working on a followup now looking at non-ssd setups.
[17:36] <nhm> Have a number more planned.
[17:37] <Robe> ahh, great
[17:37] <Robe> have skimmed over it
[17:37] <Robe> not too relevant for me since hardware is already in place
[17:37] <lennie> Robe: my coworker said to me in Amsterdam: how about adding librados to PostgreSQL so it can talk to ceph directory for storage ;-)
[17:37] <Robe> lennie!!
[17:37] <lennie> Robe: not sure if that's such a good idea :-)
[17:37] <nhm> Robe: yeah, after the next one, I'll probably do a argonaut/bobtail comparison with default paramteters, then look at tuning parameters.
[17:37] <Robe> perfect idea!
[17:38] <Robe> nhm: nice
[17:38] <lennie> Robe: I'd think the latency for a commit would be pretty high ?
[17:38] <Robe> nhm: it'd be great if you could write an article to explain how OSD operations look like on the POSIX-interface
[17:38] <Robe> lennie: yes, you only would want to operate this with synchronous_commit disabled
[17:39] <Robe> but that's ok for most workloads nowadays
[17:39] <Robe> at least in the line of business were I work
[17:39] <nhm> Robe: Like what kind of operations hit the underlying osd filesystems?
[17:39] <Robe> nhm: yes, and how the rest of the ressource consumption looks like
[17:39] <Robe> and how the on-disk representation looks like, at least to the level that's of interest from a performance/operations poV
[17:40] <nhm> Robe: yeah, I've got blktrace data for all of the runs I was doing. I've been meaning to do some analysis of it.
[17:40] <nhm> Too much to do, too little time.
[17:40] <Robe> hah, yeah :)
[17:40] <Robe> nhm: are you familiar with postgresql internals?
[17:40] <lennie> nhm: I think Sage or Greg mentioned that an OSD daemon (so on disk on a server) needs about (safety margin) 1GB of RAM and 1Ghz of processor ?
[17:40] <nhm> Robe: naw, I once a long time ago did a bit of postgresql tuning, but that's not really my area.
[17:40] <Robe> lennie: yeah, that's a bit too ballparkish for me ;)
[17:41] <Robe> nhm: I totally loved the style of http://momjian.us/main/presentations/internals.html
[17:41] <Robe> had the pleasure to see bruce talk about the shared memory in person
[17:41] <Robe> this is sort of the missing "DBA documentation" of postgres
[17:41] <Robe> since the official manual is very light on these things
[17:41] <nhm> lennie: It probably depends on the underlying filesystem used, but those are probably reasonable catch-all values.
[17:41] <Robe> whereas e.g. oracle documentation covers these aspects also very in detail
[17:42] <Robe> and those are the things people are interested in when touching new technology
[17:42] <Robe> especially when it's going to be the foundation of a large cluster
[17:43] <Robe> nhm: you on twitter?
[17:44] <lennie> Robe: have you seen http://www.openstack.org/summit/san-diego-2012/openstack-summit-sessions/presentation/how-dreamhost-builds-a-public-cloud-with-openstack ?
[17:45] <Robe> nope!
[17:45] <Robe> thanks
[17:46] <lennie> nhm: I posted the link because I was thinking 1Ghz, what would that look like, I guess it is this: http://www.dell.com/us/enterprise/p/poweredge-c2100/pd
[17:47] <lennie> 6 cores, 12 disks
[17:47] <Robe> does OSD really need that much CPU power?
[17:48] <Robe> dunno what the cpu-heavy stuff is it does
[17:48] <lennie> Robe: I think only when it needs to converge in case of lots of failures
[17:49] <lennie> Robe: but they did say, they don't really know yet, it is a safety margin
[17:49] <Robe> mhm
[17:49] <Robe> nhm: work faster! :D
[17:49] <Robe> *runs for cover*
[17:49] <nhm> Robe: theoretically yeah. :) I don't use it much. @MarkHPC
[17:50] <Robe> you're a valley startup! you absolutely must use it!
[17:50] <Robe> ;)
[17:51] <lennie> I keep wondering if there are any ARM-cores which have special hardware already to offload the calculations, so the OSDs could just run on ARM.
[17:51] <lennie> I know in Amsterdam they did mention Atom processors
[17:51] <nhm> Robe: for cpu: ceph crc32c, btrfs, various moving of data around, etc.
[17:52] <Robe> crc32 is basically free, isn't it? ;)
[17:52] <Robe> data moveage - don't you use sendfile/splice/etc?
[17:53] <lennie> Robe: I wouldn't say that. crc32c isn't completely cheap
[17:53] <Robe> lennie: judging by nehalem xeons
[17:53] <Robe> l2/3 cache as far as the eye can reach!
[17:54] <Robe> but yeah, don't know how cache-trashing-prone those hashing functions are
[17:54] * loicd (~loic@31.36.8.41) has joined #ceph
[17:54] <lennie> Robe: just look at the blog post on the ceph website of the testing. You can see the controller tests with btrfs did hit a limit.
[17:54] <nhm> Robe: the implementation we use is 3 cycles per 32bits of data. With the native instructions I think modern CPUs can do 128bits/cycle.
[17:55] * Oliver2 (~oliver1@ip-178-201-146-106.unitymediagroup.de) Quit (Quit: Leaving.)
[17:55] <Robe> lol.
[17:55] <Robe> so it's memory-bound
[17:55] <nhm> Robe: I think we've been getting better about how data gets moved around. Alex was telling me he thought there were some places we were doing copies but that may not have been correct.
[17:55] <Robe> *nods*
[17:56] <Robe> and from OSD perspective when the cluster converges it's just objects that get thrown around, right?
[17:56] <Robe> or is there some computation-heavy stuff also happening?
[17:57] <nhm> lennie: the limit on that node right now with ceph 0.50 is about 15 drives + 5 SSDs. I can do about 1.2GB/s (2.4GB/s if you count journal writes) with crc32c enabled, 1.4GB/s with crc32c disabled. It doesn't really scale past that.
[17:58] <nhm> At least with default configuration parameters.
[17:58] <nhm> I need to try scaling tests with bobtail and actually tweak some things.
[17:58] <lennie> Robe: I believe that CRUSH needs to run to figure out where data needs to be placed ??
[17:59] * loicd (~loic@31.36.8.41) Quit (Quit: Leaving.)
[18:00] <lennie> nhm: I remember this thread, btrfs also uses crc-32c and where it's limit was at the time: http://www.spinics.net/lists/linux-btrfs/msg05757.html
[18:05] <lennie> What cpu feature do I need in /proc/cpuinfo to have hardware support, does anyone know ?
[18:06] <Robe> http://en.wikipedia.org/wiki/SSE4#SSE4.2
[18:06] <Robe> apparently
[18:07] <nhm> lennie: yeah, In my tests btrfs is theoretically using the hardware crc32c support.
[18:08] * janet_fg (~G25@94.20.34.125) has joined #ceph
[18:08] <nhm> cehck /proc/crypto
[18:09] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[18:09] <lennie> judging by a R610 we have at work, it basically is in case of Dell the previous generation from Dell.
[18:09] <Robe> ice!
[18:09] <Robe> err
[18:09] <Robe> nice
[18:09] <Robe> didn't know about /proc/crypto
[18:10] <lennie> cool
[18:11] <lennie> maybe I'll see you guys around, it's dinner time for me
[18:11] * janet_fg (~G25@94.20.34.125) Quit (autokilled: Spamming. Contact support@oftc.net for further information and assistance. (2012-11-04 17:11:12))
[18:11] <Robe> lennie: we'll lurk ;)
[18:12] <lennie> I have a question: so who is currently using btrfs in production ?
[18:13] <Robe> I don't touch it with a three-foot-pole at the moment
[18:15] <Robe> judging by https://btrfs.wiki.kernel.org/index.php/Changelog you probably don't want to use it for kernels < 3.4
[18:15] <mikeryan> Robe: that's right
[18:16] <lennie> I know only a few things: when you want to use snapshots, mount with: noatime
[18:16] <Robe> mikeryan: you using it?
[18:17] <lennie> noatime is something many should have been the new default on any Linux (and Unix ?) filesystem if you ask me anyway
[18:17] <mikeryan> Robe: no, i'm a dev
[18:17] <Robe> lennie: yeah... posix moves slow though ;)
[18:17] <Robe> mikeryan: ah
[18:17] <Robe> also with inktank?
[18:17] <mikeryan> definitely don't run btrfs on kernel < 3.4
[18:18] <mikeryan> use it at your own risk on recent kernels
[18:18] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[18:18] <mikeryan> and yep, inktanker
[18:18] <Robe> great!
[18:18] <Robe> so I don't need a support contract after all :D
[18:18] <Robe> (just kidding)
[18:18] <mikeryan> heh
[18:19] <mikeryan> it's an open source project, we do our best to provide support for our users regardless of whether they have a support contract
[18:19] <Robe> well, guessing from the changelog everything >= 3.4 seems harmless enough for OSD use-cases
[18:19] <mikeryan> but if you pay us you'll get better, faster, more directed service ;)
[18:20] * lofejndif (~lsqavnbok@82VAAHOFH.tor-irc.dnsbl.oftc.net) has joined #ceph
[18:20] <Robe> yeah, will hit up your sales this week
[18:20] * loicd (~loic@31.36.8.41) has joined #ceph
[18:22] <Robe> about that debian wheezy gets stamped so that the bpo kernels move to something fresher :)
[18:22] <Robe> err
[18:23] <Robe> about time
[18:27] <lennie> Debian testing is pretty stable, we run on a number of systems
[18:27] <Robe> yeah, but it's still stuck at 3.2 ;)
[18:28] <lennie> I know. :-/
[18:28] <Robe> and I think I compiled my last kernel... 5 years ago?
[18:29] <lennie> Yeah it's been a while... at least for production. I think I did a compile something like a year ago for some testing to play with btrfs.
[18:30] <mikeryan> i'm running an "official unofficial" 3.4 kernel on ubuntu 12.04
[18:30] <mikeryan> for bluez support, but still, btrfs is there too :P
[18:30] <lennie> I checked I actually build it to test, euh play with, both btrfs and ceph
[18:31] <Robe> bluez is what?
[18:31] <mikeryan> linux bluetooth stack
[18:31] <Robe> ah
[18:31] <lennie> it was July this year, that isn't nearly a year. I guess it is just that so much happend since then ;-)
[18:31] <mikeryan> i was running bluez out of git for another project, and it turns out the kernel interface changed
[18:31] <Robe> mikeryan: I'm curious - what part are you focusing on at the moment at inktank?
[18:32] <mikeryan> OSD dev
[18:32] * loicd (~loic@31.36.8.41) Quit (Ping timeout: 480 seconds)
[18:32] <Robe> ah, ok
[18:32] <Robe> do you know if there's an somewhat uptodate librados API doc for more complex data operations?
[18:33] <mikeryan> hm, let me take a look
[18:34] <mikeryan> the stuff in doc/api is pretty old
[18:34] <mikeryan> but the API has been pretty stable
[18:34] <mikeryan> https://github.com/ceph/ceph/tree/master/doc/api
[18:34] <Robe> I'm just wondering how far rados can be a replacement for redis and mysql
[18:35] <mikeryan> i don't have much experience with redis (or key-value stores in general), but i think librados can be a suitable replacement
[18:36] <mikeryan> librados->rados
[18:36] <mikeryan> you write and retrieve objects by name
[18:36] <Robe> well, I don't want to only be able to read/write a key but also do operations on the values on the OSD itself
[18:36] <mikeryan> and it's pretty snappy, especially for small objects
[18:36] <Robe> e.g. increment a value
[18:36] <Robe> etc.
[18:36] <mikeryan> the API doesn't have support for things like that, that i know of
[18:36] <mikeryan> you'd have to read it, increment it, and write it
[18:37] <Robe> oh, ok
[18:37] <mikeryan> rados never looks at object data (except to CRC it in some cases)
[18:37] <Robe> so at the moment it's just a plain key/value story only reading/writing bytes
[18:37] <mikeryan> yeah, you can look at it like that
[18:37] <Robe> because sage mentioned that there's stuff on the horizon
[18:38] <mikeryan> heh, from sage's perspective ceph's just the tip of the iceberg :)
[18:38] <Robe> even extending osd with shared objects
[18:38] <Robe> err
[18:38] <Robe> .so's
[18:38] <Robe> extending functionality
[18:38] <mikeryan> hm, a plugin system?
[18:38] <mikeryan> first i've heard of it
[18:38] <Robe> basically
[18:38] <Robe> hahaha
[18:38] <Robe> ok :)
[18:39] <Robe> he should invite the devs too to his talks ;)
[18:39] <Robe> but I guess it was an ad-hoc thing on things for the future
[18:39] <lennie> haha, yeah. When sage presented everything he thought they might add in Amsterdam the audience went kinda quiet ;-)
[18:39] <mikeryan> we're busy implementing all the stuff he talks about!
[18:39] <Robe> mostly interested in the georeplication stuff
[18:40] <mikeryan> ah yeah, that stuff's been on the horizon for a while
[18:40] <mikeryan> most of our code assumes replicas are close (RTT-wise) and connected by a fat pipe
[18:40] <Robe> yesh
[18:41] <Robe> when you could split num-writes-needed-for-return from num-replicas that'd be easy
[18:41] <mikeryan> easi*er*
[18:41] <Robe> easy enough!
[18:41] <Robe> :>
[18:42] <mikeryan> that brings us into the grey realm of correctness enforcement
[18:42] <Robe> just write to local osds, have a local primary for each one, and sync the remotes lazily
[18:42] <mikeryan> and trust me, that's tricky enough as the code currently stands
[18:42] <Robe> hehe, yeah
[18:42] <Robe> I've read the dynamo paper
[18:42] <Robe> and their vector clock approach
[18:42] <Robe> which brings me to
[18:42] <Robe> you don't handle concurrent updates at all at the moment, do you?
[18:43] <mikeryan> define that term for me in this context
[18:43] <Robe> fair enough :)
[18:43] <Robe> Riak checks the ancestor of each write, if there are two writes with the same ancestor you'll get two siblings for the next fetch since they apparently are based off the same version meaning that they _could_ conflict
[18:44] <Robe> for Rados it's just last write wins since each write is coordinate by the primary OSD for a PG?
[18:44] <Robe> +d
[18:44] <mikeryan> that's how RADOS work, yep
[18:44] <mikeryan> works*
[18:45] <mikeryan> i can see how that behavior could be undesirable depending on your use case
[18:45] <Robe> and I guess the primary OSD serializes all writes giving you a consistent object state for each fetch whenever it's scheduled
[18:45] <mikeryan> yep, we explicitly sequence operations
[18:45] <mikeryan> we guarantee that operations will occur in order
[18:45] <mikeryan> but if two requests come at the same instant i'm not sure how we choose which one is first
[18:45] <mikeryan> probably whichever socket we poll first
[18:45] <Robe> *nods*
[18:46] <Robe> hmm
[18:46] <Robe> and for ceph all operations are sync as well I presume?
[18:46] <mikeryan> we have aio interfac3es
[18:47] <Robe> just wondering how posix semantics work in the current RADOS world
[18:47] <Robe> e.g. if I could get inconsistent states if I tease the FS hard enough
[18:47] <mikeryan> ah, that is dealt with in cephfs/MDS
[18:48] <mikeryan> last i heard, the way we get around that is by making every operation a complete transaction
[18:48] <Robe> *nods*
[18:48] <Robe> wonder if ceph would be a stable underlying fs for say.. postgresql
[18:49] <Robe> which relies heavily on correct fsync beahviour
[18:51] <mikeryan> i believe our default behavior is strict POSIX compliance
[18:51] <mikeryan> that can be relaxed for improved performance on certain ops
[18:52] <Robe> who's the main ceph guy at inktank?
[18:52] <mikeryan> sage?
[18:52] <Robe> hahaha, ok :)
[18:52] <mikeryan> or you mean cephfs?
[18:52] <Robe> cephfs, yeah
[18:53] <mikeryan> gregaf and slang are working on it right now
[18:53] <mikeryan> i think maybe one other person
[18:53] <mikeryan> we just started spinning up on it again
[18:53] <Robe> ah, ok
[18:53] <Robe> gregaf is the same greg who was in amsterdam?
[18:53] <mikeryan> if you have after-hours questions about cephfs the best place to ask is the mailing list
[18:53] <mikeryan> yep, that's the guy
[18:54] <Robe> ah, great
[18:54] <Robe> since he was kinda not-too-surey how to answer "can I use it"-question ;)
[18:54] <Robe> and I was wondering were the rough spots are/what to look out for
[18:54] <mikeryan> if greg doesn't know, i'm not sure who to ask then :S
[18:55] <mikeryan> you can try sage, because he did much of the initial implementation
[18:55] <Robe> hehehe
[18:55] <Robe> schrödingers reliability
[18:55] <Robe> "you won't know if it's stable or not unless you try it with your workload"
[18:56] <Robe> oh well, dinner time
[18:56] <Robe> thanks for your time :)
[19:10] <mikeryan> np
[19:14] * lofejndif (~lsqavnbok@82VAAHOFH.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[19:16] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[19:25] <nhm> regarding cephfs: with a single MDS it's "mostly stable" :)
[19:26] <lennie> yes, so I heared. It also has no data stored, so if you copy the mds key you can copy it to an other system and do failover ?
[19:29] <lennie> (as long as you kill any old running process before failover)
[19:30] <nhm> lennie: sounds right.
[19:30] <nhm> lennie: not sure if there is anything that could hang you up in practice.
[19:32] <lennie> I'd think as long as any previous the sessions are gone/tcp connection broken. I don't think anything bad could happen...
[19:33] <lennie> euh... OK, delete is delayed. Not sure if that is a MDS function or ODS ?
[19:33] <lennie> OR maybe even the client ?
[19:34] <nhm> haven't thought about it too much yet. Mostly I've been focused on the OSDs.
[19:34] <lennie> it's the MDS, judging by what the documentation says: http://ceph.com/docs/master/dev/delayed-delete/
[19:36] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[19:36] <lennie> it's kinda funny, because we choose not to use glusterfs anymore because of files could re-appear in case of a failure
[19:37] <gucki> Robe: a bit late probably, but i switched from debian to latest ubuntu because debian is missing syncfs support.
[19:38] <nhm> gucki: Are you running lots of OSDs per node?
[19:39] <lennie> nhm: anyway the window of delayed delete is much smaller in ceph than the problem I talked about in glusterfs
[19:39] <nhm> lennie: That's good. I haven't actually used glusterfs much. We originally were going to use it on an 8k core supercomputer but ended up switching to lustre.
[19:39] <gucki> nhm: even when running only one osd (ex sdb) and having other services making heavy use of (ex sda), the whole system gets slow. this is because sync syncs *all* disks, so sda and sdb... :(
[19:42] <nhm> gucki: yeah, I was running into that a lot last spring before ubuntu got a new enough glibc to do syncfs.
[19:44] <lennie> nhm / gucki: which kernel / glibc version got syncfs ? that it make it into Debian testing ?
[19:44] <gucki> nhm: i'm still wondering why using sync at all. i mean osds are considered machines that can die anytime, because of the nice ceph failover and replication. so why slow everything down with a sync?
[19:44] <lennie> s/that/did/g
[19:45] <gucki> lennie: i think 3.2. but also glibc has to support it, and this support is missing even in latest debian
[19:45] <lennie> gucki: that makes a person kinda sad
[19:45] <gucki> lennie: sorry: syncfs() first appeared in Linux 2.6.39.
[19:45] * loicd (~loic@31.36.8.41) has joined #ceph
[19:45] <gucki> lennie: so only glibc support is missing in debian
[19:46] <lennie> gucki: it would have been easier to upgrade the kernel ;-)
[19:46] <lennie> gucki: glibc probably not so much
[19:46] <gucki> lennie: yeah, but now i'm happy with ubuntu :) thanks to puppet the setup is quite fast :)
[19:47] <gucki> lennie: i mean for the host the distro doesn't matter that much in my case...all other services run in vms, still using squeeze...
[19:47] <nhm> yeah, it's 2.6.39 (or something) and glibc 2.14 I think.
[19:48] <gucki> lennie: http://ceph.com/docs/master/install/os-recommendations/ :)
[19:48] * loicd (~loic@31.36.8.41) Quit ()
[19:49] <gucki> nhm: the link i jujst posted seems to be wrong. i had latest squeeze installed and it complained about syncfs missing ..
[19:49] <lennie> gucki: when running Linux on Linux, we usually just use lxc if livemigration isn't needed (and customers don't need root)
[19:49] <gucki> lennie: yes i used to use openvz, but it's not flexible enough in our case
[19:50] <lennie> gucki: maybe the version in Debian is eglibc derived ?
[19:51] <gucki> lennie: mh no idea...
[19:52] <lennie> gucki: anyway life has gotten a lot easier for distributions: http://lwn.net/Articles/488847/
[19:54] <joao> mikeryan, around?
[19:56] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:56] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[19:56] * Leseb_ is now known as Leseb
[19:56] <lennie> gucki: on the issue of sync, I think the idea of ceph is: don't guess, but write to the journal. For example when all systems go down because of a powerfailure. You'll never have to deal with some kind of eventual consistency...?
[19:58] <gucki> lennie: i think this is only true for high end hardware with battery backed controlers etc.? otherwise sync doesn't really guarentee anything as far as i know?
[19:58] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[19:58] <lennie> nhm: I looked at your Twitter account, you are correct in saying you don't use it much ;-)
[19:59] <lennie> gucki: I don't think you are right, I believe battery backed controllers can only make sync faster. Sync itself is meant to mean: your data is safe
[20:00] <nhm> lennie: One of our community guys made me put a picture and a bio in. ;)
[20:00] <NaioN> lennie: assuming the disk doesn't lie...
[20:00] <lennie> nhm: Also Twitter suggested, Similair to Mark: Rupert Murdoch :-)
[20:00] <nhm> lennie: I once had another one for a website that I pretty much abandoned once I got to inktank and didn't have time to maintain anymore.
[20:01] <lennie> NaioN: yes, that is correct. But that is something for the kernel guys to worry about ? ;-)
[20:01] <gucki> lennie: ok, not sure. i thought i read somewhere that a lot of cheap hdds simply ack the sync while the data is only the their cache...but i could be wrong. what i really remember is that one some systems with synking the throughput went from around 60mb/s down to 3mb/s :(
[20:01] <nhm> lennie: yikes!
[20:01] <nhm> gucki: I think newer disks don't lie as much as older ones.
[20:01] <NaioN> lennie: hmmm the kernel huys can't do anything about it
[20:02] <joao> nhm, you have a twitter account?
[20:02] <NaioN> nhm: hmmm don't know, i wouldn't believe the consumer grade disks too much
[20:02] <joao> what would your handle be then?
[20:02] <NaioN> lennie: if the disks lie the disks lie and you don't have any idea if the data has landed
[20:02] <lennie> gucki / NaioN: let's just say non-consumer stuff is OK ?
[20:02] <nhm> joao: MarkHPC
[20:02] <joao> I'm on a mission to follow everybody at inktank
[20:03] <joao> k thx
[20:03] <NaioN> lennie: yes non-consumer should be ok
[20:03] <nhm> NaioN: there was an article a while back that discussed this. Apparently they did some analysis to determine if it was happening.
[20:04] <nhm> NaioN: I think we questioned their methodology, but thought it was probably happening.
[20:04] <NaioN> nhm: do you have a link? I would be interested!
[20:04] <nhm> NaioN: let me see if I can find it in my email.
[20:06] <nhm> http://research.cs.wisc.edu/wind/Publications/cce-dsn11.ps
[20:07] <NaioN> lennie: one of the points of enterprise grade disks is that they implement those scsi/sata calls
[20:07] <nhm> pdf: http://research.cs.wisc.edu/adsl/Publications/cce-dsn11.pdf
[20:09] <lennie> NaioN: that is what i would expect them to do :-) And don't delay errors, no endless retries
[20:10] <NaioN> nhm: thx!
[20:10] <NaioN> lennie: well that's one of the differences between consumer and enterprise grade disks
[20:11] <lennie> NaioN: price is an other ;-)
[20:13] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[20:13] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[20:14] <NaioN> lennie: indeed :)
[20:14] * loicd (~loic@31.36.8.41) has joined #ceph
[20:14] <NaioN> nhm: I scanned through the paper, but it isn't 100% garanteed if I'm correct
[20:16] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[20:17] <NaioN> it depends heavily on the way the cache is used...
[20:17] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[20:18] <lennie> NaioN: an other problem right now is the market for HDD now only has 2.5 manufacturers :-/
[20:18] * loicd (~loic@31.36.8.41) Quit ()
[20:20] <NaioN> hehe yeah that's true...
[20:21] <gucki> nhm: do you know when greg will be online tomorrow? i think i'll have to go pretty soon...
[20:22] <joao> greg should only be in the office by tuesday
[20:22] <joao> he's flying back tomorrow (CET) afternoon
[20:26] <mikeryan> tuesday PST GMT -0800
[20:26] <NaioN> they stayed in Amsterdam for the weekend?
[20:27] <NaioN> it was a very interesting workshop
[20:30] <lennie> NaioN: I agree it was also because we got to meet some people, that is always fun :-)
[20:32] <NaioN> indeed
[20:32] <lennie> NaioN: I think Ed from Inktank said they or he had to go to Spain ?
[20:32] <lennie> so maybe they did an other conference
[20:33] <joao> NaioN, Greg, Sage and I stayed around; others went to Spain
[20:34] <nhm> joao: did you make sure Greg had a good time. ;)
[20:34] <nhm> ?
[20:34] <NaioN> Ah yeah I see you're making an European tour...
[20:34] <joao> nhm, I'm sure you'll hear all about it in upcoming weeks :p
[20:34] <nhm> lol
[20:36] <gucki> joao: thanks for the info. is he the only one who knows how monitors work in detail and how to safely expand a 1 mon to a 3 mon cluster? :)
[20:37] <lennie> How is the house of Ed is everything back to normal again ?
[20:37] <joao> well, it's easy to expand it from 1 to 3 if you guarantee that the second monitor doesn't fail before you add the 3rd monitor
[20:37] <nhm> gucki: I'm sure Sage knows too. Some of the other devs probably know to various degrees. ;)
[20:37] <nhm> joao: A couple of people have tried the instructions on our website and end up out of quorum.
[20:38] <joao> that's weird
[20:38] <nhm> joao: I logged a bug for it: http://tracker.newdream.net/issues/3438
[20:38] <gucki> joao: yeah as soon as i added the second monitor both said "out of quorum"
[20:38] <lennie> gucki: what release are you using ?
[20:38] <gucki> latest argonaut, so 0.48.2 i think
[20:38] <gucki> the one from the ubuntu quantal repos
[20:39] <gucki> i mean i can try to do it again...now i know how to recover :-)
[20:40] <joao> well, there might be a corner case I will look into just to make sure
[20:40] <lennie> gucki: you are a brave person :-)
[20:40] <joao> but Greg just told me that the main problem with expanding it from 1 to 3 is if you end up not being able (or you forget) to bring up the second monitor
[20:40] <gucki> lennie: well, i just want to run this cluster any longer with only one monitor...it that one dies... :-(
[20:40] <gucki> ...if that....
[20:41] <lennie> gucki: yeah, that is a good point :-)
[20:42] <gucki> joao: do i have to complete all instructions for 2. and then 3. monitor? or already add both new monitors the config then restart both at the same time?
[20:42] <joao> erm
[20:42] <gucki> joao: i mean complete them first for the 2. monitor, then for the 3. monitor..
[20:43] <gucki> joao: if you have some time i can give it another try right now :)
[20:45] <joao> gucki, as far as the consensus in the room goes, you won't be able to add the 3rd monitor while the 2nd monitor is not in the quorum
[20:46] <joao> but I will check this out and try to reproduce whatever happened
[20:47] <lennie> gucki: how does someone end up with only one mon ?
[20:48] <joao> by creating a single-monitor cluster? ;)
[20:48] <lennie> in that case you didn't read enough/understand how ceph works (yet)
[20:49] <gucki> joao: let me just prepare the commands and pastie them....if you said that it should work this way, i'll execute them :)
[20:49] <gucki> lennie: well, i started with only 1 monitor and didn't expand yet..
[20:49] <gucki> lennie: *shamed*
[20:50] <nhm> lennie: now now, I do a lot of ceph testing with just 1 monitor. ;)
[20:50] <lennie> gucki: maybe it is a documentation problem :-)
[20:50] <lennie> nhm: testing != production
[20:51] <gucki> lennie: no, they clearly state that they recommend 3 monitors in prod.. :)
[20:51] <joao> I can't do any useful testing with just one monitor :(
[20:51] <nhm> lennie: I come froma world where the lines between testing and production are kind of blurred...
[20:53] <lennie> nhm: I can understand that, I just don't want to deploy something I don't have a feeling I understand at least as a logical level :-)
[20:53] <lennie> joao: do you run the monitors on the same machine ?
[20:53] <joao> lennie, depends on what I'm trying to accomplish
[20:54] <joao> for most of what I do, it doesn't really matter
[20:54] <lennie> joao: my test environment runs the same osds and mon on my desktop, but in seperate LXCs
[20:54] <lennie> joao: and no I don't use a kernel mount for cephfs on the same host ;-)
[20:56] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[21:00] <lennie> I wonder if the same problem exists when you go from 3 to 5 mons, I hope that is something people try to do more often then 1 to 3
[21:00] <lennie> although I can't imagine many people need more than 3
[21:10] * loicd (~loic@31.36.8.41) has joined #ceph
[21:12] * rlr219 (62dc9973@ircip1.mibbit.com) has joined #ceph
[21:12] <rlr219> mikeryan: are you on?
[21:16] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[21:18] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:19] <rlr219> anyone one that can help with a crush map error?
[21:19] <Robe> gucki: what's syncfs?
[21:21] <Robe> ach gottchen
[21:21] <Robe> you kidding me?
[21:21] <Robe> you calling just sync for each osd? :>
[21:26] * danieagle (~Daniel@186.214.79.76) has joined #ceph
[21:30] <jefferai> jtang: hah, in my case I had to fund raise for 9 months to get the money to build the cluster...but I prioritized machines over space, for now, so we can add a whole lot more hard drives to the existing machines to grow
[21:31] * sagelap (~sage@62-50-218-8.client.stsn.net) Quit (Ping timeout: 480 seconds)
[21:33] <jtang> heh
[21:35] * joao (~JL@62.50.239.160) Quit (Ping timeout: 480 seconds)
[21:36] <lennie> Robe: I was thinking about the DreamHost OpenStack, maybe it is just me: but I don't think many people have had the luxery of running Linux on their switches...? Still just a few vendors, right ?
[21:40] * joao (~JL@62.50.239.160) has joined #ceph
[21:41] <Robe> lennie: haven't watched it yet
[21:41] <Robe> more and more these days
[21:42] <Robe> lennie: have a look at https://twitter.com/terrorobe/status/264274196153131008
[21:42] * Robe checking out for the day, have a good one
[21:43] <lennie> Because I've been thinking, I would rather have dumb(er) switches for my storage network and support for multipath for the OSDs (if it is easy to support)
[21:44] <jtang> jefferai: im still planning and haven't spent any of my budget :)
[21:46] * sagelap (~sage@62-50-199-254.client.stsn.net) has joined #ceph
[21:46] <lennie> Robe: SDN and Linux on the switch isn't same things though
[21:55] <masterpe> I was last friday in Amsterdam on https://www.42on.com/events/, but I was curious does the presentations come online?
[21:57] <lennie> masterpe: do you mean the slides or video ? because I was there and I only noticed a video camera for the discussion panel (probably the least interresting part of the conference ? because it seemed they hardly talked about storage)
[21:58] <lennie> masterpe: the most interresting was the talk by Sage (of course ?) if you ask me and the same presentation is also available as video from an other conference, for example this one: http://vimeo.com/50620695
[21:59] <masterpe> No the panel was not interresting, that was just "sales bullshit" but I meen the slides
[21:59] <lennie> masterpe: here is the slides from the video I mentioned: http://www.snia.org/sites/default/files2/SDC2012/presentations/File_Systems/SageWeil_Scaling_Storage_Cloud_Ceph-2.pdf
[22:00] <lennie> masterpe: as I don't work for 42on or Inktank that is all I know
[22:00] <gucki> joao: so finally, here's a step by step how i'd expand my mon cluster http://pastie.org/5183980
[22:01] <gucki> do you think it should work this way, or did i do any mistakes? :)
[22:04] <lennie> gucki: why do you run 'ceph-mon -i b' and 'service ceph start mon b' isn't that the same thing ?
[22:05] <gucki> lennie: ah sure, just remove one. sorry
[22:06] <lennie> gucki: same for c ofcourse
[22:06] <dweazle> mm it sucks syncfs() is not backported to the glibc in rhel/centos/sl 6.3, because it does seem the syscall has been backported to the kernel
[22:06] <gucki> lennie: in any case it shouldn't hurt, as the lock file should prevent duplicate execution
[22:06] <gucki> lennie: yes, thanks..i'll remove :)
[22:06] <lennie> gucki: I wouldn't be surprised if they have a lockfile yes :-)
[22:07] <gucki> dweazle: i had the same issue with debian and switched to ubuntu quantal..
[22:07] <gucki> lennie: ok, but other then that it looks ok to you? :)
[22:07] * rlr219 (62dc9973@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[22:07] <dweazle> gucki: i think backporting the patch to glibc 2.12 is less work than switching to a whole new distro for me
[22:08] <gucki> dweazle: sure it depends. for me it was quite easy, as i use puppet for deployment, and all main services run in vms anyway...
[22:08] <dweazle> http://repo.or.cz/w/glibc.git/commit/81a5726bd231f21a3621007bde58eac9a0f82885 that's all there is
[22:09] <dweazle> gucki: i use puppet as well, but that doesn't mean switching to a new distro is easily done :)
[22:10] <lennie> dweazle: I hope it's easy to backport, in that case maybe we would do it too
[22:10] * noob2 (47f46f24@ircip2.mibbit.com) has joined #ceph
[22:11] <dweazle> lennie: it looks trivial to backport
[22:11] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[22:11] <dweazle> but i'm in no rush to deploy ceph, will need to do some hardware planning first, figure out our business case and all that
[22:11] <noob2> as far as rhel6 goes do i have to install the ceph package to gain rbd capabilities or are they built in already?
[22:12] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:12] <dweazle> noob2: i think you might need to patch the qemu-kvm package and link it with librbd as well, haven't checked that part yet
[22:12] <lennie> gucki: judging by my mons, you at least have the right keyring :-)
[22:12] <gucki> lennie: hehe
[22:13] <noob2> dweazle: gotcha
[22:13] <gucki> lennie: basically it's all just copied from the wiki and real values filled in...
[22:13] <dweazle> it would also be great if virtio-scsi was ready, because it supports trim, which would come in handy with sparse rbd images
[22:13] <dweazle> perhaps i'll wait for another 6 months after all
[22:14] <noob2> dweazle: out of curiousity has anyone setup ceph on ssd's yet?
[22:14] <dweazle> noob2: you mean other than the journal?
[22:14] <noob2> yeah
[22:15] <dweazle> dunno..
[22:15] <noob2> ok
[22:15] <dweazle> but i'm sure it works just fine
[22:15] <noob2> yeah i'm sure also
[22:15] <dweazle> just keep in mind that MLC SSD's wear out quickly
[22:16] <noob2> yeah i know
[22:16] <dweazle> even for the journal i would consider SLC, or perhaps just a couple of 15K SAS disks in RAID-1
[22:16] <lennie> dweazle: someone in Amsterdam mentioned: create a small partition and don't use the rest of the SSD, they'll last longer
[22:16] <noob2> that'd be the downside
[22:16] <lennie> it's a workaround ofcourse :-)
[22:17] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[22:17] <dweazle> lennie: that helps, but even then with a high workload you might have to replace it every year (depends on the amount of writes of couse)
[22:17] <noob2> does ceph generally scale in performance with the nodes that you add like gluster does? I haven't seen anyone mention it speeding up as you add more nodes
[22:18] <dweazle> which is extra crappy of all of your journals fail at about the same time (and the osd's with them:)
[22:18] <nhm> that's what I'm doing on our test cluster. 180GB SSDs each with 3 10G journal partitions.
[22:18] <noob2> sweet
[22:18] <noob2> nhm: have you built it yet?
[22:19] <nhm> noob2: built and rebuilt, I work for inktank. :)
[22:19] <dweazle> noob2: if you're writing to a single block you'll always be limited to the performance of a single osd
[22:19] <noob2> true
[22:19] <noob2> nhm: that's great! i keep thinking of putting in a resume with inktank for the hell of it
[22:19] <lennie> gucki: it looks fine to me, I couldn't fault it. Just the part of adding the monitors, if it is possible to add 2 and then start them later
[22:20] <dweazle> nhm: were you at the workshop last friday?
[22:20] <nhm> dweazle: naw, only Sage and Greg were there from the engineering side.
[22:20] <dweazle> ack
[22:20] <dweazle> nhm: it was very informative
[22:21] <nhm> dweazle: so I've heard!
[22:21] <dweazle> nhm: and of course very awesome
[22:21] <nhm> that's great
[22:21] <noob2> what did you end up seeing at the workshop?
[22:21] <gucki> lennie: yeah, i'm not sure about it. probably it's better to wait till tomorrow, have to get some sleep now anyway *g*. did joao leave?
[22:21] <dweazle> well.. there was a free lunch
[22:22] <dweazle> and beer afterwards
[22:22] <noob2> hehe
[22:22] <nhm> noob2: go for it!
[22:22] <dweazle> well it was a general introduction to ceph with a bit of technical background, roadmap, current status, that kind of thing
[22:22] <noob2> nhm: only thing stopping me is i'm on the east coast
[22:23] <nhm> noob2: I'm in Minneapolis
[22:23] <noob2> really
[22:23] <noob2> tell me more :)
[22:23] <nhm> noob2: Yeah, and joao lives in Portugal.
[22:23] <noob2> so inktank has no issues with hiring remote hands
[22:24] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[22:26] * lofejndif (~lsqavnbok@82VAAHOKV.tor-irc.dnsbl.oftc.net) has joined #ceph
[22:26] <nhm> noob2: Nope. I'm thankful for that too since there's pretty slim pickings around here for interesting open source companies to work for. :)
[22:26] <noob2> neat
[22:26] <noob2> well that settles that. i'll throw an app in and see what they say
[22:27] <nhm> noob2: yeah, doesn't hurt to apply!
[22:29] <lennie> noob2: just out of curriosity, what kind of work do you want to do at Inktank
[22:30] <noob2> the systems engineer job appealed to me
[22:30] <noob2> i've been working at my current place building out gluster storage
[22:30] <noob2> i suggested it awhile ago and finally got some traction
[22:31] <noob2> the other side of the equation though is ceph for block based storage. i'm petitioning management to build out a ceph cluster. i have a vm demonstration to show them
[22:33] <jtang> how is the professional services team coming along?
[22:34] <jtang> any demand for PS people who have a background in gpfs/lustre ?
[22:34] <noob2> nhm: app submitted
[22:35] <noob2> i'm def inspired by ceph. it's going to be huge
[22:35] <nhm> noob2: cool. Good luck! (I'm not involved in hiring at all)
[22:35] <noob2> nhm: i figured as much.
[22:36] <nhm> jtang: !!!
[22:36] <nhm> jtang: please apply
[22:36] <noob2> so for my rhel6.x clients would i be best served with the fuse client to rbd ?
[22:36] <jtang> nhm: i plan on chatting to some inktank peeps at SC12 first before i decide
[22:36] <gucki> ok guys, have to leave now. have a nice day, see you!
[22:37] <nhm> jtang: I'll be there. :)
[22:37] <nhm> gucki: have a good one!
[22:37] <jtang> heh cool i guess we can have a chat then
[22:37] <jtang> just dont mention it to my boss that im interested ;)
[22:37] <jtang> he might be happy
[22:37] <noob2> lol
[22:37] <nhm> jtang: lol
[22:37] <jtang> might not
[22:38] <jtang> i plan on chatting to a few storage vendors for the project im working anyway
[22:38] <noob2> yeah i'm sure my manager would strangle me
[22:39] * gucki (~smuxi@46-127-158-51.dynamic.hispeed.ch) Quit (Remote host closed the connection)
[22:39] <jtang> my boss's boss would be pretty annoyed if i left :P
[22:39] <noob2> haha
[22:39] <noob2> yeah
[22:40] <jtang> i kinda wnat to finish what im doing before i leave
[22:40] <noob2> gotta hang on to your best people
[22:40] <jtang> but then again if there is something attractive out there i might just leave
[22:40] <nhm> jtang: before I went to inktank my old boss was telling me there was no way he could give me a raise, then wanted to do a salary match as soon as I told them. :P
[22:40] <jtang> i've had a hard time leaving my current work place :P
[22:40] <lennie> nhm: I've got the strace for the kernel crash (not sure how useful it is): http://pastebin.com/yRK9J7LM
[22:40] <noob2> nhm: they're always doing stupid stuff like that
[22:40] * jtang works for a ~400 year old university
[22:40] <jtang> the campus is cool
[22:41] <nhm> jtang: yeah, I used to work for the University of Minnesota
[22:41] <noob2> jtang: nice. i used to work for a university also. it was hard to leave
[22:41] <lennie> jtang: I've been saying I won't leave before I've complete certain projects at work too. I've been there for over 10 years :-)
[22:41] <noob2> lol
[22:41] <noob2> yeah you get sucked in
[22:42] <jtang> im beginging to feel that i will never leave
[22:42] <nhm> I was at the Supercomputing Institute for 6 years. Realized far too late how much they wre underpaying me.
[22:42] <jtang> i even changed jobs internally to make sure i like the place
[22:42] <jtang> i know full well how much more i can get outside in industry
[22:42] <noob2> nhm: i applied to a supercomputing group at university of pennsylvania here and was shocked at their low offer
[22:43] <nhm> jtang: it wasn't just low for industry, it was even low in academia.
[22:43] <jtang> universities dont pay well
[22:43] <noob2> no they don't
[22:43] <noob2> i got a massive raise when i left
[22:43] <nhm> jtang: All of the salaries there are public, so I could see the averages.
[22:43] <jtang> hpc is more pervasive these days, i guess 6yrs of hpc devops would mean a good pay raise for me in industry
[22:43] <jtang> or is that 7yrs
[22:43] <jtang> gah i forget
[22:44] <noob2> jtang: def would be worthwhile to a lot of companies
[22:44] <lennie> nhm: *wave* *wave* did you have a look at the pastebin ? Is there any more information I could provide :-)
[22:44] <nhm> lennie: looking at it now. :)
[22:44] <jtang> still we a have bg/p machine which i can play with on demand
[22:44] <jtang> its kinda hard to leave :)
[22:44] <noob2> i hear ya :)
[22:45] * jtang also expresses some hate for sles10
[22:45] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[22:45] <jtang> and the gigantic mess of a system that ibm has built up to run it (bg/p)
[22:47] <jtang> sadly the public sector in ireland sucks right now
[22:47] <jtang> i've gotten a 10% pay cut since the austerity measures
[22:47] <noob2> ouch
[22:47] <jtang> plus a pension levy
[22:48] <jtang> my pay hasnt grown with the economy
[22:48] * ctrl (~Nrg3tik@78.25.73.250) Quit (Read error: Connection reset by peer)
[22:48] <nhm> jtang: that was the story for me in academia too.
[22:48] * ctrl (~Nrg3tik@78.25.73.250) has joined #ceph
[22:48] <noob2> nhm: yeah my raises were awful in academia. i loved the environment but hated the pay
[22:48] <nhm> jtang: 4 years of no raise, and furloughs (basically they didn't pay for certain days, but I was still on-call anyway).
[22:49] <jtang> heh
[22:49] <jtang> thats bad
[22:49] <jtang> right, i might poke at ansible and ceph during the week before i head of to salt-lake
[22:49] <nhm> It kind of screws you over for long term raise opprtunities too.
[22:50] <jtang> nhm: my partial strategy was to apply for a higher position than my current one ;)
[22:50] <jtang> which ended up with me doing a different job
[22:50] <jtang> now i have minions!
[22:50] <lennie> jtang: ansible ? off, that's cool I've been wanting to try it for a long time now
[22:50] <lennie> s/off/ohh/g
[22:50] <nhm> jtang: I did that successfully once, but the University wised up and forbid hiring existing staff into more senior positions in the same organization.
[22:50] <jtang> lennie: its *cool*
[22:51] <jtang> lennie: it took me a few hours to learn it, it doesnt look hard to use
[22:51] <jtang> its far more pleasant than puppet or chef
[22:51] <jtang> i've been using puppet on and off for years, and so far ansible just blows puppet away
[22:52] <lennie> jtang: I heared greg say recently: I learned chef in 2 weeks and juju in a day. So Ansible in a few hours sounds good to me ;-)
[22:52] <noob2> lol
[22:52] <jtang> yea chef doesnt look too bad, i never really got into that
[22:53] <noob2> my coworker really likes puppet. i don't have any experience with chef yet
[22:53] <jtang> we took the strategy of packaging things up into deb/rpm then rolled changes out automatically with nightly updates or with cfengine/puppet kicking off updates
[22:53] <jtang> puppet gets messy with badly written modules
[22:53] <noob2> yeah i'm sure it does
[22:53] <jtang> or modules that aren't well, modular
[22:53] <jefferai> jtang: interesting, hadn't heard of ansible -- I didn't like Chef and have been using Salt and Salt is great
[22:53] <lennie> I really don't feel like kifing a chef or be a puppetmaster ;-)
[22:53] <noob2> we haven't made extensive use of it yet but some people just went to training. So i'm sure more use is coming soon
[22:54] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[22:54] <noob2> Salt if the python based system right? A friend at princeton university really likes that
[22:54] <jtang> jefferai: ansible can be compared to salt apparently
[22:54] <jefferai> noob2: yep
[22:54] <noob2> yeah he was raving about that
[22:54] <jefferai> you can do most things pretty simply (although the docs still need some work)
[22:54] <noob2> i haven't tried it yet
[22:54] <jefferai> but you can also do complex things by writing pure python if you want, which gets very powerful
[22:54] <noob2> awesome :D
[22:54] <jefferai> and I like systems that aren't DB backed
[22:54] <jefferai> so I can store everything relevant on git
[22:55] <noob2> jefferai: i'll have to check that out
[22:55] <jtang> jefferai: you'll like ansible
[22:55] <jtang> it keeps the coding to a minimum
[22:55] <jefferai> Salt does too :-)
[22:55] <jtang> and plugins can be written in anything as long as you output json
[22:55] <jtang> which i thought was pretty neat
[22:55] <noob2> nice
[22:55] <noob2> i really like json over xml
[22:56] <lennie> 2 thumbs up on the XML :-)
[22:56] <jtang> anyway, the plan is im gonna try and shove a ceph setup into my ansible setup
[22:56] <lennie> or more like 2 thumbs up for json and 2 down for XML
[22:56] <noob2> lol
[22:56] <jtang> cause i plan on using ceph in a years time
[22:56] * lofejndif (~lsqavnbok@82VAAHOKV.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[22:56] <noob2> exactly
[22:56] <jtang> so i may as well have it in my dev/test environment
[22:56] <noob2> jtang: i think that's prob my timeframe also. maybe 6 months
[22:56] <jefferai> oh, ansible and salt are very similar
[22:57] <jefferai> both use yaml
[22:57] <jefferai> both have similar ways to define machines
[22:57] <jtang> noob2: im planning on a demoable system in 6months time for management
[22:57] <Robe> lennie: it boils down to having more control of the forward plane
[22:57] <jtang> and pilot system in 12
[22:57] <jtang> months
[22:57] <Robe> lennie: couldn't care less which OS runs on a switch
[22:57] <noob2> jtang: on bare metal or vm's?
[22:57] <Robe> next to having an usable tcpdump
[22:57] <jefferai> jtang: yeah, I'm going to try to shove ceph into my salt setup
[22:57] <jefferai> might be room for collaboration there :-)
[22:57] <Robe> but that's not limited to linux
[22:57] <jtang> noob2: the storage will be baremetal
[22:57] <nhm> ansible sounds interesting. I used to do puppet, mostly do more adhoc stuff now with pdsh.
[22:58] <nhm> lennie: I've been trying to research your bug without much luck.
[22:58] <jtang> jefferai: indeed ;)
[22:58] <noob2> jtang: nice :)
[22:58] <jefferai> jtang: the main difference between salt and ansible seems to be that ansible is more focused on running commands and salt is more focused on describing system state
[22:58] <noob2> jefferai: salt looks to be what i'm evolving towards. I make extensive use of paramiko in python
[22:58] <jtang> i'm expecting about 50tb of data from an imaging project
[22:58] <jefferai> I guess you can do both with both
[22:59] <jtang> jefferai: i havent had a look at salt before
[22:59] <jefferai> noob2: you won't be disappointed
[22:59] <lennie> nhm: well, let's start with recording the problem in the existing or a new bugreport...?
[22:59] <jefferai> :-)
[22:59] <jefferai> jtang: worth your time, although probably not worth switching your setup again, unless you have issues with ansible
[22:59] <noob2> nice :D
[22:59] <jefferai> but worth knowing what's out there
[22:59] <jtang> i might take a look, in a few weeks time
[22:59] <jtang> yea i just looked at ansible today for the fun of it
[23:00] <jtang> it took me 2-3weeks of messing around with puppet modules to get what i want
[23:00] <jtang> it took me an afternoon with ansible
[23:00] <jtang> or two
[23:00] <nhm> lennie: Yeah, please submit something to the tracker. I don't know if it's strictly a ceph issue but it's worth having it recorded.
[23:00] <jefferai> jtang: yeah that's basically my experience with salt
[23:00] <noob2> jtang: i love it when that kinda stuff happens. that's what happened when i was trying to build an openstack cloud and then switched to ovirt. built it in a day
[23:00] <jefferai> it took me a little longer because I wrote a python state just for the fun of it
[23:01] <jefferai> and it took a bit to figure out the syntax you have to return
[23:01] <jefferai> noob2: ah, how's ovirt? I've been thinking I'd go down Ganeti's route
[23:01] <jefferai> ovirt didn't seem to have ceph support
[23:01] <noob2> jefferai: i really like it. everything was easy to setup. openstack was a nightmare compared to ovirt
[23:01] <noob2> it doesn't have ceph support but they're looking to build it in shortly
[23:02] <jtang> noob2: isn't there crowbar for openstack?
[23:02] <jtang> its supposedly quite good
[23:02] <jtang> its all chef thuogh
[23:02] <noob2> jtang: yeah there's a bunch of ways to build it
[23:02] <noob2> stackops, etc
[23:02] <noob2> i had trouble getting openvswitch to work with our networking vlan setup
[23:02] <noob2> i just couldn't get it to play nice so i gave up
[23:02] <jtang> i wish i had more time to play with that kinda stuff
[23:02] <jefferai> noob2: how's ovirt's ceph support?
[23:02] * jtang has no time for it in work
[23:03] <noob2> jefferai: there's no direct support yet that i know of. There is however posix filesystem support. so if you mounted a ceph volume you could use that as your storage
[23:03] <jefferai> I see
[23:03] <jtang> gone are the says when i get to play with networking and storage for days on end
[23:03] <jefferai> so you can manually mount a RBD block device, you mean?
[23:03] <jtang> days
[23:03] <noob2> right
[23:03] <noob2> it's a little roundabout but i think it would be fine
[23:03] <jefferai> I wonder how integrated Ganeti's support is
[23:04] <jefferai> it's definitely a supported backend, but...
[23:04] <noob2> it also allows, fibre channel, iscsi, nfs, etc
[23:04] <jefferai> noob2: does migration work?
[23:04] <jefferai> yeah, I know I could export rbd via iscsi
[23:04] <jefferai> or nfs
[23:04] <noob2> jefferai: yeah works great
[23:04] <jefferai> ok, so ceph storage with migration on ovirt works well?
[23:04] <noob2> yeah you'd just need a proxy node right?
[23:04] <jefferai> Um, not sure
[23:04] <noob2> jefferai: oh, i haven't tried migration with ceph yet. just migration in general works fine
[23:04] <jefferai> I know it's theoretically possible, haven't looked at details
[23:04] <jefferai> ah, ok
[23:05] <noob2> i don't see why it wouldn't work
[23:05] <noob2> if you setup the proxy server to export over iscsi and had both ovirt nodes mount that storage i think it could migrate
[23:05] <noob2> or if it didn't, copy the data in between
[23:05] <jefferai> sure
[23:05] <noob2> in the next versin of ovirt, they're talking about live storage migration
[23:05] <jefferai> noob2: do you use vde?
[23:06] <noob2> what's vde?
[23:06] <jefferai> seems like the tap-per-vm wouldn't play well with migration
[23:06] <lennie> nhm: append it to the same bugreport ?: http://tracker.newdream.net/issues/2445
[23:06] <jefferai> virtual distributed ethernet
[23:06] <noob2> oh
[23:06] <noob2> no i haven't messed with that
[23:06] <jefferai> because normally you create a bridge, and a tap device per VM, right?
[23:06] <jefferai> so if you want to migrate...
[23:06] * jefferai is a bit new to enterprise level qemu
[23:06] <noob2> correct
[23:07] <jefferai> so ovirt and the like take care of setting those tap devices up?
[23:07] <noob2> i bridge to ports with 802.3 and then setup vlan's on top of it
[23:07] <noob2> yeah ovirt handles that by itself
[23:07] <jefferai> hm
[23:07] <noob2> you setup the management interface and the vm interface and it'll take care of everything else
[23:07] <jefferai> hm
[23:08] <noob2> it built the bonds, bridges and vlan abilities for me
[23:08] <lennie> nhm: probably not as it doesn't seem to be the exact same issue
[23:08] <jefferai> I have a raw network device, was going to put a vlan on top of that, put a bridge on the vlan
[23:08] <noob2> it was nice compared to openstack hacking
[23:08] <jefferai> then have the tap devices created for each vm
[23:08] <noob2> yeah it'll do that
[23:08] <jefferai> cool
[23:08] <noob2> i have the same setup basically
[23:08] <jefferai> hm
[23:08] <jefferai> ok
[23:08] <jefferai> I'll have to try both out I guess
[23:08] <noob2> there's a checkbox for vlan's and you give it the id
[23:08] <noob2> super easy haha
[23:08] <jefferai> I could run ovirt on two nodes and ganeti on two nodes and have a bake-off
[23:09] <noob2> exactly
[23:09] <jefferai> thanks for the push in that direction :-)
[23:09] <noob2> haha no prob
[23:09] <noob2> i really wanted openstack to work badly
[23:09] <jefferai> might come down to whether or not live migration works with the ceph cluster
[23:09] <nhm> lennie: I'd just make a new one and let them decide if they think it's a dupe.
[23:09] <noob2> i just couldn't bend it to my will :D
[23:09] <jefferai> ah, I'd heard openstack is too complicated if you're not providing tenancy
[23:09] <jefferai> if you just need to run a bunch of infrastructure
[23:09] <noob2> it's brutal
[23:09] <noob2> yeah
[23:10] <noob2> that's what i found also
[23:10] <jmlowe> I've been working on an install for a week
[23:10] <noob2> ovirt didn't have everything and the kitchen sink but i didn't need it either
[23:10] <lennie> nhm: when I create an account on tracker.newdream do I have to wait for an email link or something like that ?
[23:10] <jtang> hmm salt looks interesting
[23:10] <noob2> so to get an 'cloud' like setup i installed foreman on a vm running on ovirt
[23:10] <jmlowe> the docs could use some work and there are a ton of moving parts
[23:10] <jtang> i think ansible will be fine for me to play with
[23:10] <noob2> jmlowe: yeah that's the problem. there's about a thousand moving parts
[23:11] <jtang> openstack doesnt look too bad
[23:11] <jefferai> jtang: yeah, I think both look like good answers to puppet and chef
[23:11] * sagelap (~sage@62-50-199-254.client.stsn.net) Quit (Ping timeout: 480 seconds)
[23:11] <jefferai> even though chef was supposed to be the answer to puppet and cfengine :-)
[23:11] <jmlowe> also if you are using ceph 0.53 drop the '=' in the caps, there is a parsing bug that isn't fixed until 0.54
[23:11] <jmlowe> that cost me a couple of days
[23:11] <jtang> as bad as cfengine was, at least it worked well
[23:12] <jtang> we installed hundreds of machines with it :)
[23:12] <jefferai> jtang: I don't think openstack is bad, you just end up having to write a lot of custom bits for your setup in many cases
[23:12] <noob2> right
[23:12] <noob2> that's what i ran into
[23:12] <noob2> they give you lots of freedom
[23:12] <jtang> actaully FAI + cfengine was pretty good
[23:12] <jefferai> oh, FAI was a nightmare for me
[23:12] <jefferai> I gave up and am just using preseeding
[23:12] <jtang> i forgot about FAI, that sort of disappeared when we moved to using RHEL
[23:12] <noob2> jefferai: are there any screenshots out there of the salt interface? i'm curious what it looks like
[23:12] <jefferai> which works well, although I haven't tried out a raid install yet
[23:13] <jefferai> noob2: interface?
[23:13] <nhm> I actually had pretty good luck with puppet, but I didn't do anything really fancy.
[23:13] <noob2> jefferai: yeah they say there's some kinda web interface that goes with it
[23:13] <jefferai> oh
[23:13] <jtang> jefferai: FAI was the only game in town 6-7yrs ago for installing lots of debian machines
[23:13] <jefferai> never seen it
[23:13] <jtang> :P
[23:13] <noob2> lol
[23:13] <jefferai> only ever seen the command line
[23:13] <jefferai> but mostly I just run the highstate
[23:13] <nhm> We used HP's crazy CMU software on our big cluster.
[23:13] <noob2> yeah i doubt i'll be using it
[23:14] <jtang> ibm's stack was horrid
[23:14] <jefferai> I kind of like that salt, unlike chef, doesn't run periodically...it runs only when you tell it to
[23:14] <jefferai> although it keeps info about the machines up to date
[23:14] <jtang> xcat + director or something like that
[23:14] <noob2> that's cool
[23:14] <jtang> we just ignored it and did our own thing
[23:14] <noob2> puppet has that problem also
[23:14] <jefferai> jtang: maybe, but these days, it has two backends -- initramfs-live and dracut -- and neither works
[23:14] <noob2> it wants to keep you in compliance
[23:14] <jefferai> I tried for two days
[23:15] <jefferai> it would create a live environment that failed to have writable tmp for instance
[23:15] <jefferai> all sorts of issues
[23:15] <jtang> heh
[23:15] <jtang> have you seen warewulf for provision?
[23:15] <jefferai> also when I put boot=live in the kernel command line, which is necessary (and apparently would have fixed the writable problem), it would cause networking to fail, so it couldn't actually do DHCP and mount the NFS mount to get at the live system
[23:15] <noob2> haven't seen that. i've been using foreman
[23:15] <jefferai> and dracut backend just failed totally
[23:16] <jtang> jefferai: i used to maintain my own kernels for fai
[23:16] <jefferai> jtang: yuck
[23:16] <jtang> well kernel packages
[23:16] <jtang> we used to use debian-amd64 before it got made mainline
[23:17] <jefferai> Haven't used warewulf, but my needs right now are minimal -- my setup is using all debian right now, so no real reason why preseeding doesn't work for me -- especially since when the machine reboots, salt-minion has been installed, and the rest happens in salt
[23:17] <jefferai> which is higher level so less distro-specific things
[23:17] <jtang> so we had to keep our own set of amd64 kernels that had enough bits in it to get it to pxe boot with the tg3 driver which was flaky at the time
[23:17] <jefferai> jtang: yuck
[23:17] <jefferai> this was e1000 driver on vmware -- very standard, no reason it shouldn't work
[23:17] <noob2> nice chatting with you guys :) afk
[23:18] <jtang> actually this brings back lots of memories of FAI and debian
[23:18] * Enigmagic (enigmo@c-50-148-128-194.hsd1.ca.comcast.net) Quit (Quit: bbl)
[23:18] <jtang> and how about 5% of the machines would always some how end up being different
[23:18] * Cube (~Cube@184-231-35-186.pools.spcsdns.net) has joined #ceph
[23:19] <jefferai> hah
[23:19] <jtang> or the frakenstein of a machine we built (debian + voltaire ib drivers + sles9 kernel)
[23:19] <jefferai> yeah, so preseeding + salt seems like a nicer solution to me
[23:20] <jtang> yea preseed/kickstart is the way to go then choose your automation layer that you like
[23:40] <lennie> nhm: you still there ?
[23:44] <lennie> nhm: I did get the activation email now, I'll try creating the bugreport
[23:55] <lennie> nhm: http://tracker.newdream.net/issues/3439
[23:55] <lennie> sage: the kernel crash bug report is here: http://tracker.newdream.net/issues/3439
[23:56] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.