#ceph IRC Log

Index

IRC Log for 2011-02-03

Timestamps are in GMT/BST.

[0:03] * sagelap (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[0:03] * sagelap (~sage@ip-66-33-206-8.dreamhost.com) has left #ceph
[0:05] <gregaf> DJLee: depends on your hardware and stuff
[0:06] <gregaf> one thing is that writes are all going to complete at streaming speed because of the journal, but you might need to manage random accesses and stuff when reading
[0:07] <gregaf> also, if you're seeing uneven scaling with machines you should check out how each of your disks perform
[0:08] <DJLee> hmm,
[0:08] <DJLee> given the simplest and minimalist setup, i don't understand
[0:08] <gregaf> some users that have had scaling issues it turns out that some of their disks are performing anywhere from 10%-50% different, and that has a big impact when replicating since the speed of the slowest OSD holding the PG has to dominate it
[0:08] <gregaf> DJLee: yeah, I don't really know why reads would be slower than writes, it was just one possibility
[0:09] <gregaf> how are you measuring it?
[0:09] <DJLee> very simple, and minimalist, dd sequential write 40GB, read it back, do it again 4 times
[0:09] <DJLee> for each 1 osd, 3 osd and 6 osd,
[0:09] <DJLee> and then, again with 2x replication
[0:10] <Tv|work> ceph-booter:/images is full again
[0:10] <Tv|work> as i predicted would happen
[0:10] <gregaf> DJLee: and you get faster writes than reads?
[0:10] <gregaf> probably our pre-fetching just isn't working very effectively, then
[0:11] <gregaf> there's a variety of tunables to make that better but we haven't spent any time on it as we work on features and stability
[0:11] <DJLee> yes, faster write, for all 3 osd, and 6 osd , (1x), but for 1 osd, write is much slower
[0:11] <jantje> i'd love to see you guys testing random read/write performance :-)
[0:11] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[0:11] <jantje> (if it makes any sense ...)
[0:11] <DJLee> jantje, i have entire results of those random read/write.
[0:12] <Tv|work> jantje: working towards it..
[0:12] <DJLee> however, before i could continue, i went back and and now just trying to understand seqeutnail write first;;
[0:15] <jantje> Tv|work: i'm not an expert, but iozone seems to have nice tests, maybe compare it to a benchmark on a local fs, and see if it scales or not
[0:16] <Tv|work> jantje: yeah more about setting up the framework that does that automatically
[0:16] <Tv|work> alright who wants to keep me from removing /images/ceph-peon64/var/log/ceph
[0:16] <Tv|work> there's huge log files in there from around 12:44
[0:18] * bcherian (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[0:18] <Tv|work> and why are logs on the nfsroot in the first place
[0:18] <Tv|work> that won't work right, with multiple machines appending
[0:19] <gregaf> what do they look like?
[0:19] <DJLee> right, im going to use iozone, for more testings from now on,
[0:19] <DJLee> and dbench
[0:19] <Tv|work> -rw-r--r-- 1 root root 3.1G 2011-02-02 12:43 osd.5.log
[0:19] <DJLee> kinda not using fio anymore;;
[0:19] <gregaf> Tv|work: sjust might have thrown them in there by mistake, there was an mds crash he wanted to save
[0:19] <Tv|work> that's a 28GB partition, you can't put that much in there
[0:19] <gregaf> oh, probably not the osd logs, no
[0:19] <Tv|work> sjust: is this your stuff?
[0:20] <sjust> oops, yep
[0:20] <sjust> cosd
[0:22] <DJLee> guys, anybody explain this simple plot..?
[0:22] <DJLee> http://twiki.esc.auckland.ac.nz/twiki/pub/NDSG/Cephtest/ddplot11.pdf
[0:23] <DJLee> write 1st = dd write 40gb, read 1st = dd read back the 40gb
[0:23] <DJLee> and so on for 4 times;
[0:24] <sjust> so, the reads don't change because they are simply processed by the primary
[0:24] <DJLee> all I can see is the black bar (read) is more or less all similar across all configurations;;;
[0:24] <DJLee> right, thats'' what i've read;
[0:24] <DJLee> and that's just expected result?
[0:24] <jantje> DJLee: 175MB/s , not just ethernet I guess?
[0:24] <DJLee> yeah ethernet
[0:24] <DJLee> bonding :)
[0:24] <sjust> the writes, on the other hand, get slower because with 2x replication, the primary must wait until the replica has the write before it does its own
[0:25] <DJLee> hm
[0:25] <gregaf> it's a little odd that reads get slower with more replicas
[0:25] <gregaf> err, OSDs, not replicas
[0:26] <gregaf> but again, I bet that's an issue with prefetching
[0:26] <sjust> since the writes are cached, the writes probably get executed in parallel
[0:26] <sjust> gregaf: yeah
[0:27] <DJLee> yeah,, btw, the nodes are just 1 nodes (for 1, 3 and 6 hdd disks), and 1 for mon/mds, and the other for client, (all machines Intel xeon powerful, and dont see many % cpu)
[0:28] <DJLee> so with the reading, there's not way to read from other osds in pararellel.?
[0:28] <gregaf> I wonder if running multiple OSDs on a node is causing trouble with the local readahead too
[0:28] <sjust> DJLee: prefetching is supposed to allow for that
[0:28] * jantje is going to run iozone -t 4 -s1G -i 0 -i 2 -i 8 on his 10OSD cluster with no replication (master branch), and regular gigabit, i'll let you know the results tomorrow
[0:29] <jantje> make the 4G instead of 1G
[0:29] <gregaf> DJLee: if you set up multiple threads on the same client doing reads in different locations you'd get more aggregate throughput
[0:30] <gregaf> Ceph tries to prefetch but it's obviously not doing well enough right now
[0:30] <DJLee> how do I do the prefetching, and make it run it separate threads..? sorry
[0:30] * bcherian (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[0:31] <gregaf> oh, I meant you'd need to run multiple things at once
[0:31] <gregaf> it's doing prefetching by default but it's not well-tuned
[0:31] <DJLee> gregaf: I defeitnely saw the difference when I was runniing on 2 nodes, i.e., 1+1 , 2+2 and 3+3 osds, (for writes, but no results for read,, yet)
[0:32] <gregaf> yeah, that's because it's much simpler to make writes go out in parallel — they're buffered and get flushed asynchronously from the client
[0:35] <Tv|work> sjust: so are those files safe to remove?
[0:35] <Tv|work> sjust: rushing you because just about nothing works while that partition is full :-/
[0:35] <jantje> hmm,
[0:35] <jantje> WARNING: at fs/btrfs/inode.c:2143 btrfs_orphan_commit_root+0x7f/0x9b()
[0:35] <DJLee> gregaf, right, but wouldn't it be the same if I were to just run it on a single node with 2, 4 and 6 osds (because the crushmaps are essentially the same..)
[0:36] <jantje> is that bug fixed? if yes, where? because I also need the ino32 patch that's not yet in the kernel git repository
[0:36] <gregaf> DJLee: not sure what you mean
[0:36] <jantje> hmm
[0:36] <jantje> nevermind, i'll look it up tomorrow, just being lazy
[0:36] <gregaf> if you were talking about when I said "I wonder if running multiple OSDs on a node is causing trouble with the local readahead too"
[0:36] <jantje> nite everyone
[0:36] <DJLee> single node: 2, 4, 6 osds == 2 two nodes with 1+1, 2+2, 3+3 osds ?
[0:37] <gregaf> I was referring to the node's kernel/disk/raid card behaviors
[0:37] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[0:37] <sjust> Tv|work: yes, sorr
[0:37] <gregaf> I don't think that should happen, though
[0:37] <gregaf> I'm sorry that our read pre-fetching doesn't play nicely with dd
[0:37] <gregaf> you can go ahead and create a bug
[0:37] <Tv|work> sjust: alright, firing photon torpedos..
[0:37] <gregaf> but I'm pretty sure that's all it is
[0:38] <gregaf> when you ask for a read from the ceph kernel client, it often has to go out to the OSD to fill that, which requires round-trip latency and stuff
[0:38] <Tv|work> gregaf: uhh, dd is probably the easiest to prefetch for, so you're saying prefetch doesn't work at all?-)
[0:38] <gregaf> whereas when you're doing writing it usually gets to write to memory, return success, and then asynchronously flush out the write, hiding the latency from you
[0:38] <gregaf> Tv|work: I dunno, but it's all I can come up with
[0:39] <gregaf> I don't know what the algorithms for it look like
[0:39] <Tv|work> gregaf: having counters for "wanted a block, it wasn't there" vs "wanted a block, found it in cache" would be good for this..
[0:39] <Tv|work> gregaf: i think linux core readahead is pretty darn simple, if you do sequential reads it'll just fetch the next block for you
[0:40] <gregaf> my best guess is that maybe the prefetching is having difficulty with the relative size of the total read versus the local cache
[0:40] <gregaf> but I really don't know how the prefetching works at all so anything I come up with is just speculation
[0:41] <Tv|work> well it might be that the whole mechanism doesn't even trigger, due to ceph being slightly different
[0:42] <DJLee> http://twiki.esc.auckland.ac.nz/twiki/pub/NDSG/Cephtest/ddplot111.pdf
[0:42] <DJLee> guys this one is the same test, but with each HDD osds are supported by each SSD disk journal separate (2GB)
[0:43] <Tv|work> http://lxr.linux.no/linux+v2.6.37/mm/readahead.c#L294
[0:43] <DJLee> 2x replication is *still* going, so may take some more hours;;
[0:43] <DJLee> argh, sorry about the scale, 200MB vs 250MB
[0:48] <DJLee> when doing sequential writes, provided that the disks are initially empty, are the objects written from outer edge first? (matters for hdds!) or sort of random
[0:48] <gregaf> that's all handled by the underlying fs, Ceph doesn't know
[0:50] <Anticimex> does any system need write access to a ceph keyring?
[0:50] <Anticimex> eg, is it sufficient to export the ceph.conf read-only via nfs to all nodes?
[0:50] <Anticimex> and have the keyring in the same folder
[0:50] <gregaf> Anticimex: the monitors might need write access, but I'm not sure what their defaults are
[0:51] <Anticimex> written only from the master
[0:51] <Anticimex> ok
[0:51] <Anticimex> where master is the one i run /etc/init.d/ceph start on, which has the ssh key to other nodes, etc
[0:51] <darkfader> Anticimex: you might make yourself unhappy if you make one networked filesystem(ceph) depend on another(nfs) :)
[0:52] <Tv|work> also, private keys on nfs make me chuckle
[0:52] <Anticimex> ok, but i'm really just testing now
[0:52] <darkfader> Tv|work: i think only the conf should be there
[0:52] <Anticimex> so ease of change has priority
[0:52] <darkfader> ok :)
[0:56] <gregaf> Anticimex: not master, monitor
[0:56] <gregaf> you specify those in your ceph.conf, but you don't need to run the system start commands on those machines
[0:57] <Anticimex> that is what i am saying
[0:57] <gregaf> oh, sorry, I guess my statement about the monitors split up your sentence
[0:58] <gregaf> I was interpreting it as a response to me :)
[1:03] <DJLee> so far, would you say it is normal for read (since it is from primary), to be slower than the write..?
[1:06] <DJLee> again, going back, can I possibly read from the multiple osds (config somewhere?)
[1:18] <Tv|work> DJLee: that'd be the "read from replicas" thing we've talked about a lot recently..
[1:24] <DJLee> ..
[1:25] <DJLee> er, okay, let me go ahead and instead of e.g., 3 osd per disk, i go raid0 3 disk (and make it 1osd), and see the results.
[1:25] <DJLee> and i'll also try 2 nodes
[1:29] <DJLee> btw, in the ceph.conf, the osd_pg_bits doesnt seems to be read
[1:32] <gregaf> DJLee: you need to put the osd_pg_bits in the global section, not the OSD sectin
[1:32] <gregaf> *OSD section
[1:40] <DJLee> yeah im sure i did that
[1:40] <DJLee> its where the cephlog, and user = root, things are;
[1:41] <gregaf> hmm
[1:50] <gregaf> DJLee: what makes you think the osd_pg_bits isn't read?
[1:51] <gregaf> I'm looking at the code and it looks like I can trace its use...
[1:51] <DJLee> the pgs are still 12336, heh
[1:51] <gregaf> ah
[1:51] <gregaf> did you make a new ceph fs or just restart your current one?
[1:51] <gregaf> it can't be used to reduce the number of PGs in an existing install
[1:53] <DJLee> i did mkcephfs
[1:54] <DJLee> in fact i've never tried to restart ceph without mkceph --clobber old data; heh;
[1:54] <DJLee> but atm im more worried about the reading scalability, if its a known issue, i dont need to report on bugtrack with plots i've shown here?
[1:55] <gregaf> I don't think it's anything we've actually looked at
[1:55] <Tv|work> DJLee: honestly, we're not quite up to making it perform well yet..
[1:55] <gregaf> it's better to have stuff in the tracker than not
[1:55] <gregaf> it's just I'm not too concerned about single-client read scalability since I believe we've run some tests showing it scales across clients :)
[2:07] <DJLee> gregaf, oh, when you mean single-client, well, i was running like dbench, too
[2:07] <DJLee> with 100 threads
[2:08] <gregaf> DJLee: have you tried using multiple actual clients though?
[2:09] <gregaf> you should see your aggregate bandwidth increasing if you start a couple of different kernel clients doing dds on different files
[2:09] <DJLee> and if i remember correctly, i do see the scalability, but the speed was in about 70MB/s (dbench) with 6 OSDs
[2:09] <DJLee> hmm, i see
[2:09] <DJLee> is there any easy ways to do that? other than 'physically' multiple clients?
[2:10] <DJLee> or even then, i gotta add all the results together;;
[2:10] <gregaf> yeah, I think you'd at least need to set up some VMs
[2:10] <gregaf> you could run cfuse and the kernel client on the same box
[2:11] <DJLee> is there any real advantages or lacking features of a single kclient (e./g., buffer?)
[2:11] <gregaf> although it won't be quite as nice to analyze because the userspace and kernel clients have somewhat different behaviors
[2:11] <gregaf> not sure what you're asking
[2:12] <gregaf> but I don't think there are disadvantages
[2:12] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[2:12] <DJLee> cuz, the way i understand (correct me), is that at least when benchmarking, a single client should be able to generate, or so call it bombard the mounted space as much as possible
[2:12] <DJLee> including generating multi threads, etc
[2:12] <gregaf> it's just that Ceph is designed for serving in systems with lots of clients, not just one client
[2:12] <DJLee> should i try 'dd simulatensouly with diffferent threads..?'
[2:12] <gregaf> oh, yes, it should (across its network bandwidth, that is)
[2:13] <gregaf> but things get complicated due to stuff like network latency, etc
[2:13] <gregaf> so you either have to do reads synchronously, and pay a huge penalty for sending requests across the network, waiting for multiple disk seeks, etc
[2:13] <gregaf> or you have to try and predict reads and prefetch it before reads come in
[2:14] <gregaf> and I think the reason the read performance is dropping slightly as you increase the number of OSDs is because the algorithms in use aren't handling the prefetching as well
[2:14] <gregaf> scaling out the number of clients at least lets you see if your cluster is capable of providing more read bandwidth than that
[2:15] <DJLee> that's right,
[2:15] <gregaf> alternatively you could set up a read benchmark that does asynchronous reads in a number of threads and aggregates that into a single bandwidth number
[2:15] <gregaf> which I don't think you said you were doing?
[2:16] <DJLee> cuz i did find the write was great, i.e., all of OSDs were like 100% busy (iostat) and being bombarded,
[2:16] <DJLee> exactly, so i should do like dd & dd & dd & dd (4 threads)
[2:16] * bcherian (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:17] <gregaf> I'm not certain, but there's a good chance that would work better
[2:18] <gregaf> where by "work better" I mean "show off Ceph to better advantage" ;)
[2:18] <DJLee> sure
[2:18] <DJLee> definitely, that's what we are trying to do
[2:18] <DJLee> :)
[2:19] <DJLee> although, i really gotta read sage's thesis in more detail, ive been skim through and read and skim through again, but this time i should do it more carefully,
[2:19] <gregaf> heh
[2:19] <gregaf> just don't pay any attention to where it talks about EBOFS
[2:19] <DJLee> right
[2:20] <DJLee> things i'll pay attention is the crush, replications (rados)
[2:21] <gregaf> yep, that's still pretty much the same
[2:21] <gregaf> as is the MDS, actually
[2:21] <gregaf> minor adjustments for new features but I don't think anything big has changed
[2:21] <DJLee> right, thats reallt good work, given that he's got that already several years ago, normally the concept changes within a few years. heh;
[2:23] <gregaf> anyway, I'm off, bbl
[2:24] <DJLee> ok, thanks bye
[2:28] * eternaleye_ (~eternaley@195.215.30.181) Quit (Remote host closed the connection)
[2:28] * eternaleye_ (~eternaley@195.215.30.181) has joined #ceph
[2:48] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:55] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:04] * cmccabe (~cmccabe@208.80.64.79) Quit (Quit: Leaving.)
[3:21] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:35] * eternaleye_ is now known as eternaleye
[4:33] * baldben (~bencheria@cpe-76-173-232-163.socal.res.rr.com) has joined #ceph
[4:52] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[4:53] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[5:10] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[5:16] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[5:26] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[5:49] * midnightmagic (~mm@S0106000102ec26fe.gv.shawcable.net) has joined #ceph
[5:50] <midnightmagic> uh.. so I'm trying to register in the redmine tracker.. and even though I've not only registered, but also reset my password, when i try to log in to the front entrance it insists my login credentials are bad: "Invalid user or password"
[5:50] <sage> what user?
[5:50] <midnightmagic> I have tried to log in with my username (midnightmagic) but also my email address. it doesn't let me in at all.
[5:51] <midnightmagic> I am not using openid.
[5:51] <sage> try now?
[5:52] <midnightmagic> got in. does redmine require admin approval these days?
[5:52] <midnightmagic> been a while since i upgraded my own redmine tracker.
[5:52] <midnightmagic> anyway, thanks,.
[5:52] <sage> not normally, but i've had to do this a handful of times. not sure what makes it picky.
[5:52] <sage> np
[5:52] <midnightmagic> what is "this" if you don't mind me asking?
[5:52] <sage> click the 'activate' link on the user page
[5:52] <midnightmagic> got it. thanks!
[5:53] <midnightmagic> :-)
[5:55] <iggy> does standalone work back to 2.6.32?
[6:09] <sage> iggy: through .27 i think
[6:09] <sage> well, it compiles; we don't test on older kernels
[7:07] <jantje> morning
[7:11] <jantje> sage: [ 925.266903] WARNING: at fs/btrfs/inode.c:2143 btrfs_orphan_commit_root+0x7f/0x9b()
[7:11] <jantje> is it fixed in the kernel tree? or just the ceph-client tree?
[7:12] <jantje> that kernel is somewhat 'old', but I'm not sure where the right fix is, and what it is
[7:13] <jantje> (the ticket just got closed...)
[7:16] <jantje> i probably just need some way to merge the upstream kernel into it
[8:13] * baldben (~bencheria@cpe-76-173-232-163.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[8:52] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:53] <Anticimex> ugg
[8:53] <Anticimex> debians 2.6.37 experimental
[8:54] <Anticimex> getting lots of kernel issues with btrfs
[8:54] <Anticimex> such as
[8:54] <Anticimex> Feb 3 08:47:43 ceph-node-1 kernel: [274025.872085] BUG: unable to handle kernel NULL pointer dereference at (null)
[8:54] <Anticimex> Feb 3 08:47:44 ceph-node-1 kernel: [274025.872842] IP: [<c1020340>] kmap_atomic_prot+0x12/0xdd
[8:54] <Anticimex> Feb 3 08:47:44 ceph-node-1 kernel: [274025.873620] *pde = 00000000
[8:54] <Anticimex> Feb 3 08:47:44 ceph-node-1 kernel: [274025.874398] Oops: 0000 [#1] SMP
[8:54] <Anticimex> Feb 3 08:47:44 ceph-node-1 kernel: [274025.875203] last sysfs file: /sys/module/btrfs/initstate
[9:08] <jantje> Didn't see that one yet
[9:08] <jantje> try to get the git tree
[9:09] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[10:09] * Yoric (~David@213.144.210.93) has joined #ceph
[10:56] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[12:48] * shdb (~shdb@80-219-123-230.dclient.hispeed.ch) has joined #ceph
[13:44] * johnl_ (~johnl@109.107.34.14) Quit (Remote host closed the connection)
[13:55] * baldben (~bencheria@cpe-76-173-232-163.socal.res.rr.com) has joined #ceph
[14:11] * johnl (~johnl@109.107.34.14) has joined #ceph
[16:09] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:05] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[17:23] * baldben (~bencheria@cpe-76-173-232-163.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[17:55] * greglap (~Adium@166.205.139.76) has joined #ceph
[18:21] * Yoric (~David@213.144.210.93) Quit (reticulum.oftc.net magnet.oftc.net)
[18:21] * Anticimex (anticimex@netforce.csbnet.se) Quit (reticulum.oftc.net magnet.oftc.net)
[18:21] * jantje (~jan@paranoid.nl) Quit (reticulum.oftc.net magnet.oftc.net)
[18:21] * raso (~raso@debian-multimedia.org) Quit (reticulum.oftc.net magnet.oftc.net)
[18:21] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (reticulum.oftc.net magnet.oftc.net)
[18:22] * Yoric (~David@213.144.210.93) has joined #ceph
[18:22] * jantje (~jan@paranoid.nl) has joined #ceph
[18:22] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[18:22] * raso (~raso@debian-multimedia.org) has joined #ceph
[18:22] * Anticimex (anticimex@netforce.csbnet.se) has joined #ceph
[18:37] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:42] * greglap (~Adium@166.205.139.76) Quit (Quit: Leaving.)
[18:47] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[18:53] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:05] <yehudasa> wido: are you there?
[19:10] * cmccabe (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) has joined #ceph
[19:12] * alexxy (~alexxy@79.173.81.171) Quit (Remote host closed the connection)
[19:16] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:17] * alexxy (~alexxy@79.173.81.171) has joined #ceph
[19:26] <Tv|work> meet at 10:30?
[19:27] <sagewk> yep
[19:29] * midnightmagic (~mm@S0106000102ec26fe.gv.shawcable.net) Quit (Remote host closed the connection)
[19:31] * MK_FG (~MK_FG@188.226.51.71) Quit (Quit: o//)
[19:32] * MK_FG (~MK_FG@188.226.51.71) has joined #ceph
[19:53] <Tv|work> sagewk: why is ceph-kvm2.ceph.dreamhost.com in dns but gitbuilder. and autotest. are not?
[19:54] <Tv|work> sagewk: i don't understand the dh machine admin thingie well enough yet :-/
[19:54] <sagewk> dns is generated via the machine table, not ip table.
[19:54] <sagewk> hmm
[19:55] <Tv|work> ok i've never touched that
[19:55] <Tv|work> ah because it seems to be about physical machines
[19:56] <sagewk> yeah
[19:56] <sagewk> give me a few
[19:56] <Tv|work> no worries just seeing if i can get dhcp etc going
[19:56] <Tv|work> i can hardcode ips just as well..
[19:58] <sagewk> let's do that for now. we can set up manual dns records, but i'm not sure it's worth tracking the vms in our database
[20:09] <Tv|work> biggest stumbling blocks left with autotest: 1) configure it to have actual mysql passphrases and all 2) protect its web ui with usernames/passwords because it's effectively a "run any code you want" service
[20:10] * bcherian (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:17] <sagewk> k
[20:17] <sagewk> just need an .htaccess?
[20:17] <Tv|work> yeah
[20:17] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[20:17] <Tv|work> sagewk: if there's existing dh stuff we can piggyback off, that'd be good
[20:17] <Tv|work> sagewk: otherwise i'll throw in a little mysql table for mod_auth_somethingsomething
[20:28] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[20:43] * jantje (~jan@paranoid.nl) Quit (Read error: Connection reset by peer)
[20:43] * jantje (~jan@paranoid.nl) has joined #ceph
[21:09] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) has joined #ceph
[21:31] * gregorg (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[21:31] * gregorg (~Greg@78.155.152.6) has joined #ceph
[21:32] * gregorg_taf (~Greg@78.155.152.6) has joined #ceph
[21:32] * gregorg (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[22:25] <johnl> hi, can someone explain what blacklist is and what it's for?
[22:25] <johnl> note on the wiki asks for an explanation, thought I'd update it
[22:27] <gregaf> johnl: if you have "some daemon" (in practice, always an MDS) that talks to the OSDs, and the monitors determine that the daemon needs to get replaced, that daemon goes on the blacklist to prevent it from accessing the OSDs
[22:27] <gregaf> so if you have a laggy MDS and an available standby
[22:28] <gregaf> the monitors will put the laggy MDS in the blacklist and bring up the standby to take over
[22:29] <gregaf> and since the MDS is in the blacklist, the OSDs won't let it make changes to data, preventing divergent data sets
[22:29] <johnl> right
[22:30] <johnl> so any daemon can be blacklisted though? does it always means the osds won't speak to it?
[22:30] <gregaf> it always means the OSDs won't speak to it
[22:30] <johnl> so osds are the only daemon to consult blacklists?
[22:30] <gregaf> yes, the blacklist is part of the OSD map
[22:30] <gregaf> in practice only MDSes get blacklisted
[22:30] <johnl> other osds can be blacklisted then too?
[22:30] <johnl> right ok.
[22:31] <gregaf> other OSDs don't get blacklisted, no
[22:31] <gregaf> all the stuff with the OSDs is handled in other pieces
[22:31] <gregaf> the blacklist is a means to maintain data consistency in systems built on top of Ceph
[22:31] <gregaf> *built on top of RADOS
[22:31] <johnl> ah, so the "ceph osd blacklist" command just manipulates the blacklist - it's not *for* blacklisting osds. I see
[22:31] <gregaf> at present that's only Ceph
[22:31] <gregaf> yes
[22:33] <johnl> thanks greg
[22:33] <gregaf> np
[22:33] <wido> yehudasa: here now
[22:37] * jantje (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[22:40] <wido> I'm going afk
[22:50] <johnl> hey gregaf, how come the blacklist command can take an expiry time. under what circumstances would a daemon need to be blacklisted for just a bit?
[22:50] <johnl> you said if the "daemon needs to get replaced"
[22:50] <gregaf> johnl: not sure that you need to remove it, necessarily
[22:50] <gregaf> but that way a long-running instance doesn't have a huge blacklist
[22:51] <gregaf> and if the entry expires after one day it's a pretty good bet that the MDS has learned it's down, even if it still exists ;)
[22:52] <johnl> can an mds revcover from a situation where it was blacklisted?
[22:53] <johnl> i.e, once it knows it's down I assume it can resync or whatever.
[22:53] <johnl> ah, so blacklist fencing. just clicked :)
[22:53] <johnl> blacklisting is fencing
[22:54] <Tv|work> gaaah dhcp
[22:54] <Tv|work> not having any fun with networking today :(
[22:55] <gregaf> johnl: if an MDS gets blacklisted it kills itself
[22:55] <gregaf> or maybe it respawns, I don't recall exactly
[22:55] <gregaf> but if a daemon respawns then it gets a different identifier, so the new instance isn't blacklisted
[22:57] <johnl> by identifier do you mean the N in mdsN ? or something else?
[22:57] <gregaf> something else
[22:58] <gregaf> if you look at the daemons when they start up they print out something like 127.0.0.1:6789/bignum
[22:58] <gregaf> errr, wait, that's not the id either
[22:58] <gregaf> there's an internal entity_addr_t (or one of its related structs)
[22:59] <gregaf> that contains the IP and a big number which I believe is a randomly-generated 32 or 64-bit
[22:59] <gregaf> so when they get restarted they'll have the same name and the same IP, but a different big number
[22:59] <gregaf> so they're identified as a different instance
[22:59] <johnl> right
[22:59] <johnl> the wiki says "ceph osd blacklist" takes an "address" argument
[23:00] <johnl> is that an IP address?
[23:02] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[23:02] <gregaf> not sure the interface, I'll ask sage when he's available
[23:03] <johnl> source code seems to suggest ip address
[23:03] <johnl> in OSD.cc: if (osdmap->is_blacklisted(op->get_source_addr()))
[23:04] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[23:05] <gregaf> yeah, it's the IP address basically but I'm not sure if it can translate between the big random numbers and not and stuff
[23:06] <johnl> k
[23:09] <johnl> am wondering how the osd differentiates between a client connecting to an osd and an mds connecting to an osd when checking for blacklisting
[23:10] <johnl> I could see how an mds might get blacklisted, but if that mds is also a client machine then it'll break stuff
[23:10] <gregaf> thus the importance of the extra random number
[23:10] <gregaf> they have different entity_addr_t; it contains more than just the IP
[23:10] <johnl> not many production systems that have an mds and client on the same host I'd guess
[23:11] <johnl> but then you're suggesting that blacklisting is *just* for mdss
[23:11] <gregaf> at this point it is
[23:11] <johnl> you seemed to imply it was for any daemon but most usually mds
[23:11] <gregaf> ah
[23:12] <gregaf> it's a general RADOS mechanism that is only used for MDSes
[23:12] <johnl> do all daemons have those other identifiers?
[23:12] <johnl> sorry for the grilling!
[23:12] <gregaf> yeah, every daemon and all the clients have an entity_addr_t that is unique from all other daemons/clients
[23:20] <gregaf> johnl: okay, so with the manual blacklisting you can either specify just the IP, in which case all connections from that IP will be blacklisted
[23:20] <gregaf> or you can specify the full IP:port:"nonce"
[23:20] <gregaf> which just blacklists that specific daemon
[23:20] <gregaf> only specific daemons get automatically blacklisted
[23:21] <gregaf> if something's blacklisted then any messages it sends to the OSD are replied to with an error code telling the sender that it's blacklisted
[23:21] <yehudasa> wido: shouldn't the namespace in your fix be 'ceph' and not 'librados'?
[23:21] <gregaf> and obviously the OSD just ignores the op
[23:24] <johnl> ah excellent
[23:24] <johnl> I assume the automatic blacklist uses the nonce
[23:24] <gregaf> yeah
[23:25] <gregaf> that's how it keeps it to a specific daemon, and not the IP
[23:25] <johnl> I was just digging through the code to try and find where "ceph osd blacklist" is handled
[23:25] <johnl> to find that out :) thanks.
[23:25] <gregaf> :)
[23:25] <johnl> I couldn't find it though, any pointers?
[23:25] <gregaf> it's in mon/OSDMonitor.cc::prepare_command
[23:26] <gregaf> with the ceph command, the second word (here, "osd") tells it which system monitor to direct the request to
[23:26] <johnl> ah there we are
[23:27] <gregaf> so the monitor glue code uses that to direct the "MMonCommand" message to the correct Monitor object's implementation of the "prepare_command" function
[23:27] <johnl> addr.parse
[23:27] <gregaf> yep
[23:29] <johnl> tucked away in that parse function :)
[23:30] <johnl> thanks for the help greg. helped me find my away around the code a bit too now
[23:30] <gregaf> welcome
[23:31] <gregaf> better code documentation is something I've been wanting to do for a while
[23:32] <gregaf> except who wants to spend time doing that? ;)
[23:32] <johnl> heh. I quite like it, hehe. if I'm in the mood
[23:32] <johnl> there is some docs here
[23:33] <johnl> actually, I don't see the nonce parsing in this function
[23:33] <johnl> the port is handled
[23:34] <johnl> am on master branch.
[23:35] <gregaf> hmmm
[23:35] <sagewk> johnl: hmm, right you are
[23:35] <johnl> port is enough to differentiate between mds/client though I suppose
[23:36] <johnl> though is it source port?
[23:36] <sagewk> should be there for completeness, though. this path isn't used when the monitor blacklists things internally/automatically
[23:36] <sagewk> i'll fix it up
[23:37] <johnl> want a redmine ticket?
[23:37] <sagewk> naw
[23:37] <johnl> k, np.
[23:37] <johnl> I'm lazy, I like people to write my redmine tickets for me :)
[23:38] <gregaf> heh
[23:38] <gregaf> we're lazy and don't always write redmine tickets if we fix it fast enough
[23:39] <johnl> heh
[23:39] <johnl> oh man, not much better than closing a redmine ticket. I sometimes open a ticket after I've fixed something, just so I can close it!
[23:39] <johnl> ;)
[23:43] <johnl> is the port source port?
[23:43] <johnl> I assume source port will change with each connection, no?
[23:43] <gregaf> depends on your assignment policy and stuff
[23:44] <gregaf> you can set ports to use in your config file if you like, otherwise the daemon doesn't ask for any in particular
[23:44] <johnl> an outgoing tcp connection will usually be assigned a free source port by the OS
[23:44] <gregaf> but at least on my machine it tends to get sequential numbers based on what's already been assigned, I think?
[23:45] <johnl> heh, it should be random actually, due to a security issue :)
[23:45] <johnl> but I mean, it's difficult to actually fence a given node by IP and port and the port will change each time
[23:46] <johnl> sorry, ...and port IF the port will change each time...
[23:47] <johnl> would guess that the source port would be different for each connection to each osd too
[23:47] <johnl> I bet nobody actually uses this command do they, lol
[23:47] <johnl> all this examination
[23:47] <gregaf> yeah, I've never heard of it being used
[23:47] <johnl> hehe
[23:47] <gregaf> can be useful for failure testing and stuff though
[23:48] <gregaf> if you don't have a connection handy to just kill the daemon
[23:49] <johnl> I think i'll be a little vague in the wiki and suggest it's only used for testing
[23:54] <johnl> thanks again.

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.