#ceph IRC Log


IRC Log for 2012-07-19

Timestamps are in GMT/BST.

[0:02] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[0:03] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[0:09] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:13] <elder> joshd, looking at the one that remains, e38d3e084
[0:15] <elder> Should snapc->seq ever be modified in this way? It seems like the whole bit of "follow_seq" logic in there may be a mistake.
[0:17] <elder> I.e., instead of this, we should just copy over snapc->seq unconditionally
[0:19] <jefferai> Hey there. For the majority of its traffic, does Ceph use UDP or TCP?
[0:20] <jefferai> I have a couple of Arista switches and I'm trying to decide how to set up bonding on the hosts
[0:24] <sjust> tcp
[0:25] <jefferai> ah
[0:25] <jefferai> so balance-rr may not be a good choice due to out-of-order packets
[0:25] * Ryan_Lane (~Adium@ Quit (Read error: Connection reset by peer)
[0:25] * Ryan_Lane (~Adium@ has joined #ceph
[0:26] <joshd> elder: yes, I agree. the new snap_seq that we read is authoritative
[0:27] <elder> OK.
[0:27] <joshd> I think I kept getting confused because of that
[0:27] <elder> I've got a patch that does that, and will tack it on to the end of your patches.
[0:27] <joshd> sounds good
[0:28] <elder> I'm going to test them but I'll most likely post them to the list for review tonight. I've already marked the ones from you as reviewed by me.
[0:28] <elder> I already validated that the sizes and snapshot sizes display properly...
[0:28] <joshd> great
[0:28] <joshd> so does kernel.sh pass now?
[0:29] <elder> I haven't checked. I'll have to wait an hour for the kernel to build...
[0:29] <elder> I'm testing with UML.
[0:29] <elder> I'll try it later and let you konw.
[0:30] <joshd> ok, thanks
[0:35] * LarsFronius (~LarsFroni@2a02:8108:380:90:992f:e637:b392:68e) Quit (Quit: LarsFronius)
[0:36] <Tv_> jefferai: even with most UDP protocols, rr hurts
[0:36] <jefferai> hm
[0:36] <Tv_> jefferai: most on-top-of-UDP protocols include half of a TCP stack in the protocol implementation ;)
[0:36] <jefferai> Tv_: hah, yes, I know that
[0:37] <jefferai> Do you have suggestions? I've read the bonding article on linuxfoundation.org carefully, but it doesn't leave me feeling informed, because I don't quite fit into any of their examples
[0:37] <jefferai> rather, I *could*, but don't have some of the constraints they assume
[0:37] <Tv_> jefferai: hash on TCP/UDP ports
[0:37] <Tv_> jefferai: well, hash on something that *includes* ports
[0:37] <jefferai> Tv_: using LACP?
[0:38] <Tv_> jefferai: LACP etc are just about handshaking that wires A and B are used together
[0:38] <jefferai> sure
[0:38] <jefferai> but helps with the failover
[0:38] <Tv_> jefferai: you can trunk independently of that etc; I don't care much about Cisco terminology
[0:39] <Tv_> jefferai: oh sure, don't keep sending on a dead link
[0:39] <jefferai> however, the 802.3ad mode of bonding lets you change the xmit_hash_policy option
[0:39] <Tv_> jefferai: but you want hashing, not active-backup
[0:39] <jefferai> right, so my understanding is that 802.3ad mode (which is LACP, terminology aside) does both hashing for aggregation and does fault tolerance
[0:39] <Tv_> jefferai: i really don't believe in network devices dictating to others what to do; that way tends to lead to security issues (yeah, you should *totally* send me all traffic for all VLANs!)
[0:40] <jefferai> or perhaps 802.3ad is not LACP, but regardless, my switch supports it
[0:40] <jefferai> so we'll call it 802.3ad
[0:41] <jefferai> so this would be bonding driver mode 4
[0:41] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:41] <jefferai> with a xmit_hash_policy of layer3+4
[0:42] <Tv_> jefferai: xmit_hash_policy layer3+4 is what matters
[0:42] <jefferai> sure
[0:42] <jefferai> Tv_: are you the guy that uses Arista here? I just ask because I first heard about them from this channel, but don't remember from whom
[0:43] <jefferai> I checked them out, was impressed, got good pricing, bought two
[0:43] <Tv_> jefferai: i'm more a linux networking hacker than any switch person.. the DreamHost ops have ran proof of concepts on just about all the new fancy switches, I was involved there mostly when it involved virtual networking (think OpenStack)
[0:44] <jefferai> Ah
[0:44] <jefferai> so things like OpenVSwitch
[0:44] <Tv_> the lab we run has Force10s, cisco 2960s, 4948
[0:44] <Tv_> jefferai: things like reading the VXLAN draft and saying i can't find much wrong in it ;)
[0:44] <jefferai> :-)
[0:46] * loicd (~loic@ has joined #ceph
[0:54] <elder> joshd, ceph-client/wip-rbd-josh building now
[1:07] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:08] * Cube (~Adium@ Quit (Read error: Operation timed out)
[1:12] * James_259 (~James259@ Quit ()
[1:18] <joshd> elder: unrelated to that branch, but it rbd_header_set_snap is using snap_seq entirely incorrectly - it's treating it as though it were snap_id, when it shouldn't be changed
[1:19] <joshd> snapc->seq should always be set to the latest header->snap_seq
[1:21] <joshd> and in the second half of that function, rbd_dev->snap_id should be set to the appropriate id, not snapc->seq
[1:22] * loicd (~loic@ Quit (Quit: Leaving.)
[1:23] <joshd> and snapc is an unused parameter in rbd_req_sync_read
[1:23] * fghaas (~florian@ has joined #ceph
[1:23] * fghaas (~florian@ Quit ()
[1:23] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[1:30] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[1:35] * loicd (~loic@ has joined #ceph
[2:06] * Tv_ (~tv@2607:f298:a:607:b435:f9f6:cf25:1ca2) Quit (Quit: Tv_)
[2:15] <elder> joshd, that was confusing to me too.
[2:15] <elder> I will look into that and fix it.
[2:15] <elder> I have a lot of patches in my series right now, no harm in adding one more...
[2:18] <elder> Looking at rbd_header_set_snap(): Looks like I just delete the assignment of snapc->seq and that sholud be good for the non-snapshot block. But for the snapshot case we really want to just assign the snapshot we looked up by name to snap_id.
[2:18] <elder> I'll create a patch.
[2:19] <elder> Oh yeah, pretty much what you just said..
[2:20] <elder> I think I'm going to kill that function, or change it a lot anyway. It's sort of a mish mash.
[2:20] <joshd> yeah, that sounds like a good idea
[2:27] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:29] <elder> joshd, after a "snap_add" method is called, won't we get notified to update the header?
[2:30] <elder> Right now the function that does that (rbd_header_add_snap()) forcefully sets snapc->seq after calling that. And I think we should not do that, but should wait and get it from a header refresh instead.
[2:31] * loicd (~loic@ Quit (Quit: Leaving.)
[2:32] <elder> Actually, I don't see where snapc->seq is getting assigned at all otherwise. What a mess. I'm leaving for a bit, will check back in a few hours.
[2:32] <joshd> ok
[2:32] <joshd> that sounds fine to me
[2:33] <joshd> but we should be setting it whenever we read the header
[2:34] * kfranklin (~kfranklin@adsl-99-64-33-43.dsl.pltn13.sbcglobal.net) has left #ceph
[2:36] <joshd> I think I might have figured out what was happening before rbd_dev->snap_id existed: snapc->seq was overloaded to be the snap_id for a device mapped to a snapshot, since the snapc is only used for writes
[2:36] <joshd> and if a device is mapped to a snapshot, it's read-only
[3:05] * asadpanda (~asadpanda@2001:470:c09d:0:20c:29ff:fe4e:a66) Quit (Ping timeout: 480 seconds)
[3:12] * asadpanda (~asadpanda@2001:470:c09d:0:20c:29ff:fe4e:a66) has joined #ceph
[3:21] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[3:25] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[3:45] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[4:09] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[4:21] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[4:26] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: No route to host)
[4:55] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:31] * deepsa (~deepsa@ has joined #ceph
[5:55] <elder> yehudasa, do you know if what josh said above is right? Was the snapc->seq value used to hold the snapshot id for snapshot images at one time?
[6:01] * dmick (~dmick@2607:f298:a:607:651d:75bd:9b07:b19e) Quit (Quit: Leaving.)
[6:43] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[7:00] * LarsFronius (~LarsFroni@95-91-243-240-dynip.superkabel.de) has joined #ceph
[7:05] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Ping timeout: 480 seconds)
[7:07] * LarsFronius (~LarsFroni@95-91-243-240-dynip.superkabel.de) Quit (Quit: LarsFronius)
[7:17] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[8:05] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) has joined #ceph
[8:38] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[8:51] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:13] * glowell (~glowell@dhcp6-19.nersc.gov) Quit (Quit: Leaving)
[9:16] * BManojlovic (~steki@ has joined #ceph
[9:28] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[9:38] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) Quit (Ping timeout: 480 seconds)
[10:18] * nhm_ (~nh@184-97-241-232.mpls.qwest.net) has joined #ceph
[10:23] * nhm (~nh@184-97-254-223.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[10:25] * glowell (~glowell@c-98-210-226-131.hsd1.ca.comcast.net) has joined #ceph
[11:01] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[11:16] * deepsa (~deepsa@ Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[11:20] * deepsa (~deepsa@ has joined #ceph
[11:46] * stxShadow (~Jens@ip-78-94-238-69.unitymediagroup.de) has joined #ceph
[11:47] * stxShadow (~Jens@ip-78-94-238-69.unitymediagroup.de) has left #ceph
[12:00] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:06] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has joined #ceph
[12:07] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) Quit ()
[12:40] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[13:30] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[13:40] <nhm_> joao: ping
[13:42] <joao> pong
[13:42] <nhm_> joao: where does the workload generator live?
[13:42] <joao> test/filestore
[13:43] <nhm_> joao: cool. Can I compile it by iteslf?
[13:43] <joao> make test_filestore_workloadgen I believe
[13:46] <nhm_> joao: were you getting pretty believable io patterns out of it compared to ceph?
[13:48] <joao> nhm_, we tried to mimic the filestore's access pattern
[13:48] <joao> not sure if that answers your question :\
[13:48] <joao> (overslept, feeling a bit numb still; need coffee badly)
[13:51] <nhm_> joao: ok, cool. I'm finally to the point where I'm seeing ok performance on the underlying disk with these controllers I think. Ceph performance is still bad, so I'm trying to replicate the performance outside of ceph using the same technique that sage used a couple months back with dd, appending a couple of bytes to a file, and syncing. I'm thinking maybe the workload generator would do a better job.
[14:05] * Leseb (~Leseb@ has joined #ceph
[14:06] <Leseb> hi all!
[14:15] <nhm_> good morning leseb
[14:15] <joao> nhm_, iirc, sage was mimicking the same behavior that the workload generator did
[14:16] <nhm_> joao: the problem I'm having is that I can't seem to recreate the behavior of the filestore doing it that way, at least at small IO sizes.
[14:20] <Leseb> nhm_:morning? it's afternoon here :D
[14:21] <nhm_> Leseb: to each their own timezone. :)
[14:21] <Leseb> yep :)
[14:26] * gregorg (~Greg@ has joined #ceph
[14:36] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[15:03] <joao> nhm_, words to live by
[15:03] * Qu310 (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (Read error: Connection reset by peer)
[15:03] * Qten (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[15:20] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) has joined #ceph
[15:34] * lofejndif (~lsqavnbok@04ZAAELPB.tor-irc.dnsbl.oftc.net) has joined #ceph
[15:47] * Tv (~tv@cpe-24-24-131-250.socal.res.rr.com) has joined #ceph
[15:52] <joao> nhm_, sorry, I missed your last phrase back there
[15:52] <joao> what do you mean you can't recreate the filestore's behavior?
[15:52] <joao> how off is it?
[15:55] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[15:55] <joao> is the tracker down for anyone else besides me?
[15:59] <dspano> joao: I can't get to ceph.com at all.
[15:59] <Leseb> +1
[16:00] <Leseb> http://www.downforeveryoneorjustme.com/http://ceph.com/
[16:00] <dspano> Lol.
[16:03] <Tv> i think it's the database connection again..
[16:05] <joao> good
[16:06] <joao> at least it isn't some severed transatlantic connection
[16:08] <Tv> not blaming this one on squids
[16:09] <joao> usually, "they" blame ship anchors
[16:10] <joao> such as when three ships decided to drop their anchors on top of the three top submarine cables in the Mediterranean (was it?)
[16:10] <joao> all in the same week
[16:10] <Tv> government coverup
[16:11] <joao> I like to believe so
[16:11] <joao> makes life more interesting
[16:11] <Tv> to hide our new sea dwelling overlords
[16:11] <joao> lol
[16:13] <joao> so, what's the best way to figure out in which branch I stashed my work?
[16:13] <joao> any ideas?
[16:14] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[16:14] <joao> nevermind, found it
[16:15] <joao> apparently 'git stash list' shows all the stashes in all branches
[16:16] <dspano> Tv: lol
[16:16] <Tv> joao: stashes aren't really on branches
[16:16] <joao> yeah, had no idea
[16:17] <joao> I thought they would be kept on a per-branch basis
[16:17] <joao> I guess this makes more sense tbh
[16:17] <joao> got to stash it on a branch and pop it onto the newer, rebased one
[16:19] * f4m8 (f4m8@kudu.in-berlin.de) has joined #ceph
[16:19] * f4m8_ (f4m8@kudu.in-berlin.de) Quit (reticulum.oftc.net kilo.oftc.net)
[16:19] * f4m8_ (f4m8@kudu.in-berlin.de) has joined #ceph
[16:21] * f4m8_ (f4m8@kudu.in-berlin.de) Quit (Ping timeout: 480 seconds)
[16:29] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Ping timeout: 480 seconds)
[16:31] * lofejndif (~lsqavnbok@04ZAAELPB.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[16:38] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[16:39] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit ()
[16:39] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[16:41] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit ()
[16:42] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[16:57] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:00] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[17:01] * BManojlovic (~steki@ has joined #ceph
[17:15] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:20] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) Quit (Read error: Connection reset by peer)
[17:42] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) has joined #ceph
[18:02] * BManojlovic (~steki@ has joined #ceph
[18:03] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Quit: Ex-Chat)
[18:07] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[18:07] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:07] <nhm_> joao: can I turn off xattr operations entirely with the workload generator?
[18:09] <joao> I think so, let me check
[18:10] <joao> try with --test-suppress-ops oc
[18:11] <joao> 'o' for 'write xattr on objects' and 'c' for 'write xattr on colls'
[18:12] <nhm_> sweet, thanks
[18:12] <joao> nhm_, check out '--help'
[18:12] <joao> I think it should output the usage
[18:13] <nhm_> joao: oh duh, I failed to read the section at the bottom and was trying to figure out what ops I could surpress. :)
[18:14] <joao> lol :p
[18:14] * loicd (~loic@173-12-167-177-oregon.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[18:14] <joao> well, coffee run
[18:14] <joao> and gotta turn the A/C on
[18:15] <joao> looks like summer arrived in full force
[18:15] <joao> :\
[18:15] <nhm_> joao: hrm, seems to be hanging
[18:28] * Cube (~Adium@ has joined #ceph
[18:30] <joao> nhm_, where?
[18:31] <joao> haven't touched it in a while, no idea if something's changed, but will take a look at it
[18:33] <joao> nhm_, for some reason, if a cli option is wrong, it will hang trying to parse it
[18:33] <nhm_> joao: it appears to hang if you mispel suppress. ;)
[18:33] <nhm_> joao: yep
[18:33] <joao> yeah, I have no idea where that comes from
[18:35] <nhm_> interesting, even after suppressing ocl, I still see bad behavior.
[18:41] * loicd (~loic@ has joined #ceph
[18:42] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[18:44] <nhm_> joao: btw, thanks for writing this. It's going to help dramatically. Have you ever profiled it while it's running?
[18:45] <joao> no, I didn't
[18:45] <joao> I created some movies when we were in LA
[18:45] <joao> but tbh, have no idea if I still got them
[18:45] * mdxi (~mdxi@74-95-29-182-Atlanta.hfc.comcastbusiness.net) has joined #ceph
[18:46] <joao> can take a look, if you'd like to have them
[18:48] <nhm_> joao: don't worry about it. One of the interesting things here though is that there is performacne degredation and high iowait no matter if I suppress ocl, or if I supress d.
[18:49] <nhm_> and this is with a journal on a second drive.
[18:49] * aliguori (~anthony@ has joined #ceph
[18:50] <joao> nhm_, I've exposed these concerns back then, and I still can't but wonder if that's a problem with the test itself or not
[18:50] <nhm_> joao: I don't think so. It matches what I see in ceph.
[18:50] <joao> okay then
[18:50] <joao> that's a bummer though
[18:51] <nhm_> joao: what hardware were you testing on?
[18:51] <joao> planas
[18:51] <nhm_> joao: the good news though is that it seems to be specific to the filestore.
[18:51] <joao> nhm_, that's not very reassuring though :p
[18:51] <joao> the filestore is one of the cornerstones
[18:52] <joao> but at least it is pinpointed to one component ;)
[18:53] <nhm_> joao: I think if it came down to it, I'd rather it be in the filestore than in the messenger or some other higher up layer.
[18:54] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[18:54] <joao> true
[18:55] <joao> or be scattered throughout the code base
[18:56] * loicd (~loic@ Quit (Quit: Leaving.)
[18:56] <nhm_> joao: let me tell you again that this thing is fantastic. I should have started using it earlier.
[18:56] <nhm_> joao++
[18:58] * lxo (~aoliva@83TAAHI0Z.tor-irc.dnsbl.oftc.net) Quit (Read error: No route to host)
[19:02] <joao> I'm glad it's being put to use :)
[19:02] * nymous (~darthampe@93-181-194-222.pppoe.yaroslavl.ru) has joined #ceph
[19:02] <nymous> hello people
[19:03] <nymous> tried cephfs 0.48 for live migration on openstack
[19:03] <nymous> can't even start an instance lol
[19:03] <nhm_> nymous: doh!
[19:04] <gregaf> I didn't think that OpenStack supported live migration yet
[19:04] <nymous> whole cluster acts unresponsive with several answers per 5 minute
[19:04] <nymous> got lots of slow request warnings in the logs
[19:04] <gregaf> okay, that sounds more like a Ceph problem :) :(
[19:05] <nymous> that's weird, because rados benchmark went normal and show average ~40 MB/s
[19:05] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:07] <nymous> i've looked into maillist and found that the problem was likely introduced in 0.48
[19:07] <gregaf> nymous: what does your cluster look like, etc?
[19:07] <nymous> should i try to downgrade?
[19:08] <sagewk> um, i guess i'm upgrading redmine now
[19:08] <sjust> nymous: no, there was a disk format change
[19:08] <nymous> gregaf: 4 nodes, core i7 + 32 GB RAM, 2 disks xfs formatted as osds on each
[19:08] * dmick (~dmick@ has joined #ceph
[19:08] <nymous> gigabit ethernet network interconnect
[19:08] <nymous> so, 3 mons, 4 mds, 8 osds
[19:09] <gregaf> okay, not going to help at all but if you're just using RBD you don't need any MDSes :)
[19:09] <gregaf> what's the rados bench summary look like?
[19:09] <sjust> nymous: can you add debug optracker = 20?
[19:10] <sjust> if you can then reproduce the slowness, the logs should be able to tell me what's going on
[19:10] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[19:11] * nhm_ wonders how he was missing kcachegrind
[19:11] <sjust> oh, also debug journal = 10 and debug filestore = 20
[19:11] <nhm_> sjust: btw, I may have a lot of work for you soon. :P
[19:11] <sjust> nhm_: going on vacation next week :)
[19:11] <nhm_> sjust: that's what you think! ;)
[19:12] <nhm_> sjust: unless you want me playing in the FS. >:)
[19:12] <sjust> heh
[19:12] <sjust> what's up?
[19:13] <nymous> sjust: yes, i can, but currently whole cluster is just stuck
[19:13] <nymous> it is happening right now)
[19:13] <sjust> k
[19:13] * joshd (~joshd@ has joined #ceph
[19:13] <nhm_> sjust: I think the controller issue as a red herring. It seems to have been related to an interaction with the drive cache and the controller. I managed to get fio performance up, but it didn't fix ceph. So I decided to try to replicate what sage was doing with dd, xattr, and small file appends. I could get tiny glimpses of the problem we were seeing before, but nothing like what ceph would produce.
[19:14] <sjust> hmm
[19:14] <nhm_> sjust: so this morning I got the workload generator up and going, and have been able to produce the same problem with ease. Interestingly it happens despite surpressing xattrs operations and pg_log appends.
[19:14] <nhm_> sjust: so only data operations are going through.
[19:15] <nymous> oh, it just responded me with HEALTH_OK
[19:15] <sjust> what does ceph -s output?
[19:15] <nymous> but stuck on next command)
[19:15] <nhm_> sjust: so now I'm going to profile it and see what the hell it's doing.
[19:15] <sjust> ok
[19:16] <nymous> i got lots of
[19:16] <nymous> [WRN] slow request 167.119426 seconds old, received at 2012-07-19 18:45:23.021527: osd_sub_op(client.5508.0:16991 0.91 da78ea91/1000000053f.000003e2/head//0 [] v 83'27 snapset=0=[]:[] snapc=0=[]) v7 currently started
[19:16] <nhm_> sjust: also interesting, that with the workload generator, and no xattr or pg_log operations, we can't get past about 5MB for 4KB operations.
[19:16] <nhm_> sorry 5MB/s
[19:21] <nymous> still non responsive
[19:21] <nymous> last time i had to reboot all nodes to solve such behaviour
[19:21] <nhm_> nymous: everyone in a meeting for about 20mins
[19:25] <nymous> what's your local time?
[19:26] * chuanyu_ (chuanyu@linux3.cs.nctu.edu.tw) has joined #ceph
[19:27] * chuanyu (chuanyu@linux3.cs.nctu.edu.tw) Quit (Ping timeout: 480 seconds)
[19:29] * newtontm (~jsfrerot@charlie.mdc.gameloft.com) has joined #ceph
[19:31] <newtontm> Hi, Is this a good place here to get help to setup a ceph test setup ?
[19:32] <newtontm> I have currently installed and configured 3 servers with ceph, every servers currently runs osd and mon, and 1 server only is running mds
[19:32] <newtontm> when I run "ceph health" it just hangs there without saying anything
[19:32] <newtontm> any clue ?
[19:33] <Leseb> start your daemons?
[19:33] <newtontm> they are started...
[19:33] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Quit: LarsFronius)
[19:33] <newtontm> /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /tmp/ceph.conf.8712
[19:33] <Leseb> no offense, the last time someone told me that I didn't start his OSDs
[19:33] <newtontm> /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /tmp/ceph.conf.8712
[19:33] <Leseb> *he didn't
[19:34] <sjust> nymous: what does ceph -s output?
[19:34] <nymous> did you wait for response?
[19:34] <newtontm> how long should I wait ?
[19:34] <joao> any "hunting for mon" messages?
[19:34] <newtontm> i figured that after a minute, it's not going to reply anythin
[19:34] <nymous> sjust: i've rebooted whole cluster, because it stuck and didn't replyed to anything, even ceph -s
[19:35] <Leseb> you should not have to wait for an answer
[19:35] <nymous> now it says HEALTH_OK
[19:35] <newtontm> i'm currently running "ceph -s" and i have no reply
[19:35] <sjust> hmm, that's not related to the slowness that's been reported on the list, actually
[19:35] <nymous> newtontm: sometimes it takes minutes to respond in my case(
[19:36] <joshd> ceph -s and ceph health just connect to the monitor cluster, they shouldn't take long
[19:36] <joshd> are your monitors running and accessible from where you're running ceph health?
[19:36] <newtontm> speaking about mon process, i see this in the logs
[19:36] <newtontm> 2012-07-19 17:36:37.773207 7fdcc9641700 1 mon.b@0(probing) e0 discarding message auth(proto 0 26 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
[19:38] <newtontm> and I can reach port 6789 from/to all my ceph servers
[19:38] <gregaf> newtontm: okay, that looks like your monitors aren't forming a quorum together ??? how did you set them up?
[19:38] <gregaf> and do they all have the same ceph.conf, and what does it look like? (pastebin, please!)
[19:38] <newtontm> nc -vz ceph01 6789
[19:38] <newtontm> ceph01.mdc.gameloft.org [] 6789 (?) open
[19:39] <nhm_> sjust: What's LFNIndex?
[19:39] <nymous> sjust: i could try to reproduce with more debug verbosity
[19:39] <Leseb> ceph mon stat ?
[19:39] <sjust> nhm_: it does a translation to handle objects with names longer than the filesystem allows
[19:39] <nymous> just tell me where should i put this ops
[19:40] <newtontm> gregaf: I used the mkcephfs script
[19:40] <sjust> for normal objects, it doesn't really come into play
[19:40] <newtontm> I figured out how to successfully launch it
[19:40] <nhm_> sjust: granted this is under valgrind so it may not be a problem under normal use, but it seems that lfn_open is taking up a lot of CPU time.
[19:40] <sjust> that is, objects with small names
[19:40] * chutzpah (~chutz@ has joined #ceph
[19:41] <newtontm> gregaf: yes they do have the same config... (what is pastebin ?)
[19:41] <gregaf> it's a website :)
[19:41] <sjust> nymous: those options go in the ceph.conf osd section
[19:42] <newtontm> http://pastebin.com/B9wuZxvK ?
[19:42] <gregaf> newtontm: can you look at the logs for your monitors and see if they output a line that includes something like "init fsid fe9b9e44-aece-4a21-ae1f-421dbac677cd"?
[19:42] <sjust> nhm_: how many files are you creating?
[19:43] <newtontm> gregaf: mon.c@-1(probing) e0 init fsid c8b1ffc0-b189-4415-94f8-69c8f8018290
[19:43] <newtontm> mon.c@-1(probing) e0 my rank is now 1 (was -1)
[19:44] <nymous> sjust: should i update all nodes or just one is enough?
[19:44] <sjust> nymous: one should be adequate
[19:45] <sjust> nhm_: try turning filestore_split_multiple to 4000
[19:45] <gregaf> newtontm: do they all have the same fsid listed?
[19:45] * ninkotech (~duplo@ Quit (Remote host closed the connection)
[19:45] <newtontm> gregaf: let me check
[19:45] <nhm_> sjust: I just ran for a while and then did a ctrl+c, so I don't have a really good idea yet. I'll do more refined tests in a bit. For these two runs, it looks like during one a 128k bs test it was called 41,782 times, and on a 4k test it was called 38,728 times.
[19:45] <sjust> it should be called once per write, I think
[19:46] <nymous> [osd]
[19:46] <nymous> debug optracker = 20
[19:46] <nymous> debug journal = 10
[19:46] <nymous> debug filestore = 20
[19:46] <nymous> right?
[19:46] <sjust> yeah
[19:46] <gregaf> newtontm: oh, and this just occurred to me???are the clocks synchronized on your monitors?
[19:46] <newtontm> gregaf: they all have the same fsid
[19:46] <nymous> ok
[19:46] <nhm_> sjust: the hash is growing for every write?
[19:46] <newtontm> gregaf: yes they are
[19:47] <sjust> growing?
[19:48] <sjust> it's the call that gets the file handle for the file
[19:48] <gregaf> newtontm: okay, I don't see anything obvious in the config???can you add "debug ms = 1" and "debug mon = 20" lines to your ceph.conf and restart the monitors?
[19:48] <sjust> it also handles file creation and updating the collection structure
[19:48] <newtontm> gregaf: sure
[19:48] <gregaf> that will generate a lot of output, including all the messages they're passing, that should tell us why they aren't forming a quorum
[19:48] * ninkotech (~duplo@ has joined #ceph
[19:48] <nymous> HEALTH_OK
[19:49] <newtontm> gregaf: in the global config right ?
[19:49] <nymous> starting an instance on fs
[19:49] <sjust> k
[19:49] <gregaf> or the mon section, yeah
[19:49] <newtontm> k
[19:49] <nymous> i've started ceph -w to monitor...
[19:50] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[19:50] <nymous> 0 monclient: hunting for new mon
[19:51] <nymous> hm... /var/log/ceph/ceph.log doesn't look more verbose
[19:51] <sjust> it would be osd.?.log
[19:51] <nhm_> sjust: sorry, I saw it was calling created a bunch and thought it was create. there's a decent amount of time spent in complete_split, which sounded like a growing hash.
[19:51] <sjust> ah, it is
[19:51] <sjust> it's splitting a bucket
[19:51] <newtontm> gregaf: how do you just restart hust the mon processes. I'm going to use service ceph -a stop/start
[19:51] <sjust> ok, so these are all file creations/
[19:51] <sjust> ?
[19:52] <sjust> yeah, try the config I mentioned above, it'll prevent it from splitting until a ludicrous number of objects build up
[19:52] <nymous> cluster went unresponsive
[19:52] <gregaf> newtontm: yeah, I think that should do it (if it doesn't, you may need to do it locally on each node)
[19:52] <sjust> by cluster, do you mean admin commands?
[19:53] <gregaf> oh, just the mons???hrm, I think you can just say "service ceph -a restart mon"?
[19:53] <newtontm> gregaf: done
[19:53] <nymous> yes, both admin commands and file ops like ls or df
[19:53] <nymous> on mounted fs
[19:53] <sjust> oh, not rbd them
[19:53] <sjust> or yes rbd in side the vm?
[19:54] <gregaf> newtontm: okay, first let's make sure you can't connect to them ??? run "ceph -s -m IP", where IP is the address for one of your mons
[19:54] <gregaf> (actually, try all three)
[19:54] <gregaf> but don't wait for it to time out, just let it sit for a second to make sure it's not working
[19:54] <nymous> i'm using rbd for image storage... but instance is created on fs
[19:54] <gregaf> then we can go look through one of the logs and see what messages are exchanged
[19:54] <gregaf> I can walk you through it or you can pastebin it or zip it up and email me (depending on size)
[19:55] <nymous> because openstack doesn't support rbd live migration yet
[19:55] <newtontm> gregaf: doesn't seem to work...
[19:55] <nymous> oh, osd.log is now sooo verbose
[19:55] <sjust> yeah, but the problem isn't there
[19:55] <sjust> if ceph -s isn't working than nothing else is intereseting
[19:55] <sjust> *interesting
[19:56] <nymous> ceph -w is still alive
[19:56] <newtontm> i'll try to use pastebin
[19:56] <gregaf> k
[19:56] <sjust> oh, ok
[19:56] <sjust> can you pastebin the output of ceph -s?
[19:56] <nymous> it says [WRN] slow request 37.099792 seconds old
[19:56] <nymous> then i issued ceph -s, new line 0 monclient: hunting for new mon appeared
[19:56] <newtontm> gregaf: want the logs of the 3 servers or just one ?
[19:57] <gregaf> just one should be enough
[19:57] <sjust> gregaf: any ideas?
[19:57] <gregaf> all the symptoms I've noticed from your conversation are "I get slowness warnings, and then ceph -s stops working eventually"
[19:57] <gregaf> was there anything else?
[19:58] <sjust> not that I've seen
[19:58] <gregaf> from those symptoms???are clocks synchronized? are your monitors and OSDs sharing disks? since they're on xfs, are your nodes new enough to have syncfs?
[19:58] <nymous> http://pastebin.com/PWvDVVk8
[19:59] <gregaf> for a more targeted approach, turn on mon logging and find out why the monitors aren't responding to a ceph -s command???are they dropping out of quorum?
[19:59] <nymous> sometimes it says about laggy mds, but sometimes it just reply with HEALTH_OK
[19:59] <sjust> gregaf: looks liek ceph -s is working
[19:59] <sjust> ok, can you convey that verbose osd log to use?
[20:00] <gregaf> anyway, so far it looks like everything being slow, which makes me think that's why the monitors are unhappy too, so I'd look into whether the nodes are just overloaded
[20:00] <gregaf> do they have a high iowait?
[20:00] <newtontm> gregaf: http://pastebin.com/q8yCKDXj
[20:00] <gregaf> or cpu/mem usage?
[20:00] <sjust> *to us
[20:00] <newtontm> gregaf: brb in 5 minutes
[20:03] <nymous> monitors got re-elected, mds became up, slow request is still slowering
[20:03] <nymous> now it's 190 secs
[20:03] <nhm_> sjust: ok, changes dramatically lowered that lfn stuff as a hotspot, now all the cpu time is spent in crc32c
[20:04] <sjust> where?
[20:04] <nhm_> in prepare_single_write
[20:04] <sjust> journal
[20:04] <sjust> ok
[20:04] <nhm_> not sure any of this actually matters though.
[20:04] <sjust> that's a good sign
[20:04] <sjust> did it actually improve throughput?
[20:04] <nhm_> in valgrind yes, for real, no.
[20:04] <sjust> ok, that's interesting
[20:05] <sjust> valgrind might actually nuke cpu performance enough to make it matter
[20:05] <nhm_> yeah, that's my thought too.
[20:05] <nhm_> so if we have people running on ancient hardware, we can tell them to watch out for it. ;)
[20:06] <sjust> however, that config option should have just about taken the file hashing out of the equation,
[20:06] <sjust> so it should have had an effect on the non-valgrind throughput
[20:06] <nhm_> if anything the non-valgrind throughput degraded faster, though it's tough to get consistent results.
[20:07] <nhm_> might have just been noise.
[20:09] <newtontm> gregaf: back
[20:09] <Tv> nhm_: i run out of my depth with out btrfs performance on the latest email from Calvin Morrow, "Poor read performance in KVM", can you step in?
[20:09] <Tv> *our
[20:10] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[20:11] <nymous> looks like mds is constantly crashing and recovering after a while
[20:13] <nhm_> Tv: sure
[20:13] <nymous> http://pastebin.com/d3B8B4y1 my current config
[20:16] <nymous> rebooted whole cluster again...
[20:17] * loicd (~loic@ has joined #ceph
[20:19] * Ryan_Lane (~Adium@ has joined #ceph
[20:19] * wido_ (~wido@rockbox.widodh.nl) has joined #ceph
[20:20] * loicd (~loic@ Quit ()
[20:20] * BManojlovic (~steki@ has joined #ceph
[20:21] <gregaf> newtontm: sorry, I'm looking but there's something very bizarre happening here
[20:21] <gregaf> can you also put up the logs from the other two?
[20:21] <wido_> quit
[20:21] * wido_ (~wido@rockbox.widodh.nl) Quit ()
[20:22] * widodh (~widodh@minotaur.apache.org) Quit (Quit: leaving)
[20:22] <gregaf> the monitors successfully form a quorum, but it looks like mon.b forgets about it somehow and then they all get into a fight over it
[20:23] * wido_ (~wido@rockbox.widodh.nl) has joined #ceph
[20:23] * wido_ is now known as widodh
[20:23] * widodh is now known as wido
[20:24] <newtontm> gregaf: sure
[20:26] <gregaf> also, what version are you running? I'd expect to see it printed out at the top of the log but it's not there
[20:28] <newtontm> gregaf: http://pastebin.com/3LCz1etV
[20:28] <newtontm> gregaf: 0.48argonaut
[20:29] <newtontm> gragaf: http://pastebin.com/ExsCRFj2
[20:29] <sagewk> tracker is back
[20:30] <sagewk> i hate ruby
[20:30] <sagewk> that is all
[20:31] <Tv> sagewk: can i interest you in some chef/crowbar work?-)
[20:31] * loicd (~loic@ has joined #ceph
[20:31] <sagewk> so tempting..
[20:33] <sjust> nymous: osd.1 appears not to be the slow one, can you do the same to all osds and upload all of the logs?
[20:33] <sjust> also, add debug osd = 20
[20:33] * loicd (~loic@ Quit ()
[20:33] * loicd (~loic@ has joined #ceph
[20:33] <nymous> hm... they should be exactly same...
[20:33] <sjust> yeah, all of the slow requests on osd.1 were waiting on other osds, I need to logs from those
[20:34] <sjust> *need the logs
[20:34] <nymous> which osds is it waiting?
[20:35] * LarsFronius (~LarsFroni@95-91-243-240-dynip.superkabel.de) has joined #ceph
[20:36] <sjust> what is the output of 'ceph pg map 0.cf' ?
[20:37] <nymous> osdmap e159 pg 0.cf (0.cf) -> up [1,7,2] acting [1,7,2]
[20:37] <sjust> ok, I need at least 7 and 2 as well
[20:38] <nymous> ok, it would take couple of minutes
[20:38] <sjust> ok, thanks
[20:41] <gregaf> newtontm: hmm, it looks like not all the messages that mon.c is sending to mon.b are reaching it
[20:41] <dspano> If anyone can answer, it's just for my own curiosity. When I removed an OSD, I was only able to re-add it without the cluster locking up if the weight was different from the other OSD. I.E. one was 1.0 and the other was 2.0.
[20:41] <gregaf> can you check out that network connectivity more carefully?
[20:41] <gregaf> bbiab, lunchtime
[20:41] <dspano> After adding it, I could re-weight to match the other.
[20:44] <joshd> dspano: that sounds very strange could you file a bug with the commands you ran to remove and re-add it?
[20:45] <joshd> and by 'cluster locking up', did it do that because it ran out of space on the remaining osds?
[20:46] <dspano> joshd: It would get stucking on peering with half the pgs, and not really log anything else.
[20:47] <dspano> joshd: I was using the packages that came with Ubuntu 12.04 (0.41). I'm using 0.48argonaut now. I'll tell you if it happens with the new packages.
[20:47] <joshd> if there were writes in progress to the osd when you removed it that would make sense
[20:48] <joshd> but only making progress with a different weight is odd
[20:48] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[20:49] <dspano> joshd: I was beginning to suspect earlier that it was because I shut the OSD down before removing it from the crush map.
[20:50] <joshd> yeah, usually you should remove it from the crush map first
[20:51] <joshd> I thought we had checks for that
[20:52] <joshd> you can also gradually lower its weight until it's zero to slowly migrate data off it
[20:57] <newtontm> gregaf: ok, seems i see some packet dropped too, let investage on this and i'll come back to you
[20:57] <dspano> joshd: Must be something with the package from Ubuntu, it works fine with 0.48argonaut
[20:57] <newtontm> gregaf: *let me investigate
[20:58] <joshd> dspano: glad it works in argonaut
[20:58] <dspano> joshd: The way ceph -w logs to screen with the new version is much better.
[20:58] <dspano> joshd: It's much clearer what's happening with the pgs than with 0.41
[20:59] <joshd> good to hear
[21:00] <joshd> it's hard for me to remember what 0.41 was like. 6 months is a long time
[21:01] <dspano> joshd: I totally understand. That's why I was saying it was just out of curiosity.
[21:02] <dspano> joshd: If I were to change the weight on one from 1.0 to say 0.5 then 0, it would migrate all the data away from that particular OSD? I'm just beginning to read about CRUSH.
[21:02] <joshd> yeah
[21:02] <dspano> joshd: Would there be any indication that it had moved it in the logs?
[21:03] <joshd> yeah, although it's more clear just by looking at the data directories
[21:03] <dspano> joshd: That's awesome.
[21:03] <joshd> setting the weight to 0 is the same as marking it out (but not down)
[21:10] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[21:10] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[21:10] * Leseb_ is now known as Leseb
[21:11] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[21:11] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[21:11] * Leseb_ is now known as Leseb
[21:13] <nymous> sjust: uploaded finally
[21:15] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[21:15] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[21:15] * Leseb_ is now known as Leseb
[21:16] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[21:16] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[21:16] * Leseb_ is now known as Leseb
[21:22] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[21:22] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[21:22] * Leseb_ is now known as Leseb
[21:28] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[21:28] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[21:28] * Leseb_ is now known as Leseb
[21:30] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[21:33] * loicd (~loic@ Quit (Read error: Connection reset by peer)
[21:33] * loicd1 (~loic@ has joined #ceph
[21:48] <sjust> nymous: looking
[21:53] * loicd (~loic@ has joined #ceph
[21:53] * loicd1 (~loic@ Quit (Read error: Connection reset by peer)
[21:56] <gregaf> newtontm: let me know if clearing up the network doesn't fix it???if you're seeing packets get dropped then the monitors are essentially doing what they're supposed to
[21:57] <newtontm> gregaf: I'm having a mtu issue, my setup is complicated we are using jumbo frame with xen, vlan and bridges... I haven't fixed it yet, but will soon !
[21:58] <gregaf> okay, glad to hear we probably don't have a bug on our end :)
[21:58] <gregaf> and it's good to know that this is the kind of thing we should try and report more clearly if possible
[21:59] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:01] <Tv> so with a decently-sized vm on vercoi, a full ceph build is ~4min, full debian package build is ~5m45s
[22:01] * lofejndif (~lsqavnbok@anon.xmission.com) has joined #ceph
[22:02] <joshd> wow, that would speed up the test cycle a lot
[22:03] <Tv> on current gitbuilders, that's ~25min and ~10min (not on the same vm so not those two numbers aren't comparable)
[22:03] <Tv> so i'm actually thinking of not even doing ccache etc ;)
[22:03] <gregaf> can we set up enough of those VMs on vercoi?
[22:04] <Tv> i gave it 8gig and 8 cores, out of 24 & 24
[22:04] <Tv> so we could max out at 3 per box -> 24, right now
[22:04] <Tv> with no other work being done etc
[22:04] <Tv> but part of this is asking whether we can work smarter not harder with how the gitbuilders are set up
[22:05] <Tv> i see we have 10 ceph gitbuilders in the old system
[22:05] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:05] <Tv> poking around more, just figured i'd share good news
[22:06] * loicd (~loic@ Quit (resistance.oftc.net synthon.oftc.net)
[22:06] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (resistance.oftc.net synthon.oftc.net)
[22:06] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) Quit (resistance.oftc.net synthon.oftc.net)
[22:06] * acaos (~zac@209-99-103-42.fwd.datafoundry.com) Quit (resistance.oftc.net synthon.oftc.net)
[22:06] * sdouglas (~sdouglas@c-24-6-44-231.hsd1.ca.comcast.net) Quit (resistance.oftc.net synthon.oftc.net)
[22:06] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (resistance.oftc.net synthon.oftc.net)
[22:06] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (resistance.oftc.net synthon.oftc.net)
[22:06] * jantje (~jan@paranoid.nl) Quit (resistance.oftc.net synthon.oftc.net)
[22:06] * Solver (~robert@atlas.opentrend.net) Quit (resistance.oftc.net synthon.oftc.net)
[22:06] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[22:06] * loicd (~loic@ has joined #ceph
[22:06] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[22:06] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[22:06] * Solver (~robert@atlas.opentrend.net) has joined #ceph
[22:06] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[22:06] * sdouglas (~sdouglas@c-24-6-44-231.hsd1.ca.comcast.net) has joined #ceph
[22:06] * acaos (~zac@209-99-103-42.fwd.datafoundry.com) has joined #ceph
[22:06] * jantje (~jan@paranoid.nl) has joined #ceph
[22:06] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) has joined #ceph
[22:06] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:08] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:08] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[22:08] * Leseb_ is now known as Leseb
[22:11] <nymous> sjust: it's late night here already, so i better off to bed. but i will be back :)
[22:11] <sjust> ok
[22:11] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[22:11] * nymous (~darthampe@93-181-194-222.pppoe.yaroslavl.ru) Quit (Quit: RAGING AXE! RAGING AXE!)
[22:11] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[22:14] * loicd1 (~loic@ has joined #ceph
[22:17] <dmick> Tv: nice. Does that mean it's running -j 16?
[22:18] <Tv> yeah
[22:18] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[22:18] <dmick> that seems to work great on my 4-core box FWIW. memory is the limitation
[22:19] <dmick> watching system perf by eye, it seemed to get close to 6GB, never swapping
[22:19] <Tv> yeah 6GB was not enough for -j16
[22:20] <Tv> on some runs; it depends on what runs in parallel
[22:20] <dmick> true
[22:23] <dmick> yeah, maybe 6.5 or so this run
[22:23] <dmick> that's total, when baseline is 3.2 with my current X session
[22:25] <dmick> must have been about 4 minutes for so for me as well (just "make"). so if you need cores for something, you can probably get by with fewer
[22:25] <Tv> i have a vm with no swap, it dies when it runs out -- makes measuring waterline easy ;)
[22:26] <dmick> oh that's also with debug; may make a difference
[22:29] * loicd1 (~loic@ Quit (Ping timeout: 480 seconds)
[22:47] * lofejndif (~lsqavnbok@1RDAAC4LI.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[22:48] * sagewk (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) Quit (Ping timeout: 480 seconds)
[22:49] * sagewk (~sage@ has joined #ceph
[22:49] * loicd (~loic@ has joined #ceph
[22:51] * sagewk (~sage@ Quit (Remote host closed the connection)
[22:57] * thisnickistaken (~toor@ has joined #ceph
[23:02] <thisnickistaken> hello, I am considering ceph for an HA storage solution, I am trying to see if it is a good fit for a 2 server solution, in other words, I have two storage servers that I would like to make highly available to at least 2 KVM hosts
[23:03] <thisnickistaken> trying to eliminate spof's in our VM solution
[23:07] <Tv> thisnickistaken: running ceph-mon on just 2 machines is futile, because of the paxos majority requirement
[23:07] <Tv> thisnickistaken: for HA, you really want 3
[23:09] <gregaf> are you looking to use RBD or the filesystem? (CephFS isn't production-ready yet, but RBD should be good)
[23:09] <thisnickistaken> ok, next question, supposing I have 3 servers with 2TB space on each (for example sake), running mon, mds, and osd on each, how much space could be available with the option of having any node in the cluster beat with a hammer at any point? @TB?
[23:09] <thisnickistaken> *2
[23:10] <newtontm> gregaf: ok I fixed my issue, now I got feed back from ceph health !
[23:10] <newtontm> HEALTH_ERR 576 pgs stuck inactive; 576 pgs stuck unclean; no osds
[23:10] <gregaf> newtontm: are your OSDs actually running?
[23:10] <thisnickistaken> gregaf: to answer your question, the other thing I'm looking at is ZVOLs exported via AoE and raided at client end for HA.......in other words svr1 exports zvol1, svr2 exports zvol2 raid1 on client side so could lose either server and still run
[23:11] <Tv> thisnickistaken: you'd need replication level N>=2 for that, available space is roughly 3*2/N
[23:11] <newtontm> gregaf: nope trying to get them working ;) I know they are not
[23:11] <gregaf> (ceph -s outputs a lot more useful information than ceph health, which is really just for ringing alarm bells)
[23:11] <gregaf> okay, so PGs live on OSDs ?????if the OSDs aren't working, you'll get reports of stale PGs (and creating, if they haven't yet been made anywhere)
[23:12] <gregaf> thisnickistaken: okay, but you want block storage, right?
[23:12] <thisnickistaken> Tv: so, is N in that case the sum of storage on all servers? or the storage on a single server?
[23:12] <thisnickistaken> gregaf: yes, but I would have use for both block and fs level storage if I could swing it
[23:13] <Tv> thisnickistaken: your case would be either N=2 for 3*2/2 TB = 3 TB, or for more robust storage, N=3 for 3*2/3 = 2 TB
[23:13] <Tv> thisnickistaken: total storage divided by replication factor
[23:14] <gregaf> Ceph lets you choose the number of copies of data you want :)
[23:14] <thisnickistaken> ok in the N=2 case, could I lose any of the servers but only 1? and N=3 could lose 2 of them?
[23:14] <gregaf> (but there's no erasure coding/raid5-like system)
[23:14] <gregaf> that's correct, yes
[23:15] <thisnickistaken> ok, im starting to get it I think lol
[23:15] <gregaf> newtontm: do you need help with your OSDs, or are you set? :)
[23:15] <thisnickistaken> when you say its not stable yet.....how unstable are we talkin? people losing storage pools?
[23:16] <newtontm> gregaf: i'm rebuild them with mkcephfs
[23:16] <newtontm> gregaf: i'll let you know if I need help, and btw thx for your time
[23:16] <gregaf> thisnickistaken: depending on your workload, it might be just fine, but there are known bugs that if you hit there's not presently a fix for
[23:17] <gregaf> no loss of data that I'm aware of, but loss of easy access, yes
[23:17] <gregaf> you're welcome, newtontm ??? thanks for trying out Ceph ;)
[23:17] <thisnickistaken> and in the case of block devices, does each server export them? say I have 2 KVM hosts and each needs to simultaneously see the block devices in order to migrate VMs at the same time, the KVM hosts need to be able to still see them whether or not one of the servers has died
[23:18] <gregaf> Ceph works a little differently than some systems???what happens is you give clients a list of the monitors, and then they go talk to one of them and find out what servers hold the data they need
[23:19] <gregaf> if a server dies, responsibility for the data is shifted to servers which are still alive
[23:20] <gregaf> so, yes, the KVM host could continue accessing disk images without a restart or anything, although depending on timeouts there might be a stall in disk access
[23:20] <thisnickistaken> ok, nice
[23:21] * sagewk (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) has joined #ceph
[23:21] <sagewk> elder: how are you reproducing the rbd EIO thing?
[23:21] <sagewk> not seeing it with a trivial test
[23:21] <thisnickistaken> are the fs components and block device components stored side by side...........in other words, if youre familiar with ZFS, is it like having ZFS fs along side zvols?
[23:22] <gregaf> I think so???but somebody else more familiar with zfs will need to answer that???dmick?
[23:22] <dmick> not sure about the ZFS analogy, but
[23:23] <dmick> the "block device" is just an abstraction that uses the underlying RADOS cluster
[23:23] <dmick> as is the "ceph filesystem"
[23:23] <Tv> thisnickistaken: everything is stored as objects, including cephfs file contents, directory contents, and block device images
[23:23] <elder> sagewk, just a minute, I'll send you my test script.
[23:23] <thisnickistaken> ok, guess I need to do some reading into rados as well then
[23:23] <thisnickistaken> ceph is built on top of rados or the other way around?
[23:23] <Tv> thisnickistaken: but you can think of it as just having pools of storage
[23:24] <Tv> thisnickistaken: it's just instead of each pool being chunks of disk, it's pools of objects
[23:24] <Tv> my new desktop box arrived
[23:24] <thisnickistaken> ok, thats inline with my zfs logic....i think....at least at the high level
[23:24] <dmick> "the ceph filesystem" is built on top of rados. Ceph is the opensource project that encompasses most of these pieces. I usually call "the POSIX filesystem" "cephfs" to distinguish it
[23:24] <Tv> the ubuntu installer is "wiping swap space for security".. on a box with 64GB RAM
[23:25] <thisnickistaken> lol
[23:25] <dmick> but yes, cephfs, rbd (the block device), and other clients are "on top of" the RADOS cluster
[23:25] <dmick> RADOS just stores objects (reliably, redundantly, across many hosts)
[23:26] <elder> sagewk, it's not reproducing now.
[23:26] <sagewk> elder: naturally :)
[23:26] <dmick> is there a quantum physicist in the house?
[23:26] <elder> Actually, I think I might have a clue on it thought.
[23:27] <Tv> dmick: yes|no
[23:27] <elder> Looking at what I was using yesterday, I created the snapshot in a script, immediately after creating the image.
[23:27] <thisnickistaken> does rados do hashing etc? and dedup or anything of that nature?
[23:27] <elder> But I now have a short delay to allow udev to catch up.
[23:27] <elder> The problem may have been due to that.
[23:27] <elder> However the problem persisted, so there could still be something wrong, just not as bad as it seemed.
[23:28] <elder> I will try again without my delay.
[23:28] <elder> I'll send you my script by e-mail anyway.
[23:28] <sagewk> k
[23:28] <sagewk> thanks
[23:28] <gregaf> thisnickistaken: there's no dedup; there's hashes all over the place but if you're talking about dedup hashes and content-addressable storage, no
[23:28] <dspano> There's no dedup, but ceph is webscale though.
[23:29] <thisnickistaken> ok I can live without all that, as long as it knows that bad data/metadata is bad :)
[23:29] <thisnickistaken> webscale?
[23:30] <dspano> I was just joking. I was referring to this: http://www.youtube.com/watch?v=b2F-DItXtZs
[23:30] <thisnickistaken> ok lol
[23:31] <thisnickistaken> does it have a scrub-type feature?
[23:31] <gregaf> ah, I see what you were asking
[23:32] <dmick> thisnickistaken: there is a scrub. gregaf, I think it currently compares replica size/attributes, but not contents, and flags but does not repair, correct?
[23:33] <gregaf> right
[23:33] <gregaf> you can do a repair, but it's...weak
[23:33] <dmick> "the primary must be correct"?
[23:33] <gregaf> right now you're really dependent on the underlying filesystem for good data integrity
[23:33] <gregaf> yeah
[23:33] <thisnickistaken> rados lies on top of another filesystem?
[23:34] <dspano> thisnickistaken: http://ceph.com/uncategorized/scrubbing/
[23:34] <dspano> Yes.
[23:34] <gregaf> rados stores its objects in regular ol' linux filesystems ??? ext4, xfs, btrfs
[23:34] <gregaf> haha, this mongodb video is hilarious
[23:34] <thisnickistaken> ooook, so I could put it ontop of zfs?
[23:35] <dspano> thisnickistaken: Most people use xfs from what I understand.
[23:35] <dmick> we really ought to have an architecture slide deck up by now
[23:35] <dmick> but yes, short answer: most of rados is "daemons running on hosts, accessing disks, storing things with native filesystems". We really only do Linux well at this point, so zfs is a little difficult, but in theory yes
[23:35] <dspano> gregaf: I watched that video twice. People at work must've thought something was wrong with me because I was laughing so hard.
[23:40] <elder> sagewk, I am now unable to reproduce the problem.
[23:40] <elder> Kind of scary, it was very reliable yesterday. Even with old code.
[23:40] <elder> And I don't think I've restarted my servers either.
[23:40] <thisnickistaken> ok thanks for indulging my numerous questions! I'm sure I'll be back
[23:40] <dmick> thisnickistaken: np, that's why we're here
[23:40] <thisnickistaken> bye yall
[23:41] * thisnickistaken (~toor@ Quit (Quit: knowledge hunger satisfied)
[23:42] * adjohn (~adjohn@ has joined #ceph
[23:42] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[23:46] * Tv (~tv@cpe-24-24-131-250.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[23:46] * loicd1 (~loic@ has joined #ceph
[23:47] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[23:48] <nhm_> what's syscall_306?
[23:53] <dmick> nhm_: context?
[23:53] <nhm_> dmick: something strace doesn't understand.
[23:53] <dmick> sigh outofdate Ubuntu
[23:54] <dmick> 64bit seems to be syncfs
[23:54] <dmick> which would match our prior experience
[23:54] <gregaf> oh hey, that's ours! :D
[23:55] <dmick> (from arch/x86/syscalls/syscall_64.tbl)
[23:55] <nhm_> ah, good deal
[23:55] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[23:56] * s[X]_ (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[23:56] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) Quit (Read error: Connection reset by peer)
[23:57] * Tv (~tv@cpe-24-24-131-250.socal.res.rr.com) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.