#ceph IRC Log

Index

IRC Log for 2012-04-20

Timestamps are in GMT/BST.

[0:00] <alo> nhm: only a bad note... i'm using ubuntu and the package in precise are toot old, and on 686 crash when i make mkcephfs. With 0.45 we didn't see this problem.
[0:02] <nhm> alo: good to know. Some of the folks in here that are involved with the precise rollout might be interested in that.
[0:03] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) has joined #ceph
[0:03] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[0:05] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[0:05] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit ()
[0:06] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[0:07] * mkampe (~markk@aon.hq.newdream.net) Quit (Remote host closed the connection)
[0:08] * loicd (~loic@204.16.154.194) has joined #ceph
[0:08] * mkampe (~markk@aon.hq.newdream.net) has joined #ceph
[0:20] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) Quit (Quit: LarsFronius)
[0:21] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) has joined #ceph
[0:25] <alo> node added! :)
[0:26] <alo> but i was copying while the new node started to sync... and the client... kernel panic on libceph :(
[0:27] <alo> here is too late. thanks again and see you tomorrow.
[0:29] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[0:30] <gregaf> cya tomorrow alo
[0:35] * alo (~alo@host90-43-static.242-95-b.business.telecomitalia.it) Quit (Quit: Sto andando via)
[0:36] * loicd (~loic@204.16.154.194) Quit (Quit: Leaving.)
[0:39] <sagewk> joao: i can't reproduce :(
[0:39] <sagewk> the no sync option also isn't working for me.. not sure yet what i'm donig wrong
[0:39] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[0:40] <sagewk> oh, i see
[0:42] * MoXx (~Spooky@fb.rognant.fr) Quit (Read error: No route to host)
[0:42] * BManojlovic (~steki@212.200.243.246) Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:42] * joao-phone (~JL@97.104.166.178.rev.vodafone.pt) has joined #ceph
[0:43] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[0:45] * lofejndif (~lsqavnbok@09GAAE4D7.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[0:50] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[0:53] <dmick> woot. vercoi 10G interfaces can talk to all back and ipmi host interfaces
[0:53] <dmick> (and 1G can talk to front)
[0:53] <sjust> cool!
[0:56] <elder> When do we get our 100G interfaces?
[0:57] <dmick> I'll wire you an account number elder
[1:00] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[1:01] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[1:01] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Read error: Connection reset by peer)
[1:02] <elder> OC-768 (about 40 Gbit/sec) is out there. It's still the ironic case that you can get bits across the country at a higher data rate than you can get them across the room.
[1:04] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) Quit ()
[1:15] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[1:19] * joao-phone (~JL@97.104.166.178.rev.vodafone.pt) Quit (Remote host closed the connection)
[1:20] <sagewk> joao: sigh, still not able to reproduce (altho i got the journal thing working). pushing some commits that make diff more informative (show _all_ differences, not just the first one).. can you run that on your result dirs??
[1:22] * brambles (brambles@79.133.200.49) Quit (Ping timeout: 480 seconds)
[1:25] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[1:26] * brambles (brambles@79.133.200.49) has joined #ceph
[1:32] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[1:42] * Tv (~tv@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[1:49] * bchrisman (~Adium@108.60.121.114) Quit (Quit: Leaving.)
[1:49] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[1:51] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[1:56] * chutzpah (~chutz@216.174.109.254) Quit (Quit: Leaving)
[2:05] * ivan\ (~ivan@108-213-76-179.lightspeed.frokca.sbcglobal.net) Quit (Quit: ERC Version 5.3 (IRC client for Emacs))
[2:06] * ivan\ (~ivan@108-213-76-179.lightspeed.frokca.sbcglobal.net) has joined #ceph
[2:13] * imjustmatthew (~imjustmat@pool-74-110-201-39.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[2:22] * ferai (~quassel@quassel.jefferai.org) has joined #ceph
[2:26] * jefferai (~quassel@quassel.jefferai.org) Quit (Ping timeout: 480 seconds)
[2:30] * jantje (jan@paranoid.nl) Quit (Read error: Connection reset by peer)
[2:30] * yoshi (~yoshi@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:30] * jantje (jan@paranoid.nl) has joined #ceph
[2:31] <joao> sagewk, oh, now I see why you flipped the returns...
[2:31] <joao> it's true if the are different
[2:31] <joao> and false otherwise
[2:31] <joao> makes sense
[2:32] <joao> and to think I changed a whole method you had previously changed because it had made no sense back then :p
[2:34] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Ping timeout: 480 seconds)
[2:36] <nhm> anyone still around?
[2:36] * Tv (~tv@204-16-154-194-static.ipnetworksinc.net) Quit (Ping timeout: 480 seconds)
[2:40] <joao> sagewk, if we disable journal sync on both stores, then it remains with only a single mismatch; if we disable on only one of them, a whole lot more are showing up now
[2:43] <joao> oh...
[2:43] <joao> sagewk, I forgot to tell you... it's running only over 5 collections
[2:43] <joao> that's probably why you couldn't reproduce it...
[2:44] <nhm> joao: I know I keep harping on this, but how many hours of sleep do you get a night? ;)
[2:46] <dmick> hey he doesn't have to be up until 10:15 PST :)
[2:46] <dmick> PDT, whatevs
[2:47] <nhm> dmick: btw, have you had any time to do more journal testing?
[2:47] <dmick> I have had time, but have been spending it on other things :)
[2:47] <nhm> fair enough
[2:48] <nhm> dmick: Any idea of those SSDs are in one of the hotswap bays?
[2:49] <joao> nhm, I'm usually fine with 6/7 hours
[2:50] <nhm> joao: I used to be able to do that, about 7 is my minimum these days. :(
[2:50] <joao> tonight I must sleep a bit less because I'll have to pick up my passport early in the morning
[2:51] * loicd (~loic@99-7-168-244.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[2:51] <nhm> joao: That's right. Alex and I are trying to figure out when to go out there too. Hopefully we'll overlap and can meet in person.
[2:51] <dmick> nhm: I believe they are; when they were installed, Robert added them to carriers and shoved them all in in about an hour. I can't imagine you could do that with anything but hotswap bays. I'm hard pressed to believe he did it with.
[2:52] <nhm> dmick: excellent. I'm thinking if we temporarily stole some SSDs from other nodes we could set 2 nodes up with 4 SSDs each and try to approach 10G now that you've got the network setup.
[2:52] <joao> nhm, meeting everybody in one go would be great :p
[2:52] <joao> so far I've only met those who went to Germany
[2:52] <joao> in person, I mean
[2:53] <nhm> joao: when do you fly back home?
[2:56] <joao> June 5th, I expect
[2:56] <nhm> oh wow, we'll definitely be there when you are there then.
[3:22] <joao> sagewk, @plana25: ~ubuntu/test_idempotent/run-me.sh
[3:22] <joao> it reproduces what I obtained
[3:22] <joao> and places the results in logs/
[3:23] <joao> well, 'night everyone
[3:23] <joao> o/
[3:23] * joao (~JL@89-181-153-140.net.novis.pt) Quit (Quit: Leaving)
[3:42] * brambles (brambles@79.133.200.49) Quit (Quit: leaving)
[3:43] * brambles (brambles@79.133.200.49) has joined #ceph
[4:15] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[5:28] * Tv (~tv@m900536d0.tmodns.net) has joined #ceph
[5:51] * aa (~aa@r186-52-162-217.dialup.adsl.anteldata.net.uy) has joined #ceph
[5:52] * Tv (~tv@m900536d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[6:29] * wido (~wido@rockbox.widodh.nl) Quit (Remote host closed the connection)
[7:00] -coulomb.oftc.net- *** Looking up your hostname...
[7:00] -coulomb.oftc.net- *** Checking Ident
[7:00] -coulomb.oftc.net- *** No Ident response
[7:00] -coulomb.oftc.net- *** Found your hostname
[7:00] * CephLogBot (~PircBot@rockbox.widodh.nl) has joined #ceph
[7:34] * cattelan is now known as cattelan_away
[8:31] * Tv (~tv@mda0536d0.tmodns.net) has joined #ceph
[9:05] * Tv (~tv@mda0536d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[9:08] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:14] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[9:25] * MoXx (~Spooky@fb.rognant.fr) has joined #ceph
[9:29] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) has joined #ceph
[9:37] * LarsFronius (~LarsFroni@95-91-243-252-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[10:02] * f4m8_ is now known as f4m8
[10:12] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[10:19] * gregaf1 (~Adium@aon.hq.newdream.net) has joined #ceph
[10:26] * gregaf (~Adium@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[10:29] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[10:41] * LarsFronius_ (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[10:41] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[10:41] * LarsFronius_ is now known as LarsFronius
[11:47] * wido (~wido@rockbox.widodh.nl) has joined #ceph
[11:56] * yoshi (~yoshi@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:32] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[12:41] * MK_FG (~MK_FG@188.226.51.71) Quit (Ping timeout: 480 seconds)
[13:00] * joao (~JL@89.181.153.140) has joined #ceph
[13:33] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[13:39] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Quit: LarsFronius)
[13:40] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[13:56] * andresambrois (~aa@r186-52-131-194.dialup.adsl.anteldata.net.uy) has joined #ceph
[14:02] <joao> nhm, around?
[14:03] * aa (~aa@r186-52-162-217.dialup.adsl.anteldata.net.uy) Quit (Ping timeout: 480 seconds)
[14:11] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[14:31] * mfoemmel (~mfoemmel@chml01.drwholdings.com) Quit (Read error: Operation timed out)
[14:32] * mfoemmel (~mfoemmel@chml01.drwholdings.com) has joined #ceph
[14:52] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[14:58] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[15:23] * joao-phone (~JL@51.127.166.178.rev.vodafone.pt) has joined #ceph
[15:36] * joao-phone (~JL@51.127.166.178.rev.vodafone.pt) Quit (Remote host closed the connection)
[15:47] * f4m8 is now known as f4m8_
[15:52] * cattelan_away is now known as cattelan
[15:52] * andresambrois (~aa@r186-52-131-194.dialup.adsl.anteldata.net.uy) Quit (Remote host closed the connection)
[16:08] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[16:17] * cattelan is now known as cattelan_away
[16:18] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[16:44] <nhm> joao: Sorry, wasn't paying attention. What's up?
[16:47] * cattelan_away is now known as cattelan
[16:49] * BManojlovic (~steki@91.195.39.5) Quit (Remote host closed the connection)
[17:33] * deam (~deam@dhcp-077-249-088-048.chello.nl) Quit (Ping timeout: 480 seconds)
[17:37] * deam (~deam@dhcp-077-249-088-048.chello.nl) has joined #ceph
[18:05] * loicd (~loic@99-7-168-244.lightspeed.sntcca.sbcglobal.net) Quit (Quit: Leaving.)
[18:12] * gregaf1 (~Adium@aon.hq.newdream.net) Quit (Quit: Leaving.)
[18:12] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit (Remote host closed the connection)
[18:17] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[18:26] <sagewk> joao: around?
[18:28] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[18:33] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit ()
[18:34] <joao> sagewk, yeah
[18:35] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[18:37] * bchrisman (~Adium@108.60.121.114) has joined #ceph
[18:37] <yehudasa> ubuntu 12.04 LTS, codename Nostalgia (TM): when you start feeling nostalgic about the good old past of win 3.1/95/whatever and you only needed to reboot your desktop twice a day
[18:37] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[18:38] <sagewk> joao: nm, found your logs, looking through them now.
[18:38] <joao> okay :)
[18:38] <joao> I'll be looking into the teuthology task you did
[18:39] * loicd1 (~loic@204.16.154.194) has joined #ceph
[18:39] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit ()
[18:40] <joao> I've this crazy idea of having a script that takes a .yaml and performs the test I want on my local machine
[18:41] <joao> yehudasa, I lol'd
[18:42] <yehudasa> joao: I'm crying
[18:43] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[18:46] * matthew_ (~smuxi@pool-96-228-59-72.rcmdva.fios.verizon.net) has joined #ceph
[18:47] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[18:47] * loicd1 (~loic@204.16.154.194) Quit (Quit: Leaving.)
[18:48] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[18:48] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Read error: Connection reset by peer)
[18:50] * Tv (~tv@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[18:51] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[18:52] * imjustmatthew (~imjustmat@pool-96-228-59-72.rcmdva.fios.verizon.net) has joined #ceph
[18:53] * MK_FG (~MK_FG@188.226.51.71) has joined #ceph
[18:55] <imjustmatthew> gregaf: Are you still looking for logs for #2218? A clean cluster just started having mismatch issues.
[18:57] <gregaf> imjustmatthew: do you have logs of it occurring?
[18:57] <gregaf> I haven't fully delved into the last set you provided, but if you've got logs of it happening on a clean cluster that'll be a lot easier to work with\
[18:58] <imjustmatthew> Probably, I have mds debug 20 since the cluster was created yesterday except for a brief period when the MDS failedover
[18:58] <gregaf> awesome, yes please!
[18:58] <imjustmatthew> k, I'll upload them this evening
[18:58] <gregaf> thanks
[19:03] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[19:05] <The_Bishop> i've a question: added storage to the pool, reweighted the OSDs, scrubbed all OSDs, but still: pg v155643: 594 pgs: 18 active, 18 active+clean, 547 active+remapped, 11 active+degraded; 423 GB data, 837 GB used, 2370 GB / 3250 GB avail; 2764/239300 degraded (1.155%)
[19:05] * MK_FG (~MK_FG@188.226.51.71) Quit (Remote host closed the connection)
[19:06] <The_Bishop> how to get the cluster de-degraded?
[19:06] <gregaf> The_Bishop: is it changing over time?
[19:07] <The_Bishop> nope, it stays with this numbers - otherwise i would not ask
[19:07] <gregaf> hrmm, what's the full output of ceph -s?
[19:07] <The_Bishop> 2012-04-20 19:07:35.802690 pg v155643: 594 pgs: 18 active, 18 active+clean, 547 active+remapped, 11 active+degraded; 423 GB data, 837 GB used, 2370 GB / 3250 GB avail; 2764/239300 degraded (1.155%)
[19:07] <The_Bishop> 2012-04-20 19:07:35.811299 mds e16031: 2/2/2 up {0=0=up:active,1=1=up:active}
[19:07] <The_Bishop> 2012-04-20 19:07:35.813084 osd e2857: 5 osds: 5 up, 5 in
[19:07] <The_Bishop> 2012-04-20 19:07:35.815063 log 2012-04-20 19:02:52.912847 osd.3 192.168.32.185:6800/2499 687 : [INF] 2.4f scrub ok
[19:07] <The_Bishop> 2012-04-20 19:07:35.816712 mon e2: 1 mons at {0=192.168.32.177:6789/0}
[19:08] * MK_FG (~MK_FG@188.226.51.71) has joined #ceph
[19:08] <sagewk> joao: pushed a fix.. can you test?
[19:09] <joao> sure
[19:10] <gregaf> The_Bishop: how long has it been like that? and you're sure none of the numbers are changing?
[19:10] <gregaf> and how many OSDs did you add?
[19:11] <The_Bishop> i added osd.4 with 240G
[19:12] <The_Bishop> the numbers are steady for around 30min. started scrub on all osd but after scrubbing i see the same numbers
[19:13] * chutzpah (~chutz@216.174.109.254) has joined #ceph
[19:14] <gregaf> you started a full scrub while it was peering?
[19:14] <gregaf> have all the scrubs actually finished?
[19:14] <gregaf> you might have just added enough extra load that things have slowed down a lot
[19:15] <gregaf> bbiab
[19:15] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[19:16] <joao> sagewk, everything's okay :)
[19:16] <The_Bishop> i started the scrub as the cluster was idle and the numbers froze
[19:17] <The_Bishop> right now there is no client connected to ceph
[19:19] <The_Bishop> and the scrub should be finished, no more "[INF] 1.51 scrub ok" messages
[19:20] <The_Bishop> i dont know what the cluster is waiting for
[19:22] * Tv (~tv@204-16-154-194-static.ipnetworksinc.net) Quit (Ping timeout: 480 seconds)
[19:23] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[19:24] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[19:29] <sagewk> joao: still there?
[19:30] <sagewk> joao: i'd prefer filestore_journal_sync_enabled be named something like filestore_plug_sync or block_sync or something that is more clearly a test/dev tool and nothing you'd actually want to do in real life
[19:31] <sagewk> i'm also wondering if we can get the same effect by just setting filestore_min_sync_interval to something very big... did you test that by any chance?
[19:35] <sagewk> joao: i'm also curious why changing the journal flush behavior has any effect on the the clean 'b' store, which never experiences a failure...
[19:36] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[19:36] <joao> let me read everything
[19:37] <joao> I'm also quite against using 'filestore_journal_sync_enabled', both as the option name and the variable
[19:37] <joao> it was one of those things for which I couldn't come up with anything better
[19:38] <joao> haven't tested the filestore_min_sync_interval
[19:38] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[19:38] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[19:39] <joao> sagewk, I guess that the effect on 'b' is the same as on 'a', except we usually fail to apply a transaction onto the store on 'a' (but it is applied onto the journal)
[19:40] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Read error: Connection reset by peer)
[19:40] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[19:40] <joao> if you replay from far back enough, you may replay operations that weren't guarded
[19:40] <sagewk> b is never replayed, though...
[19:40] <sagewk> oooh, it is by diff. got it.
[19:40] <joao> it is, on the diff
[19:40] <sagewk> :)
[19:42] <joao> I'll take a look on the filestore_{min,max}_sync_interval
[19:42] <joao> but first, got to brew some coffee and drink it all
[19:42] <joao> brb
[19:43] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit ()
[19:48] <sagewk> hehe
[19:48] <gregaf> The_Bishop: okay, can you get a fresh ceph -s dump for me?
[19:48] <gregaf> after that I think we'll need to go to logs, but let me check with sjust
[19:49] <sjust> logs would help
[19:51] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[19:52] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Read error: Connection reset by peer)
[19:52] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[19:54] <yehudasa> gregaf: sagewk: shouldn't we register the dump_ops_in_flight admin command in the messenger code, and not in the osd?
[19:54] <gregaf> uh, what?
[19:54] <gregaf> ops in flight are OSD ops, not incoming messages
[19:54] <yehudasa> ah
[19:55] <yehudasa> then here's another question
[19:55] <gregaf> it keeps track of some of the messenger data as well, but it's for tracking progress through the OSD specifically, not through the Messenger
[19:55] <yehudasa> shouldn't we have requests in-flight admin op?
[19:55] <gregaf> I'm not sure what you mean
[19:55] <gregaf> something to list all the Messages going through dispatch?
[19:55] <yehudasa> yep
[19:55] <yehudasa> like what we have in the kernel client
[19:56] <gregaf> the Messenger-observable lifecycle for most of them is uselessly short
[19:56] <sagewk> on the kclient its request structures, like objecter_dump.. not raw messages
[19:56] <gregaf> the kclient stuff doesn't work via the kernel messenger either
[19:57] <gregaf> it's the kind of thing that each daemon needs to implement on its own
[19:57] <yehudasa> well.. we need to be able to see all outstanding rados requests
[19:57] <gregaf> (or at least, going through the Messenger won't provide a more usable interface than just writing a generic OpTracker that the daemons can feed stuff into)
[19:57] <yehudasa> requests that weren't acked
[19:58] <yehudasa> or requests that are alive anyway
[19:58] <gregaf> the Messenger can't associate a request message and an ack message
[19:58] <yehudasa> yeah, but the rados layer can
[19:58] <sagewk> we have that, objecter_dump on the radosgw side
[19:58] <gregaf> yes, so that needs to be implemented in each user independently
[19:59] <gregaf> (where the Objecter is a user)
[20:00] <gregaf> sagewk: yehudasa: I think you mean objecter_requests?
[20:00] <yehudasa> hmm.. that's not showing up
[20:01] <gregaf> yehudasa: it's "objecter_requests", not "objecter_dump"
[20:01] <gregaf> (at least in my current checkout)
[20:02] <sagewk> 'help' will always list was there
[20:04] <yehudasa> skinny:/more/yehuda/ceph/src# ./ceph --admin-daemon /tmp/radosgw.adsock help
[20:04] <yehudasa> help list available commands
[20:04] <yehudasa> perfcounters_dump dump perfcounters value
[20:04] <yehudasa> perfcounters_schema dump perfcounters schema
[20:04] <yehudasa> version get protocol version
[20:04] <The_Bishop> i weighted the big disk (osd.3) 1.0 and all other disks 0.01
[20:05] <The_Bishop> could this be the cause?
[20:05] <gregaf> The_Bishop: oh, yes, that could definitely do it
[20:05] <gregaf> CRUSH has some issues we haven't ironed out yet when individual buckets are too small
[20:06] <sagewk> yehudasa: oh, i bet it's not sharing the cct with librados.
[20:06] <sagewk> yehudasa: we can fix that
[20:06] <gregaf> and if you've got 3 OSDs that total 3% of one of the others in a flat crush pool, the placing isn't going to work out properly
[20:06] <yehudasa> sagewk: figured it out
[20:06] <yehudasa> sagewk: the ceph command was overriding it
[20:06] <yehudasa> sharing the same ceph.conf
[20:06] <sagewk> heh right
[20:07] <gregaf> you mean overriding the admin socket?
[20:07] <yehudasa> gregaf: yes
[20:07] <gregaf> yeah, that's really annoying
[20:07] <gregaf> took me forever to figure out when I was doing the objectcacher perfcounters last week
[20:08] <yehudasa> well.. maybe we should have a different setting for the ceph command
[20:08] <yehudasa> ?
[20:09] <yehudasa> or better, have some config variable lie $cmd?
[20:09] <yehudasa> so: admin socket = /var/run/ceph/$cmd.$cluster.asok
[20:09] <yehudasa> or something like that
[20:14] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Read error: Connection reset by peer)
[20:14] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[20:23] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[20:24] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: No route to host)
[20:31] <sagewk> yehudasa: right now it's $name.asok.. so at least normally radosgw runs as a someting other than client.admin. but, yeah.
[20:34] <yehudasa> sagewk: anyway, $cmd can be useful
[20:34] <yehudasa> also, maybe $[A-Z] should be used to reference env variables?
[20:34] <yehudasa> (not related)
[20:34] <yehudasa> $[A-Z]*
[20:44] * loicd (~loic@204.16.154.194) has joined #ceph
[20:47] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:48] * loicd (~loic@204.16.154.194) Quit ()
[20:59] <elder> I'm not going to Journal Club (in case anyone cares)
[21:02] * BManojlovic (~steki@212.200.243.246) has joined #ceph
[21:09] <joao> sagewk, it appears that (taking the last commit prior to your fixes) the mismatches pop-up running with "--filestore-max-sync-interval 30 --filestore-min-sync-interval 29"
[21:09] <sagewk> joao: cool. i like that better than adding a new option in there. we can make that 300 or something instead of 30, too, in the scripts that do this.
[21:10] <gregaf> elder: neither is anybody else outside of Aon, apparently, so we moved to 4pm Pacific if you prefer that
[21:10] <gregaf> or if you don't! ;)
[21:10] <joao> sagewk, this approach does not survive a "force_sync" though
[21:10] <sagewk> yeah, but the tester never does that
[21:10] <elder> I just haven't read the paper(s). I don't expect to before then either...
[21:11] <elder> But if I finish what I'm doing I might join in.
[21:11] <gregaf> the PNUTS one is actually pretty interesting; it's definitely an engineering effort but they have a consistency model a little different than I've seen before
[21:11] <joao> sagewk, a sync() would suffice to get force_sync = true
[21:12] <joao> oh
[21:12] <joao> yeah, "our tester" never does that
[21:12] <joao> I was thinking of a "tester" in the broader, more generic, sense of the word
[21:13] <The_Bishop> now i have set the weights normalized and resync+backfill starts... lets see
[21:14] <sagewk> joao: yeah.
[21:14] <joao> sagewk, also, relying on --filestore-{max,min}-sync-interval will imply that we must define *huge* intervals for long runs
[21:14] <sagewk> teuthology is just running run_seed_to_range.sh.. let's update that with two cases for each failure point (with and without the min_sync_interval on the failure run)
[21:14] <sagewk> yeah, just make it a very large number.
[21:15] <joao> sagewk, kay, looking into that
[21:16] <elder> sagewk, or someone, can you tell me what exactly happens to the snap_follows field in a CEPH_MSG_CLIENT_CAPS message when it is received?
[21:17] <sagewk> elder: aie... that's a long story.
[21:17] <elder> It turns out it's a snapshot id, and what I'm really interested to know is what happens if its value is 0.
[21:17] <elder> Or whether that's important.
[21:18] <sagewk> it's only important with respect to the client<->mds interaction
[21:18] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[21:18] <sagewk> i would ignore it for now
[21:19] <elder> Well what I'd like to do is *not* assign 0 to it. Or, if 0 is meaningful, give that 0 a symbolic name.
[21:20] <elder> It's the only place in the entire codebase where 0 is used as a snapshot id.
[21:21] <sagewk> ah. i can't remember offhand whether/when the mds sets it to 0, or what it means, without digging through the code.
[21:21] <elder> OK. I'll look myself.
[21:21] <elder> Just looking for insight.
[21:22] <sagewk> i would wait until we look at the mds code closely and leave it for now.
[21:22] <elder> OK. I've made a note of it for now.
[21:22] <The_Bishop> hmmm, building ceph with "-flto" fails :(
[21:26] <gregaf> elder: sagewk: I'm pretty sure 0 is uninitialized
[21:27] <gregaf> and in fact that member is only filled in by the client (the MDS just fills it in from data the client sends)
[21:28] <gregaf> but it needs to stay at 0 since otherwise the client will try and do things like delete that snapshot, etc
[21:28] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[21:32] * aa (~aa@200.211.176.5) has joined #ceph
[21:33] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[21:41] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[21:44] * Tv (~tv@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[21:45] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[21:46] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[21:46] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Read error: Connection reset by peer)
[21:48] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Quit: LarsFronius)
[21:51] * aa (~aa@200.211.176.5) Quit (Quit: Konversation terminated!)
[21:51] * andresambrois (~aa@200.211.176.5) has joined #ceph
[22:08] * loicd1 (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[22:21] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:44] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[22:47] * Tv (~tv@204-16-154-194-static.ipnetworksinc.net) Quit (Ping timeout: 480 seconds)
[22:58] <elder> sage, the ceph-client amd64 gitbuilder isn't seeing my ommit again. How long ago did you push your wip-lpg branch?
[22:59] <sagewk> an hour or so
[22:59] <elder> What was it last yesterday? My build completed quickly after you looked into it. BTW, I have to go deliver my son to practice. Back in about an hour.
[23:01] * Tv (~tv@204-16-154-194-static.ipnetworksinc.net) has joined #ceph
[23:16] * lofejndif (~lsqavnbok@28IAAD3A3.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:18] * loicd (~loic@204-16-154-194-static.ipnetworksinc.net) Quit (Quit: Leaving.)
[23:35] * loicd (~loic@99-7-168-244.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[23:51] * joshd (~jdurgin@108-89-24-20.lightspeed.irvnca.sbcglobal.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.