#ceph IRC Log


IRC Log for 2012-06-13

Timestamps are in GMT/BST.

[0:01] <Tv_> "The weekday is present to make it easy for a human to tell when a lease expires - it's specified as a number from zero to six, with zero being Sunday." -- because all humans are accustomed to using numbers for weekdays, and Everyone starts their week on Sunday
[0:01] <Tv_> *sigh*
[0:08] <joao> I am accustomed to that, but I usually go at it using a Fibonacci sequence to map numbers into weekdays
[0:08] <Tv_> must be tough to deal with 1, 1
[0:08] <joao> it's odd how sunday and monday always end up with the same number though
[0:08] <Tv_> at least the week gets more and more perfect as the weekend approaches
[0:39] * gregorg_taf (~Greg@ has joined #ceph
[0:39] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[0:41] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:13] * mtk (~mtk@ool-44c358a7.dyn.optonline.net) Quit (Read error: Connection reset by peer)
[1:25] * Tv_ (~tv@2607:f298:a:607:bd15:990e:65cd:46db) Quit (Quit: Tv_)
[1:37] * lofejndif (~lsqavnbok@09GAAF39I.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[2:02] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:30] * yehudasa__ (~yehudasa@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:57] * adjohn (~adjohn@ Quit (Quit: adjohn)
[3:04] * yoshi (~yoshi@p37158-ipngn3901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:11] <thafreak> say i have a test machine...with 4 disks....can I run one monitor and 4 osd's on that one box as a test?
[3:12] * yehudasa (~yehudasa@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[3:15] <jmlowe> thafreak: I run mon, mds, and 6 osd's on one box
[3:15] <jmlowe> I've got 2 of those and I run a couple of dozen vm's backed by it
[3:23] <thafreak> awesome
[3:23] <thafreak> was there any docs that helped you set up something like that?
[3:24] <thafreak> anything odd about the config? just leave out the host field for every entry since they're all localhost?
[3:35] * renzhi_away is now known as renzhi
[4:17] <jmlowe> just repeat the host with the fqdn, it will auto pick the ports for everything except the mon
[4:18] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:29] * renzhi (~renzhi@ Quit (Ping timeout: 480 seconds)
[4:38] * renzhi (~renzhi@ has joined #ceph
[4:51] * Ryan_Lane1 (~Adium@dslb-178-000-112-155.pools.arcor-ip.net) has joined #ceph
[4:58] * Ryan_Lane (~Adium@dslb-188-106-110-073.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[5:32] * yehudasa (~yehudasa@99-48-179-68.lightspeed.irvnca.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[6:20] * cattelan_away is now known as cattelan_away_away
[6:26] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[7:00] -magnet.oftc.net- *** Looking up your hostname...
[7:00] -magnet.oftc.net- *** Checking Ident
[7:00] -magnet.oftc.net- *** No Ident response
[7:00] -magnet.oftc.net- *** Found your hostname
[7:00] * CephLogBot (~PircBot@rockbox.widodh.nl) has joined #ceph
[7:57] <renzhi> Hi, I have 6 osd, two on each rack. And the pool replica size is set to 3. How can I make sure that each replica goes to either one of the osd on each rack?
[7:57] <renzhi> this seems like an easy thing to do, but couldn't find the pointer to the document description
[8:24] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[8:24] * mikiem (~mike@cpe-24-165-6-26.san.res.rr.com) has joined #ceph
[8:27] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[8:36] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[8:39] * iggy (~iggy@theiggy.com) Quit ()
[8:40] * iggy (~iggy@theiggy.com) has joined #ceph
[8:52] * gregorg_taf (~Greg@ Quit (Quit: Quitte)
[9:05] * Ryan_Lane (~Adium@dslb-178-000-112-155.pools.arcor-ip.net) has joined #ceph
[9:05] * Ryan_Lane1 (~Adium@dslb-178-000-112-155.pools.arcor-ip.net) Quit (Read error: Connection reset by peer)
[9:05] * fghaas (~florian@ has joined #ceph
[9:05] * RupS (~rups@panoramix.m0z.net) has joined #ceph
[9:11] * Ryan_Lane1 (~Adium@dslb-178-000-112-155.pools.arcor-ip.net) has joined #ceph
[9:11] * Ryan_Lane (~Adium@dslb-178-000-112-155.pools.arcor-ip.net) Quit (Read error: Connection reset by peer)
[9:20] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:24] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[9:28] * BManojlovic (~steki@ has joined #ceph
[9:35] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:52] * Qten (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) Quit (Read error: Connection reset by peer)
[9:52] * Qten (~qgrasso@ip-121-0-1-110.static.dsl.onqcomms.net) has joined #ceph
[10:14] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[10:14] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[10:50] * hijacker (~hijacker@ Quit (Remote host closed the connection)
[11:42] * yoshi (~yoshi@p37158-ipngn3901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:51] * fghaas (~florian@ Quit (Ping timeout: 480 seconds)
[12:14] * maje (~anonymous@office.siteworkers.nl) Quit (Remote host closed the connection)
[12:14] * maarten (~maarten@office.siteworkers.nl) has joined #ceph
[12:14] * maarten (~maarten@office.siteworkers.nl) has left #ceph
[12:15] * maje (~maarten@office.siteworkers.nl) has joined #ceph
[12:21] * renzhi is now known as renzhi_away
[13:06] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[13:20] * lofejndif (~lsqavnbok@9KCAAF7RX.tor-irc.dnsbl.oftc.net) has joined #ceph
[13:24] * stass (stas@ssh.deglitch.com) Quit (Remote host closed the connection)
[13:31] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[14:11] * mtk (~mtk@ool-44c358a7.dyn.optonline.net) has joined #ceph
[14:17] * lofejndif (~lsqavnbok@9KCAAF7RX.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[14:26] * lofejndif (~lsqavnbok@82VAAEGPU.tor-irc.dnsbl.oftc.net) has joined #ceph
[14:31] * lofejndif (~lsqavnbok@82VAAEGPU.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[14:31] * rosco (~r.nap@ Quit (Read error: Connection reset by peer)
[14:31] * exel (~pi@ Quit (Read error: Connection reset by peer)
[14:32] * lofejndif (~lsqavnbok@09GAAF44X.tor-irc.dnsbl.oftc.net) has joined #ceph
[14:53] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[14:54] * brambles (brambles@ Quit (Remote host closed the connection)
[14:55] * brambles (brambles@ has joined #ceph
[15:00] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[15:05] * lofejndif (~lsqavnbok@09GAAF44X.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[15:11] * Dieter_b1 (~Dieterbe@dieter2.plaetinck.be) Quit (Quit: leaving)
[15:16] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[15:27] * rosco (~r.nap@ has joined #ceph
[15:31] <elder> sage I should have grabbed your teuthology branch last night. I can't get to the repository at the moment.
[15:32] <elder> This despite them reporting on their status site "all systems are operational" and "24 hours uptime" or something like that.
[15:33] <elder> Ah, back again. 24 hours uptime it is!
[15:35] * jmlowe (~Adium@140-182-210-232.dhcp-bl.indiana.edu) has joined #ceph
[15:37] * DLange (~DLange@dlange.user.oftc.net) Quit (Quit: a reboot a day keeps the bugs away)
[15:39] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[16:09] <elder> sage, "make deb-pkg" took significantly longer than "make all"
[16:09] <elder> It looked like it may have rebuilt everything, so I should do one or the other.
[16:10] <elder> It also generated four distinct deb packages: linux-firmware-image, linux-headers, linux-libc-dev, and linux-image
[16:11] <elder> I am only installing the linux-image one, but it makes me wonder whether we might need or want the others too.
[16:11] <jerker> renzhi_away: http://ceph.com/wiki/Custom_data_placement_with_CRUSH#Example_crush_map Does that help you maybe?
[16:12] <elder> When building with a separate directory for object files (O=<dir>) it places the debian files in "<dir>/.."
[16:14] <elder> Seems to be working so far though, though my upload speed from home may counteract some of the time benefit versus gitbuilder. Perhaps I should upload once to a machine closer to the target hardware...
[16:20] * jmlowe (~Adium@140-182-210-232.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[16:23] <elder> Wow, the kernel debian package is 450 MB. No wonder it takes so damned long. Can we please trim that down a bit?
[16:34] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[16:36] <jmlowe> I could use some hand holding
[16:36] <jmlowe> ceph health
[16:36] <jmlowe> HEALTH_WARN 585 pgs backfill; 314 pgs degraded; 4 pgs down; 6 pgs peering; 711 pgs recovering; 5 pgs stuck inactive; 717 pgs stuck unclean; recovery 241867/1705774 degraded (14.179%); 2/535256 unfound (0.000%); 1 mons down, quorum 0,2
[16:37] <nhm> jmlowe: any idea which mon is down?
[16:40] <jmlowe> I appear to have 3 of 3 running
[16:41] <jmlowe> I have 2 osd's that crash constantly
[16:41] <jmlowe> leveldb seems to be corrupt on them
[16:42] * diggalabs (~jrod@cpe-72-177-238-137.satx.res.rr.com) Quit (Quit: Jrod has left the building.)
[17:00] * gregorg (~Greg@ has joined #ceph
[17:06] * lofejndif (~lsqavnbok@09GAAF5BW.tor-irc.dnsbl.oftc.net) has joined #ceph
[17:14] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:18] * fridge_ (~matt@34-232-181-180.cpe.skymesh.net.au) has joined #ceph
[17:19] * fridge_ (~matt@34-232-181-180.cpe.skymesh.net.au) has left #ceph
[17:21] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:30] * cattelan_away_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (Ping timeout: 480 seconds)
[17:31] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[17:33] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:56] <jmlowe> anybody know anything about leveldb and ceph?
[17:56] * cattelan_away_away (~cattelan@2001:4978:267:0:21c:c0ff:febf:814b) has joined #ceph
[17:56] <nhm> jmlowe: doh, sorry. I got distracted. That doesn't sound good.
[17:57] <nhm> jmlowe: I don't know much about leveldb, sorry.
[17:57] <nhm> jmlowe: I assume the situation hasn't changed much in the past hour?
[17:58] <jmlowe> I naively tried to repair and open with the python leveldb module and that seems ok
[17:59] <nhm> Has the cluster health improved?
[17:59] <jmlowe> not really
[17:59] <nhm> does it still think one of your mons is down?
[18:00] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[18:00] <jmlowe> I have about 6 pg's that I really need to recover so that I can get healthy again
[18:01] <nhm> joshd: Were you involved much with leveldb?
[18:02] <joshd> no, that was all sam
[18:02] <nhm> joshd: ah, ok. jmlowe is trying to recover his cluster and said he thinks it might be leveldb corruption.
[18:03] <nhm> I'm probably not the right person to try and diagnose this. :)
[18:06] * BManojlovic (~steki@ has joined #ceph
[18:14] <elder> github is not responding again. But let's not count that against their 24 hour uptime, mkay?
[18:15] <elder> Back again. See? 24 hours!
[18:25] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) Quit (Remote host closed the connection)
[18:26] <joao> damn, using a vnc client from the laptop to the desktop sure sucks
[18:32] <joao> okay, it's truly unbearable to vnc to the desktop
[18:32] <joao> brb
[18:37] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) has joined #ceph
[18:37] <jmlowe> is there a manual way to recover a pg?
[18:38] <sjust> not really
[18:39] <jmlowe> sjust: oh good, you are here
[18:39] <sjust> yeah, sorry, my connection is wierd
[18:40] * joao-laptop (~Adium@89-181-148-114.net.novis.pt) has joined #ceph
[18:41] <sjust> so what did you attempt since yesterday?
[18:42] <jmlowe> found python-leveldb in precise, installed, leveldb.RecoverDB('omap')
[18:43] <jmlowe> churns for a few minutes, can open without error and paranoid checks on
[18:43] <sjust> that's interesting
[18:43] <sjust> did that allow the osd to start?
[18:44] <jmlowe> osd start still crashes
[18:44] <jmlowe> 15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482) [0x6cdec2]
[18:44] <jmlowe> 16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x6ce6c0]
[18:44] <jmlowe> 17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6cf168]
[18:44] <jmlowe> 18: /usr/bin/ceph-osd() [0x6e811f]
[18:44] <sjust> ah
[18:44] <sjust> that's actually encouraging, probably not corruption
[18:44] <sjust> one
[18:44] <sjust> sec
[18:44] <jmlowe> if I can just get it up long enough to copy the 4 down'ed pg's off I'll be in relatively good shape
[18:45] * Tv_ (~tv@2607:f298:a:607:2464:f4d0:d7cc:999a) has joined #ceph
[18:45] <sjust> ext4?
[18:46] <jmlowe> btrfs
[18:46] <sjust> ok
[18:47] <jmlowe> current state: 2012-06-13 12:47:03.762859 pg v2786552: 2304 pgs: 2 active, 1941 active+clean, 1 active+recovering+replay+remapped+backfill, 1 active+recovering+replay+degraded+backfill, 19 active+clean+replay, 115 active+recovering+degraded+remapped+backfill, 4 down+peering, 22 active+recovering, 144 active+recovering+remapped+backfill, 55 active+recovering+degraded+backfill; 2021 GB data, 4815 GB used, 15773 GB / 21424 GB avail; 128950/
[18:47] <sjust> ah
[18:52] <jmlowe> hmm, also, I think the ceph health report might have a bug
[18:52] <jmlowe> ceph mon stat
[18:52] <jmlowe> e3: 3 mons at {alpha=,beta=,gwbmgmt=}, election epoch 4, quorum 0,2
[18:52] <jmlowe> ceph health
[18:52] <jmlowe> HEALTH_WARN 298 pgs backfill; 164 pgs degraded; 4 pgs down; 5 pgs peering; 319 pgs recovering; 4 pgs stuck inactive; 326 pgs stuck unclean; recovery 122371/1666261 degraded (7.344%); 3/535801 unfound (0.001%); 1 mons down, quorum 0,2
[18:52] <jmlowe> stat shows 3 good mon's but health shows one down
[18:52] <jmlowe> afaik
[18:54] * bchrisman (~Adium@ has joined #ceph
[19:05] <joao-laptop> are you guys around?
[19:07] <gregaf> no
[19:08] <joao-laptop> the sprint planning is today, is it not?
[19:08] <gregaf> it got moved to this afternoon
[19:08] <gregaf> (our afternoon)
[19:08] <joao-laptop> oh
[19:08] <elder> What time?
[19:08] <joao-laptop> cool then
[19:08] <gregaf> uh, 2pm pacific
[19:09] <gregaf> you should've got a calendar update?
[19:09] <elder> 4pm my time.
[19:09] <elder> dark-o'clock Joao's time.
[19:09] <elder> Well there it is after all.
[19:09] <joao-laptop> well, the one good thing out of the whole change is that now I can watch the game
[19:09] <elder> I wondered why I didn't see it (at noon)
[19:10] * chutzpah (~chutz@ has joined #ceph
[19:10] <elder> Do we still have a standup?
[19:10] <joao-laptop> and so far we're winning against denmark (hurray!)
[19:10] <joao-laptop> yeah, what about the standup?
[19:11] <gregaf> yes
[19:11] <joao-laptop> oh yeah??? I changed my calendar's timezone to gmt
[19:12] <joao-laptop> so, that '10' is not 10am PST, but 10pm GMT
[19:12] <joao-laptop> that explains it
[19:15] <joao-laptop> will the standup be now or during the afternoon meeting?
[19:15] <gregaf> now
[19:15] <joao-laptop> okay
[19:28] * joao-laptop (~Adium@89-181-148-114.net.novis.pt) Quit (Quit: Leaving.)
[19:29] <joao> be back in approx. 50 minutes
[19:30] <gregaf> joao: we should chat when you get back (if I'm not at lunch) :)
[19:30] <sagewk1> elder: if the problem(s) are hiding for now, lets get the testing-next updated (if it isn't already) so that we're hitting it in the nightlies
[19:30] <elder> OK.
[19:30] <elder> I was thinking that too.
[19:30] <sagewk1> are there any outstanding patches that aren't in testing yet?
[19:30] <elder> I have a few.
[19:30] <elder> I have to sort of look at the state of things because I feel like it's become a bit chaotic...
[19:31] <sagewk1> yeah.
[19:31] <elder> But I'll get on top of it and will confer with you in a little while about my plan.
[19:31] <sagewk1> what i'd like to do is get what's been done already in testing and semi-wrapped up, and shift gears for a bit to pick up some of hte other things need attentino, like the failing xfstests
[19:31] <elder> OK.
[19:32] <sagewk1> at this point those are triggering more often than the sneaky msgr thing.. which may be a red herring anyway
[19:32] * allsystemsarego (~allsystem@ has joined #ceph
[19:32] <elder> Well, maybe. I'm *sure* I saw it. But then again, I think many, many possible errors show up with that signature because of the visibility (or not) of symbols in the image.
[19:32] <elder> try_write+<something> covers a lot of area, it turns out.
[19:33] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (Remote host closed the connection)
[19:38] <sagewk1> yeah
[19:39] * sagewk1 is now known as sagewk
[19:39] * adjohn (~adjohn@ has joined #ceph
[19:39] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[19:41] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[19:49] <nhm> doh, sorry, I thought we would piggyback the standup with the planning meeting.
[19:49] <sjust> jmlowe: ok, I can now reproduce locally
[19:50] <sagewk> nhm: no worries
[19:52] <nhm> sagewk: had lunch with an old friend that went from the defense industry to doing R&D at Toro on lawn mowers. He said it's super fun. :)
[19:52] <sagewk> hehe
[19:52] <joao> gregaf, I'm back
[19:52] <gregaf> right, so about this monitor stuck stuff
[19:53] <joao> the osd is the one that's stuck
[19:53] <joao> waiting for some message
[19:53] <sagewk> tv_: https://github.com/ceph/ceph/commit/8fdc8f57b0416c040383fa0fbbf374b4a6ae8ee4
[19:53] <gregaf> joao: which one?
[19:54] <Tv_> sagewk: reading
[19:54] <joao> gregaf, it gets stuck on this line
[19:55] <Tv_> sagewk: the first two bullet points don't make sense to me..
[19:55] <joao> argh, pasting from gvim always fails me
[19:55] <joao> client_messenger->wait();
[19:55] <joao> this one
[19:55] <nhm> sjust: are you reproducing the mon report from health or something else?
[19:55] <sjust> sorry, just the leveldb thing
[19:55] <joao> gets stuck here, waiting for something
[19:56] <gregaf> joao: that's not a bug; that's the OSD starting up successfully and the main thread waiting until it's supposed to shut down
[19:56] <joao> and since it is never considered up (or down), I'm guessing that the monitor is messing something up
[19:57] <gregaf> have you looked through the logs to see if they're connecting to each other?
[19:58] <joao> gregaf, been looking at them, but haven't found anything useful
[19:58] <joao> had to crank up --debug-ms to 20
[19:58] <gregaf> ie, is the problem that the monitors fail to commit a map update, or before that
[19:58] <joao> but then I hit a bug I introduced on the AuthMonitor, and got to get at it again
[19:58] <joao> *and have to
[20:00] <gregaf> "but then"? you weren't hitting it before and you did when you turned up ms_debug?
[20:00] <jmlowe> sjust: back now, let me know if there is anything I can help with including access to my broken cluster
[20:00] <joao> gregaf, I killed the monitor to crank it up
[20:00] <joao> and then, the bug popped up
[20:00] <sjust> jmlowe: shouldn't be necessary, the leveldb crash appears to be independent of ceph
[20:00] <gregaf> ah
[20:01] <joao> being the bug on update_from_paxos(), my guess is that something went awfully wrong when encode_pending()
[20:01] <gregaf> okay; so that may well be similar to the bug you're hitting with the OSDs too
[20:02] <gregaf> I still haven't gotten around to reviewing the implementation changes
[20:02] <gregaf> have you verified that the monitors can commit any new paxos transactions, and that it also works to update the maps?
[20:02] <joao> gregaf, sorry, just a sec (phone ringing and it won't stop for some reason)
[20:03] <gregaf> brb myself
[20:05] <joao> gregaf, it works; stuff is proposed, applied, updated, repeat
[20:06] <joao> last night I fixed a stupid bug that consisted in writing the wrong bufferlist to a version, during encode_pending(); when we updated from paxos, things went boom
[20:07] <sagewk> nhm: any insight into the oneiric 0.46 vs precise 0.47.2 thing?
[20:07] <joao> so I'm now looking for something similar on this one
[20:07] <joao> although it appears to be a case of reaching the end of a buffer (i.e., we wrote a bufferlist of length zero) instead of having stuff there
[20:08] <joao> which I didn't consider because I wasn't aware it could be possible (hence the problem may be somewhere else, or I may be trying to read the wrong version)
[20:09] <joao> before going at it though, I'm going to create a simple class to read the store and output it, so I can see what is happening before starting making assumptions and burning time proving me right/wrong
[20:10] <joao> s/me/myself
[20:10] <nhm> sagewk: Not yet. Yesterday was really unproductive. Should have some numbers tonight/tomorrow.
[20:12] <gregaf> joao: okay, that sounds good
[20:22] * lofejndif (~lsqavnbok@09GAAF5BW.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[20:51] <elder> sagewk, what is your assessment of recent ceph-client/testing teuthology test results?
[20:51] <elder> That branch is based on 3.5-rc1.
[20:52] <elder> Any surprises since Friday?
[20:52] <elder> If not, I will update the master branch to match it.
[20:57] <nhm> elder: I think he had to run for a bit.
[20:57] <elder> No problem.
[20:57] <elder> I'm not in a huge hurry...
[21:05] <dmick> he's at lunch
[21:05] <dmick> ENOSEAT so I came back to the desk
[21:07] <nhm> dmick: desklunch = productivity!
[21:08] <dmick> in theory
[21:08] <elder> nosleep = productivity!
[21:09] <dmick> or brooklyn, or something
[21:09] <jmlowe> elder: you misplaced your ! shouldbe nosleep != productivity
[21:09] <elder> Maybe nhm did too
[21:10] <dmick> electrical engineering: http://i.imgur.com/gs9x5.jpg
[21:10] <elder> Is the umbrella the "engineering" part?
[21:11] <nhm> elder: I think that's to pretect the little box on the floor from the eventual rain of sparks/death.
[21:11] <nhm> s/pretect/protect
[21:12] <elder> It doesn't actually look like the whole setup would draw much current even when fully utilized.
[21:12] <elder> Unless the umbrella tipped over.
[21:19] <sagewk> elder: no surprises, same issues as before
[21:19] <elder> So OK with you for us to move forward to be 3.5-rc1 based?
[21:19] <sagewk> elder: there was consistent xfstest breakage, #2522
[21:19] <elder> Right.
[21:19] <sagewk> but it looks better now
[21:19] <sagewk> ?
[21:20] <sagewk> maybe because you just updated testing the other day?
[21:20] <elder> I looked at that one.
[21:20] <elder> The problem is that the output of repquota was not producing what was expected.
[21:21] <elder> I don't know if that had to do with a newer version of repquota (Precise?)
[21:21] <sagewk> it failed some 6 or 7 runs in a row.. pretty consistent
[21:21] <sagewk> oh, maybe
[21:21] <elder> And/or a newer version of xfs and/or xfstests (3.5-rc?)
[21:21] <elder> Or some combination of all of that.
[21:22] <elder> We probably ought to be very careful (in the future anyway) about marking exactly when changes like that occur with respect to our various test results.
[21:22] <sagewk> well, let's see it not happen one more time and then close it, i guess. not important aside from the noise it generates
[21:22] <elder> I will do a quick test run with the testing branch that runs test 219 repeatedly.
[21:22] <dmick> "let's see it not happen". I want to see that happen. :)
[21:23] <elder> You never will.
[21:23] <elder> Or won't you?
[21:23] <dmick> or...won't I? JINX
[21:23] <dmick> you have a sick, sick mind, elder. I can tell because it jumps to the same conclusions as mine, often
[21:23] <elder> I was concluding the same thing about you.
[21:24] <dmick> see?
[21:24] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Quit: LarsFronius)
[21:24] <nhm> I'd say to get a room, but your in one.
[21:24] <nhm> bah, you're.
[21:24] * nhm slinks back into his hole
[21:25] <jmlowe> sjust: any news?
[21:25] * lofejndif (~lsqavnbok@04ZAADUBX.tor-irc.dnsbl.oftc.net) has joined #ceph
[21:26] <dmick> jmlowe: he's lunching, should be back shortly
[21:29] <jmlowe> ah
[21:31] <jmlowe> one more non sequitur for you guys:
[21:31] <jmlowe> 2012-06-13 15:30:13.874487 pg v2794977: 2304 pgs: 2235 active+clean, 59 active+clean+replay, 4 down+peering, 2 active+recovering, 2 active+recovering+remapped+backfill, 2 active+recovering+degraded+backfill; 2026 GB data, 4808 GB used, 15779 GB / 21424 GB avail; 33/1611760 degraded (0.002%); 4/537058 unfound (0.001%)
[21:31] <jmlowe> 2012-06-13 15:30:14.940841 pg v2794978: 2304 pgs: 2235 active+clean, 59 active+clean+replay, 4 down+peering, 2 active+recovering, 2 active+recovering+remapped+backfill, 2 active+recovering+degraded+backfill; 2026 GB data, 4808 GB used, 15779 GB / 21424 GB avail; -4/1611760 degraded (-0.000%); 4/537058 unfound (0.001%)
[21:32] <jmlowe> how can you have a negative count of degraded objects?
[21:33] <jmlowe> wait, I see why
[21:33] <jmlowe> 2012-06-13 15:32:32.582138 osd.1 2586 : [ERR] 2.b1 backfill osd.15 stat mismatch on finish: num_bytes 2107219968 != expected 3149824
[21:34] <jmlowe> don't know what that means
[21:43] * ninkotech (~duplo@ Quit (Remote host closed the connection)
[21:44] * lofejndif (~lsqavnbok@04ZAADUBX.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[21:45] * lofejndif (~lsqavnbok@9KCAAF8C4.tor-irc.dnsbl.oftc.net) has joined #ceph
[21:55] <elder> sagewk, I'm getting the test 219 failure consistently using the current testing branch.
[21:55] <sagewk> k. can we fix it? or just disable that test?
[21:55] <sagewk> the error makes it look like bad bash .. a bash vs dsh syntax issue, maybe
[21:56] <sagewk> s/dsh/dash/
[21:57] <elder> Why do we not have bash?
[21:58] <sjust> jmlowe: still working on it, haven't worked a whole lot with this bit of leveldb
[21:58] <Tv_> elder: /bin/sh is not bash
[21:58] <Tv_> elder: you want bash, you ask for it (and pay the performance cost)
[21:58] <elder> These tests all specify #!/bin/bash
[21:58] <elder> So does that mean we get bash?
[21:58] <Tv_> yeah
[21:59] <elder> OK.
[21:59] <Tv_> if they're executed like that and not via "sh filename" ;)
[21:59] <elder> Well within xfstests I'm pretty sure they're all executed directly, but I'll check that.
[21:59] <Tv_> yeah that part should be a given
[21:59] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:00] <Tv_> just playing devil's advocate
[22:02] <elder> Yes, they get executed directly. I wasn't sure though, because they're run from a driver script. And I know in the past we've updated a few tests to set the executable bit.
[22:03] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) Quit (Quit: Leaving.)
[22:03] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[22:03] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[22:03] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[22:03] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[22:04] * sjust (~sam@aon.hq.newdream.net) Quit ()
[22:04] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[22:09] <Tv_> sagewk: wtf http://tracker.newdream.net/issues/2247
[22:09] <Tv_> sagewk: single-word ticket, really?
[22:09] <Tv_> do ALL the things
[22:10] <dmick> if you don't understand vercoi, you shouldn't mess with vercoi
[22:10] <dmick> (I am a living counterexample)
[22:10] <Tv_> also related, "you mess with the bull, you get the horns"
[22:11] <Tv_> anyone can change the networking config; whether you walk away from the incident is up to you ;)
[22:11] <nhm> Tv_: at first I missed the single word
[22:12] <nhm> Tv_: btw, I'm thinking about smart ways to integrat megacli into teuthology. I think the only really safe way to do it is have the nodes reboot after the controller is reconfigured.
[22:12] <dmick> title: it's broken. content: fix it.
[22:12] <Tv_> nhm: can't touch rootfs on the running machine -> do via pxeboot -> put into node provisioning, not into teuthology
[22:14] <nhm> Tv_: theoretically we don't need to touch rootfs. We could put it into provisioning, but that means reprovisioning every time journals change.
[22:14] <nhm> Can we schedule reprovisions as part of teuthology tasks right now?
[22:15] <Tv_> nhm: nope
[22:15] <dmick> wait I'll say it: "this needs to be chef"
[22:15] <Tv_> nhm: what do you mean by "journals change"?
[22:16] <Tv_> dmick: no chef in this part of provisioning please..
[22:16] <nhm> Tv_: If I want to test a seperate raid group for the jouranl vs seperate partition on the same raid group, vs dedicating the entire raid group to an OSD vs a journal on an entire different disk.
[22:17] <Tv_> nhm: well, that sounds like you got some coding to do
[22:17] <nhm> Tv_: Eventually we may want to have automated testing also include various raid configs behind OSDs.
[22:18] <Tv_> nhm: the good news is, it sounds like you can live with not touching root = 1 disk
[22:18] <nhm> Tv_: yeah, I think so. I've got scripts that do this, but it's probably time to start thinking about how to do it right (well, better).
[22:19] <nhm> Tv_: yeah, the only thing that sucks is that we can't do a full controller reset that way, but it looks like we can probably individually set everything back into a more or less clean state ourselves.
[22:23] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:24] <Tv_> nhm: not sure when a full controller reset would be wanted anyway, and how that is different from just rebooting
[22:24] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:25] <nhm> Tv_: When I say reset, I mean reseting the controller back to all of the factory original default settings for things like cachie pinning and flush intervals.
[22:25] <Tv_> nhm: oh.. wouldn't removing a raid group & recreating it do that for you?
[22:26] <Tv_> nhm: if that's controller-wide, you probably can't change it anyway while running your OS from the controller
[22:26] <nhm> Tv_: I think it's controller wide. Not sure if it can be overridden at the raid group level.
[22:27] <nhm> Tv_: Not sure. Megacli doesn't like to give very useful error messages (assuming you get one at all).
[22:27] <Tv_> nhm: yeah good luck
[22:28] <Tv_> sagewk: why can't i see #2550 on http://tracker.newdream.net/rb/master_backlogs/ceph ?
[22:29] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: No route to host)
[22:29] <sagewk> tv_ it's a bug
[22:29] <sagewk> only features, docs, and cleanups show up
[22:29] <Tv_> ah, ok
[22:32] <dmick> agh. so if I click on a backlog line by mistake, how do I get out of edit mode without changing the issue?
[22:32] <sagewk> escape
[22:33] * ninkotech (~duplo@ has joined #ceph
[22:39] <gregaf> nhm: you've got like 14 machines in plana locked across 3 boxes that haven't been upgraded to precise; are you actually using them?
[22:39] <gregaf> similar to you sagewk and joshd
[22:39] <nhm> gregaf: yes, those are the aging cluster nodes
[22:39] <gregaf> I'm trying to loop a 3-machine test and keep on getting errors on failing to lock because there's only 5 machines unlocked right now (when I unlock mine)
[22:39] <gregaf> oh, I thought all those nodes were in burnupi :(
[22:40] <nhm> gregaf: osds are burnupi, clients and mons are plana
[22:40] <gregaf> ???why?
[22:40] <nhm> gregaf: clients and mons don't need lots of disks
[22:41] <gregaf> clients and mons don't need anything ??? stick them all on one disk together ;)
[22:41] <gregaf> we really need to get these machines turned into VMs so we aren't limited 4 each when the nightlies are running
[22:41] <nhm> gregaf: Don't the mons cause some nasty seek behavior without syncfs?
[22:42] <gregaf> I dunno, but in an aging cluster the contention wouldn't matter anyway I don't think
[22:43] <nhm> gregaf: Ok, well, I'm cool with reducing the number of plana nodes I'm using for those tests.
[22:44] * The_Bishop (~bishop@2a01:198:2ee:0:c8d8:5171:ce12:71c1) Quit (Read error: Connection reset by peer)
[22:44] <nhm> gregaf: Not sure who is actually in charge of deciding that. ;)
[22:44] * The_Bishop (~bishop@2a01:198:2ee:0:c8d8:5171:ce12:71c1) has joined #ceph
[22:45] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:50] <jmlowe> can somebody refresh my memory active+clean+inconsistent means?
[22:53] <sagewk> jmlowe: scrub found an inconsistency :(
[22:53] <sagewk> you should see a log message about it in the cluster log, $mon_data/log.err
[22:53] <sagewk> (soon to be moved to /var/log/ceph/$cluster.log in 0.48)
[22:57] <dmick> and as I've recently had it explained to me, scrub compares objects, primary to replica(s)?...
[22:57] <dmick> at least metadata about them, not sure which
[22:58] <elder> What a fricken nightmare.
[22:59] <Tv_> elder: which part?
[22:59] <elder> My kernel builds are now failing.
[22:59] <elder> Sorry, unrelated.
[22:59] <Tv_> ah, heh
[22:59] <Tv_> planning time
[22:59] <elder> Yup.
[23:17] * adjohn (~adjohn@ Quit (Ping timeout: 480 seconds)
[23:17] * The_Bishop (~bishop@2a01:198:2ee:0:c8d8:5171:ce12:71c1) Quit (Ping timeout: 480 seconds)
[23:52] <elder> sagewk, the problem with the builds I was seeing I think affects only UML, with the updated (3.5-rc1 based) kernel code.
[23:58] <jmlowe> What would happen if given this message: 2012-06-13 17:31:39.142128 osd.16 11603 : [ERR] 2.262 osd.5 missing c104de62/rb.0.3c.0000000088d4/head
[23:58] <jmlowe> I copied ./current/2.262_head/DIR_2/DIR_6/rb.0.3c.0000000088d4__head_C104DE62 from osd.16 to osd.5
[23:58] <jmlowe> ?

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.