#ceph IRC Log

Index

IRC Log for 2012-03-22

Timestamps are in GMT/BST.

[0:06] * lofejndif (~lsqavnbok@19NAAHI40.tor-irc.dnsbl.oftc.net) Quit (Quit: Leaving)
[0:16] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[0:16] * The_Bishop (~bishop@178-17-163-220.static-host.net) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[0:20] * perplexed (~perplexed@mobile-198-228-210-160.mycingular.net) Quit (Quit: perplexed)
[0:23] * Tv|work (~Tv_@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[0:27] * Kioob (~kioob@luuna.daevel.fr) Quit (Ping timeout: 480 seconds)
[0:38] * gregorg_taf (~Greg@78.155.152.6) has joined #ceph
[0:38] * gregorg (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[0:49] * Tv__ (~tv@cpe-24-24-131-250.socal.res.rr.com) has joined #ceph
[0:59] * yoshi (~yoshi@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:03] * The_Bishop (~bishop@178-17-163-220.static-host.net) has joined #ceph
[1:24] * lofejndif (~lsqavnbok@exit-01c.noisetor.net) has joined #ceph
[2:07] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:08] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[2:08] * jmlowe1 (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[2:11] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Read error: Operation timed out)
[2:37] * BManojlovic (~steki@212.200.240.216) Quit (Ping timeout: 480 seconds)
[3:22] * chutzpah (~chutz@216.174.109.254) Quit (Quit: Leaving)
[3:25] * lofejndif (~lsqavnbok@19NAAHI9K.tor-irc.dnsbl.oftc.net) Quit (Quit: Leaving)
[3:54] * imjustmatthew (~imjustmat@pool-96-228-59-130.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[4:06] * perplexed (~perplexed@c-76-21-85-168.hsd1.ca.comcast.net) has joined #ceph
[4:20] * perplexed (~perplexed@c-76-21-85-168.hsd1.ca.comcast.net) Quit (Quit: perplexed)
[4:36] * perplexed (~perplexed@c-76-21-85-168.hsd1.ca.comcast.net) has joined #ceph
[4:38] * imjustmatthew (~imjustmat@pool-96-228-59-130.rcmdva.fios.verizon.net) has joined #ceph
[4:46] * perplexed (~perplexed@c-76-21-85-168.hsd1.ca.comcast.net) Quit (Quit: perplexed)
[5:02] * d405 (~nobody@un.interestingsh.it) Quit (Read error: Operation timed out)
[5:39] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) Quit (Quit: Leaving.)
[5:39] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[5:40] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has left #ceph
[5:42] * d405 (~nobody@2001:470:1f11:45b::11) has joined #ceph
[5:47] * cattelan is now known as cattelan_away
[6:09] * wlbilljg (~wgallaghe@nat-204-14-239-208-sfo.net.salesforce.com) Quit (Read error: Operation timed out)
[6:54] * tnt_ (~tnt@194.11-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[7:46] * Tv__ (~tv@cpe-24-24-131-250.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[8:02] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[8:52] * tnt_ (~tnt@ptr-91-87-217-70.mobistar.be) has joined #ceph
[9:15] * tnt_ (~tnt@ptr-91-87-217-70.mobistar.be) Quit (Ping timeout: 480 seconds)
[9:21] * BManojlovic (~steki@212.200.240.216) has joined #ceph
[10:05] * Theuni (~Theuni@195.62.106.110) has joined #ceph
[10:07] * The_Bishop (~bishop@178-17-163-220.static-host.net) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[10:31] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[10:32] <lxo> should there be leveldb tarballs, or should leveldb code be in the ceph tarball?
[10:36] <lxo> hmm... odd. it looks like leveldb *is* part of the tarball, but the rpm build fails as if it wasn't
[10:36] <NaioN> lxo: dependencies?
[10:37] <NaioN> I've noticed some mails on the mailinglist concerning some dependencies if I remember correctly
[10:39] <lxo> no, it fails to include leveldb headers. lemme look into that
[10:39] <rz> is ceph doing round-robin when volume is distributed or there is a specific algorithm that's used to spread the file over the nodes so one node can be full without writing problems ?
[10:42] <NaioN> rz: it uses an algoritm
[10:42] <lxo> placement of file fragments is controlled by crushmap rules. files can be distributed across multiple nodes, but if any single node becomes full, the entire filesystem is full, so you'd better have crushmap rules that take node size into account
[10:44] <lxo> NaioN, it appears to be some make -j issue
[10:44] <NaioN> hmmm yeah
[10:52] <lxo> nope. that's not it. very odd. if I get into the rpm build dir and type make, it works, but the make invocation by rpmbuild fails, trying to build src/leveldb/db/ stuff but not finding include/leveldb headers
[10:53] <lxo> aaah, it seems like a misuse of CFLAGS/CXXFLAGS in leveldb/Makefile.am. it should be using CPPFLAGS or AM_CPPFLAGS for -I flags, rather than the updir-overridden CFLAGS
[10:53] <rz> lxo / NaioN : ok so nodes MUST have same volumes size
[10:54] <rz> or nodes can be heterogeneous
[10:57] <lxo> rz, they can be heterogeneous, but ceph won't rearrange the crush rules so that the size differences are taken into account. if you're going to have enough data that some of the nodes would be filled up with the default weights, you should set up your own crush rules so that fewer PGs are placed in it
[10:59] * Guest7150 (Q@ppp59-167-157-24.static.internode.on.net) Quit (Read error: Connection reset by peer)
[11:00] <Azrael> is there a default/standard size of a PG?
[11:00] <lxo> no
[11:00] <Azrael> is it configurable? or automatically selected?
[11:00] <lxo> but the standard weight is 1.0 for all osds, and placement of PGs is controlled by weight
[11:02] <lxo> the fraction that each osd's weight amounts to WRT the total sum of weights is a good first approximation of the fraction of total space that that osd will use WRT the total space used in the ceph object store
[11:02] <Azrael> i read in one of the ceph slidesets/presentations that a 1.0 release would be out spring 2011. i take it thats not on schedule anymore?
[11:03] <lxo> dunno about that, I'm just a user and occasional contributor
[11:04] <Azrael> ok
[11:10] * yoshi (~yoshi@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:18] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[11:20] <rz> lxo: thanks for your answers
[11:26] * BManojlovic (~steki@212.200.240.216) Quit (Ping timeout: 480 seconds)
[11:33] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[11:45] * Theuni (~Theuni@195.62.106.110) Quit (Quit: Leaving.)
[11:46] * Theuni (~Theuni@195.62.106.91) has joined #ceph
[12:02] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Quit: Ex-Chat)
[12:58] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[13:28] * tnt_ (~tnt@91-64-60-154-dynip.superkabel.de) has joined #ceph
[13:32] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[13:33] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[13:45] <Azrael> hmm
[13:45] <Azrael> i'm getting errors like this:
[13:46] <Azrael> log 2012-03-22 13:45:26.581567 mon.0 172.26.0.10:6789/0 109 : [INF] osd.0 172.26.0.30:6800/3324 failed (by osd.2 172.26.0.32:6800/1601)
[13:46] <Azrael> i just added osd.0 to my pool (i started iwth osd.1 and osd.2 and forgot about 0)
[13:46] <Azrael> i adjusted the crushmap
[13:46] <Azrael> and replication is occurning, but not to osd.0
[13:46] <Azrael> just been 1 and 2
[13:46] <Azrael> nothing has been placed on osd.0
[13:46] <Azrael> and... osd.0 for some reason stops listening on 6800/6801/6802
[13:47] <Azrael> log 2012-03-22 13:46:55.901923 osd.0 172.26.0.30:6800/3324 4 : [WRN] map e167 wrongly marked me down or wrong addr
[13:47] <Azrael> whats that about too?
[13:48] * wlbilljg (~wgallaghe@nat-204-14-239-208-sfo.net.salesforce.com) has joined #ceph
[14:29] * Theuni1 (~Theuni@195.62.106.110) has joined #ceph
[14:32] * Theuni (~Theuni@195.62.106.91) Quit (Ping timeout: 480 seconds)
[15:11] * lofejndif (~lsqavnbok@9YYAAERTK.tor-irc.dnsbl.oftc.net) has joined #ceph
[15:14] * tnt_ (~tnt@91-64-60-154-dynip.superkabel.de) Quit (Remote host closed the connection)
[15:29] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[15:29] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit ()
[15:30] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[15:46] * cattelan_away is now known as cattelan
[15:49] * f4m8 (~f4m8@lug-owl.de) has left #ceph
[15:52] * jmlowe1 (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[16:32] <Azrael> looks like it was a version compatibility issue. 0.43 couldn't talk to an 0.44 osd.
[16:37] * sagelap (~sage@HSI-KBW-46-237-223-250.hsi.kabel-badenwuerttemberg.de) has joined #ceph
[16:37] <sagelap> elder: there?
[16:37] <elder> Yes.
[16:37] <sagelap> just replied to your email :)
[16:38] <sagelap> why not rebase on v3.3? since we have to do it anyway, may as well use a stable release
[16:38] <elder> Not a bad idea I guess.
[16:38] <elder> Linus has in the past not liked to see things rebased like that.
[16:38] <elder> But that's more like in-between releases.
[16:38] <Azrael> 2012-03-22 16:38:39.732619 log 2012-03-22 16:38:34.604020 osd.2 172.26.0.32:6800/3666 5957 : [WRN] old request osd_op(mds.0.2:595 200.0000000d [write 1706575~21926] 1.11cbcd3) v4 received at 2012-03-22 16:38:04.590265 currently waiting for sub ops
[16:38] <sagelap> really?
[16:38] <sagelap> yeah
[16:38] <Azrael> weeeh heh
[16:39] <elder> I'll do that. It's pretty simple.
[16:39] <elder> My XFS practice was to wait to rebase until -rc1.
[16:39] <elder> Anyway, I'm building with 3.2-rc3 as a base right now. I'll rebase that for testing.
[16:40] <elder> (i.e., testing branch based on 3.3)
[16:40] <sagelap> hmm
[16:40] <sagelap> yeah ok.. sounds good!
[16:40] <elder> How's Germany?
[16:40] <sagelap> phew, glad that's sorted out.. annoying mystery
[16:40] <sagelap> pretty good!
[16:40] <elder> It's been bugging me for a long time.
[16:40] <elder> (Not Germany)
[16:41] <sagelap> my only real complain are tmobile's data roaming charges (criminal) and lack of any international data plan (sad). i'll be moving back to at&t next chance i get
[16:41] <Azrael> i think i'll be visiting germany in early april
[16:41] <Azrael> since its a skip away
[16:41] <elder> I almost dropped one more patch when I updated... But caught it because I want to be pretty damned careful. I have an international phone that we bought for my daughter when she went to Greece last year.
[16:42] <elder> It sits idle until we need it.
[16:42] <elder> I'd love to need it...
[16:42] <Azrael> at&t had a plan you could tack onto your current phone plan for data + voice at an ok rate
[16:42] <Azrael> i mean still expensive, but not 2012-03-22 16:38:39.732619 log 2012-03-22 16:38:34.604020 osd.2 172.26.0.32:6800/3666 5957 : [WRN] old request osd_op(mds.0.2:595 200.0000000d [write 1706575~21926] 1.11cbcd3) v4 received at 2012-03-22 16:38:04.590265 currently waiting for sub ops
[16:42] <Azrael> whoops
[16:42] <Azrael> but not expensive compared to if you didn't sign up for the plan
[16:43] <sagelap> azrael: as long as it's cheaper than $700 after ~1 hr of google navigation :)
[16:44] <Azrael> whoooooooooooooaaaaaaaah
[16:44] <elder> Yikes
[16:44] <Azrael> yah it was like $60/mo for 125mb of data
[16:44] <elder> No WiFi at the conference?
[16:44] <elder> Oh, navigatino...
[16:44] <elder> you mean GPS data.
[16:44] <sagelap> there is, but that's not helpful on the train or driving...
[16:45] <elder> I mean maps. You should buy a map.
[16:45] <Azrael> i relied on that at&t world traveler plan until i finally got a danish cellphone
[16:45] <elder> Old skool.
[16:45] <sagelap> also deutsche bahn doesn't have wifi on most trains, which is disappointing.
[16:45] <elder> First world problem.
[16:45] <sagelap> elder: yeah :)
[16:45] <Azrael> heh
[16:46] <Azrael> i'm untarring the linux kernel onto ceph mounted via FUSE. its going at about 5 MB/s and i'm seeing lots of messages printed by ceph -w like so:
[16:46] <Azrael> 2012-03-22 16:45:49.654424 log 2012-03-22 16:45:44.860506 osd.2 172.26.0.32:6800/3666 17375 : [WRN] old request osd_op(mds.0.2:2440 200.0000002a [write 1656690~28343] 1.5bf49b38) v4 received at 2012-03-22 16:45:14.698814 currently waiting for sub ops
[16:46] <Azrael> yet if i move just one large file onto the same mount, i have no problems.
[16:47] <Azrael> is this an issue? or to be expected?
[16:48] <sagelap> the message just means the osds are backed up. probably having trouble with lots of small requests.
[16:48] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[16:48] <sagelap> those code paths need to be tuned/optimized
[16:48] <sagelap> you can make the warnings go away by changing the threshold for the timeout... doesn't "fix" it but makes it shut up
[16:49] <Azrael> ahh interesting
[16:49] <sagelap> 'osd op complaint time = 60' or something
[16:49] <Azrael> there's no integrity issues though right? its just being a whiney baby but it will eventually succeed?
[16:50] <Azrael> hmm. that osd op complaint command. pass to the ceph cli, yeah?
[16:51] <sagelap> right
[16:51] <sagelap> stick it in ceph.conf
[16:52] <sagelap> and/or to inject it you can 'ceph osd tell \* injectargs "--osd-op-complaint-time 60"'
[16:52] <Azrael> nice
[16:53] <Azrael> thanks sagelap
[16:53] <sagelap> np
[16:53] * sagelap (~sage@HSI-KBW-46-237-223-250.hsi.kabel-badenwuerttemberg.de) has left #ceph
[16:53] * sagelap (~sage@HSI-KBW-46-237-223-250.hsi.kabel-badenwuerttemberg.de) has joined #ceph
[16:55] <Azrael> btw... what is "rjenkins"
[16:55] <Azrael> pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 192 pgp_num 192 lpg_num 2 lpgp_num 2 last_change 78 owner 0 crash_replay_interval 45
[17:01] * lofejndif (~lsqavnbok@9YYAAERTK.tor-irc.dnsbl.oftc.net) Quit (Quit: Leaving)
[17:07] * Theuni1 (~Theuni@195.62.106.110) Quit (Ping timeout: 480 seconds)
[17:09] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[17:23] <iggy> sagelap: most people I've talked to have a world phone and just get a prepaid sim when they land
[17:24] <sagelap> iggy: having to use a different phone number sort of defeats the purpose of having a cell phone.
[17:24] <sagelap> i guess i could use a different phone for data..
[17:24] <iggy> call forward?
[17:24] <sagelap> mostly i'm just frustrated that it's 2012 and I'm no better off than 10 years ago
[17:24] <iggy> I don't know how they solved that
[17:25] * Tv|work (~Tv_@aon.hq.newdream.net) has joined #ceph
[17:25] <sagelap> and want to complain loudly about tmobile sucking :)
[17:25] <iggy> the people I'm referring to were mostly interested in data
[17:25] <iggy> don't blame you on that last one
[17:34] * aliguori (~anthony@32.97.110.59) has joined #ceph
[17:34] * perplexed (~ncampbell@216.113.168.141) has joined #ceph
[17:39] <perplexed> Hi, I'm working on compiling/installing Ceph on RHEL 6.1 (2.6.32). Anything I should be aware of? Also, is ceph-0.44 the recommended version currently? I'm assuming I don't -need- to patch the kernel (?)
[17:41] <sagelap> you should be okay on xfs or ext4 (obviously no btrfs on that kernel) for the OSDs
[17:46] <elder> Tv|work, my attempts to boot a new kernel on a teuthology test are failing. They seem to be looking in the wrong place for the kernel (oneiric, not maverick?)
[17:46] <sagelap> what url is it fetching from?
[17:47] <sagelap> you may need to git pull, the location changed recently
[17:47] <elder> http://gitbuilder.ceph.com/kernel-deb-oneiric-x86_64-basic/sha1/53ea8b149e86005868876c1bc29665766df38ff2/
[17:47] <elder> I did git pull and rebuilt teuthology
[17:47] <elder> Or do I need to rebuild something else.
[17:47] <elder> ?
[17:47] <Tv|work> new sepia is oneiric, old sepia was maverick
[17:48] <sagelap> oh, it's just bilding... see http://ceph.newdream.net/gitbuilder.cgi
[17:48] <sagelap> building
[17:48] <perplexed> Thx. I'd been looking at http://ceph.newdream.net/download/ for the source
[17:49] <sagelap> perplexed: that's the place to look (unless you want debs or the git tree)
[17:50] <perplexed> I'll see if I can get the OS changed to something that will support btrfs too. Any recommendations?
[17:51] <perplexed> Was thinking Ubuntu 11.10 might be a better option on a number of fronts
[17:52] <elder> Maybe gitbuilder-kernel-amd64.ceph.newdream.net should point to the one that's going to be used then. (That's where my bookmark went)
[17:53] <elder> And since there are no more old sepia machines, well, could that machine be put to better use on oneiric builds?
[17:54] <elder> No, wait.
[17:55] <elder> I guess I don'tt know where to look to see if my build is done.
[17:56] <Tv|work> elder: gitbuilder names are mostly historical at this point and will get redone as part of the move to new vm hosting
[17:56] <sagelap> elder that's the right place actually
[17:56] <Tv|work> oh wow, didn't expect to see this: http://www.debian.org/misc/children-distros#stonegate
[17:56] <elder> OK, well that shows wip-master is done. My yaml file specified that, but failed. Also tried sha1 53ea8b149e86005868876c1bc29665766df38ff2 but no.
[17:56] <Tv|work> (i built that thing!)
[17:57] * wlbilljg (~wgallaghe@nat-204-14-239-208-sfo.net.salesforce.com) has left #ceph
[17:57] <sagelap> elder: checking..
[17:58] <elder> biab
[17:59] <sagelap> elder: looksl ike it's there
[18:01] * BManojlovic (~steki@212.200.240.216) has joined #ceph
[18:04] * The_Bishop (~bishop@178-17-163-220.static-host.net) has joined #ceph
[18:09] <elder> OK, it seems to be working now. Any thoughts on why the web interface showed it built, but the package wasn't available when I tried to use it?
[18:09] * sagelap (~sage@HSI-KBW-46-237-223-250.hsi.kabel-badenwuerttemberg.de) Quit (Read error: Operation timed out)
[18:10] * tjikkun_ (~tjikkun@82-169-255-84.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[18:13] * chutzpah (~chutz@216.174.109.254) has joined #ceph
[18:15] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[18:25] <elder> Tv|work, is it possible plana33, plana34, and plana93 are now dead? I just initiated installation of the 53ea8b149... kernel mentioned above in my yaml file, and I now have "no route to host" when I try to ssh in. I restarted my VPN just in case.
[18:30] <Tv|work> elder: sounds like the kernel didn't boot
[18:33] <elder> So how can I resolve that?
[18:33] <Tv|work> elder: i don't have the RAM to start a remote console now, sorry
[18:34] <elder> So... lock some more nodes?
[18:39] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[18:41] * adjohn (~adjohn@50.56.129.169) has joined #ceph
[18:47] * sagelap (~sage@HSI-KBW-46-237-223-250.hsi.kabel-badenwuerttemberg.de) has joined #ceph
[18:47] <elder> Tv|work, I don't want to consume your time. But is it likely that attempting to boot the same kernel on other nodes will make them die as well? I don't want to kill off systems, but I'm stuck.
[18:48] <Tv|work> elder: if it happened three times already, it sounds likely it's a bad kernel
[18:49] <elder> And without a console there's no way too see more, riight?
[18:50] <elder> I have a different kernel I wanted to test. Any objection to trying that on a different plana machine?
[18:51] <joshd> elder: maybe try it on just one machine first?
[18:51] <elder> Right. I will.
[18:54] <elder> Is this a reasonable minimal roles specification: - [mon.a, osd.0, mds.a, client.0]
[18:55] <joshd> elder: you can just do - [client.0], and have only the interactive task and the kernel install
[18:55] <elder> Thank you.
[18:56] <elder> Tthat's the onnly task, master.yaml:- interactive:
[18:56] <elder> ?
[18:56] <elder> (no master)
[18:57] <joshd> yeah, there's no requirement that you run ceph or anything like that
[18:57] <dmick> elder: I can try to look at the dead machines; I wanted to anyway
[18:58] <elder> I would if I knew how.
[18:58] <dmick> you don't want to know how
[18:58] <elder> OK, then I won't then.
[18:58] <Tv|work> these boxes have embedded Cthulhu.. looking makes you insane
[18:58] <dmick> and you're happy about the insanity
[18:58] <elder> You've been looking at them then, The_Bishop
[18:58] <elder> Tv|Work?
[18:59] <elder> (Whoops)
[18:59] <dmick> awesome. even the ipmi is down. whee.
[19:00] <Tv|work> oh i was crazy to take the task in the first place
[19:00] <Tv|work> nothing to lose here
[19:12] <elder> I'm sorry to report thatt plana16 may no be dead as well.
[19:14] <elder> I had oddities an hour ago with the build server. Do you suppose it's related? It would be nice if someone else could try to (not) kill a machine by specifying an alternate kernel in a yaml file.
[19:15] <elder> Meanwhile, my owwn machine is acting up, I'm goinoing to rreboot.
[19:22] <dmick> elder: so 33, at least, appears to be in the kernel, but there were some startup failures (maybe including networking)
[19:22] <dmick> trying to get more info
[19:25] * sagelap (~sage@HSI-KBW-46-237-223-250.hsi.kabel-badenwuerttemberg.de) Quit (Ping timeout: 480 seconds)
[19:29] * tnt_ (~tnt@91-64-60-154-dynip.superkabel.de) has joined #ceph
[19:30] * tnt_ (~tnt@91-64-60-154-dynip.superkabel.de) Quit ()
[19:36] <elder> dmick, I'm back online again.
[19:36] <elder> Any more info on the startup failures?
[19:39] * adjohn (~adjohn@50.56.129.169) Quit (Quit: adjohn)
[19:40] <dmick> not yet
[19:40] <dmick> resetting the DRAC on 33
[19:43] <dmick> elder, did you see I'd said so 33, at least, appears to be in the kernel, but there were some startup failures (maybe including networking)
[19:43] <elder> Yes I saw that.
[19:43] <dmick> it is still printing messages, so not hung (got some usb warnings)
[19:43] <dmick> (probably from me resetting the drac)
[19:43] <dmick> will try powercycling
[19:43] <elder> What's a drac?
[19:44] <dmick> the parasitic system controller computer
[19:47] <dmick> ok, power cycling seems to have worked
[19:47] <dmick> I shall watch its booting
[19:49] * tjikkun_ (~tjikkun@82-169-255-84.ip.telfort.nl) has joined #ceph
[19:50] <dmick> bnx2 complained about loading a firmware file
[19:50] <dmick> network config...
[19:51] <dmick> still waiting for network config to succeed...
[19:52] <dmick> Starting automatic crash report generation failed
[19:52] <dmick> Starting ISC DHCP server dhcpd failed
[19:52] <dmick> now at same point it was
[19:53] <elder> So not the kernel, it's some sort of DHCP problem
[19:53] <dmick> so yeah, the problem is the networking is boned, but I don't know why
[19:53] <darkfader> dmick: i have read a lot of people crying about bxn2 being broken in rhel5.8/6.2
[19:53] <dmick> well it's acting like it has not configured the network devices
[19:54] <dmick> maybe I can get it to give me a serial console if I can get into Grub
[19:55] <dmick> if some recent kernel change broke the network driver, that would explain this I guess
[19:55] <dmick> grub. cool.
[20:01] <dmick> so, one problem is that serial console should be configured for ttyS1, not ttyS0
[20:02] <elder> OK. That's a teuthology scripting thing, right?
[20:02] <dmick> I dunno; does teuthology set up Grub?
[20:03] <elder> I think so--when I request a kernel by branch or sha1 or tag, it has to set up grub to boot the specified kernel.
[20:04] <dmick> it may be inheriting the serial console setup tho
[20:04] <dmick> but the console doesn't appear to be helping; still stuck at
[20:04] <dmick> * Stopping System V runlevel compatibility [ OK ]
[20:09] <dmick> ok, got to the console through VT switch. Indeed, networking is not up
[20:10] <Tv|work> what made that box use dhcp in the first place, if elder only replaced the kernel?
[20:10] <dmick> I don't think it is
[20:11] <dmick> that was just a guess from not much output\
[20:24] * imjustmatthew (~imjustmat@pool-96-228-59-130.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[20:24] <dmick> ok. eth0 will not come up because bnx2 demands firmware version 1b
[20:25] <dmick> bnx2-mips-09-6-2.1b.fw
[20:25] <dmick> so if you're going to run that driver you have to install that fw as well it seems
[20:25] <dmick> ^ elder
[20:25] <elder> OK, well, how do I do that? gitbuilder has the config file that I need I guess.
[20:26] <elder> I don't know anything about the hardware...
[20:26] <dmick> gitbuilder may well have built the fw packages
[20:26] <dmick> but typically they're not upgraded IIRC
[20:26] <elder> I'm just trusting that when I specify it by name in the yaml file, it will do all the magic things that are needed.
[20:26] <dmick> which is a thing we should probably fix with chef. generally its' not a problem
[20:26] <dmick> I wondered about this, because /lib/firmware is not versioned
[20:27] <dmick> but apparently it's kept as a strictly-expanding set
[20:27] <elder> Any reason why this has started happening now?
[20:27] <dmick> because bnx2 changed
[20:27] <dmick> and now it requires the newer fw
[20:27] <dmick> (the newer fw file)
[20:27] <dmick> maybe I should make that clear: this is all in /lib/firmware, and installed by the drivers when they need it
[20:28] <elder> OK. So what is it I need to do? Or is this something something else needs to fix?
[20:28] <dmick> you might be able to roll back bnx2 in your kernel build too, but that could get ugly
[20:28] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[20:28] <dmick> somehow, someone needs to update the firmware package, or use an earlier bnx2 driver
[20:29] <iggy> I think there's a linux-firmware package on kernel.org somewhere
[20:30] <elder> OK, the first kernel I tried was based on Linux 3.3-rc3. The second one was based on 3.3 (final). Prior to that, kernels were based on 3.2. Did the change in the bnx2 driver requirements occur then sometime between 3.2 (final) and 3.3-rc2?
[20:30] <elder> I mean 3.3-rc3
[20:31] <NaioN> elder: we have the same problem
[20:32] <NaioN> dmick: were to get that fw?
[20:32] * imjustmatthew (~imjustmat@pool-96-228-59-130.rcmdva.fios.verizon.net) has joined #ceph
[20:33] <NaioN> the firmware shipped with 3.3 doesn't seem the right one... we have trouble with that one
[20:33] <elder> linux-firmware_1.60_all.deb
[20:33] <elder> (That is from the Oneiric package list)
[20:35] <NaioN> hmmm but oneiric doesn't ship the 3.3 kernel...
[20:35] <elder> Right.
[20:36] <NaioN> or do you have a ppa?
[20:36] <elder> I'm just saying, the machine is running oneiric, I'm trying to load a 3.3 kernel.
[20:36] <elder> http://marc.info/?l=linux-netdev&m=132616185703573&w=2
[20:36] <jmlowe> http://kernel.ubuntu.com/~kernel-ppa/mainline/
[20:37] <NaioN> jmlowe: are the fw's included in that package?
[20:37] <jmlowe> beats me, don't use them
[20:38] <NaioN> elder: thanks... so we have to get it from the git :)
[20:38] <elder> I don't know.
[20:39] <iggy> http://git.kernel.org/?p=linux/kernel/git/firmware/linux-firmware.git;a=blob;f=bnx2/bnx2-mips-09-6.2.1b.fw;h=8bd1e7992f55ae876d88a8af73eb4c0b4ed5972f;hb=HEAD
[20:39] <elder> I pasted that link before I had finished doing research. I think I have an updated git repository link that Im downloading now.
[20:39] <NaioN> iggy: thx
[20:39] <elder> Yes, that one.
[20:41] <elder> So dmick do you know what to do with the firmware file?
[20:45] <elder> Can we simply update what's found under /lib/firmware on the target system so it includes the contents of the bnx2 directory in the linux-firmware git tree?
[20:46] <NaioN> I would think so
[20:47] <NaioN> the driver in the kernel looks for a certain fw file in /lib/firmware
[20:47] <NaioN> so if it's the correct one it will load that file into the card
[20:48] <jmlowe> http://packages.debian.org/squeeze/firmware-bnx2 would seem to indicate that it's universal
[20:48] <NaioN> different kernels with different driver versions can refer to the same fw file
[20:49] <elder> dmick, Tv|work, at this point I'm dependent on you guys to get those down machines working. Mean time I may try to grab one more and manually update /lib/firmware before trying once more to reboot into a 3.0 kernel.
[20:49] <NaioN> but it seems that the driver in the 3.3 kernel refers to a newer fw file
[20:49] <NaioN> elder: you can have as many fw files besides each other as you want
[20:49] <jmlowe> yep, as long as it doesn't live under /lib/firmware/$(uname -r) then it can be used with any kernel
[20:51] <NaioN> jmlowe: yeps as long as the drivers refers to that fw file/version
[20:54] <elder> These appear in the git repository but not on plana63:/lib/firmware: bnx2-mips-06-6.2.3.fw
[20:54] <elder> bnx2-mips-09-6.2.1b.fw
[20:56] <elder> Oh and there's a bunch under directory bnx2x also.
[21:00] <dmick> to repair them, we need to update /lib/firmware
[21:00] <Tv|work> elder, dmick: so some brave volunteer needs to change teuthology/task/kernel.py to also install the linux-firmware-image deb
[21:00] <dmick> I believe we can do that by rebooting them into a different kernel and installing the appropriate package by hand
[21:00] <elder> I'm going to try it manually on plan63 right now.
[21:00] <dmick> ideally, yes, what Tv|work just said
[21:01] <dmick> one should probably open an issue on the problem, one should
[21:02] <darkfader> or just sleep for 2 weeks and wait till it shows up in generic distro kernels :)
[21:04] <dmick> so the firmware package is on the gitbuilder AFAICT
[21:04] <dmick> http://gitbuilder.ceph.com/kernel-deb-oneiric-x86_64-basic/ref/v3.3/linux-firmware-image_3.3.0-ceph-1_amd64.deb
[21:04] <dmick> I think is right
[21:05] <elder> So installing that should be nearly equivalent to manually updating them from the git tree?
[21:05] <dmick> however
[21:06] <dmick> that package does not contain the fw file in question
[21:06] <dmick> hmm
[21:06] * rosco_ (~r.nap@188.205.52.204) Quit (Ping timeout: 480 seconds)
[21:06] <dmick> mismerge in the ceph kernel tree?...looking
[21:07] <elder> Do you know what version of the firmware is required?
[21:07] <dmick> yes, see above
[21:07] <dmick> or here's the message again
[21:07] <dmick> [ 1438.040847] bnx2: Can't load firmware file "bnx2/bnx2-mips-09-6.2.1b.fw"
[21:07] * nhorman_ (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[21:07] * nhorman_ (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit ()
[21:08] <elder> I've got all the updated ones loaded on plana63.
[21:08] <elder> Well damn, that's there...
[21:08] <elder> I'm going to try my test again on that updated machine.
[21:08] <dmick> which branch?
[21:08] <elder> wip-testing
[21:09] * rosco (~r.nap@188.205.52.204) has joined #ceph
[21:10] <elder> Here goes.
[21:11] <dmick> ah, I was looking at the gitbuilder for 3.3
[21:13] <dmick> still confused; the gitbuilder output from wip-testing doesn't seem to have that file either
[21:14] <dmick> nor in fact does the source tree
[21:14] <elder> Oh. Sorry. No I installed the firmware files manually from the linux-firmware.git tree on plana63, and then requested it install my kernel.
[21:14] <dmick> there's a separate repo for fw?
[21:15] <dmick> I'm looking in the kernel tree, in firmware/
[21:15] <elder> http://git.kernel.org/?p=linux/kernel/git/firmware/linux-firmware.git;a=summary
[21:15] <elder> I think this is a binary dumping ground.
[21:15] <elder> That way, proprietary firmware can be stuffed there.
[21:15] <dmick> so is this an upstream problem, in that the kernel tree doesn't have it yet?
[21:16] <elder> Without being part of the Linux kernel proper. Maybe.
[21:16] <elder> Maybe it's an upstream problem, I just don't know.
[21:16] <elder> I hardly understand what's going on as it is:>
[21:16] <dmick> (firmware/ is that sort of dumping ground AFAICT)
[21:16] <elder> plana63 came back, running my new kernel.
[21:16] <elder> I.e., adding the missing firmware files worked.
[21:17] <elder> For now we could run a script that updated them all as a short-term fix. Longer term we ought to somehow ensure the right stuff is installed.
[21:17] <dmick> indeed, the file is not in http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git
[21:17] <dmick> and I think it's supposed to be
[21:18] <dmick> https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=786937
[21:18] <elder> Are you able to get anything into the machines I got stuck before? Can you boot an older kernel?
[21:19] <dmick> http://marc.info/?l=linux-netdev&m=132616185703573&w=2
[21:19] <dmick> I should be able to
[21:19] <dmick> rather painfully, but I should be able to
[21:20] <elder> There are four machines now I think. plana33, plana34, plana63, and plana16
[21:20] <elder> If you can get them booted, I can install the firmware on them.
[21:20] <dmick> roger
[21:21] <elder> And once those are cleaned up we can decide what to do about updating firmware on everything.
[21:21] <iggy> surprised this hasn't been hit before... I mean 1 RH bug with a few comments seems low
[21:21] <dmick> I think the right answer is that the new fw needs to be ni the kernel tree
[21:21] <dmick> perhaps you could research what Linus's plan is there?...
[21:22] * adjohn (~adjohn@50.56.129.169) has joined #ceph
[21:22] <elder> I'll look into how it's supposed to happen.
[21:22] * LarsFronius (~LarsFroni@f054101151.adsl.alicedsl.de) has joined #ceph
[21:23] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[21:25] <dmick> I believe the newer fw file should be in the kernel firmware/ directory (along side the older ones). I just don't know why it's not yet.
[21:25] <dmick> since clearly the driver now requires it.
[21:25] <elder> I mean, I'm not sure who is responsible for getting the firmware into the kernel repository.
[21:25] <elder> I'm looking now.
[21:25] <dmick> as in
[21:25] <dmick> http://marc.info/?l=linux-netdev&m=132616185703573&w=2
[21:25] <iggy> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=c2c20ef43d00b1439631e603f8dcee9a803cd8b3
[21:26] <elder> There you go. Thanks.
[21:26] <dmick> http://lists.openwall.net/netdev/2012/02/15/24 is interesting
[21:26] <iggy> so mchan@broadcom.com
[21:26] <dmick> "firmware is no longer distributed with kernel source, there is a separate firmware repo:"
[21:26] <iggy> that is interesting
[21:27] <dmick> I wonder if Harvey speaks with authority
[21:27] <iggy> I hadn't heard that anywhere
[21:28] * sagelap (~sage@HSI-KBW-46-237-223-250.hsi.kabel-badenwuerttemberg.de) has joined #ceph
[21:28] <iggy> http://lwn.net/Articles/294308/
[21:29] <iggy> looks like a license issue that caught on
[21:29] <elder> I saw that article earlier.
[21:29] <iggy> I have a slightly better feeling about trusting woodhouse
[21:29] <elder> That's where I got the impression the purpose was to allow other firmware to be dropped in without full inclusion.
[21:30] <iggy> so looks like that needs to be a kernel dep and be built from... git?
[21:31] <elder> I think, though, it must be just a place to get pre-built firmware for people who don't want to check out the kernel source and build it.
[21:31] <elder> (That's an article from August 2008)
[21:31] <Tv|work> most of that firmware is binary blobs, there is no build
[21:31] <iggy> yeah, you couldn't build that from the kernel souce
[21:31] <iggy> *source
[21:31] <dmick> yeah. I guess the operative thing is: how different is firmware/ from firmware.git
[21:32] <iggy> firmware/bnx2 hasn't been touched in almost 2 years
[21:32] <iggy> so I'd guess at least broadcom has switched to using fw.git
[21:32] <elder> I summarized the changes above. Two new ones under bnx2/ and a bunch more under bnx2x
[21:33] <dmick> ah
[21:34] <iggy> debian has a rec on linux-image for firmware-linux-free
[21:37] <dmick> http://lists.debian.org/debian-kernel/2011/12/msg00432.html et seq would seem to indicate that Canonical is building that way now
[21:37] <dmick> (although I don't know that Tim is a buildmeister so I'm just inferring)
[21:38] <elder> Yep, saw that too. That's where I got the git repository.
[21:39] <elder> I'm sending a note to mchan@broadcom.com right now about this.
[21:39] <elder> As I write it I think maybe it would be handled properly if I were doing a package-based install of the kernel.
[21:40] <elder> The package could specify a prerequisite version of a separate linux-firmware package.
[21:40] <elder> But since I'm just doing a manual kernel install on top of an older os install (oneiric) the dependency isn't resolved for me.
[21:40] <elder> But I'm just speculatinog.
[21:41] <dmick> but it seems like this is necessarily now a cross-repo dependency, so you'd have to (in the worst case) have built kernel packages and fw packages from two different repos
[21:42] <dmick> hey, git submodule :)
[21:42] <elder> I wonder if Linus knows how to use tyat.
[21:42] <elder> (that0
[21:42] <elder> (I need a new keyboard)
[21:53] * sagelap (~sage@HSI-KBW-46-237-223-250.hsi.kabel-badenwuerttemberg.de) Quit (Ping timeout: 480 seconds)
[22:00] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[22:03] * lofejndif (~lsqavnbok@9YYAAER5G.tor-irc.dnsbl.oftc.net) has joined #ceph
[22:04] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[22:41] * aliguori (~anthony@32.97.110.59) Quit (Remote host closed the connection)
[22:44] * yehudasa_ (~yehudasa@aon.hq.newdream.net) has joined #ceph
[22:44] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[22:44] * dmick1 (~dmick@aon.hq.newdream.net) has joined #ceph
[22:45] * joshd1 (~joshd@aon.hq.newdream.net) has joined #ceph
[22:45] * dmick (~dmick@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[22:45] * gregaf1 (~Adium@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[22:45] * mkampe (~markk@aon.hq.newdream.net) has joined #ceph
[22:46] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[22:49] * Tv|work (~Tv_@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[22:49] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[22:50] * joshd (~joshd@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[22:51] * sjust1 (~sam@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[22:51] * yehudasa__ (~yehudasa@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[22:51] * mkampe1 (~markk@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[22:51] * sagewk1 (~sage@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[22:51] <elder> dmick1, it appears you guys are having network problems?
[22:53] <gregaf> elder: I think we just flapped once, dunno why
[22:53] <dmick1> yeah
[22:54] <dmick1> lost me my remote console connection mid-edit; it was awesome
[22:54] <dmick1> reconnecting, I couldn't type
[22:54] * joshd1 is now known as joshd
[22:59] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[23:20] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: Connection reset by peer)
[23:20] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[23:20] * andret (~andre@pcandre.nine.ch) Quit (Ping timeout: 480 seconds)
[23:30] <elder> Got a response from Michael Chan. The firmware not being up-to-date is intentional. He said "They want the firmware to be completely separated from the kernel tree."
[23:32] * andret (~andre@pcandre.nine.ch) has joined #ceph
[23:32] <dmick1> well isn't that special
[23:32] * dmick1 is now known as dmick
[23:33] <dmick> so it's now even easier to make nonbootable kernels. Progress!
[23:33] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:38] * LarsFronius (~LarsFroni@f054101151.adsl.alicedsl.de) Quit (Quit: LarsFronius)
[23:58] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.