#ceph IRC Log


IRC Log for 2012-07-31

Timestamps are in GMT/BST.

[0:02] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[0:06] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:06] * loicd (~loic@magenta.dachary.org) has joined #ceph
[0:07] * EmilienM (~EmilienM@ has left #ceph
[0:15] * Cube (~Adium@ Quit (Quit: Leaving.)
[0:24] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:40] * chutzpah (~chutz@ Quit (Quit: Leaving)
[0:44] * eightyeight (~atoponce@pinyin.ae7.st) Quit (Ping timeout: 480 seconds)
[0:46] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[0:49] <elder> joshd, are you planning to offer a review of "rbd: __rbd_init_snaps_header() bug"?
[0:50] <joshd> yeah, I'm in the process of doing so
[0:50] <elder> OK.
[0:50] <elder> Just wanted to be sure before I go too far with committing stuff.
[0:50] <elder> I'll wait for your review before I start testing.
[0:50] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:52] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[0:53] <Kioob> Warning: Don't use rbd kernel driver on the osd server. Perhaps it will freeze the rbd client and your osd server. <== oh, not fun :(
[0:53] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[1:02] * jluis (~JL@89-181-146-118.net.novis.pt) has joined #ceph
[1:07] * joao (~JL@ Quit (Ping timeout: 480 seconds)
[1:12] * lofejndif (~lsqavnbok@9KCAAAJ4C.tor-irc.dnsbl.oftc.net) has joined #ceph
[1:17] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:17] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:25] * eightyeight (~atoponce@pinyin.ae7.st) has joined #ceph
[1:32] * Leseb_ (~Leseb@ has joined #ceph
[1:33] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Ping timeout: 480 seconds)
[1:38] * Leseb_ (~Leseb@ Quit (Quit: Leseb_)
[1:38] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[1:41] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[1:42] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[1:43] * Tv_ (~tv@ Quit (Quit: Tv_)
[1:48] <iggy> Kioob: that is true of the kernel rbd driver (also of the kernel cephfs driver fwiw)
[1:49] <iggy> Kioob: but userspace (qemu's built-in librbd support, etc.) is safe
[1:53] <Kioob> ok... I was trying to mount a Xen setup, on one uniq server for testing
[1:53] <Kioob> but if I can't have real block devices...
[1:55] <iggy> the same is effectively true of any kernel driver that talks to a user space daemon on the same system
[1:57] <Kioob> so, no chance to be fixed soon
[1:58] <dmick> well, it's one of those "you can't really fix this in any way you'd like"
[1:58] <Kioob> ok ;)
[1:58] <Kioob> thanks for clarification
[1:58] <dmick> the nature of kernel-through-userland that's... dangerous.
[2:19] <Kioob> thanks ! ++
[2:19] * Kioob (~kioob@luuna.daevel.fr) Quit (Quit: Leaving.)
[2:28] <elder> sagewk, I finished reviewing your patches. I have 7 patches reviewed and ready to commit to testing. They are sitting now in (updated) branch "wip-rbd-cleanup".
[2:28] <elder> I''
[2:29] <elder> ve been running my set of tests for sanity, including xfstests over rbd, and all is fine.
[2:29] <elder> I can commit now, but I don'
[2:29] <elder> t want to mess up your plans with your series.
[2:29] <elder> I'm headed out shortly though.
[2:29] <elder> dmick, is Sage still in?
[2:31] <dmick> I think he's still here, but meeting
[2:31] <dmick> AFK
[2:33] <elder> OK. Looks like he messed with the testing branch while I wasn't looking. I'll have him pull in my changes whenever he's ready.
[2:37] <elder> sagewk, I sent you an e-mail regarding ceph-client.git/wip-rbd-cleanup
[3:00] * bshah (~bshah@sproxy2.fna.fujitsu.com) Quit (Ping timeout: 480 seconds)
[3:01] * adjohn (~adjohn@ Quit (Quit: adjohn)
[3:05] * adjohn (~adjohn@ has joined #ceph
[3:06] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[3:09] * dmick (~dmick@2607:f298:a:607:e09a:fd61:bb0e:84f9) Quit (Quit: Leaving.)
[3:10] * MarkDude (~MT@c-98-210-253-235.hsd1.ca.comcast.net) has joined #ceph
[3:16] <sagewk> elder: still there?
[3:16] <sagewk> elder: there are 2 patches that were there before that aren't in your new branch.. 'rbd: kill rbd_init_watch_dev()' and 'rbd: __rbd_init_snaps_header() bug'
[3:17] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[3:17] <jluis> sjust, sagewk, still around?
[3:18] <sagewk> jluis: i am
[3:19] <jluis> sagewk, what does it mean, in the PGMap context, to have a 'last_clean_epoch'?
[3:20] <sagewk> you mean in pg_stat_t?
[3:20] <sagewk> ah, last_epoch_clean
[3:21] <jluis> the function I'm looking at (PGMap::calc_min_last_epoch_clean) returns an epoch_t
[3:21] <sagewk> each pg's pg_stat_t has the last osdmap epoch it was clean
[3:21] <jluis> yeah, last_epoch_clean; sorry about that
[3:21] <sagewk> that returns a lower bound for the whole cluster
[3:21] <sagewk> its used to trim the osdmaps when we know they won't be needed for peering or for finding object replicas
[3:22] <jluis> oh, I see.
[3:23] * atoponce (~atoponce@pinyin.ae7.st) has joined #ceph
[3:23] <jluis> and the single paxos, with services keeping their own versions, is biting us in the ass (I think)
[3:24] <jluis> because the osdmon will make the decision to trim based on the pgmon's version
[3:25] <jluis> well, I'll look into it in the morning :)
[3:25] <jluis> thanks
[3:25] <jluis> good night #ceph !
[3:26] * eightyeight (~atoponce@pinyin.ae7.st) Quit (Ping timeout: 480 seconds)
[3:27] * Datfabpeak1 (~udWopyaca@ has joined #ceph
[3:28] <sagewk> jluis: 'night!
[3:46] <jluis> just a quick, totally off-topic question before going to bed: what does 'TS' stand for on the all@hq list?
[3:47] <jluis> this has been bugging me for months, and I can't think of anything besides "teamspeak" :p
[3:50] * adjohn (~adjohn@ Quit (Quit: adjohn)
[3:58] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[4:00] * atoponce is now known as eightyeight
[4:02] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[4:05] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[4:15] * SuperSonicSound (~SuperSoni@83TAAHSUS.tor-irc.dnsbl.oftc.net) has joined #ceph
[4:19] * themgt (~themgt@24-181-215-214.dhcp.hckr.nc.charter.com) has joined #ceph
[4:26] <themgt> I'm trying to start radosgw on ubuntu 12.04 after following http://ceph.com/docs/master/radosgw/config/ , and just getting this in /var/log/ceph/radosgw.log (after running /etc/init.d/radosgw start):
[4:26] <themgt> 2012-07-31 02:25:11.521450 7f7d933f7700 -1 Initialization timeout, failed to initialize
[4:27] <themgt> the process (/usr/bin/radosgw -n client.radosgw.gateway) does start running, but dies after maybe 20-30 seconds, and just leaves that in the log
[4:31] <iggy> themgt: does the cluster look okay otherwise (ceph -s or -w or whatever it is)
[4:34] <themgt> ahh duhh, thanks iggy. looks like I broke it during the install. the processes were all still running, didn't realize there was a problem
[4:35] * glowell_ (~glowell@ Quit (Ping timeout: 480 seconds)
[4:45] * lofejndif (~lsqavnbok@9KCAAAJ4C.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[4:59] * deepsa (~deepsa@ has joined #ceph
[5:41] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[5:41] * loicd (~loic@magenta.dachary.org) has joined #ceph
[5:46] <elder> sage, I presume you got my e-mail.
[5:47] <elder> jluis, Tech Support?
[6:05] * darkfader (~floh@ has joined #ceph
[6:05] * darkfaded (~floh@ Quit (Read error: Connection reset by peer)
[6:27] * gregaf1 (~Adium@2607:f298:a:607:1dd5:5769:ce20:fcdc) has joined #ceph
[6:33] * gregaf (~Adium@2607:f298:a:607:a1bd:67f8:e7c0:213a) Quit (Ping timeout: 480 seconds)
[6:37] * glowell (~glowell@ip-64-134-166-114.public.wayport.net) has joined #ceph
[6:39] * SuperSonicSound (~SuperSoni@83TAAHSUS.tor-irc.dnsbl.oftc.net) Quit (Quit: Leaving)
[7:41] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[8:11] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[8:34] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[8:35] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[9:00] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:13] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:16] * BManojlovic (~steki@ has joined #ceph
[9:18] * MarkDude (~MT@c-98-210-253-235.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[9:19] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[9:19] * MarkDude (~MT@c-98-210-253-235.hsd1.ca.comcast.net) has joined #ceph
[9:28] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[9:29] * andret (~andre@pcandre.nine.ch) has joined #ceph
[9:31] * Leseb (~Leseb@ has joined #ceph
[9:35] * EmilienM (~EmilienM@ has joined #ceph
[9:41] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[9:41] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:46] * MarkDude (~MT@c-98-210-253-235.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[9:49] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:54] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[9:58] * loicd (~loic@ has joined #ceph
[10:08] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[10:09] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:09] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[10:37] <jluis> elder, damn, that makes a whole lot of sense :x
[11:18] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:32] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[11:46] * loicd (~loic@ has joined #ceph
[11:47] * fc (~fc@ Quit (Quit: leaving)
[11:47] * fc (~fc@ has joined #ceph
[12:20] * allsystemsarego (~allsystem@ has joined #ceph
[12:53] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[12:56] * ninkotech_ (~duplo@ has joined #ceph
[12:56] * ninkotech (~duplo@ Quit (Read error: Connection reset by peer)
[14:29] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[14:30] * deepsa_ (~deepsa@ has joined #ceph
[14:30] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[14:30] * deepsa_ is now known as deepsa
[14:32] * glowell (~glowell@ip-64-134-166-114.public.wayport.net) Quit (Remote host closed the connection)
[14:34] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[15:07] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[15:15] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[15:23] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[15:29] * newtontm (~jsfrerot@charlie.mdc.gameloft.com) has joined #ceph
[15:30] <newtontm> Hi, I have this mds process running but seems that mon doesn't see it, anyone can give me a hand with this?
[15:58] * Datfabpeak1 (~udWopyaca@ Quit (Ping timeout: 480 seconds)
[16:06] * Datfabpeak1 (~udWopyaca@25-186.77-83.cust.bluewin.ch) has joined #ceph
[17:08] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:15] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[17:21] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[17:22] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:26] <themgt> hmm, radosgw is still failing to launch with "Initialization timeout, failed to initialize" in the log. ceph -s says HEALTH_OK and client.radosgw.gateway is in the auth list
[17:27] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[17:29] <nhm> KABOOM
[17:32] <jluis> nhm, what have you done?
[17:33] <jluis> should we, the world, be concerned?
[17:33] <nhm> jluis: naw, I just drank a spiked coffee so I'm feeling spunky. ;)
[17:33] <jluis> oh, carry on then! :p
[17:34] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[17:35] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:37] <themgt> hmm yeah, launching in the foreground w/ "/usr/bin/radosgw -d -n client.radosgw.gateway" and it takes about 30 seconds then says 'Initialization timeout'. is there some way to get more debug information about what it's trying to do when it times out?
[17:46] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Quit: Ex-Chat)
[17:47] <sagewk> --log-to-stderr --debug-rados 10 --debug-ms 1 --debug-monc 10
[17:47] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[17:48] * loicd (~loic@magenta.dachary.org) has joined #ceph
[17:54] * loicd1 (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) has joined #ceph
[17:56] <themgt> sagewk: ahh thanks, that's helpful. looks like maybe some sort of routing issue
[17:56] <sagewk> elder: did you look at that atomic_open patch?
[17:57] <elder> I"m looking at it now.
[17:57] <elder> Just trying to find the original thread.
[17:59] * loicd (~loic@magenta.dachary.org) Quit (Ping timeout: 480 seconds)
[18:04] * Tv_ (~tv@ has joined #ceph
[18:06] <sagewk> elder: k. not reply from miklos, but it's passing my tests and looks right to me
[18:07] <elder> OK.
[18:07] <elder> I expect it's fine, but am really trying to do a little research on the context before I review your change.
[18:07] <sagewk> otherwise ready to send the pull request to linus. several annoying conflicts for him to fix, sadly, but oh well.
[18:07] <elder> I tried it this morning and yes I saw the conflicts.
[18:07] <sagewk> we should put any additional patches on top of his tree
[18:07] <sagewk> instead of back-merging.
[18:07] <elder> Probably my fault with all the refactoring I'm doing.
[18:08] <jluis> sagewk, do we really need to keep 500 pgmap versions around at all times?
[18:08] <elder> I promise it will stop.
[18:08] <sagewk> hehe
[18:08] <elder> Eventually, when everything is beautiful.
[18:08] <sagewk> jluis: not really.
[18:08] <sagewk> jluis: but if it's causing problems then i'm worried, because often we need to keep several thousand osdmaps
[18:09] <jluis> it's not causing problems; I've just been trying to trigger the trim on the different monitor services
[18:09] <sagewk> ok cool
[18:09] <sagewk> yeah, set the config value to whatever you want for testing
[18:10] <jluis> and the pgmap and osdmap trims are only triggered after 500 proposals of their own, which goes as high as 1.4k paxos proposals overall
[18:10] <jluis> yeah, going to do that; but wanted to make sure everything was working before I started to mess with the options ;)
[18:11] * BManojlovic (~steki@ has joined #ceph
[18:13] * aliguori (~anthony@ has joined #ceph
[18:17] * Cube (~Adium@ has joined #ceph
[18:22] <themgt> it seems like radosgw is using the wrong IP for the other node for some reason: http://pastebin.com/raw.php?i=daqGwUBz
[18:22] <themgt> the IP should be but it looks like it's trying to talk to
[18:22] <themgt> ceph itself is seemingly connected/working fine
[18:23] <themgt> this is a bit of an odd config in that the two nodes are VMs on different boxes. so is the internal IP of the other box, but I can't figure out where radosgw is getting that from
[18:24] <sagewk> themgt: ceph.conf on teh radosgw node probably
[18:24] <sagewk> themgt: or maybe the osds are registering with teh wrong ips.. see 'ceph osd dump'
[18:25] <themgt> ahh yeea??? it must see that as the source IP of the connection
[18:26] <themgt> is there some way to force it to identify as the VM IP?
[18:32] <sagewk> is it the osdmap ips?
[18:33] <elder> Whoops. Just replied all instead of sender, despite my efforts to do otherwise.
[18:33] * loicd1 (~loic@2a01:e35:2eba:db10:120b:a9ff:feb7:cce0) Quit (Quit: Leaving.)
[18:34] * glowell (~glowell@ has joined #ceph
[18:37] <themgt> yeah, osd.1 on the other node is "up" but it's got the wrong IP address
[18:37] <sagewk> in [osd] section, set 'public network =' and restart ceph-osd daemons.. the osdmap should then reflect the new ips
[18:37] <sagewk> you maybe need to set cluster network = teh same thing if the other ip is needed for the VMs to talk to each other
[18:40] <sagewk> elder: atomic_open look ok?
[18:45] <elder> I'm finally getting what it's doing.
[18:45] <elder> If you're in a hurry, just go for it. You can have my reviewed-by...
[18:46] <elder> It looks fine to me, I'm just working through all the details of how both lookup_open() and atomic_open() were used.
[18:48] <themgt> sagewk - awesome, got it. I have mds on the other node too so that needed the public/cluster config too, but now radosgw is working. thanks!
[18:49] <sagewk> elder: i can wait, thanks
[18:49] <elder> It won't be long.
[18:49] <sagewk> elder: it took me a few tries to get it.. see my msg to miklos for where i ended up
[18:54] <nhm> joao: Can you push the changes you made for the workload generator last week?
[18:55] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[19:08] <newtontm> hi, i need help with the mds process and I also have a problem with pg map
[19:09] <newtontm> pgmap v523: 192 pgs: 192 stale+active+degraded; 0 bytes data, 2075 MB used, 38864 MB / 40940 MB avail
[19:09] <newtontm> how can I fix that, all my osds are up
[19:09] <newtontm> ceph pg 0.0 query: i don't have pgid 0.0
[19:11] * chutzpah (~chutz@ has joined #ceph
[19:11] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[19:14] <elder> sagewk, what branch has your atomic open change included?
[19:14] <sagewk> testing
[19:14] <sagewk> top patch
[19:15] <sagewk> test = (master + linus/master merge + atomic_open patch)
[19:15] <elder> Sorry, I needed to update
[19:15] <elder> I made a separate repostiory so I could look at them side-by-side and neglected to upate the "old" one.
[19:17] * Leseb (~Leseb@ Quit (Quit: Leseb)
[19:17] <nhm> joshd: oops, my mistake, I think you already did that.
[19:33] * Ryan_Lane (~Adium@ has joined #ceph
[19:34] * dmick (~dmick@2607:f298:a:607:94e5:5ddf:1fa9:93f4) has joined #ceph
[19:36] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:42] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:48] <sagewk> elder: ah, nightly found a d_rehash() BUG_ON. will fix that first.
[19:48] <elder> I'll keep looking anyway.
[20:03] * loicd (~loic@ has joined #ceph
[20:06] <sagewk> elder: bah, seems to be calling atomic_open even on hashed dentries :(
[20:07] <sagewk> weird semantic.
[20:11] <elder> If this is just an improvement over what Miklos did I don't think you should hold up the pull request for it.
[20:12] <elder> BTW I'm going to fire off a linux-3.4.7-ceph build and test.
[20:13] <elder> I should update it with more critical fixes. I've only been re-basing things on the new stable releases as they come.
[20:13] <sagewk> elder: it's a huge improvement, and fixes at least one bug
[20:14] <sagewk> anyway, i think it's fixed now
[20:14] <elder> OK.
[20:14] <sagewk> i was assuming atomic_open is only called on unhashed dentries (like ->lookup) but it's also called on hashed negative dentries
[20:14] <elder> I'm just having a hard time getting the focus on it today. Lots of distractions, and they're actually distracting me for some reason.
[20:14] <sagewk> no problem, i think i have it covered
[20:14] <sagewk> thanks!
[20:15] <elder> Yes it looks like it's called to finish all of the opens if it's available.
[20:15] <tnt> sagewk: speaking of critical fixes #2846 would qualify ? I mean kernel panic triggered by non privileged user in any kernel that happens to have rbd loaded ...
[20:18] <sagewk> tnt: ah, yeah. i'll fix that.
[20:19] <sagewk> tnt: is that your patch btw?
[20:19] <tnt> yes
[20:19] <sagewk> can i add your Signed-off-by: to the kernel patch?
[20:19] <tnt> yes sure.
[20:20] <sagewk> what's your full name and email?
[20:21] <tnt> Sylvain Munaut <tnt@246tNt.com>
[20:25] <newtontm> btw, i just found out my problem with my mds installation, now I still have my problem with OSD: pgmap v528: 192 pgs: 192 stale+active+degraded;
[20:25] <sagewk> stale usually means all your osds are down
[20:27] * aliguori (~anthony@ Quit (Ping timeout: 480 seconds)
[20:28] <newtontm> osdmap e27: 2 osds: 2 up, 2 in
[20:30] <newtontm> so the osds are up, any ideas ?
[20:32] <sagewk> elder: i'm going to send the pull without atomic_open, and then a second pull on top of his merge tomorrow
[20:33] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[20:34] <joshd> newtontm: can you access the osds? i.e. does something like 'rados -p data ls' work?
[20:34] <newtontm> well one thing I know is this, running: "ceph pg dump_stuck stale" then taking one of the pgs: "ceph pg 0.0", I get:
[20:34] <newtontm> i don't have pgid 0.0
[20:35] <elder> Sounds like a good plan, sagewk.
[20:35] <joshd> newtontm: can you pastebin that pg dump?
[20:36] <newtontm> http://pastebin.com/B1ara8XE
[20:37] <newtontm> "rados -p data ls" doesn't seem to work
[20:37] <newtontm> it hangs
[20:38] <newtontm> btw, i'm trying to not use mkcep??fs and building my own puppet class, so I think I may have missed a step... i it can help
[20:38] <newtontm> if*&
[20:39] <joshd> it sounds like the pgs haven't been created on the osds
[20:40] <newtontm> how would you create them?
[20:41] <joshd> try 'ceph pg force_create_pg <pgid>'
[20:42] <newtontm> pgid beeing one of the problematic pgid i've got?
[20:42] <joshd> yeah
[20:42] <newtontm> pgmap v539: 192 pgs: 1 active+clean, 191 stale+active+degraded;
[20:43] <joshd> sounds like that's the problem then
[20:43] <newtontm> it worked, so what did I miss?
[20:43] <newtontm> because this would be created automatically by some script/commands right?
[20:44] <joshd> yeah, probably
[20:45] <joshd> normally the monitors send the creation messages by themselves
[20:46] <newtontm> is there any command I should issue to trigger the mons to start creating the pgs ?
[20:46] <joshd> are your scripts creating the monitors, then adding osds to the crush and osd maps?
[20:46] <joshd> no, it should be done automatically
[20:46] <newtontm> i'll paste bin all the commands I did, actually i followed the steps from the doc site, let me find the page again
[20:49] <sagewk> tnt: thanks, fixes applied to kernel and userland trees
[20:50] <newtontm> http://ceph.com/docs/master/ops/manage/grow/mon/
[20:50] <newtontm> this is for mons
[20:52] <newtontm> and this one for osd I think, http://ceph.com/wiki/OSD_cluster_expansion/contraction
[20:53] <tnt> sagewk: great thanks.
[20:55] <newtontm> joshd: so I did all the steps from the last link, and I think my crush map is set correctly
[20:56] <Tv_> newtontm: the wiki is often out of date.. http://ceph.com/docs/master/ops/manage/grow/osd/
[20:59] <newtontm> ok, this looks simplier, but still the steps are very similar, and I don't get where the pgs get created...
[21:00] <joshd> newtontm: it looks like the old instructions on the wiki might not be adding the osd to the osdmap
[21:01] <newtontm> ok, but when running this: ceph osd stat
[21:01] <newtontm> show me this: e32: 2 osds: 2 up, 2 in
[21:01] <newtontm> doesn't this mean it's already in ?
[21:02] <newtontm> osdmap e32: 2 osds: 2 up, 2 in
[21:03] <joshd> what's 'ceph osd dump' and 'ceph osd tree' show?
[21:03] <joshd> (that's the osdmap and crush map respectively)
[21:03] <joshd> the pgs should get created when the osdmap is populated
[21:03] <newtontm> http://pastebin.com/SPyN8E9X
[21:04] <newtontm> so if I remove osd.1 from the osdmap and add it again, this should create the pgs?
[21:07] <elder> tnt, sagewk that patch looks good to me (the "fix crypto key null deref" one). The way it works aligns well with one or two other key_type->destroy methods I glanced at.
[21:08] <joshd> newtontm: hmm, actually it looks like it's triggered when the monitors go active
[21:09] <tnt> elder: yeah that's where I got my inspiration from :p
[21:11] <joshd> newtontm: I suspect it has to do with how your osdmap was being initialized, but i'm not sure where. if you try again with 'debug mon = 20' you can look for lines containing check_osd_map and register_new_pgs, and see why it's skipping register them
[21:12] <newtontm> when you say, try again, you mean destroying the whole cluster and recreate from scratch?
[21:31] <joshd> yeah
[21:31] <joshd> newtontm: logging on the osds might help too, if you want to be certain to catch everything (debug osd = 20)
[21:35] * aliguori (~anthony@ has joined #ceph
[21:49] * MarkDude (~MT@adsl-75-37-34-244.dsl.pltn13.sbcglobal.net) has joined #ceph
[21:57] <newtontm> I re-created everything and it worked now, but I added the "ceph osd create" this time...
[22:00] <joshd> yeah, it looks like setmaxosd doesn't set some flags in the osd map (CEPH_OSD_EXISTS | CEPH_OSD_NEW), so that might be the problem
[22:03] <gregaf1> joshd: why would setmaxosd be setting flags on OSDs?
[22:05] <joshd> gregaf1: it's not, that's the problem
[22:05] <joshd> gregaf1: the wiki was out of date
[22:21] * EmilienM (~EmilienM@ has left #ceph
[22:41] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:58] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:00] <dmick> sigh, tracker is once again moribund
[23:00] <dmick> silly ruby
[23:00] <mikeryan> sigh, should have used mongodb for our web scale project
[23:02] <gregaf1> *chortle* stop reminding me of that video; it's distracting
[23:02] <mikeryan> watch it again sometime, it's aged well
[23:07] <Tv_> "aged like a sock"
[23:12] * MarkDude (~MT@adsl-75-37-34-244.dsl.pltn13.sbcglobal.net) Quit (Read error: Connection reset by peer)
[23:13] * aliguori (~anthony@ Quit (Remote host closed the connection)
[23:13] * MarkDude (~MT@adsl-75-37-34-244.dsl.pltn13.sbcglobal.net) has joined #ceph
[23:15] * danieagle (~Daniel@ has joined #ceph
[23:16] <dmick> tracker is backer
[23:17] <mikeryan> dmick: what actually happens to tracker?
[23:18] <mikeryan> is it ruby or the db losing its mnind?
[23:18] <mikeryan> mind*
[23:18] <dmick> I don't really know
[23:18] <dmick> aspersions have been cast on the plugin that syncs with github
[23:18] <Tv_> historically, it's been the mysql (connection?)
[23:19] <dmick> sage uttered something along the lines of "there are two worker (tasks? threads?) and when they're both used by the github syncing, it won't serve other requests". or at least that's what I interpreted
[23:20] <dmick> I give that a 30% confidence rating
[23:25] <mikeryan> what exactly's being sync'd to github?
[23:26] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:27] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[23:30] <dmick> there's a magic connection between SHA1 and tracker ID
[23:31] <dmick> if a commit has Fixes:#NNNN (I can't remember if it needs the space after : or not) it will be viewable from the tracker
[23:32] <dmick> I can't remember if "commit:<sha1>" in the tracker does anything magic, or if it's just convention (for the other direction)
[23:32] <mikeryan> ah, iirc it does
[23:32] <mikeryan> but that's cheap
[23:32] <mikeryan> getting the data from github to tracker is less cheap
[23:34] <dmick> what does commit:<sha1> do, exactly?
[23:34] <gregaf1> I don't know if it still does, but once upon a time it put in a link to that commit
[23:35] <mikeryan> trac's git plugin does that, and i assume they copied it from redmine
[23:35] <dmick> oh, makes sense. I don't think I've ever seen it actually work
[23:35] <gregaf1> mikeryan: anyway, in order to do any of the linking the tracker needs to know that the sha1 is a valid commit, and to pull commits into bugs (which it does) it needs to scan them ??? thus, rsync
[23:36] <mikeryan> yep, it's all clear to me now
[23:36] <dmick> http://tracker.newdream.net/issues/2866, for example
[23:37] <dmick> appears to have both directions
[23:41] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[23:54] * MarkDude (~MT@adsl-75-37-34-244.dsl.pltn13.sbcglobal.net) Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.