#ceph IRC Log

Index

IRC Log for 2014-03-30

Timestamps are in GMT/BST.

[0:18] * Discovery (~Discovery@109.235.55.69) Quit (Read error: Connection reset by peer)
[0:38] * nrs_ (~nrs@ool-435376d0.dyn.optonline.net) has joined #ceph
[1:01] * mattt (~textual@S010690724001c795.vc.shawcable.net) Quit (Quit: Computer has gone to sleep.)
[1:02] * nrs_ (~nrs@ool-435376d0.dyn.optonline.net) Quit (Quit: My MacBook has gone to sleep. ZZZzzz???)
[1:08] * scuttlemonkey (~scuttlemo@c-107-5-193-244.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[1:16] * MACscr (~Adium@c-98-214-103-147.hsd1.il.comcast.net) Quit (Ping timeout: 480 seconds)
[1:19] * danieagle (~Daniel@179.186.125.231.dynamic.adsl.gvt.net.br) Quit (Quit: Obrigado por Tudo! :-) inte+ :-))
[1:40] * dgbaley27 (~matt@c-76-120-64-12.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[1:44] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[1:47] * zack_dolby (~textual@p852cae.tokynt01.ap.so-net.ne.jp) Quit (Quit: My MacBook has gone to sleep. ZZZzzz???)
[1:55] * shimo (~A13032@122x212x216x66.ap122.ftth.ucom.ne.jp) Quit (Ping timeout: 480 seconds)
[3:05] * rotbeard (~redbeard@aftr-37-24-149-176.unity-media.net) Quit (Quit: Verlassend)
[3:17] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[3:22] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[3:22] * mattt_ (~textual@92.52.76.140) has joined #ceph
[3:33] * sputnik1_ (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) has joined #ceph
[3:36] * sputnik__ (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) has joined #ceph
[3:36] * sputnik1_ (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) Quit (Read error: Connection reset by peer)
[3:43] * mattt_ (~textual@92.52.76.140) Quit (Read error: Connection reset by peer)
[3:55] * sputnik__ (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) Quit (Quit: My MacBook has gone to sleep. ZZZzzz???)
[3:56] * sputnik1_ (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) has joined #ceph
[3:57] * sputnik1_ (~sputnik13@wsip-68-105-248-60.sd.sd.cox.net) Quit ()
[4:13] * devicenull (sid4013@id-4013.ealing.irccloud.com) has joined #ceph
[4:21] * Koma (~Koma@0001c112.user.oftc.net) Quit (Ping timeout: 480 seconds)
[4:56] * Guest4834 (~knoppix@cpe-76-179-140-89.maine.res.rr.com) has joined #ceph
[5:00] * Guest4834 (~knoppix@cpe-76-179-140-89.maine.res.rr.com) Quit (Quit: Leaving)
[5:00] * Koma (~Koma@0001c112.user.oftc.net) has joined #ceph
[5:07] * Vacum_ (~vovo@88.130.192.126) has joined #ceph
[5:14] * Vacum (~vovo@88.130.192.6) Quit (Ping timeout: 480 seconds)
[5:23] * zack_dolby (~textual@e0109-114-22-14-183.uqwimax.jp) has joined #ceph
[5:31] * zack_dolby (~textual@e0109-114-22-14-183.uqwimax.jp) Quit (Ping timeout: 480 seconds)
[5:49] * rmoe (~quassel@173-228-89-134.dsl.static.sonic.net) Quit (Ping timeout: 480 seconds)
[5:53] * mattt (~textual@S010690724001c795.vc.shawcable.net) has joined #ceph
[5:54] * rmoe (~quassel@173-228-89-134.dsl.static.sonic.net) has joined #ceph
[6:20] * leochill (~leochill@nyc-333.nycbit.com) Quit (Ping timeout: 480 seconds)
[6:28] * zack_dolby (~textual@e0109-114-22-14-183.uqwimax.jp) has joined #ceph
[6:33] * zack_dolby (~textual@e0109-114-22-14-183.uqwimax.jp) Quit ()
[6:34] * scuttlemonkey (~scuttlemo@c-107-5-193-244.hsd1.mi.comcast.net) has joined #ceph
[6:34] * ChanServ sets mode +o scuttlemonkey
[6:58] * mattt (~textual@S010690724001c795.vc.shawcable.net) Quit (Quit: Computer has gone to sleep.)
[7:01] * nrs_ (~nrs@ool-435376d0.dyn.optonline.net) has joined #ceph
[7:41] * mattt (~textual@S010690724001c795.vc.shawcable.net) has joined #ceph
[7:43] * mattt (~textual@S010690724001c795.vc.shawcable.net) Quit ()
[7:58] * nrs_ (~nrs@ool-435376d0.dyn.optonline.net) Quit (Quit: My MacBook has gone to sleep. ZZZzzz???)
[7:59] * wschulze (~wschulze@p5B3FCA71.dip0.t-ipconnect.de) has joined #ceph
[7:59] * wschulze (~wschulze@p5B3FCA71.dip0.t-ipconnect.de) Quit ()
[8:43] * kiwigeraint (~kiwigerai@208.72.139.54) Quit (Remote host closed the connection)
[8:48] * kiwigeraint (~kiwigerai@208.72.139.54) has joined #ceph
[9:14] * kiwigera_ (~kiwigerai@208.72.139.54) has joined #ceph
[9:14] * kiwigeraint (~kiwigerai@208.72.139.54) Quit (Read error: Connection reset by peer)
[9:23] * kiwigera_ (~kiwigerai@208.72.139.54) Quit (Read error: Connection reset by peer)
[9:23] * kiwigeraint (~kiwigerai@208.72.139.54) has joined #ceph
[9:26] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Quit: Leaving.)
[9:32] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) Quit (Quit: Leaving.)
[9:56] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[10:04] * thomnico (~thomnico@2a01:e35:8b41:120:f1a9:cbfa:3fec:1668) has joined #ceph
[10:09] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[10:09] * rotbeard (~redbeard@2a02:908:df10:6f00:22cf:30ff:fe1b:43c1) has joined #ceph
[10:42] * kiwigera_ (~kiwigerai@208.72.139.54) has joined #ceph
[10:42] * kiwigeraint (~kiwigerai@208.72.139.54) Quit (Read error: Connection reset by peer)
[10:47] <athrift> We had a host fail in our small 4 host cluster today, everything was OK until we brought the failed host back up and now all IO to the ceph cluster has stopped.
[10:48] <athrift> Is this fairly normal ?
[10:49] <athrift> the host that failed has a huge amount of read IO, and a small amount of write, while all other hosts have primarily write IO happening.
[11:04] <Fruit> does ceph health give any reason?
[11:12] <athrift> just the expected active+remapped+backfilling
[11:17] * BillK (~BillK-OFT@58-7-59-126.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[11:32] * carif (~mcarifio@ANice-651-1-354-94.w86-205.abo.wanadoo.fr) has joined #ceph
[11:38] * carif_ (~mcarifio@ANice-651-1-373-59.w83-201.abo.wanadoo.fr) has joined #ceph
[11:39] * thomnico (~thomnico@2a01:e35:8b41:120:f1a9:cbfa:3fec:1668) Quit (Quit: Ex-Chat)
[11:39] <aarontc> can anyone help me understand why pgs are consistently staying active+degraded?
[11:39] <aarontc> I have been scouring docs and I can't figure out how to coerce OSDs into replicating the missing data (which is my understanding of the cause of persistent +degraded state)
[11:40] * carif (~mcarifio@ANice-651-1-354-94.w86-205.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[11:46] * MACscr (~Adium@c-98-214-103-147.hsd1.il.comcast.net) has joined #ceph
[11:57] * zack_dolby (~textual@p852cae.tokynt01.ap.so-net.ne.jp) has joined #ceph
[11:59] * carif__ (~mcarifio@ANice-651-1-422-148.w83-201.abo.wanadoo.fr) has joined #ceph
[12:03] * carif_ (~mcarifio@ANice-651-1-373-59.w83-201.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[12:03] * Cataglottism (~Cataglott@dsl-087-195-030-184.solcon.nl) has joined #ceph
[12:07] * carif__ (~mcarifio@ANice-651-1-422-148.w83-201.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[12:22] * MACscr (~Adium@c-98-214-103-147.hsd1.il.comcast.net) Quit (Ping timeout: 480 seconds)
[12:28] * Cataglottism (~Cataglott@dsl-087-195-030-184.solcon.nl) Quit (Quit: My Mac Pro has gone to sleep. ZZZzzz???)
[12:38] * leseb (~leseb@185.21.172.77) Quit (Killed (NickServ (Too many failed password attempts.)))
[12:38] * leseb (~leseb@185.21.172.77) has joined #ceph
[12:50] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[13:03] * Sysadmin88 (~IceChat77@176.254.32.31) Quit (Quit: The early bird may get the worm, but the second mouse gets the cheese)
[13:05] * ircuser-1 (~ircuser-1@35.222-62-69.ftth.swbr.surewest.net) Quit (Read error: Operation timed out)
[13:06] * MACscr (~Adium@c-50-158-183-38.hsd1.il.comcast.net) has joined #ceph
[13:12] * MACscr1 (~Adium@c-50-158-183-38.hsd1.il.comcast.net) has joined #ceph
[13:12] * MACscr (~Adium@c-50-158-183-38.hsd1.il.comcast.net) Quit (Read error: Connection reset by peer)
[13:17] * yguang11 (~yguang11@vpn-nat.corp.tw1.yahoo.com) Quit (Ping timeout: 480 seconds)
[13:25] * i_m (~ivan.miro@nat-5-carp.hcn-strela.ru) has joined #ceph
[13:28] * MACscr (~Adium@c-50-158-183-38.hsd1.il.comcast.net) has joined #ceph
[13:28] * fdmanana (~fdmanana@bl5-6-132.dsl.telepac.pt) has joined #ceph
[13:29] * MACscr1 (~Adium@c-50-158-183-38.hsd1.il.comcast.net) Quit (Ping timeout: 480 seconds)
[13:38] * cfreak200 (~cfreak200@p4FF3EAE4.dip0.t-ipconnect.de) has joined #ceph
[13:40] * cfreak201 (~cfreak200@p4FF3F7EA.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[13:43] * yguang11 (~yguang11@vpn-nat.corp.tw1.yahoo.com) has joined #ceph
[13:53] * ircuser-1 (~ircuser-1@35.222-62-69.ftth.swbr.surewest.net) has joined #ceph
[14:02] <loicd> leseb: reping ;-) how is the brag.ceph.com going ? Do you need anything ?
[14:16] <joao> oh hi
[14:16] <joao> o/
[14:16] <kraken> \o
[14:19] * thomnico (~thomnico@2a01:e35:8b41:120:f1a9:cbfa:3fec:1668) has joined #ceph
[14:31] * leseb (~leseb@185.21.172.77) Quit (Killed (NickServ (Too many failed password attempts.)))
[14:31] * leseb (~leseb@185.21.172.77) has joined #ceph
[14:33] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:39] <Fruit> who admins the debian apt repos? there seems to be something wonky with the emperor sources
[14:48] * yguang11_ (~yguang11@123.114.134.161) has joined #ceph
[14:49] * yguang11_ (~yguang11@123.114.134.161) Quit (Remote host closed the connection)
[14:50] * yguang11_ (~yguang11@vpn-nat.corp.tw1.yahoo.com) has joined #ceph
[14:50] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[14:55] * yguang11 (~yguang11@vpn-nat.corp.tw1.yahoo.com) Quit (Ping timeout: 480 seconds)
[15:05] * rotbeard (~redbeard@2a02:908:df10:6f00:22cf:30ff:fe1b:43c1) Quit (Quit: Verlassend)
[15:23] * BillK (~BillK-OFT@58-7-59-126.dyn.iinet.net.au) has joined #ceph
[15:27] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:29] <loicd> !norris aarontc
[15:29] <kraken> aarontc can slam revolving doors.
[15:31] * thomnico (~thomnico@2a01:e35:8b41:120:f1a9:cbfa:3fec:1668) Quit (Quit: Ex-Chat)
[15:32] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:35] * Sysadmin88 (~IceChat77@176.254.32.31) has joined #ceph
[16:05] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[16:13] * fghaas (~florian@216-75-224-178.static.wiline.com) has joined #ceph
[16:19] * sprachgenerator (~sprachgen@c-67-167-211-254.hsd1.il.comcast.net) has joined #ceph
[16:23] * fghaas (~florian@216-75-224-178.static.wiline.com) Quit (Quit: Leaving.)
[16:29] * The_Bishop__ (~bishop@g229101187.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[16:30] * The_Bishop (~bishop@g229101187.adsl.alicedsl.de) has joined #ceph
[16:50] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[17:03] * Cataglottism (~Cataglott@dsl-087-195-030-184.solcon.nl) has joined #ceph
[17:10] * fghaas (~florian@216-75-224-178.static.wiline.com) has joined #ceph
[17:29] <devicenull> where would I start when troubleshooting poor performance? I have three OSDs running in VMs (on different hosts), and one monitor. they're all running in rather small instances (1 cpu, 768mb ram), but I only have a tiny dataset atm, so I was assuming that was okay.
[17:30] <Gugge-47527> start by defining what poor performance is
[17:30] <devicenull> I've got a single volume mounted through qemu that I'm benchmarking in. inside that qemu instance, I only see around 150 iops, despite each of the OSDs being able to achieve around 3000 iops individually
[17:30] <devicenull> yea, I was still typing that :)
[17:30] * thanhtran (~thanhtran@113.172.209.10) has joined #ceph
[17:30] <Gugge-47527> what is the latency on a single io?
[17:30] <Gugge-47527> what queue depth do you benchmark with?
[17:31] <Gugge-47527> if its queue depth 1, 150 iops is more than i would expect for a system vith virtual machines running osds :)
[17:33] <devicenull> ahha, it is indeed queue depth 1
[17:34] <Gugge-47527> thats 6.6ms pr io
[17:42] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[17:43] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[17:47] * Cataglottism (~Cataglott@dsl-087-195-030-184.solcon.nl) Quit (Quit: My Mac Pro has gone to sleep. ZZZzzz???)
[17:48] <jks> trying to upgrade ceph from dumpling to emperor on Fedora 18... during yum upgrade of ceph, all osds were shutdown by the upgrade process.. is that to be expected? (it hasn't happened during any other upgrade)
[17:48] <jks> after the yum upgrade finished, the osds cannot start... they give error: "failed: 'timeout 10 /usr/bin/ceph [...]'
[17:48] <jks> anyone seen that before?
[17:49] <devicenull> hmm, not sure that was it. I raised the iodepth to 256, and am seeing virtually identical performance. `ioping /` in the guest is showing between 1.4ms and 2.5 ms
[17:49] <devicenull> the network does seem to be the bottleneck though, so I'll look at that some more
[17:51] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[17:54] <devicenull> so far the only real performance information I've found is at http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ .. is there somewhere else I should be looking?
[17:57] <Fruit> devicenull: you could also try fio (the latest version supports rbd) to rule out qemu as the source of your performance troubles
[17:58] <devicenull> ah that's good to know. I'll try that, eliminating the qemu layer would be good
[18:03] * fghaas (~florian@216-75-224-178.static.wiline.com) has left #ceph
[18:03] <jks> enabled debugging on my upgrade problem... it shows: mon_command_ack([{"prefix": "osd crush create-or-move", "args": ["root=default", "host=stor5"], "id": 16, "weight": 1.82}]=-22 (22) Invalid argument v10239) v1
[18:04] <jks> anyone got an idea what it could be about? (searching the mailinglist did not turn up anything that seems related)
[18:05] <jks> the rest of the cluster running dumpling is working fine... only emperor osds that cannot start
[18:07] <jks> is it hard to find any information on "create-or-move"... is that something added to emperor after the first emperor release, which breaks upgrading from earlier releases?
[18:07] <Fruit> are your mons emperor?
[18:07] <jks> Fruit: not yet
[18:07] <jks> I have one mon on emperor, 4 on dumpling
[18:08] <Fruit> does the upgrade documentation say anything about that? perhaps you need to upgrade your mons first, and only then the osds
[18:08] <jks> the upgrade documentation does not say to upgrade mons first... it says rolling upgrade is possible
[18:08] <jks> (where as the documentation for upgrading cuttlefish to dumpling for example says to upgrade mons first)
[18:09] <jks> my initial plan was to upgrade mons first... but as simply running "yum upgrade" made it shutdown all the osds unexpectedly - that plan went down the drain ;)
[18:09] <jks> (I have osds on my mon servers)
[18:09] <Fruit> heh, yeah. the same would happen on debian
[18:10] <jks> it just never happened for me during any other update, so it was unexpected for me
[18:10] <jks> but not much of a problem, if only I could start the osds again :)
[18:10] <Fruit> except on debian I would know how to prevent that from happening :)
[18:10] <jks> my guess is that "crush create-or-move" is some new command that my dumpling system does not support, and therefore fails
[18:10] <jks> but as I cannot start the emperor daemons without it, that seems to be a 'deadlock'
[18:11] * Cataglottism (~Cataglott@dsl-087-195-030-184.solcon.nl) has joined #ceph
[18:12] <devicenull> Fruit: thanks! fio on the hypervisor shows 10x more iops (1200 average), so I'll look carefully at qemu/guest
[18:12] <jks> Fruit, how do you prevent it on debian? - perhaps I can do the same on Fedora
[18:13] <jks> devicenull, which version of qemu are you running?
[18:14] <Gugge-47527> jks: osd crush create-or-move was added in bobtail (according to the release notes)
[18:14] <devicenull> jks: 1.7.0
[18:14] <jks> Gugge-47527, yeah, just looked at a dumpling init script, and it is used there too... wonder why it would give an invalid value error on emperor
[18:15] * leseb (~leseb@185.21.172.77) Quit (Killed (NickServ (Too many failed password attempts.)))
[18:15] * leseb (~leseb@185.21.172.77) has joined #ceph
[18:15] <Fruit> jks: very debian-specific, but I'd create a /usr/sbin/policy-rc.d file that always exits with status 101
[18:15] <Gugge-47527> if you remove it from the init script, does the osd start?
[18:15] <jks> devicenull, okay, I was thinking that it might be caused by a bug in an earlier version of qemu, but with 1.7.0 that's not the case :)
[18:15] <jks> Gugge-47527, yep
[18:15] * thanhtran_ (~thanhtran@113.172.190.53) has joined #ceph
[18:15] <devicenull> I probably just have something configured wrong within libvirt, the documentation for that is.. minimal
[18:16] <Gugge-47527> "However, you cannot run any of the ???ceph osd pool set??? commands while your monitors are running separate versions"
[18:16] * thanhtran (~thanhtran@113.172.209.10) Quit (Read error: Operation timed out)
[18:16] * thanhtran_ is now known as thanhtran
[18:16] * kiwigera_ (~kiwigerai@208.72.139.54) Quit (Remote host closed the connection)
[18:16] <Gugge-47527> maybe you cant run the ceph crush create-or-move commands either :)
[18:16] <jks> Gugge-47527, yeah, but this is a ceph crush command... not osd pool?
[18:17] <jks> Gugge-47527, but that sounds weird... how would you then ever do a rolling upgrade... unless you manually change ceph-deploy or the init scripts
[18:17] <Gugge-47527> ive seen bugs in software before :)
[18:18] <jks> hehe, yeah - I just guess I can't be the first person doing an dumpling => emperor upgrade
[18:20] <Gugge-47527> you could disable osd_crush_update_on_start in ceph.conf :)
[18:20] <jks> hmm, yeah ofcourse - but sounds weird it is not mentioned in the docs
[18:21] <Gugge-47527> agree
[18:21] <jks> I wonder if this will always fail.. or it will somehow start to work again after the full cluster has been updated
[18:21] <jks> it would be nice if it said, which of the argument it considers invalid
[18:21] <Gugge-47527> my guess it it will work again when all monitors is upgraded
[18:21] <jks> it is worth a try I guess
[18:22] <jks> this cluster has been upgraded all the way from argonaut or earlier, so it might be somehow different than "standard" clusters
[18:23] <Gugge-47527> maybe :)
[18:23] <Gugge-47527> im still only running cuttlefish :)
[18:24] <jks> haven't got any cuttlefish clusters left... all dumpling ;)
[18:24] <jks> ceph osd perf is damn handy in dumpling
[18:27] <Gugge-47527> i guess i should upgrade, but it just works :)
[18:35] <jks> now my monitors are running emperor... but it still fails during start with that invalid argument error!
[18:35] * wrale (~wrale@cpe-107-9-20-3.woh.res.rr.com) has joined #ceph
[18:36] * rotbeard (~redbeard@2a02:908:df10:6f00:22cf:30ff:fe1b:43c1) has joined #ceph
[18:36] <Gugge-47527> what is the full create-or-move command its trying to run?
[18:37] <jks> osd crush create-or-move -- 16 1.82 root=default host=stor5
[18:37] <jks> (for example)
[18:38] <Gugge-47527> and if you run that manually you get the same invalid argument error?
[18:38] <jks> yes
[18:39] <wrale> I'm setting up a new Ceph pool for the first time. I want to use a replica factor of 3. Is there any reason to set a min_size of anything other than 1? Do I understand correctly that replica 3 with min_size 3 is a more synchronous method, over min_size 1. With the former, data would need to reach three OSDs before sync was complete? With the latter, only one OSD would need to receive the object data before the sync command might
[18:39] <wrale> return?
[18:39] <Gugge-47527> jks: and if you remove the -- part?
[18:40] <jks> wrale: for safety reasons?
[18:40] <jks> Gugge-47527, I'll try!
[18:40] <wrale> jks: yes
[18:41] <jks> Gugge-47527, no difference
[18:42] <Gugge-47527> jks: no idea then
[18:42] <jks> Gugge-47527, I'm wondering why it doesn't specify the rack... only the root and the host
[18:43] <Gugge-47527> ohh, you have a rack defined too
[18:43] <jks> if I add a rack to the command, it works fine
[18:43] <jks> but the init scripts should do that automatically, I'm pretty sure?
[18:43] <jks> I can't be only one with a multi-rack setup? :)
[18:44] <Gugge-47527> im looking at my cuttlefish script, it just does root= host= and whatever is defined in the osd_crush_location config
[18:44] <Gugge-47527> maybe you should set osd_crush_update_on_start to 0 when you have a non standard location :)
[18:44] <Gugge-47527> i dont know :)
[18:53] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[19:01] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[19:03] * The_Bishop_ (~bishop@f055071121.adsl.alicedsl.de) has joined #ceph
[19:09] * kiwigeraint (~kiwigerai@208.72.139.54) has joined #ceph
[19:09] * kiwigeraint (~kiwigerai@208.72.139.54) Quit (Remote host closed the connection)
[19:09] * kiwigeraint (~kiwigerai@208.72.139.54) has joined #ceph
[19:10] * The_Bishop (~bishop@g229101187.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[19:14] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[19:17] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[19:17] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[19:19] * Cataglottism (~Cataglott@dsl-087-195-030-184.solcon.nl) Quit (Quit: My Mac Pro has gone to sleep. ZZZzzz???)
[19:27] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Quit: Leaving.)
[19:34] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[19:35] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[19:42] * The_Bishop_ (~bishop@f055071121.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[19:43] * Cataglottism (~Cataglott@dsl-087-195-030-170.solcon.nl) has joined #ceph
[19:44] * fedgoat (~fedgoat@cpe-24-28-22-21.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[19:47] * thomnico (~thomnico@2a01:e35:8b41:120:f1a9:cbfa:3fec:1668) has joined #ceph
[19:50] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[19:51] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[19:54] * Cataglottism (~Cataglott@dsl-087-195-030-170.solcon.nl) Quit (Quit: My Mac Pro has gone to sleep. ZZZzzz???)
[19:57] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[19:59] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[20:02] * mattt (~textual@S010690724001c795.vc.shawcable.net) has joined #ceph
[20:07] <jks> what kind of fs_commit_latency numbers should you expect to see when having journals on SSD? (just ballpark figures)
[20:08] * mattt (~textual@S010690724001c795.vc.shawcable.net) Quit (Remote host closed the connection)
[20:08] * mattt (~textual@92.52.76.140) has joined #ceph
[20:09] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[20:09] * mattt (~textual@92.52.76.140) Quit (Read error: Connection reset by peer)
[20:10] * leseb (~leseb@185.21.172.77) Quit (Killed (NickServ (Too many failed password attempts.)))
[20:10] * leseb (~leseb@185.21.172.77) has joined #ceph
[20:25] * mattt (~textual@S010690724001c795.vc.shawcable.net) has joined #ceph
[20:26] <jks> Gugge-47527, this upgrade really isn't going well :) now I cannot restart mons... they just hang when starting :-|
[20:26] * andreask (~andreask@h081217016175.dyn.cm.kabsi.at) has joined #ceph
[20:26] * ChanServ sets mode +v andreask
[20:26] * andreask (~andreask@h081217016175.dyn.cm.kabsi.at) Quit ()
[20:27] * Cataglottism (~Cataglott@dsl-087-195-030-170.solcon.nl) has joined #ceph
[20:29] * The_Bishop_ (~bishop@2001:470:50b6:0:c59d:46b8:673b:e0b7) has joined #ceph
[20:33] * thanhtran (~thanhtran@113.172.190.53) Quit (Quit: Going offline, see ya! (www.adiirc.com))
[20:33] * mattt (~textual@S010690724001c795.vc.shawcable.net) Quit (Quit: Computer has gone to sleep.)
[20:42] * fghaas (~florian@205.158.164.101.ptr.us.xo.net) has joined #ceph
[20:46] * fedgoat (~fedgoat@cpe-24-28-22-21.austin.res.rr.com) has joined #ceph
[20:52] * Cataglottism (~Cataglott@dsl-087-195-030-170.solcon.nl) Quit (Ping timeout: 480 seconds)
[21:01] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[21:03] * wrale (~wrale@cpe-107-9-20-3.woh.res.rr.com) Quit (Quit: Leaving...)
[21:06] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[21:07] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[21:10] * kuu (~kuu@virtual362.tentacle.fi) Quit (Quit: leaving)
[21:10] * zidarsk8 (~zidar@89-212-142-10.dynamic.t-2.net) has joined #ceph
[21:11] * zidarsk8 (~zidar@89-212-142-10.dynamic.t-2.net) has left #ceph
[21:18] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[21:18] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[21:21] <joao> jks, what version are you upgrading from and to?
[21:24] <jks> joao, I upgraded from cuttlefish to dumpling, and then from dumpling to emperor
[21:25] <joao> and the mons got stuck upgrading from dumpling to emperor?
[21:25] <jks> joao, after the emperor upgrade finished everything was running okay... all mons, osds and mds was running the latest emperor
[21:25] <jks> joao, then I installed the ceph-extras repository which updated curl, libcurl and leveldb
[21:25] <jks> joao, then I tried restarting the mon, and it just hangs
[21:26] <jks> so I'm not sure if the cause is the ceph-extras repository or the upgrade to emperor, I'm afraid... but the initial start of the mon after upgrading to emperor worked fine
[21:26] <joao> can you please set 'debug mon = 10' on the mons and point me to the resulting logs?
[21:26] <jks> joao, I have pushed all debug settings to 20 and it only logs one line:
[21:26] <jks> 2014-03-30 20:31:31.829959 7f3c989877c0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 21310
[21:26] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[21:26] <joao> wth
[21:27] <joao> that doesn't sound right at all
[21:27] <jks> no, not at all :-|
[21:27] <joao> there's a lot of stuff that would usually be printed in-between that and something that would get stuck
[21:27] <joao> can you please check if the monitor is running at all?
[21:27] <jks> the process is there, yes
[21:28] <jks> when I strace it, it shows it in a futex call
[21:28] <joao> are you able to run a 'ceph daemon mon.ID mon_status' on the mon's server?
[21:28] <jks> it says invalid command?
[21:29] <joao> jks, only thing that comes to mind would be to attach gdb to the process and get a backtrace
[21:29] <jks> is that the right syntax for that mon_status command?
[21:29] <jks> I can run "ceph mon_status", which gives pages after pages of debug output :)
[21:29] <joao> uh
[21:30] <joao> do you get a reply?
[21:30] <jks> what do you mean by "a reply" specifically?
[21:31] <joao> does the command return, or does it just sit there hanging?
[21:31] <jks> it returns
[21:31] <jks> but it is "ceph mon_status"... not a request specifically for mon.e
[21:31] <joao> then there must be a quorum
[21:31] <jks> yes, there's a quorum alright
[21:31] <joao> 'ceph -s' should return as well
[21:31] <joao> so the monitors are fine?
[21:32] <jks> sure, everything works - just not this specific monitor that I tried to restart
[21:32] <joao> ah
[21:32] <jks> I haven't dared restart the other mons
[21:32] <joao> one monitor
[21:32] <joao> ah, kay
[21:32] <joao> got it
[21:32] <joao> thought all the mons were suffering from that
[21:32] <jks> it's just odd that nothing really is logged
[21:32] <jks> I'll try attaching with gdb to get a backtrace
[21:32] <joao> okay
[21:33] <joao> so, let's try it this way
[21:34] <joao> kill the ailed mon, and start it under gdb
[21:34] <joao> should go something like 'gdb --args ceph-mon -i ID -d'
[21:34] <joao> run it, and once it hangs interrupt and get a backtrace
[21:35] <jks> I'll do that, thanks!
[21:35] <joao> we should be able to figure out more once we know where it is getting stuck
[21:38] * jjgalvez (~jjgalvez@ip98-167-16-160.lv.lv.cox.net) has joined #ceph
[21:48] <jks> joao, sorry for the long wait - had to get gdb installed, etc :)
[21:48] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[21:48] <jks> joao, it hangs in #0 0x00007ffff73e712d in read () from /lib64/libpthread.so.0
[21:48] <jks> joao, backtrace shows safe_read(), safe_read_exact(), Preforker::parent_wait() and main() (in that order)
[21:48] <jks> so I'm guessing gdb is attaching to the wrong subprocess or something?
[21:56] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[21:57] <jks> joao, okay, turned off detach-on-fork and switched to the secondary process and let it run until it hangs
[21:58] <jks> joao, that backtrace is: __lll_lock_wait(), _L_lock_886(), pthread_mutex_lock(), leveldb::port::Mutex::Lock (port_posix.cc:26), MutexLock (../util/mutexlock.h:27), leveldb::DBImpl::Get() (db_impl.cc:1098), LevelDBStore::_get_iterator(), KeyValueDB::get_iterator(), MonitorDBStore::exists(), main()
[22:12] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[22:13] * rotbeard (~redbeard@2a02:908:df10:6f00:22cf:30ff:fe1b:43c1) Quit (Quit: Verlassend)
[22:19] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Read error: Operation timed out)
[22:20] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[22:35] * fghaas (~florian@205.158.164.101.ptr.us.xo.net) Quit (Quit: Leaving.)
[22:35] * diegows_ (~diegows@190.190.5.238) has joined #ceph
[22:41] * thomnico (~thomnico@2a01:e35:8b41:120:f1a9:cbfa:3fec:1668) Quit (Quit: Ex-Chat)
[22:42] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[22:54] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[23:00] * jjgalvez (~jjgalvez@ip98-167-16-160.lv.lv.cox.net) Quit (Read error: Connection reset by peer)
[23:01] * gregsfortytwo (~Adium@2607:f298:a:607:5c6d:7d97:be10:c78c) Quit (Ping timeout: 480 seconds)
[23:15] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[23:20] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[23:20] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[23:25] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[23:26] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has left #ceph
[23:41] * mattt (~textual@S010690724001c795.vc.shawcable.net) has joined #ceph
[23:47] * mattt (~textual@S010690724001c795.vc.shawcable.net) Quit (Quit: Computer has gone to sleep.)
[23:49] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[23:49] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) has joined #ceph
[23:54] * MACscr1 (~Adium@c-50-158-183-38.hsd1.il.comcast.net) has joined #ceph
[23:56] * gregsfortytwo (~Adium@2607:f298:a:607:99f3:4afe:a112:d22b) has joined #ceph
[23:57] * thuc (~thuc@c-71-198-202-49.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[23:59] * MACscr (~Adium@c-50-158-183-38.hsd1.il.comcast.net) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.