#ceph IRC Log


IRC Log for 2012-02-02

Timestamps are in GMT/BST.

[0:00] <nhm> bbl, heading home
[0:00] <sage> sjust ^^
[0:00] <sjust> sage: yeah, skimming now
[0:00] <sage> i was running into linker hell getting ceph-dencoder to use the types in PG.h
[0:01] <sjust> ah, right
[0:01] <sjust> seems reasonable to me
[0:01] <sage> k
[0:01] * andresambrois (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[0:02] * andresambrois (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[0:02] * adjohn (~adjohn@ma30536d0.tmodns.net) Quit (Read error: Connection reset by peer)
[0:03] * adjohn (~adjohn@ma30536d0.tmodns.net) has joined #ceph
[0:06] * andresambrois (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[0:08] * BManojlovic (~steki@ Quit (Read error: Connection reset by peer)
[0:09] * BManojlovic (~steki@ has joined #ceph
[0:09] * adjohn (~adjohn@ma30536d0.tmodns.net) Quit (Read error: Connection reset by peer)
[0:09] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[0:12] * andresambrois (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[0:19] * andresambrois (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Read error: Connection reset by peer)
[0:31] <elder> joshd, OK, thanks. I'm going to take the liberty of assuming bonnie doesn't work... From my na�ve understanding it doesn't look like a problem related to what I have worked on.
[0:33] <elder> sage, I have a few more small changes, then I'm going to commit my wip-rbd-new-fixes branch contents to master later this evening. It includes rework on a few of the commits that are already there.
[0:34] <sage> elder: ok cool
[0:51] <Tv|work> FYI staff: if you trouble with plana /tmp/cephtest/archive/syslog access modes, try again -- should be fixed
[0:52] <Tv|work> erf
[0:52] <Tv|work> *if you had trouble
[0:52] <Tv|work> my brain has checked out already for the day
[0:52] <Tv|work> and so has the networking :(
[1:01] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Remote host closed the connection)
[1:02] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[1:05] * joao (~joao@89-181-157-105.net.novis.pt) Quit (Ping timeout: 480 seconds)
[1:14] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[1:27] * izdubar (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[1:34] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[1:34] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit ()
[1:37] * Mike (~Mike@awlaptop1.esc.auckland.ac.nz) has joined #ceph
[1:38] * Mike is now known as Guest1271
[1:38] <Guest1271> Hi again, I was previosuly Guest1164
[1:38] <Guest1271> I have located two disks osd3 and osd6 on 2 different servers that are down
[1:39] * yoshi (~yoshi@p11133-ipngn3402marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:39] <Guest1271> This happened after an iozone test hung ceph
[1:39] <Guest1271> When I used df -h on the server holding osd3 the device that is osd3 is no longer there (/dev/sde)
[1:40] <Guest1271> Being a Linux storage new-b (as well as ceph new-b) I wanted to know of the best way
[1:40] <Guest1271> to determine is this device = a disk is failed or if I can restore it (reformatting is ok, it is a test system)
[1:41] <Guest1271> and then get it back into ceph with a cosd command
[1:41] <Guest1271> Any advice?
[1:43] <Tv|work> Guest1271: /dev/sde disappeared?
[1:44] <Tv|work> Guest1271: is it like an external usb disk or something..
[1:44] <Guest1271> I can't see it in df -h, but it is there in fdisk -l
[1:44] <Tv|work> sounds like it's unmounted
[1:44] <Guest1271> Nope a 2TB had drive
[1:45] <Guest1271> I didn't think I needed to mount the dives if I am using btrfs with ceph?
[1:45] <Guest1271> Sorry I mean drives
[1:45] <Guest1271> Also 2TB HDD
[1:46] * amichel (~amichel@ has left #ceph
[1:46] <Guest1271> ss1:~ # df -h
[1:46] <Guest1271> Filesystem Size Used Avail Use% Mounted on
[1:46] <Guest1271> /dev/disk/by-id/scsi-36003048000848c6014dd6ab30d1718f7-part2
[1:46] <Guest1271> 144G 22G 116G 16% /
[1:46] <Guest1271> devtmpfs 8.9G 232K 8.9G 1% /dev
[1:46] <Guest1271> tmpfs 8.9G 4.0K 8.9G 1% /dev/shm
[1:46] <Guest1271> /dev/sdb 1.9T 9.1G 1.8T 1% /data/osd.11
[1:46] <Guest1271> /dev/sdc 1.9T 9.5G 1.8T 1% /data/osd.12
[1:46] <Guest1271> /dev/sdd 1.9T 9.8G 1.8T 1% /data/osd.13
[1:46] <Guest1271> /dev/sdf 1.9T 8.7G 1.8T 1% /data/osd.15
[1:46] <Guest1271> /dev/sdg 1.9T 8.6G 1.8T 1% /data/osd.16
[1:47] <Guest1271> Partial result from fdisk -l
[1:47] <Guest1271> Disk /dev/sde: 1999.0 GB, 1998998994944 bytes
[1:47] <Guest1271> 255 heads, 63 sectors/track, 243031 cylinders
[1:47] <Guest1271> Units = cylinders of 16065 * 512 = 8225280 bytes
[1:47] <Guest1271> Sector size (logical/physical): 512 bytes / 512 bytes
[1:47] <Guest1271> I/O size (minimum/optimal): 512 bytes / 512 bytes
[1:47] <Guest1271> Disk identifier: 0x00000000
[1:47] <Guest1271> Disk /dev/sde doesn't contain a valid partition table
[1:47] <Guest1271> Disk /dev/sdf: 1999.0 GB, 1998998994944 bytes
[1:47] <Guest1271> 255 heads, 63 sectors/track, 243031 cylinders
[1:47] <Guest1271> Units = cylinders of 16065 * 512 = 8225280 bytes
[1:47] <Guest1271> Sector size (logical/physical): 512 bytes / 512 bytes
[1:47] <Guest1271> I/O size (minimum/optimal): 512 bytes / 512 bytes
[1:47] <Guest1271> Disk identifier: 0x00000000
[1:47] <Guest1271> Disk /dev/sdf doesn't contain a valid partition table
[1:50] <Tv|work> sounds like you got corrupted disk contents and/or broken disks
[1:51] <Tv|work> unless your fdisk is e.g. just too stupid to understand GPT partition tables
[1:51] <Tv|work> ohhhhh
[1:51] <Tv|work> you're mounting the disks directly
[1:51] <Tv|work> without partition tables
[1:51] <Tv|work> that's kinda nasty
[1:52] <Tv|work> fdisk will never make heads or tails of that
[1:52] <Guest1271> They are just raw disks on the server and then ceph builds btrfs OSDs on them
[1:52] <Tv|work> sure but then you need to stop using tools like fdis
[1:52] <Tv|work> k
[1:53] <Guest1271> So I guess what I'd like to do is "wipe" them and then bring them back into ceph so it can build btrfs again
[1:53] <Tv|work> try mounting /dev/sde again
[1:53] <Guest1271> But I need to make sure they have not failed or I can then go and pull them out and ask Seagate for replacements
[1:53] <Guest1271> ok
[1:53] <Tv|work> ignore my mention of corrupted disks etc, that was before i saw you're not using partitions
[1:54] <Tv|work> the disk contents look like garbled partitions to fdisk, for a good reason
[1:54] <Guest1271> ss1:~ # mount /dev/sde
[1:54] <Guest1271> mount: can't find /dev/sde in /etc/fstab or /etc/mtab
[1:54] <Tv|work> cat /etc/fstab
[1:54] <Guest1271> ss1:~ # cat /etc/fstab
[1:54] <Guest1271> /dev/disk/by-id/scsi-36003048000848c6014dd6ab30d1718f7-part1 swap swap defaults 0 0
[1:54] <Guest1271> /dev/disk/by-id/scsi-36003048000848c6014dd6ab30d1718f7-part2 / ext4 acl,user_xattr 1 1
[1:54] <Guest1271> proc /proc proc defaults 0 0
[1:54] <Guest1271> sysfs /sys sysfs noauto 0 0
[1:54] <Guest1271> debugfs /sys/kernel/debug debugfs noauto 0 0
[1:54] <Guest1271> devpts /dev/pts devpts mode=0620,gid=5 0 0
[1:55] <Tv|work> heh, irc rate limiting
[1:55] <Tv|work> was devpts the last line?
[1:55] <Guest1271> Yes, sorry
[1:55] <Tv|work> pastebin is good for >3 lines
[1:55] <Tv|work> how did you mount the others?
[1:55] <Tv|work> if you don't have entries in fstab, mount won't know what to do with the block device
[1:56] <Tv|work> you can do things manually, "mount /dev/sde /data/osd.42", or add entries to fstab etc
[1:56] <Guest1271> I don't mount them, I let ceph build btrfs on them and then mkcephfs does the rest
[1:56] <Tv|work> but i don't see sde being any different from the others, there
[1:56] <Tv|work> oh
[1:56] <Tv|work> you might want to avoid "btrfs devs"
[1:56] <Guest1271> As far as I understand it...
[1:56] <Tv|work> it's a convenience feature to make Sage's life better
[1:56] <Guest1271> All the others are working ok though
[1:56] <Tv|work> but if you choose to use it, the ceph init script should mount it
[1:56] <Tv|work> i guess
[1:56] <Guest1271> I think I just had a double point of failure
[1:56] <Guest1271> With one disk on two servers both going down
[1:57] <Guest1271> So I'd like to "clean" the disks and addd them back in
[1:57] <Tv|work> "btrfs devs" in ceph.conf isn't very flexible
[1:57] <Tv|work> it knows how to that only at mkcephfs time
[1:57] <Tv|work> as i said, you may wish to avoid the feature
[1:58] <Tv|work> mkcephfs doesn't understand the concept of rebuilding just one osd
[1:58] <Tv|work> it'll re-do the whole cluster
[1:58] <Tv|work> that's also a very limited tool, the replacement for it just isn't ready yet
[1:58] <Guest1271> Yes, but if I can "clean" the dev I can then either go through the steps for adding the disk back?
[1:59] <Tv|work> i don't know what you mean by "clean" in this context
[1:59] <Guest1271> Or run cosd and bring the OSD back up and ceph will self-heal?
[1:59] <Guest1271> Don't need the data, just want the disk back if possible
[2:01] <sage> you can ceph-osd -i 123 --mkfs and it'll reinitialize.. then start it back up. you probably want to mkfs.btrfs and remount manually first
[2:01] <sage> (well, init-ceph will mount, so you can skip that part)
[2:03] <Tv|work> but it needs to be mounted before the ceph-osd --mkfs run
[2:03] <Tv|work> init script time is too alte
[2:03] <Tv|work> you don't want to ceph-osd --mkfs the mount point
[2:03] <Tv|work> also ceph-osd --mkfs means you'll need to re-shuffle authentication keys, to make it part of the cluster
[2:03] <Tv|work> sorry, gotta run
[2:04] <Guest1271> No problem, actually gotta go too
[2:04] <Guest1271> Any last command?
[2:04] <Guest1271> Happy to mess up the disk's data at this stage
[2:05] <Guest1271> I'm thinking mkfs.btrfs /dev/sde ?
[2:05] <Guest1271> Then try cosd -i 3 -c /etc/ceph/ceph/conf
[2:05] <Guest1271> Or am I missing a step?
[2:06] <sage> mount btrfs, then run the --mkfs command.
[2:06] <sage> then start ceph-osd and it'll rejoin the cluster
[2:07] <Guest1271> ok, so 1) mkfs.btrfs /dev/sde
[2:07] <Guest1271> 2) mount /dev/sde /data/osd14
[2:07] <Guest1271> 3) ceph-osd -i 3 --mkfs
[2:08] <Guest1271> 4) cosd -i 3 -c /etc/ceph/ceph.conf
[2:08] <Guest1271> Does that sound right?
[2:08] <sage> s/cosd/ceph-osd/, but yeah, should do itl
[2:08] <Guest1271> Great, thanks Sage, I'll try that when I get back
[2:08] <Guest1271> Most appreciated
[2:10] <sage> np
[2:12] * Tv|work (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:13] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[2:16] * adjohn is now known as Guest1279
[2:16] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[2:17] * Guest1279 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: Operation timed out)
[2:24] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:29] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[3:07] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[3:58] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:03] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[4:09] * The_Bishop (~bishop@cable-89-16-138-109.cust.telecolumbus.net) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[6:04] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[6:06] * adjohn (~adjohn@50-0-164-170.dsl.dynamic.sonic.net) has joined #ceph
[6:08] * adjohn (~adjohn@50-0-164-170.dsl.dynamic.sonic.net) Quit ()
[7:09] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[7:33] * Meths_ (rift@ has joined #ceph
[7:38] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[7:54] * yoshi (~yoshi@p11133-ipngn3402marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[9:19] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[11:06] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Read error: Operation timed out)
[11:12] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[11:22] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[11:33] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[11:34] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[11:35] * gregaf (~Adium@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[11:36] * sjust (~sam@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[11:36] * sagewk (~sage@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[11:42] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:18] * joao (~joao@ has joined #ceph
[12:20] * yehudasa (~yehudasa@aon.hq.newdream.net) has joined #ceph
[12:22] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[12:23] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[12:25] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[12:33] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[12:56] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[13:19] * joao (~joao@ Quit (Quit: joao)
[13:27] * hijacker (~hijacker@ Quit (Quit: Leaving)
[13:58] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[13:58] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[13:58] * fronlius_ is now known as fronlius
[14:03] * fghaas (~florian@ has joined #ceph
[14:03] * fghaas (~florian@ Quit ()
[14:04] * The_Bishop (~bishop@cable-89-16-138-109.cust.telecolumbus.net) has joined #ceph
[14:13] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:20] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:28] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[14:28] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[14:28] * fronlius_ is now known as fronlius
[14:28] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[14:28] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[14:50] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[14:58] * hijacker (~hijacker@ has joined #ceph
[15:19] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[15:30] <lxo> why would lots of pgs oscillate in ceph pg dump between active with $degr == $objects and whatever their actual state is (active+clean with $degr == 0, or active with the correct number of degraded objects), without any such changes reported by ceph -w, and without any changes in the osdmap other than a slow decrease in the number of degraded objects and active pgs?
[15:32] <lxo> this is new in 0.41 AFAICT
[15:35] <lxo> when this odd, unstable state hits (the $degr == $objects), it doesn't last very long (but long enough for ceph pg dump to capture it, sometimes more than once), and when it does, the $mip count also fluctuates up to $objects for pgs whose primary is not one of the preloaded osds
[16:11] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[16:13] * BManojlovic (~steki@ has joined #ceph
[16:14] * joao (~joao@ has joined #ceph
[16:36] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[16:41] * joao (~joao@ Quit (Quit: joao)
[16:44] * joao (~joao@89-181-144-188.net.novis.pt) has joined #ceph
[17:03] <Kioob`Taff> hi
[17:04] <Kioob`Taff> I know Ceph is not ready for production, but I have a lot of "hangs", and I see no way to debug that
[17:05] <Kioob`Taff> A simple "ls" on ceph clients hang. What can I do to find where is the problem ?
[17:10] <nhm> Kioob: Do the hangs every resolve?
[17:13] <Kioob`Taff> yes
[17:14] <nhm> Kioob`Taff: do they hang for consistent amounts of time or does it vary a lot?
[17:14] <Kioob`Taff> mm I don't know
[17:15] <Kioob`Taff> the "load" vary a lot for know, devs are doing tests and benchmarks
[17:15] <Kioob`Taff> and Ceph hang near every day
[17:16] <nhm> Kioob`Taff: Yeah, I'm wondering if something isn't responding or maybe timeouts of some kind.
[17:17] <nhm> Kioob`Taff: when you say it hangs every day, is this the same thing you were talking about when you say that ls hangs?
[17:17] <Kioob`Taff> yes
[17:18] <Kioob`Taff> all ceph process seem running, but read on ceph mounts doesn't respond
[17:19] <nhm> Kioob`Taff: do writes also stop?
[17:20] <Kioob`Taff> a "touch x" works
[17:20] <Kioob`Taff> echo "xxx" >> x
[17:20] <Kioob`Taff> works also
[17:20] <Kioob`Taff> mmm
[17:21] <Kioob`Taff> "cat x" works oO
[17:21] <Kioob`Taff> mmm now ls works too
[17:21] <nhm> ok, so looks like it resolved itself at some point... how long do you think it was hung for?
[17:21] <Kioob`Taff> but the other "ls" in an other ssh session stay working
[17:21] <nhm> same client?
[17:22] <Kioob`Taff> yes
[17:22] <Kioob`Taff> the ls is working since several minutes
[17:22] <Kioob`Taff> (near 20 minutes)
[17:23] <nhm> is the hung ls still hung?
[17:23] <Kioob`Taff> oh : "ls" works. But not "ls -l"
[17:23] <nhm> ah
[17:23] <lxo> as to the oscillating pg dumps, I'm now guessing one of my mons is lagging behind the others, because an experiment (bringing down one of the osds) showed oscillation between its being listed and not. I'm guessing it's the one whose osd partitions are being most written to. now, is a mon supposed to give out outdated info like that?
[17:23] <Kioob`Taff> so... a MDS problem ?
[17:23] <nhm> sounds like it
[17:23] <nhm> same thing if you ls -l that "x" file you created?
[17:25] <Kioob`Taff> yes, "ls -l x" works
[17:25] <nhm> ok. How about if you repeat the original ls -l in the new terminal?
[17:27] <Kioob`Taff> so... stupid aliases. "ls -l" works very well. But not "ls -al"
[17:28] <Kioob`Taff> so, the "stat" on the "." or ".." maybe ?
[17:28] <nhm> seems like decent guess.
[17:28] <Kioob`Taff> mm no, "stat" works too
[17:28] <nhm> huh
[17:28] <nhm> maybe go through the directory one by one?
[17:29] <Kioob`Taff> mmm, I didn't understand :$
[17:30] <nhm> so you are doing ls -l <some dir> vs ls -al <some dir> right? One hangs, one does not?
[17:30] <Kioob`Taff> yes
[17:30] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[17:31] <nhm> I wonder if you tried doing a ls -l on everything in the directory individually (both hidden and not) if you could trigger it...
[17:33] <Kioob`Taff> wonderfull :D I have a ".boot.inc.php" file in that directory, a "cat" of it doesn't answer
[17:33] <nhm> hrm
[17:33] <nhm> so I wonder what's up with that file.
[17:34] <Kioob`Taff> and... how can I check that ?
[17:35] <nhm> Kioob: good question. Thinking. One of the ceph devs might have better answers too.
[17:35] <Kioob`Taff> ok, thanks :)
[17:36] <nhm> any idea if that file was previously accessible?
[17:36] <Kioob`Taff> yes it was
[17:36] <Kioob`Taff> near all scripts of the devs use it
[17:37] <Kioob`Taff> so PHP made a lot of "stat()" and read() of that file
[17:41] <nhm> Kioob`Taff: if you restart the mds, you could try "ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1"
[17:43] <Kioob`Taff> I have two mds, should I restart both ?
[17:47] <Kioob`Taff> mmm the restart of one MDS made all my ls answer
[17:50] * lxo uses his ancient distributed systems knowledge to realize that neither the lagging-behind mon nor the ceph client can realize that they're dealing with outdated information unless they talk to other mons, or ceph saves earlier state to discard older info
[18:00] <Kioob`Taff> mmm, I have a "frozen" rsync now
[18:00] <Kioob`Taff> (state "D")
[18:00] <Kioob`Taff> and a restart of both MDS doesn't solve anything
[18:01] <Kioob`Taff> (I also do a "strace -p xxx" on the rsync process and didn't see any progress)
[18:02] <Kioob`Taff> select(1, [0], [], NULL, {16, 76126}) = 0 (Timeout)
[18:08] <gregaf> Kioob`Taff: there are a few possibilities as to what's going wrong with stat'ing that file; unfortunately figuring it out will require reasonably detailed MDS logs (do you have some)
[18:09] <gregaf> options are either that another client is currently working on that file and can't/won't give up its capabilities, or the MDS hit a bug and that request is effectively lost
[18:09] <Kioob`Taff> I have some logs.... but I'm not able to understand them :/
[18:09] <gregaf> actually if a restart of the MSD made it work…probably the second option
[18:10] <Kioob`Taff> what should I search on that logs ?
[18:10] <gregaf> you want to go to the logs from before you restarted the MDS
[18:11] <gregaf> and look for the request that hung
[18:12] <gregaf> probably your best bet is to grep for the filename and pipe that into a grep for incoming messages; then check out the last of those that is a stat
[18:12] <gregaf> that will include the request id and you want to try and track what happened to that request
[18:12] <gregaf> if you zip them up and post them somewhere/email them (if they're small enough) somebody will dig through them eventually; we don't want requests to get lost ;)
[18:13] <sagewk> elder: did you see jim schutts messenger patch?
[18:14] <gregaf> lxo: I don't think I followed your explanation of what you were seeing with the ceph pg dump...
[18:15] <gregaf> but I don't believe you should be able to connect to an out-of-date monitor for long; if they're not in the quorum they start refusing connections
[18:17] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[18:18] <lxo> I was seeing information in pg dumps that (after some investigation) was clearly outdated info
[18:19] <lxo> the mon is in the quorum AFAICT, it's just taking a while to commit state to disk
[18:19] <gregaf> lxo: actually, no…I forgot we didn't reject new connections, we just boot them quickly if we can't get back in
[18:19] <gregaf> but not quickly enough to avoid giving a pg dump
[18:19] <gregaf> argh
[18:21] <lxo> hmm, maybe it's not as simple as slow commits. that mon's log has tons of “mon.2@2(peon).pg v283678 update_from_paxos: error parsing incremental update: buffer::end_of_buffer” messages, all the way to the last logrotate. maybe I should resync the mon state from one of the active mons...
[18:21] <Kioob`Taff> gregaf: does that file can help ? http://bool.boolsite.net/ceph-mds.log.gz
[18:21] <Kioob`Taff> I tried to filter a little
[18:22] <lxo> v304841 is the current pg map. wow! talk about lagging behind ;-)
[18:22] <gregaf> Kioob`Taff: looking
[18:23] <Kioob`Taff> thanks gregaf
[18:23] <gregaf> lxo: yikes
[18:23] <gregaf> I wonder how that could have happened :/
[18:25] <lxo> my first guess would be one of those btrfs bugs that have been zeroing files out
[18:25] <lxo> I've got pginfo and pglog files zeroed out like that, but this is a first in the mon filesystem
[18:26] <gregaf> I didn't know btrfs had zero-file bugs :(
[18:29] <Kioob`Taff> (I had a lot of "zero-file" bugs before too)
[18:30] <gregaf> Kioob`Taff: I think I see it being accessed on replay, but what you've given me doesn't seem to have any of the incoming messages before restart
[18:30] <gregaf> unless I'm misunderstanding what I'm seeing here
[18:34] <Kioob`Taff> I filtered like that : grep mds.db1 /var/log/syslog | grep 'Feb 2 17:'
[18:34] <Kioob`Taff> maybe it's wrong ?
[18:35] <gregaf> oh, hrm…what was going on from 17:14 to 17:45? I don't see any incoming messages at all in that period
[18:35] <gregaf> (17:14 is the start of the log)
[18:36] <Kioob`Taff> between 17:14 and 17:45 I was doing tests with nhm, here
[18:36] <gregaf> wait, actually, no messages until 17:51 — did you turn up the logging at some point in there?
[18:37] <Kioob`Taff> I run "ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'" and "ceph mds tell 1 injectargs '--debug-mds 20 --debug-ms 1'", near 17:50
[18:37] <gregaf> ah
[18:37] <gregaf> okay, we're unlikely to pick anything useful out of the older logs then :(
[18:37] <Kioob`Taff> ok, sorry :p
[18:37] <gregaf> np
[18:37] <lxo> not a zero-file bug, FWIW. pgmap/last_consumed was 283678, even though {first,last}_committed were way ahead of it, and pgmap/283678 no longer existed
[18:37] <Kioob`Taff> so, I have to wait for the next "hang" ?
[18:38] <gregaf> yeah, unfortunately
[18:38] <Kioob`Taff> ok, np
[18:38] <Kioob`Taff> thanks gregaf
[18:38] <gregaf> thank you!
[18:38] <sagewk> elder: did you mean to lock 20 sepia nodes?
[18:38] <elder> sage, CRAP
[18:38] <sagewk> tv: you have 14
[18:38] <elder> Just a sec.
[18:39] <gregaf> lxo: so something happened that prevented it from being consumed, but we can't tell what it was any more :(
[18:39] <gregaf> sagewk: he's not in right now
[18:39] <elder> But to your earlier note, I hadn't looked at the list this morning. I'll look at it after lunch (in a few hours), I'm leaving shortly.
[18:40] <sagewk> elder: k
[18:43] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:43] <elder> sagewk, shouldn't a run unlock the nodes when they're done if I get them locked dynamically?
[18:43] <lxo> gregaf, yeah :-( if I had to guess, I'd say it was the partial rollback attempt that got 0.41 all confused
[18:43] <elder> Mostly I've been using the same 3 nodes but I ran a test or two recently where I allocated them on the fly.
[18:44] <elder> Or maybe I did something else wrong. Anyway, I unlocked the machines I had in use.
[18:44] <sjust> lxo: yeah... I've been trying to make sense of your logs, some of the pgs do appear to be making progress
[18:44] <sagewk> elder: it doesn't if there's a failure.. cleanup is sort of tedious. nuke-on-error: true helps some
[18:44] <gregaf> lxo: I don't think that should have been able to confuse the pg map, at least not in a way that would work on 2 monitors but not the third...
[18:44] <sagewk> i usually lock/unlock them explicitly
[18:45] <elder> OK. I have used nuke-on-error but maybe I missed it.
[18:45] <lxo> actually, no...
[18:45] <lxo> (it wasn't what I thought of)
[18:45] <gregaf> Kioob`Taff: oh, you still have a hung rsync?
[18:45] <elder> About to forcefully update the master branch on ceph-client.
[18:45] <gregaf> that we ought to be able to work out if it got hung after you changed the logging level
[18:45] <Kioob`Taff> yes
[18:46] <elder> (finally)
[18:46] <lxo> that mon went down after this:2012-02-01 08:15:40.262154 7fcf71bc4700 mon.2@2(peon).pg v283678 PGMonitor::preprocess_pg_stats: no monitor session!
[18:46] <lxo> mon/Paxos.cc: In function 'void Paxos::handle_begin(MMonPaxos*)' thread 7fcf71bc4700 time 2012-02-01 08:15:45.390468
[18:46] <lxo> mon/Paxos.cc: 412: FAILED assert(begin->last_committed == last_committed)
[18:46] <lxo> I get this assertion failure on it sometimes
[18:46] <Kioob`Taff> yes gregaf, but rsync do a lot of stats, so logs are huge
[18:46] <lxo> now, does that version number look familiar?
[18:46] * aliguori (~anthony@ has joined #ceph
[18:46] <gregaf> heh
[18:47] <elder> No, I forgot to re-merge in Linus master. It'll have to wait until after lunch.
[18:47] <gregaf> Kioob`Taff: well I've got bandwidth to download if you've got room to post :)
[18:47] <lxo> I brought it back up several hours later, and that's when it started printing that message
[18:47] <Kioob`Taff> ok
[18:48] <gregaf> Kioob`Taff: also if you could give me the list of MDS requests currently in-flight on that node running the rsync
[18:48] <gregaf> which you can pull out of…argh, I don't remember what fake filesystem it's in
[18:48] <gregaf> is there a Ceph subdir in /sys/kernel/debug?
[18:49] <gregaf> or maybe it's /proc/sys/kernel...
[18:49] <gregaf> sagewk: what fake file has the list of mds requests in flight?
[18:50] <sagewk> /sys/kenrel/debug/ceph/*/mdsc
[18:50] <sagewk> may need to mount -t debugfs none /sys/kernel/debug first
[18:50] <gregaf> Kioob`Taff: that one, if it exists
[18:51] <lxo> these were the first messages it printed after connecting to other mons, requesting and election and logging the osdmap updates
[18:57] <gregaf> lxo: yeah, looking
[19:02] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[19:03] <Kioob`Taff> gregaf: http://bool.boolsite.net/ceph-mds2.log.gz
[19:03] <Kioob`Taff> there was only the rsync
[19:06] <gregaf> cool
[19:06] <gregaf> I'm doing lxo's thing right now; I'll get to it sometime today :)
[19:06] * chutzpah (~chutz@ has joined #ceph
[19:07] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[19:07] <Kioob`Taff> and I haven't got the debug mountpoint
[19:07] <lx0> sjust, yeah, the pgs were making progress indeed, but not in the +backfill state. I'm not sure how much that matters, it just looked odd that it was used for a few pgs only
[19:08] <sjust> lx0: their logs actually extend back to (0,0) so they would not need to backfill
[19:08] <lx0> sjust, BTW, those logs are all from before the mon got into this odd state, so don't worry about that
[19:09] <lx0> aah. I guess that somehow makes sense to you. me, I guess I'll have to read up on what backfill amounts to ;-)
[19:10] <lx0> it doesn't surprise me that the logs extend back to (0,0); the osds were initialized while a third osd was missing. I think this prevents logs from trimming or something
[19:13] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[19:13] <sjust> lx0: it's actually pretty simple, when the primary's log overlaps the replica it is recovering, it can just walk the log from the replica's last update to the pg's last update to get all objects that need to be recovered
[19:13] <sjust> when they don't overlap, we need to scan the pg on disk collections
[19:13] <sjust> the latter is backfill
[19:14] <sjust> so in this case, the logs overlap (even with empty osds)
[19:14] <sjust> and we don't need to backfill
[19:15] <lx0> oh, so backfill is the worse case, rather than the new shiny optimized recovery?
[19:18] <lx0> so I guess the few pgs that did undergo backfill did so because they managed to replicate to the first osd that came up, trim their logs, and then when other osds joined the group, the primary no longer had full logs to do the replication and had to resort to backfilling. does that sound right?
[19:20] * lx0 is now known as lxo
[19:24] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[19:54] <sjust> lxo: sounds plausible
[19:55] <lxo> well, then... sorry about wasting your time on a non-issue :-(
[19:55] <sjust> Ixo: backfill is sort of the shiny new optimized worst case
[19:55] <sjust> Ixo: not at all
[19:55] <sjust> lxo: let me know if your pgs don't make it all the way to active+clean
[19:56] <lxo> I'm 2/3 there already; hopefully I'll have a fully clean cluster by the end of the week
[19:56] <sjust> lxo: how much data is there?
[19:56] <lxo> about 1TB
[19:57] <sjust> lxo: hmm, that's pretty absurd for 1TB of data
[19:57] <lxo> what is?
[19:57] <sjust> the amount of time
[19:58] <sjust> should really be recovering faster than that...
[19:59] * Meths_ is now known as Meths
[19:59] <lxo> well, it's all coming out of two disks, that's seeking a lot because so many osds are trying to get data from it
[20:00] <sjust> lxo: yeah, but it's still quite a while
[20:00] <sjust> how many servers?
[20:00] <lxo> well, I'm not surprised. I kind of got used to it
[20:00] <lxo> 3 servers, two with two disks each, the other with the rest
[20:01] <sjust> osd.0 and osd.2 are on different servers?
[20:01] <lxo> yep
[20:02] <lxo> and each of the servers has one of the osds in an external disk, but those are mostly unused ATM, and have already fully synced up. so we're speaking of osds 5 to 10 syncing from osd0 alone (osd2 has already completed syncing its stuff; it should be the same as osd0, but for some reason it's faster)
[20:02] <lxo> actually, I brought osd0 down (but still in), so that osd2 would take over the syncing
[20:03] <lxo> so now it's all coming out of a single SATA spinning disk
[20:03] <sjust> lxo: hmm, it's possible we do something silly like always choose the lower numbered location of a missing object
[20:03] <lxo> (my guess is that osd0 is slower because it also runs a mail server, that constantly syncs to the same disk)
[20:03] <sjust> lxo: ah, that wouldn't help
[20:04] <lxo> before I brought osd0 down, it was all still coming out of the same disk, just the slower of the two ;-)
[20:04] <sjust> lxo: that's what I'm thinking
[20:05] <sjust> lxo: yeah, in pretty much every case where we could have pulled from either osd.0 or osd.2, I think we were pulling from osd.0
[20:06] <lxo> you sure? I got the impression that, given [a,b,c], when a needed and object, it would try b and then c, regardless of their numbers
[20:07] <sjust> lxo: maybe, but if b and c don't have it it pulls from the first up member of the set of osds that have the object (which is not necessarily related to the set of acting osds)
[20:08] <sjust> lxo: it's usually not so acute a problem, but we should get around to fixing it at some point :)
[20:09] <lxo> aah. well, that's not really relevant for my case. osd.0 and osd.2 are acting for 95%+ of the pgs here, for all the other disks are much smaller
[20:09] <sjust> lxo: actually, it doesn't consider acting at all, it just dumps all osds which might have the object into an stl set and pull from the first one that's up
[20:10] <lxo> besides, they were the only ones that had data in the first place
[20:10] <lxo> oh, now *that* would explain why osd0 lags behind osd2 in recovery
[20:10] <sjust> lxo: yeah, but if we were choosing randomly, osd0 and osd2 would split the load
[20:10] <sjust> osd2 is only being used for those pulls where the primary doesn't know about a copy on osd0
[20:11] <lxo> or those that have osd2 itself as primary, right?
[20:11] <sjust> lxo: right
[20:11] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[20:11] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[20:11] * fronlius (~fronlius@testing78.jimdo-server.com) Quit ()
[20:12] <sjust> lxo: since they don't need to recover the primary in the first place
[20:12] <lxo> ok, that kind of sucks for the way I chose to initialize the cluster, but I agree it's not such a big deal in general, although certainly not ideal
[20:12] <sjust> lxo: well, I'm going to make a bug for when we get a bit of time, it won't actually be much work to fix
[20:13] <lxo> yeah, just adding a random_shuffle before picking the first should do ;-)
[20:13] <lxo> (that's the stupid way to pick any random member of the set, you know ;-)
[20:14] <sjust> lxo: certainly, I just need to make sure that it doesn't foul up the fail over (we move on to the rest of the set if the first fails)
[20:19] <lxo> maybe I should take this opportunity to try out some patch. can you point me at the location in the code where the primary chooses where to fetch an object from?
[20:19] <sjust> ReplicatedPG::pull
[20:20] <lxo> yay, I'd just opened that file as a first guess ;-)
[20:23] <sjust> yeah, I think that randomizing that selection is adequate
[20:24] <lxo> eek, it's a set, not a vector
[20:25] <sjust> lxo: yeah, but just choosing an element at random is enough, I believe
[20:27] <lxo> hmm... we need some way to stop if all elements turn out to be down
[20:30] <sjust> copy to vector and permute?
[20:31] <lxo> yeah. shouldn't be a big vector, anyway
[20:33] <lxo> otherwise it would be kind of silly to copy and shuffle the whole thing instead of picking at random and removing or so
[20:43] * lxo is reminded of why C++ is his favorite language, and sighs for not writing more code in it ;-)
[20:44] * fronlius (~fronlius@f054108045.adsl.alicedsl.de) has joined #ceph
[20:47] <lxo> sjust, did you open the bug yet? I'd attach the patch there once I'm done testing, if so. otherwise I'll just post it to the list
[20:48] <sjust> http://tracker.newdream.net/issues/2016
[20:48] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[20:51] <lxo> sjust, FYI, my wife is going to hate you for bringing more load onto her desktop (that holds osd.2 ;-)
[20:55] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[21:02] <lxo> it works! patch attached
[21:11] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[21:15] * alexxy (~alexxy@ has joined #ceph
[21:26] * adjohn is now known as Guest1372
[21:26] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[21:32] * Guest1372 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Ping timeout: 480 seconds)
[22:01] <sjust> lxo: heh
[22:04] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) has joined #ceph
[22:07] * jclendenan (~jclendena@ Quit (Remote host closed the connection)
[22:12] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:16] * diegows (~diegows@ has joined #ceph
[22:19] * diegows (~diegows@ Quit (Remote host closed the connection)
[22:32] * fronlius_ (~fronlius@e176052045.adsl.alicedsl.de) has joined #ceph
[22:38] * fronlius (~fronlius@f054108045.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[22:38] * fronlius_ is now known as fronlius
[22:39] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[22:41] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:30] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) Quit (Quit: Ex-Chat)
[23:35] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[23:48] * joao is now known as Guest1381
[23:48] * joao_ (~joao@89-181-144-188.net.novis.pt) has joined #ceph
[23:48] * joao_ is now known as joao
[23:54] * Guest1381 (~joao@89-181-144-188.net.novis.pt) Quit (Ping timeout: 480 seconds)
[23:56] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[23:58] * joao (~joao@89-181-144-188.net.novis.pt) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.