Troubleshooting Disk Failures on a Linux Software RAID with LVM

The following describes a failure of a drive I had on Ubuntu Linux with a Linux software RAID 5 volume with LVM, how I diagnosed it, and how I went about fixing it. The server had 4 2TB drives in software RAID 5.

When checking kernel messages, here is an example of the bad sectors:

# dmesg -T
[Sun Jul 21 13:36:30 2013] ata4.00: status: { DRDY ERR }
[Sun Jul 21 13:36:30 2013] ata4.00: error: { UNC }
[Sun Jul 21 13:36:30 2013] ata4.00: configured for UDMA/133
[Sun Jul 21 13:36:30 2013] ata4: EH complete
[Sun Jul 21 13:36:32 2013] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
[Sun Jul 21 13:36:32 2013] ata4.00: irq_stat 0x40000008
[Sun Jul 21 13:36:32 2013] ata4.00: failed command: READ FPDMA QUEUED
[Sun Jul 21 13:36:32 2013] ata4.00: cmd 60/20:00:2c:eb:7e/00:00:08:00:00/40 tag 0 ncq 16384 in
[Sun Jul 21 13:36:32 2013] res 41/40:00:2f:eb:7e/00:00:08:00:00/40 Emask 0x409 (media error) <F>
[Sun Jul 21 13:36:32 2013] ata4.00: status: { DRDY ERR }
[Sun Jul 21 13:36:32 2013] ata4.00: error: { UNC }
[Sun Jul 21 13:36:32 2013] ata4.00: configured for UDMA/133
[Sun Jul 21 13:36:32 2013] sd 3:0:0:0: [sdd] Unhandled sense code
[Sun Jul 21 13:36:32 2013] sd 3:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sun Jul 21 13:36:32 2013] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor]
[Sun Jul 21 13:36:32 2013] Descriptor sense data with sense descriptors (in hex):
[Sun Jul 21 13:36:32 2013] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[Sun Jul 21 13:36:32 2013] 08 7e eb 2f
[Sun Jul 21 13:36:32 2013] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
[Sun Jul 21 13:36:32 2013] sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 08 7e eb 2c 00 00 20 00
[Sun Jul 21 13:36:32 2013] end_request: I/O error, dev sdd, sector 142535471
[Sun Jul 21 13:36:32 2013] ata4: EH complete

It’s clear in this case with those messages that the problem was with sdd, but in some cases, it’s not always clear which ata# in dmesg matches up with which drive, so for that I followed this askubuntu guide to see if ata4.00 was the same as /dev/sdd:

In my case, it’s clear that the problem is indeed with /dev/sdd or (ata4)

I have my RAID configured to email me if there’s a problem with the RAID, but it wasn’t doing it, even though I confirmed it was configured to do so.

I checked to make sure the RAID looked normal:

cat /proc/mdstat

looked normal (all U’s on the drives)

The output of mdadm looked fine as well

# mdadm --detail /dev/sdd

The problem from dmesg seems to be that it was having trouble reading specific sectors on drive [sdd]. So I knew I needed to start checking that drive specifically.

There are some good guides available about dealing with badblocks, such as the “bad block HOWTO”

Also, the FAQ for smartmontools (which contains the smartctl program):

The problem with the bad block HOWTO was that didn’t cover my case which is RAID/LVM. You definitely don’t want to get the dd command wrong, and I was nervous about trying to get it right on my system. A better approach my be to use hdparm as described here (forcing a hard disk to reallocate bad sectors). However, I didn’t do this for reasons explained below

The crux of that last page is looking for the bad sector in dmesg, and then confirming is with this command (entering your sector):

# hdparm –read-sector 1261069669234239432572396425

You should get:

/dev/sdb: Input/Output error

Then, the drive can’t be part of the array when you do this, doing a:

# hdparm –write-sector 1261069669234239432572396425 /dev/sdb

Followed by an force assemble to get the drive back in.

However, I didn’t want to do all that for two reasons, I didn’t know the extent of the bad sectors (if the bad sectors were isolated, or the whole drive was going bad), and because the command writes 0’s into the sector which means you will loose data if not done carefully (taking precautions to ensure that since our array is still functioning otherwise, I don’t write 0’s into the good drives), so I decided to try something else instead.

I started with smartctl reported that the individual drive was healthy. It turns out that a drive can actually be on the verge of failure or not healthy, and smart still reports it as being healthy. Smart health is more like an indicator that things can go bad not a definitive way to tell:

# smartctl -Hc /dev/sdd
SMART overall-health self-assessment test result: PASSED

Then I checked for vendor-specific SMART attributes using -a. The thing that stands out here is the “197 Current_Pending_Sector” being higher than 0 (in my case 40), meaning that there are sectors pending re-allocation. This means there is a high likelihood there are bad sectors.

# smartctl -a

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 6200
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 188 188 140 Pre-fail Always - 89
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 5693
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 18
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 4
194 Temperature_Celsius 0x0022 080 074 000 Old_age Always - 72
196 Reallocated_Event_Count 0x0032 113 113 000 Old_age Always - 87
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 40
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

I decided to run a short test:

# smartctl -t short /dev/sdd

And then 60 seconds later I checked the status:

# smartctl -l selftest /dev/sdd

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 2 Short offline Completed: read failure 90% 5691 71314612

That confirmed there was a bad sector. At this point, I decided I needed to make a judgment on whether the drive was going bad, or there were just some bad sectors I needed to mark as such and move on. You can mark sectors as bad by writing 0’s to the bad area, and the disk firmware should automatically mark them as bad. The reason for this is explained in the earlier mentioned smartmontools FAQ:

If the disk can read the sector of data a single time, and the damage is permanent, not transient, then the disk firmware will mark the sector as ‘bad’ and allocate a spare sector to replace it. But if the disk can’t read the sector even once, then it won’t reallocate the sector, in hopes of being able, at some time in the future, to read the data from it. A write to an unreadable (corrupted) sector will fix the problem. If the damage is transient, then new consistent data will be written to the sector. If the damange is permanent, then the write will force sector reallocation. Please see Bad block HOWTO for instructions about how to force this sector to reallocate (Linux only).

The disk still has passing health status because the firmware has not found other signs of trouble, such as a failing servo.

Such disks can often be repaired by using the disk manufacturer’s ‘disk evaluation and repair’ utility. Beware: this may force reallocation of the lost sector and thus corrupt or destroy any file system on the disk. See Bad block HOWTO for generic Linux instructions.

The problem I had is the “bad block HOWTO” guide I mentioned earlier doesn’t cover my scenario, RAID/LVM. I’m sure you could dig in and find exactly the sector and mark it, but I didn’t want to risk it. So I was about to track down a western digital disk evaluation and repair utility, when I ran across a post that suggested I can just do a RAID sync (was a “repair” on older kernels”). To initiate, you run:

# echo 'check' > /sys/block/md0/md/sync_action

Then check the RAID check status with:

# cat /proc/mdstat

In my case, it was going really slow, so I first did what I could to shut down unecessary activity on the drive, and then ran through suggestions from here

The main thing that sped things up was setting the stripe cache size to a higher level than the default 256.

# echo 32768 > /sys/block/md3/md/stripe_cache_size

As it was doing the check, lots of errors were being thrown about the drive in dmesg, so I knew it wasn’t an isolated incident I was going to be able to fix by using a drive utility to mark bad sectors, the whole drive would need to be replaced.

As I was monitoring the RAID status, it got through about 5% and then the RAID removed the drive from the array and stopped it’s work. Here’s what dmesg said:

[Sun Jul 21 17:14:29 2013] md/raid:md0: Disk failure on sdd2, disabling device.
[Sun Jul 21 17:14:29 2013] md/raid:md0: Operation continuing on 3 devices.
[Sun Jul 21 17:14:29 2013] md: md0: data-check done.

so now I have a degraded array, and need to get a new drive ASAP to replace it and rebuild the array.

To do this, I had to ensure that the serial numbers matched properly. Also, since the drive was no longer showing up in my system, I had to issue one of the following commands:

mdadm /dev/md0 -r detached

mdadm /dev/md0 -r failed

The man page says:

“The first causes all failed device to be removed. The second causes any device which is no longer connected to the system (i.e an ‘open’ returns ENXIO) to be removed. This will only succeed for devices that are spares or have already been marked as failed.”

Troubleshooting Disk Failures on a Linux Software RAID with LVM

Recent Posts

Categories